The UK Biobank (UKBB) is a long-term resource arising from a prospective study which has recruited approximately 500,000 participants, from across the United Kingdom, with the aim of investigating the contributions of genetics and environment to the development of disease. Genotype data for approximately 490,000 participants together with a wide range of phenotype data are available as described by Sudlow et al .
Medications taken by participants can be a valuable indicator of disease status. Medication data were captured at the time of participant assessment centre visits in two steps; participants first indicated, via a touch-screen questionnaire, if they were currently taking any of certain important classes of medication. These were blood pressure lowering drugs, cholesterol lowering drugs, hormone replacement therapy and also if they were regularly taking over the counter medications including vitamins and supplements. In the interview which followed, a trained member of staff confirmed information indicated via the touch screen questionnaire, and selected details of particular drugs taken from a list of possible alternatives. Other medications currently taken at the time of the interview were captured during the interview. “Current” did not include taking of antibiotics or other short-term medications. There is no published structure in the UKBB-assigned medication codes and no means of grouping medicines into categories based on, for example, disease area or drug class. Furthermore, a mixture of trade names and generic drug names have been included in descriptions without any direct information on their equivalences, and while formulations are given in some cases (for example, Lipitor 10mg tablet), this does not indicate dosage (how many tablets and how many times per day). The UKBB medications coding table, which list codes versus medication descriptions and participant report count, is publicly available from the UKBB data showcase  as a separate download. Table 1 shows an excerpt from this table, in which there are a number of groupings which could be made, for example Atorvastatin and Lipitor, respectively the generic and a brand name for the same cholesterol lowering drug, have separate UKBB codes and were reported by significant numbers of participants, of whom only 326 reported taking both.
Similarly, there are two items listed in the excerpt for salbutamol plus salbuvent is also a brand name for salbutamol.
Information on medication can be used to derive a range of phenotypes as stand-alone proxies for clinical phenotypes or as additional evidence in combination with other UKBB-derived phenotypes. For example, a participant might report that a medication (for example, a long acting beta adrenoceptor agonist) is being taken currently, indicating a diagnosis of asthma. In addition, there might be a record of an ICD10 code, assigned at the time of a hospital episode, indicating an asthma exacerbation.
A recent study has used the UKBB medication codes and descriptions directly as proxies for phenotypes  and UKBB medication codes have been used singly in fine-grained unstructured PheWAS data , though these would suffer from the problem of the appearance of participants taking related medication as controls for any given medication in an analysis. In this paper we show that our method for grouping related medications under ATC or BNF classifications, can eliminate false controls.
Drug Classification Systems
Drug classification systems are published by the World Health Organisation as the Anatomical Therapeutic Chemical (ATC)  system and by the British National Formulary (BNF) .
The ATC system consists of a hierarchy in which levels defined by prefixes of the full designation string for a compound are used to group drug therapies. For example:
ATC level1: R = Respiratory System
ATC level2: R03 = Drugs for Obstructive Airway Diseases
ATC level3: R03A = Adrenergenics, Inhalants
ATC level4: R03AC = Selective beta-2-adrenoreceptor agonists
ATC level5: R03AC02 = Salbutamol
ATC codes and descriptions were sourced from the ChEMBL database , which is “a curated database of bioactive chemicals with drug-like properties” maintained at the European Bioinformatics Institute (EBI), for testing purposes. The ChEMBL database is available as a free download in several database formats .
The BNF system uses a hierarchy of chapters, sections and subsections followed by codes to uniquely identify a drug. For example:
BNF chapter 3: Respiratory System
BNF section 3.1: Bronchodilators
BNF subsection 3.1.1: Adrenoceptor Agonists
BNF code 0301011R0BPADAW: SALBUTAMOL 400 CYCLOCAPS_CAP 400MCG
BNF coding is available at the Health Informatics Centre (HIC) of the University of Dundee as part of NHS-supplied data and this paper describes HIC’s use of this resource in classifying UKBB self-reported medication data.
The project aim was to design, write and test software to facilitate classification of self-reported medications in such a way that the data can effectively contribute to defining or refining clinical phenotypes, either alone or assembled from multiple sources within the UKBB data.