British surname origins, population structure and health outcomes – an observational study of English hospital records

Background Population and social structure act as important confounders on pathways linking genotypes to health outcomes. This study examines whether the historical, geographical origins of British surnames – as markers of population structure - are associated with differential health outcomes today. Methods We coded the hospital admissions of more than 30 million patients in England between 1999 and 2013 to their surname origin and divided their diagnoses into 125 major disease categories. A base population dataset was constructed with patients’ rst admission of any kind. Age- and sex-standardised odds ratios were calculated with logistic regression using patients with ubiquitous English surnames such as “Smith” as reference. Using a data mining approach, we scanned the results for “signals”, where a branch of related surname origins all had signicantly higher or lower risk than the reference group. We subsequently studied the age- and sex-standardised incidence for each signal across the density of the surname origin (quintiles) as well as quintiles of area deprivation. We identied a signal with Scottish surnames (alcohol-related disorders) and three with different branches of English surnames (disorders of teeth and jaw, fractures, upper gastrointestinal disorders). For the three English surname groups, the risk was only different to patients with other surnames in the quintile with the highest density of that group. Differential risk remained when studied across quintiles of area deprivation. The study shows that thus deprivation. health outcomes and may thus act as combined markers of biosocial population structure over and above area deprivation. If related surname origins were studied - assuming either genetic or cultural lineage effect - then again it was possible to nd correlates between certain health outcomes in the areas with the highest density of the surname origin aggregate. The fact that correlates were only present in the highest density quintile suggests that ancestral heartlands vary in some aspect from the rest of the country. Hypothetically this pattern could be explained by a combination of factors related to both nature and nurture, in part depending on the nature of the health outcome itself. The results may thus inform more detailed investigations untangling biological and social factors in health.


Abstract Background
Population and social structure act as important confounders on pathways linking genotypes to health outcomes. This study examines whether the historical, geographical origins of British surnames -as markers of population structure -are associated with differential health outcomes today.

Methods
We coded the hospital admissions of more than 30 million patients in England between 1999 and 2013 to their surname origin and divided their diagnoses into 125 major disease categories. A base population dataset was constructed with patients' rst admission of any kind. Age-and sex-standardised odds ratios were calculated with logistic regression using patients with ubiquitous English surnames such as "Smith" as reference. Using a data mining approach, we scanned the results for "signals", where a branch of related surname origins all had signi cantly higher or lower risk than the reference group. We subsequently studied the age-and sex-standardised incidence for each signal across the density of the surname origin (quintiles) as well as quintiles of area deprivation.

Results
We identi ed a signal with Scottish surnames (alcohol-related disorders) and three with different branches of English surnames (disorders of teeth and jaw, fractures, upper gastrointestinal disorders). For the three English surname groups, the risk was only different to patients with other surnames in the quintile with the highest density of that group. Differential risk remained when studied across quintiles of area deprivation.

Conclusions
The study shows that surname origins are associated with diverse health outcomes and may thus act as combined markers of population structure over and above area deprivation.

Background
It is well established that the distribution of surnames correlates with that of genetic population structure [1][2][3]. In Britain, surnames have been passed down from generation to generation for more than seven centuries [4]. When the geographical distributions of surnames are mapped with data from 19th century censuses, it becomes clear that many surnames can be traced back to very speci c localities prior to the large-scale urbanisation and migration that characterise the people of the British Isles today [5]. Despite migration and mixing of populations -many of these surnames are still most common in the same heartlands as they were in the 19th century [6]. Genetic studies have found that the rarer the surname, the more likely the bearers of that surname are related [4]. This re ects that many more common surnames have multiple apparent origins, indicating different, unrelated, ancestors. Among the more widespread surnames with multiple origins are names taken from professions (e.g., Smith) or landscape features (e.g., Ford). Previous work has characterised the regionality of British surnames de ning socalled isonymy regions based on the 1881 Census geography [6-8]. Isonymy regions are geographical regions whose populations bear a distinct constellation of surnames. One study identi ed direct correspondence between isonymy regions in 1881 and contemporary, genetic population structure [9](we will refer to isonymy regions as surname origins hereinafter). Surname geographies can reveal patterns of migration and social mobility [10]. The 1881 regional analysis does not, however, accommodate the apparent multiple origins of surnames imported to Great Britain prior to the 1881 Census -with many names borne by migrants from Ireland being prominent examples.
In theory, bearers with the same surname origin are more likely to be related genetically and culturally. How related they are and whether surname origins could be a useful marker of co-ancestry in health studies is an ongoing research concern [4], but evidence from multiple countries suggests that present day surname bearers resident in high density heartlands are more likely to be related than bearers scattered across lower density regions, where the chance of admixture is greater.
Population structure also acts as a strong confounder which continues to limit the validity of genome-wide association studies [11], and there is a need for methods that can systematically identify population structure and inform the sampling design of such studies.
The aim of the study is to test whether surname origins are associated with health outcomes and thus inform more detailed investigations untangling biological and social factors in health. We investigate whether British surname origins can be associated with health outcomes in a large database study of hospital admissions in England today. We divide the admissions of over 30 million patients into 125 major disease categories for the study. Even though administrative hospital data are sparse on contextual data, we make provisions for this in two ways. First, we develop a dose-response relationship between health outcomes and density of different surname groups. Second, we study confounding socio-economic factors by coding patients' residences to a national index of area deprivation. Diagnoses were coded using the International Classi cation of Diseases (ICD10) system [12], which for analytical purposes was aggregated into 125 Clinical Classi cations Software (CCS) categories [13]. A denominator dataset was created with the rst admission (of any kind) from each of 32,860,835 patients with known residence recorded between April 1999 and March 2014. In this process, patients without surname records or non-British surnames were excluded. Only patients' rst admission for each disease category was kept. In the interest of keeping speci city and avoiding dilution of any associations, the analyses were carried out in an initial "signal detection" round followed by a second round of more detailed analyses of potential signals. Ageand sex-adjusted odds ratios for each of the 125 CCS disease categories were estimated for all 74 surname origins using logistic regression. For these analyses, patients from the dominant Cluster-13, whose origin is coterminous with England as a whole, was used as a reference population. Only 94 out of the 125 CCS categories had su cient data for analysis. "Signals" were identi ed from this rst round where the names making up a whole branch of related surname origins had admission risks signi cantly above or below the reference population. The rationale for de ning signals in this way was that branch-level clustering suggests a lineage effect between individuals sharing surname origin and potentially also genetic and cultural roots.

Methods
Age-and sex-standardised incidence per 100,000 population were then calculated, weighted according to the 2013 European Standard Population [14] of the identi ed signals, and broken down by origin density quintiles and area deprivation quintiles. The origin density was constructed as the percentage of patients with a given regional origin relative to all patients resident in each local authority district, separated into quintiles. Local authority was chosen the unit of analysis because all the signal origins would have an unbroken, non-zero, distribution. A dispersion ratio for each origin aggregate was calculated as Q1 divided by Q5, so that a fully dispersed origin would have a ratio of 1 and the least dispersed a ratio approaching 0. Area deprivation quintiles were coded at neighbourhood level (2011 Middle Layer Super Output Area) [15].

Results
Associations between 125 CCS disease categories and 74 different surname origins studied and plotted during the screening round. The denominators for the analyses are shown in Table 1 and the numerators (case patient numbers) for each signal in Table 2. The case numbers varied from 6,665 teeth and jaw disorder patients with Branch-44 (Southern) surnames to 25,218 fracture patients with Branch-34 surnames (Northern) ( Table 2). The Scottish origin group was more evenly dispersed with a dispersion ratio (Q1/Q5) of 0.42 compared to 0.09-0.10 for the English origin groups (Table 1).  The full results of the initial signal detection round can be found in the Supplementary Materials Tables S1 (examples with variable labels) and S2 (all signals). The hospitalisation odds ratios for the signals are shown in Fig. 2. The age-and sexstandardised incidence for each signal disease are shown broken down by origin density and area deprivation quintile in Fig. 3.

Discussion
The main objective of this research is to examine whether the origin of patients' surnames in 1881 act as plausible markers of population structure that are associated with health outcomes today. Given that most origin groups are relatively sparsely powered, we decided to screen the results for signals where an entire branch of related origins had signi cantly higher or lower risk than the reference population of pan-English surname bearers.
We found four such signals. All the signals were associated with greater (dis-)advantage in the high-density regions compared to the low-density regions. For the signal with alcohol-related disorders and Branch-6 (Scottish) surnames, this was also the case for patients with any other surname origin. For the three other signals, the (dis-)advantage was only signi cant in the highest density quintile.
The analysis of area deprivation showed that disadvantage increased with area deprivation relative to patients with all other surname origin for alcohol-related disorders and teeth and jaw disorders. For fractures and upper gastrointestinal disorders, the relative (dis-)advantage changed little across deprivation quintiles.
The results for the three English signals suggest that it is possible to detect differential health outcomes linked to the ancestral "heartland" and that the effects persist even when considering variation across different levels of area deprivation.
The patients with Scottish surnames were more dispersed than the three English groups. The fact that we did not nd a density effect could be that the heartland, in this case Scotland, was outside the study area.
From a genetic perspective, it could be hypothesised that the higher risk was caused by inbreeding depression or genetic drift.
The fact that the identi ed signals are not headline diseases with a big impact on patients' lives or healthcare budgets, could arise because they are exactly marginal diseases under low selection pressure. Inbreeding depression is classically associated with consanguineous relationships and monogenic diseases, but nascent research suggest that it can also be associated with complex diseases and other complex traits even in population-based samples [16,17].
From a social science perspective, an alternative explanation could be that the surname classi cation is likely to capture a wide range of variables associated with "nurture", which may again differ between heartlands and the "host" regions. The area deprivation index used in this study is admittedly a very reductionist measure con ating diverse conditions and experiences that can lead to poorer health outcomes. Further research should therefore include more detailed data on individual socio-economic factors.
As posited by dual inheritance theory, genetic and cultural roots are intertwined [18]. While genetic inheritance can only be vertical, cultural inheritance can both be horizontal and vertical. Cultural practices can thus be passed down vertically from forbears or quickly adopted from peers horizontally. Wealth accumulated in families can be an example of vertical, cultural transmission. Which set of factors, vertical or horizontal, genetic or cultural, are dominant will depend on the particular aetiology of the health condition in question.
A motivation for further studies would either be to contextualise surname origins as a new geography or to study the ne-scale population structure in case it has implications for the design of genetic studies [19]. The former approach should re ne the regional geography used here in order to accommodate the likely effects of migration of family groups prior to the collection of censuses that have been digitally encoded.
A number of limitations should be acknowledged. HES was created for administrative and billing purposes, but the data have been validated for research [20,21]. It cannot be ruled out that there could be sub-national coding differences especially when studying the entire range of diagnoses as in this case. The purpose of a medical classi cation such as CCS is to break down the analyses into meaningful categories. As with any categorisation, important variation may be lost and vary depending on the system deployed.
The base population was created from HES itself, i.e., as patients with any diagnosis by surname origin. In this way it was possible to study individual surname origin groups against the dominant group with ubiquitous English surnames. The results may be biased if the patient populations are not generalisable to the populations with the same surname origin and we were not able to validate this aspect of the analyses due to the lack of external reference population data.
To avoid distorting representations of health conditions associated with multiple admissions, we only used rst admission for each condition and for each patient in the base population. With data collection spanning fteen years, there could be a bias towards exposures in patients' younger years. We assumed that any potential biases from this source would cancel each other out although there could be residual net bias if the age of onset varied markedly between the surname origin and the reference population. In addition, HES does not contain identi ers for households, which for our analyses may mean that there could be residual clustering at household level, e.g., poisoning incidents involving multiple household members with identical surname origin.
A patient may change surname, e.g., following marriage, and end up being coded to a different surname origin than the one at birth. This would make the dataset noisier, but we reduced this by only using the maiden or rst-recorded surname for each patient. In addition, local intermarriage is still common and thus it can be expected that the newly adopted surname may belong to the same or a closely related surname origin [22].
Retirement migration may mean that exposures causing some health problems are systematically attributed to retirement regions. This is a limitation, although in this study it would only apply if the retiree moves into a different area deprivation quintile since all other variables would remain the same.
We studied 94 different disease categories across 74 surname origins. Especially for the rst round, this constitutes a multiple comparison problem. We decided to focus on signals where groups of surname origins differed from the reference population on the basis that "lineage" would provide a higher degree of plausibility and crucially reduce the number of simultaneous analyses.
In the Supplementary Materials, we have provided the results both with and without Bonferroni corrections for 94 simultaneous comparisons.

Conclusion
The study shows that surname origins are associated with diverse health outcomes and may thus act as combined markers of biosocial population structure over and above area deprivation. If related surname origins were studied -assuming either genetic or cultural lineage effect -then again it was possible to nd correlates between certain health outcomes in the areas with the highest density of the surname origin aggregate. The fact that correlates were only present in the highest density quintile suggests that ancestral heartlands vary in some aspect from the rest of the country. Hypothetically this pattern could be explained by a combination of factors related to both nature and nurture, in part depending on the nature of the health outcome itself. The results may thus inform more detailed investigations untangling biological and social factors in health.

REC -Research Ethics Committee
Declarations Ethics approval and consent to participate Ethical approval was obtained from Bromley REC (Reference: 13/LO/1355). The study was conducted in accordance with relevant guidelines and regulations. As for other non-identi able database studies, it was not practical nor desirable to obtain consent from each patient. The study only used routinely collected, secondary data and as such involved no experimental components requiring additional protocols and approvals. The HES data licence reference is DARS-NIC-28051-Q3K7L.

Consent for Publication
Not applicable.

Availability of data and material
All the data that support the ndings of this study are available from NHS Digital subject to ethnical and scienti c approval of a study protocol. Researchers wishing to get access to the study data can visit this website: https://digital.nhs.uk/data-andinformation/data-tools-and-services/data-services/hospital-episode-statistics. The authors of this study had no special access privileges others would not have.

Competing interests
None

Funding
The UK Economic and Social Research Council is acknowledged for its support for the UCL Consumer Data Research Centre (CDRC) enabling this research (Grant ES/L011840/1). The funder had no direct role in relation to the speci c study.

Authors' contributions
All authors contributed to conception and design, critically revised the manuscript, gave nal approval, and agreed to be accountable for all aspects of work ensuring integrity and accuracy (JK, PAL, JP). JK and PAL contributed to data acquisition. JP contributed to analysis and the rst draft of the manuscript.

Figure 1
Dendrogram of the British isonymy regions (surname origins). The map shows regions at the top of the hierarchy (k=19).
Regions with identi ed signals are highlighted.

Figure 2
Hospital admission risk by surname origin for four disease-origin branch signals. The x-axis has the order of each surname origin in the dendrogram shown in Fig. 1 putting each "branch" next to its distal sub-branch.