Enhancing understandings of social determinants of health in China: Linkage and analysis of a national multilevel population health surveillance with routinely collected mortality records for 98 058 people in 31 provinces of mainland China

Background: The aim of this study was to enhance capability in research on social determinants of health in China by linking and analyzing routinely-collected death records over 5 years with national population health surveillance. Methods: Linkage of 98 058 participants in the 2010 China Chronic Disease and Risk Factor Surveillance (CCDRFS) to records in the national death surveillance data from 2011 to 2015 was conducted through a matching program involving identification numbers, name, gender and residential address, followed by a structured checking process. Multilevel regressions were used to investigate five-year odds of all-cause, non-communicable disease (NCD), infectious disease and injury mortality in relation to person- and county-level factors. Results: A total of 3,365 deaths were observed in the linked mortality and population health surveillance. Cross-checks and comparisons with national mortality distributions provided assurance that the linkage was reasonable. Geographic variation in mortality was observed via age and gender adjusted median odds ratios for all-cause mortality (>1.30), infectious disease (>2.01), NCD (>1.24) and injury (>1.12). Increased odds of all-cause and all three cause-specific mortality outcomes were higher with age and among men. Low educational attainment was a predictor of all-cause, NCD and injury mortality. Longer mean years of education at the county-level was only associated with lower injury mortality. Divorcees had a higher odd of all-cause and NCD mortality than singletons. Rurality was a predictor of all-cause and NCD mortality. Conclusion: The results of this study provide utility for future investigations of social determinants of health and mortality using linked data in China.

4 Background Non-communicable diseases (NCDs) have become the leading causes of death in China brought about by decades of economic development, rapid urbanization and improvements in health care, living conditions and nutrition [1,2]. Many studies have evaluated the impact of risk factors on NCDs around the world, such as unhealthy lifestyles [3][4][5], socioeconomic status[6-9], family history[10] and abnormal biological indicators [11][12][13]. In recent decades, there are also some studies on determinants of NCDs in China, such as the China Kadoorie Biobank (CKB) study [14], the Jinchang cohort on cancer [15,16], and a 15-year and nationally representative prospective study [17,18]. These studies have found meaningful results about the association between risk factors and NCDs, although they are not without some limitations.
For example, some studies were implemented in specific regional areas and were not nationally representative [14]. Some studies focused their attention on particular risk factors, certain groups of people or isolated diseases, providing a limited scope of results [15,16]. Perhaps most importantly, given the abovementioned structural changes sweeping China, there has been insufficient focus on social determinants of health. There is a need for a more general, nationally representative and multilevel population health data that can enable researchers to investigate social and spatial determinants of NCD and related mortality in China.
Record linkage based on current data has proven to be a cost-effective means for integrating information from different sources [19]. In recent decades, it has been applied extensively to generate databases for epidemiological studies in higher income countries, such as United States of America [20,21], Australia [22,23], Canada[24] and the United Kingdom [25], but it is much less common in China.
Obstacles include a lack of attention paid to record linkage, missing identification numbers, variation in the permanent or residence address coding, imprecision in the reporting information as well as some other issues related to data quality.
In this study, we link records from 98 058 participants in the 2010 Chronic Disease and Risk Factor Surveillance to the routinely-collected national mortality surveillance system in China from 2011 to 2015 inclusive. We also tested different matching strategies used to link records from both of the databases based on conventional personal identifiers (e.g., name, age, identification number, address).
The linkage was evaluated by comparing the proportion of different causes of death based on the matched records and with data from the national mortality surveillance system. To test the utility of the data and to provide a demonstration of its potential, multilevel logistic regressions were used to investigate associations between the main cause of death categories and social determinants.

Data source
Two separate, national representative databases were used in this study.  where data are consolidated. Therefore, the national mortality surveillance system can provide data on total mortality, the broad cause-of-death distribution and the geographic distribution of deaths.
From 2013, the identification number of each death record was essential information and compulsory to fill in. The accuracy of the identification number was checked through comparing to the corresponding information from the police department from 2015. It is acknowledged that some death records do not have the identification number or an inaccurate identification number.

Record linkage
Two matching steps were adopted to link all the records from both of the two data sources. Before the match, some identifiers in the two data sources were collected including identification numbers, full name, gender, date of birth, address of permanent and residence and the corresponding codes. However, only partial coverage of the identification number and detailed address information can be gained for each record in each of the two data sources.
In the first step of matching the two data sources, a computer program was written including several strategies to match all possible records. The first program was to match records in population health surveillance with death records through identification number, which is unique identifier including 18 numbers. Those records without an identification number or where the identification number cannot be matched were progressed through a second program including four linkage 8 strategies to find possible matched records one by one. If a record can be matched in any strategy, it will not be progressed to the next strategies. The four linkage strategies are: i) Name, gender, permanent address coding; ii) Name, gender and residence address coding; iii) Name spelling, gender and permanent address coding; iv) Name spelling, gender, residence address coding.
In the second step, program check and artificial check were adopted to get the confirmed matched records. Firstly, the name and identification numbers were combined together in the program check to confirm whether the records matched through identification number were correct. Secondly, all the other records matched through another four strategies were checked manually with two methods. If detailed village address were available, a check method that includes name or name spelling, gender and permanent or residence villages' address was used to confirm the linkage records. If the village address were not available, another check method that included name or name spelling, gender, permanent or residence town address and age differences less than 5 years would be used. Finally, the confirmed match database through program check and artificial check can be gained. The work flow of the whole matching process can be seen in Figure 1.
( Figure 1 The work flow of the matching process) In order to guarantee the quality of matching, both of the two steps were preceded separately by two statisticians. All of the programs involved in program match and program check were run in SAS v9.4 (SAS Institute Inc. Cary, USA).

Statistical analysis
Two statistical analyses were performed. The first was to evaluate the matching effect and to assess the degree of completeness in the linked records. This included comparison of the mortality and proportion of different causes of death based on the linked database and from the national mortality surveillance system with chisquare test.
The second statistical analysis performed was to estimate and attempt to explain geographical variation in the odds of death over a discrete time period of 5-years using multilevel logistic regression [31]. A five-level model was fitted, with the individual as level 1, village as level 2, town as level 3, county as level 4, and province as level 5. An initial model included fixed effects for gender and age group which was divided by three subgroups. We then adjusted this model sequentially with variables describing education attainment, mean years of education, marital status and place of residence. Dummy variables for age, education attainment and marital status were added as fixed effects. The corresponding subcategories are shown in Table 3. All fixed effect parameters were expressed as odds ratios (OR) and 95% confidence intervals (95%CI). Geographic variation was expressed through year in this cohort, as expected due to ageing (Table 1).

Matching results
All-cause mortality based on linked participants in the cohort was 6.15%, 6.62%, 6.81%, 7.30% and 7.57% in 2011, 2012, 2013, 2014 and 2015, respectively. These mortality rates were close to the corresponding data from the national death surveillance system, at 7.04%, 7.26%, 7.87%, 7.93% and 7.97%, respectively ( Table   2). Based on a chi-square test, there was no statistically significant difference in mortality measured in the cohort as compared with the national death surveillance (X 2 =0.004, P=0.947).
(See Table 2 Table 3  There was no geographic variation observed in injury mortality between provinces, counties or districts or villages, but there was evidence of variation between towns (MOR=1.499).

Multilevel analyses
Odds Ratio (ORs) in the multilevel logistic regression indicate that all-cause mortality was significantly lower among females (OR=0.745, 95%CI 0.694, 0.799).
Compared to the people aged 18 to 44 years old, the odds of all-cause mortality were higher in people aged 45 to 59 years old (OR=3.300, 95%CI 2.917, 3.734) and people aged older than 60 years old (OR=14.296, 95%CI 12.735, 16.049). After adding education attainment, mean years of education, marital status and place of residence into fixed part of the models, 23.42% of the province-level variation, 36.60% of the variation between counties or districts, 10.71% of the variation between towns and 7.50% of the variation between villages were explained.
All-cause mortality was lower in participants with primary education or above compared to those with no educational qualifications. However, there were no significant differences in all-cause mortality among people with different mean years of education. Compared to single people, the odds of all-cause mortality was higher in divorced people (OR=1.462, 95%CI 1.158, 1.846). People living in rural areas had higher all-cause mortality than people living in urban areas (OR=1.461, 95%CI 1.196, 1.642).
(See Table 3) Both all-cause mortality and NCDs mortality were significantly higher among male, elderly, lower education attainment, divorced and living in rural area. Infectious disease was only associated with gender and age, namely male and elderly having higher mortality. Injury mortality was higher among male, elderly, lower education attainment and fewer years of education. Full details are available in additional file 1.

Discussion
In recent years, the Chinese government has paid significant attention to the control and prevention of NCDs and allocated resources for cohort study data collections to help enhance understandings of the scale of the challenges at hand [33]. Linkage of survey data to other forms of data, such as routinely collected mortality records, as was done in this study with the 2010 Chronic Disease and Risk Factor Surveillance, can be a low-cost means of strengthening the capacity for evidence-based decision-making.
In our study the results from the linked cohort and mortality data were consistent with the wider routinely collected mortality surveillance. This provided some assurance towards its reliability for future use [34], even while accounting for the multiple approaches used to perform the data linkage in the absence of an identification number for every person. Matching by a combination of gender, address and age helped in the absence of an identification number.
Convergence in mortality among people in the cohort and the routine mortality surveillance was observed in subsequent years after baseline. The initial difference may be attributable to subject bias wherein healthier people tend to participate in face-to-face surveys [35]. This convergence can be confirmed in following studies based on the two databases.
Further assurance of the data quality was provided by the results of the multilevel logistic regressions. These models revealed geographic variation and associations in the main cause of death with several social determinants. These results are in agreement with previous reports [36], confirming the utility of the linkage database.
The multilevel logistic regression model afforded insights into geographic variations across multiple spatial scales. These models further demonstrate the importance of multilevel modelling of health data in large countries with varied populations, socioeconomics and topography, such as China. The results showed geographic variation in all-cause mortality and NCDs mortality varied among different area levels. These results are aligned with findings from previous studies [37][38][39].
Enhancing understandings of social determinants and inequities in all-cause and NCD-related mortality in China is critical to give a more complete, in-depth picture of the public health challenges that decision-makers face, while also providing data that can be used to evaluate specific policies and interventions.
To our knowledge, this is the first study to link the national chronic disease and risk factor surveillance with routinely collected cause-specific mortality data in China.
This linked data will provide opportunities and possibilities for researchers and policy makers. There are, however, some limitations to acknowledge. First, the matching process of the detailed Chinese characters took a long time to implement and confirm. This work could be potentially conducted via machine learning to improve efficiency [40]. Second, the linked social data was cross-sectional and so common changes in social determinants over time in positive (e.g. educational attainment) and negative (e.g. job loss) directions could not be investigated in this particular data. Therefore, collection of longitudinal data and linkage to mortality records in future work is needed to build a more comprehensive understanding of the social determinants of mortality in China.

Conclusions
In this study, we linked 98 058 participants in the 2010 Chronic Disease and Risk Factor Surveillance to records in the national death surveillance data from 2011 to 2015. Program match, program check and artificial check were adopted to link all the records from both of the two data sources. Cross-checks and comparisons with national mortality distributions provided assurance that the linkage was reasonable.
Multilevel logistic regression models revealed geographic variation and associations in the main cause of death with several social determinants. The results of this study provide utility for future investigations of social determinants of health and mortality using linked data in China.

Ethics approval and consent to participate
The ethical review com mittee of Chinese Center for Disease Control and Prevention approved the 2010 CCDRFS and written informed consent was obtained from each par ticipant before data collection.
The records in the national death surveillance data from 2011 to 2015 are obtained from the national mortality surveillance system. The information on individual deaths in all population catchment areas has been reported to the national mortality surveillance system according to the national guidelines since 1992.

Consent for publication
Not applicable.

Availability of data and materials
The datasets used and/or analyzed during the study will be made available by the corresponding author following a reasonable request.

Comflict of Interest
The authors declare that they have no conflict of interest.     The work flow of the matching process

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.
additional file 1.pdf