The use of surnames to impute missing ethnicity data in the South African National Cancer Registry database

doi:10.21203/rs.3.rs-2033699/v1

Download PDF

Method Article

The use of surnames to impute missing ethnicity data in the South African National Cancer Registry database

https://doi.org/10.21203/rs.3.rs-2033699/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The National Cancer Registry (NCR) of South Africa (SA) calculates cancer incidence rates based on full pathology reports from South African private and public health care laboratories and presents the cancer incidence data by ethnic groups. The sensitivity of collecting ethnicity data in post-apartheid South Africa by reporting sources has resulted in large proportions of cancer cases being reported without population group/ethnicity information. The absence of ethnicity data is a significant challenge to cancer incidence reporting.

An imputation method was developed to impute the missing ethnicities by using surnames with known patient-reported ethnicities. A hold-out test done by masking the ethnicities of 50% (n = 332232) of the NCR dataset with known ethnicities, from 1986 to 2014, was used to evaluate this imputation method. The masked ethnicities were imputed and then compared to the patient-reported ethnicities. 94.31% of ethnicities were correctly classified using this imputation method. Sensitivities and specificities were calculated per ethnicity group (Asian, Black, Coloured, White). The imputation method performed well for the Asian, Black and White ethnic groups, but performed poorly for the Coloured ethnic group.

The strong relationship between surnames and ethnic groups, as evidenced by the results, mitigates the significant concern of whether surname itself is predictive of ethnicity. Despite the increasing proportion of missing data over the years, the percentage of correctly classified individuals remains high across the test dataset. The strength of this imputation methodology is demonstrated in this study, however, with the large disparities across the private and public healthcare sectors in SA, all cancer cases should be reported with complete information, from all sources, for accurate cancer incidence reporting without the need for having to impute for missing data. There are still challenges around collecting sensitive data such as ethnicities in a SA that warrant further discussions.

Epidemiology

Statistical Epidemiology

cancer registration

multiple imputation

ethnicity

The National Cancer Registry (NCR) of South Africa (SA) is a pathology-based cancer surveillance registry that utilizes laboratory-confirmed diagnosis of cancer (cytological, histological and haematological) amongst SA residents to calculate national cancer incidence rates and perform national cancer surveillance [1]. Patient demographic information together with cancer topography and morphology data are extracted from laboratory reports. This data is used to describe Age Standardised Incidence Rates (ASIR) per cancer, by gender and ethnicity for a given year.

The institutionalised racial segregation and discrimination that was established during the Apartheid era in South Africa still affects the present society. As such, ethnicity is a sensitive topic and citizens may choose not to declare their ethnicity in the post-Apartheid era official documentation. This applies to medical records as well. Hence, there has been an increasing trend of missing ethnicity data in the cancer laboratory reports since 1994 (Fig. 1). Therefore, it has become increasingly difficult to produce accurate cancer incidence data representative of the demographics of SA given the increasingly large proportions of missing ethnicity data post − 1994.

Missing data is a common problem in the field of epidemiology, particularly in surveillance systems, where routinely collected data may be used for study purposes. Often complete case analysis is performed using available data. However, not only does this method lead to selection bias in the analysis as the missingness pattern may not be completely random, cumulative effects of missing data and exclusion of missing data would significantly reduce the power of the analysis [4]. Therefore, imputation methods were needed to allocate the missing ethnicity for the SA cancer surveillance data.

SA is a culturally diverse and multi-lingual country with 11 official-spoken and written languages [5]. Despite efforts by the SA government to reduce health inequities post-democracy, large disparities still exist in patient access to healthcare services. Approximately 16%-17% of the South African population have medical insurance and access to private healthcare services [2, 3], and another 9% of the population funds private healthcare services out-of-pocket [3]. The vast majority of the South African population depends on over-burdened public services to access healthcare. The NCR collects its data from both private and public healthcare laboratories in South Africa.

Given SA’s unique political history and relatively young democracy, surnames are still highly correlated to ethnicity. This presents a unique opportunity to use surnames to impute the ethnicity of patients where such data is not available. This paper describes an imputation method developed at the NCR and evaluates its effectiveness for imputing missing ethnicity data using known South African surname-ethnicity pairings, for the NCR cancer surveillance data.

The South African NCR dataset from 1986 to 2014 was available for this study. This dataset consists of patients’ demographic information (surnames, first names, gender, date of birth) and cancer diagnostic information (cancer topography, morphology, date of diagnosis) extracted from laboratory reports. The full methodology of the NCR has been previously described [1]. The NCR uses the ethnicity classification of Asian, Black, Coloured, and White as defined by the official statistical bureau of South Africa (Statistics-South Africa) in its reports [5, 6]. According to Stats-SA, in 2015, Blacks, Whites, Coloured (mixed ancestry) and Asians/Indians comprised 80.5%, 8.3%, 8.8% and 2.5% of the SA population respectively [5]. This study is covered under the ethical clearance waiver obtained by the NCR for performing routine cancer surveillance.

Study sample:

A total of 664607 records from the NCR dataset where patient-reported ethnicities were available were used for this study. The ethnicity for the full sample dataset consisted of 47.27% Black, 43.26% White, 6.52% Coloured and 2.95% Asian.

Test-dataset:

Random masking of the patient-reported ethnicity groups for ~ 50% of the full sample dataset (n = 332232) was done to create the test dataset. The ethnicity constitution of the test dataset was representative of the full dataset.

Imputation reference panel:

A surname list with known, patient-reported ethnicity was obtained from the National Cancer Registry data repository and used to construct the imputation reference panel.

Imputation methodology:

The imputation methodology was developed in Stata MP 15.1 (StataCorp, USA). The first step uses known surname-ethnicity pairs from the imputation reference panel to create a unique surname-ethnicity lookup table where the frequency-occurrence of the different ethnicities for a given surname is calculated. The most prevalent ethnicity for a given surname is then chosen as the impute ethnicity. The strength of imputation is calculated by the frequency-occurrences of a surname for a given ethnicity divided by the total number of occurrences of that surname and presented as a percentage. Low imputation strength percentages and tied percentages were clerically reviewed before analysis.

Statistical analysis:

After the imputation using the reference panel, the test dataset's patient-reported ethnicities were unmasked and statistically compared to the imputed ethnicities. Mean imputation strength was calculated. Cohen’s kappa statistics for an intervariable agreement were calculated using the kappa algorithm in Stata MP 15.1 (StataCorp, USA). The sensitivity and specificity of the imputation method were calculated using the roctab algorithm in Stata MP 15.1 (StataCorp, USA).

The percentage of missing ethnicity data for cancer laboratory reports increased significantly from 1998 onwards (Fig. 1). When missing ethnicity data was further analysed, we found a large disparity in patterns, by 2008, with 97% of cases missing ethnicity in privately funded laboratory data, primarily servicing the private healthcare sector, compared to 59% in state-funded laboratories, mainly servicing the public healthcare sector (Table 1).

Table 1

Percentage of ethnicity data not supplied by healthcare sectors.
Year period:	Public healthcare sector:	Private healthcare sector:
1986–1992	1%	1%
1993–1997	7%	9%
1998–2008	59%	97%

The test-dataset consisted of 47.37% Black; 43.19% White; 6.55% Coloured; 2.89% Asian. Of this 30.38% were private healthcare sector data and 69.62% were public healthcare sector data. For the private healthcare data 90.45% were White, 6.78% Black, 1.86% Asian and 0.90% Coloured. For the public healthcare data, 65.23% were Black, 22.22% White, 9.15% Coloured and 3.39% Asian.

The imputation reference panel consisted of 406642 unique surname-ethnicity pairs that were 66.63% Black, 25.05% White, 4.56% Asian and 3.75% Coloured.

Imputation:

The ethnicity for the test dataset was imputed using the imputation reference panel and 99.52% (n = 330627) of the test dataset was successfully imputed. 1605 (0.48%) of the test dataset was not imputed due to the surname-ethnicity pairs not existing in the imputation reference panel. The mean imputation strength was 91.48% (± 12.17%). Only 1% of the total number of imputed records had imputation strength below 50%. A Cohen’s kappa statistic (k) of 0.9031 (P < 0.00001) was achieved, translating to a 94.35% agreement between the patient-reported ethnicities and the imputed ethnicities.

Table 2 lists the sensitivity and specificity of the imputation by ethnicity groups. High sensitivities (> 90%) were achieved for the Asian, Black and White ethnic groups, while high specificities (> 95%) were achieved for all ethnic groups.

Table 2

Sensitivities and specificities of the imputation by ethnicity.
Ethnicity Groups:	Sensitivity:	Specificity:
Asian	92.04%	99.66%
Black	97.30%	98.78%
Coloured	63.79%	97.97%
White	95.81%	95.04%

This paper sought to validate the imputation method employed by the South African National Cancer Registry for allocating missing ethnicity data in the cancer surveillance system. Of the test data, 99.5% was successfully imputed with a mean imputation strength of 91%. A Cohen’s kappa agreement of 94.35% was achieved between the patient-reported and imputed ethnicities.

This imputation method worked well for Asian, Black and White ethnic groups, achieving sensitivity and specificity of above 90% for the three groups respectively (Table 2) but had limitations for the Coloured ethnic group (sensitivity of 63.79%) where a large percentage of cases were misallocated to the White ethnic group. This problem has historical roots in SA. The Coloured population comprises a mixed population consisting of major ancestral components from Khoisan, Bantu-speaking Africans, European, and Asian groups [7]. Surnames are derived from this mixed ancestry with many surnames similar to that of the White population group with European ancestry.

One limitation of this imputation method is that it relies on known surname-ethnicity pairs. This hinders the ability of this method to impute ethnicities for previously not-known or uncommon surnames. Considering that uncommon surnames may make up a very small percentage of overall missing data, one could use other imputation methodologies, such as multiple imputations, to overcome this limitation. However, this would require further validation. Additionally, the trade-off between the percentage of missing data versus the accuracy of imputed data should be considered carefully before embarking on further data analysis. Since the strength of this imputation method relies heavily on known surname-ethnicity pairs, the construction of the imputation reference panel and the validation of the imputation method would have been more robust had the researchers had access to an independent dataset in SA with surnames linked to ethnicities where a comparison to a “gold standard” could have been performed. However, there are no publicly available datasets of this nature in SA.

Another limitation of this analysis lies in the inability to confirm the “missing at random” supposition. The imputation reference panel dataset originated from cancer pathology reports from both public and private healthcare sectors; however, a large proportion of ethnicity data was either withheld or not-collected from the private healthcare sector (Table 1). This non-collection of the ethnicity variable from the private healthcare sector could lead to a bias in imputation for the White, Coloured and Asian population groups where the ethnicity data are more likely to be missing when compared to the Black population groups who were more likely to access public healthcare facilities.

Imputation of ethnicity using a reference panel is not a novel method. A previous study used surname and geocoding data from the U.S. census in a similar approach to impute for missing ethnicity in electronic health record systems and has found this approach to be useful in imputing missing ethnicity data (88% correctly imputed) thus reducing outcome bias [8].

Racial/ethnic/population group classification in health data has been discussed extensively in SA and international health literature [9–11]. The NCR is fully cognisant of the arguments against such categorisation of health data; that it may entrench “race-based mindsets” and may be perceived as showing approval of race-based segregation [9]. Importantly for health researchers, race-based segregation may result in the actual social, political and cultural determinants of health being missed. Marmot cautions that “if two groups, however, defined, have different rates of disease, a productive aetiological investigation may follow” [11]. The NCR aims to motivate this productive aetiological investigation through its description of cancer incidence estimates by age, gender and ethnicity. Researchers agree that in SA, ethnicity is a proxy for socio-economic determinants and access to healthcare. Because of previous apartheid institutionalised racial segregation in SA, socioeconomic status and access to healthcare are largely divided according to ethnicity with a large percentage of the Black populations in the lower socioeconomic strata [12, 13].

The large number of cases with missing ethnicity data post-1998 can be ascribed to societal sensitivity to any form of racial classification post-apartheid. The system of apartheid before 1994, ensured the oppression of non-White individuals through social, economic and political policies based on race segregation [12]. During democracy, “people are more likely to perceive the obligation to declare one’s “race” on a form as an affirmation of the validity of “race” classification” [9]. Therefore, fewer people supplied information about ethnicity in the pathology request form. In addition, recent pathology-request forms from the private healthcare sector laboratories have excluded the ethnicity variable entirely.

Given that the NCR is a national surveillance organisation, a pragmatic cost-effective approach to the collection of variables is adopted with the number of variables kept to a minimum. In addition, the source of our surveillance data (pathology reports) determines that the NCR does not have access to the true socioeconomic, and cultural determinants of cancer. Therefore, the presentation of data classified as population group is meant to stimulate policy discussions and research into the underlying aetiological factors that present themselves as population group differences.

There were increasing proportions of missing ethnicity data in the cancer surveillance data reported to the NCR because of socio-political changes in SA during democracy. Using an imputation method, the percentage of missing ethnicities can be accurately imputed across the test dataset. However, the classification rate, particularly for the Coloured population could be improved by using a more robust imputation reference panel. Although this methodology could prove useful for researchers working with ethnicity data from SA, the accurate and complete reporting of patients’ demographic data is essential for comprehensive surveillance.

Competing interests

The authors declare no competing interests.

Singh E, Underwood JM, Nattey C, et al. (2015) South African National Cancer Registry: Effect of withheld data from private health systems on cancer incidence estimates. South African Med J 105:107. doi: 10.7196/samj.8858
Council for Medical Schemes (2016) Annual report 2015/2016. ISBN:978-0-621-44536-7.
Statistics South Africa (2017) Statistical release P0318 General Household Survey 2016. Statistics South Africa, Pretoria
Sterne JAC, White IR, Carlin JB, et al. (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:b2393.
Statistics South Africa (2015) Statistical release P0302 Mid-year population estimates 2015. pp20. Statistics South Africa, Pretoria
Statistics South Africa (2011) Census 2011 Municipal report Gauteng, Report No 03-01-55. Pretoria.
de Wit E, Delport W, Rugamika CE, et al. (2010) Genome-wide analysis of the structure of the South African Coloured Population in the Western Cape. Hum Genet 128:145–53. doi: 10.1007/s00439-010-0836-1
Grundmeier RW, Song L, Ramos MJ, et al. (2015) Imputing missing race/ethnicity in pediatric electronix health records: reducing bias with use of U.S. census location and surname data. Health Serv Res. 50(4): 946–960. doi: 10.1111/1475-6773.12295
Ncayiyana D (2007) Racial profiling in medical research : What are we measuring ? SAMJ 97:1225–1226. doi: 10.1101/gr99292
Rothberg AD (2008) Equity and quality of care through racial profiling. 98:435–437.
Ellison GTH (1996) Desegregating health statistics and health research in South Africa. 1257–1262.
Coovadia H, Jewkes R, Barron P, et al. (2009) The health and health system of South Africa: historical roots of current public health challenges. Lancet 374:817–34. doi: 10.1016/S0140-6736(09)60951-X
Ndletyana M (2014) Middle-class in south africa: significance, role and impact. In: BRICS 6th Acad. Forum, Brazil. pp 1–21
Statistics South Africa (2015) General Household Survey; Statistical Release P0318.

Download PDF

Version 1

posted

You are reading this latest preprint version

The use of surnames to impute missing ethnicity data in the South African National Cancer Registry database

Status:

Version 1

Abstract

Figures

Introduction

Methodology

Study sample:

Imputation reference panel:

Imputation methodology:

Statistical analysis:

Results

Imputation:

Discussion

Conclusion

Declarations

References

Status:

Version 1