This paper sought to validate the imputation method employed by the South African National Cancer Registry for allocating missing ethnicity data in the cancer surveillance system. Of the test data, 99.5% was successfully imputed with a mean imputation strength of 91%. A Cohen’s kappa agreement of 94.35% was achieved between the patient-reported and imputed ethnicities.
This imputation method worked well for Asian, Black and White ethnic groups, achieving sensitivity and specificity of above 90% for the three groups respectively (Table 2) but had limitations for the Coloured ethnic group (sensitivity of 63.79%) where a large percentage of cases were misallocated to the White ethnic group. This problem has historical roots in SA. The Coloured population comprises a mixed population consisting of major ancestral components from Khoisan, Bantu-speaking Africans, European, and Asian groups [7]. Surnames are derived from this mixed ancestry with many surnames similar to that of the White population group with European ancestry.
One limitation of this imputation method is that it relies on known surname-ethnicity pairs. This hinders the ability of this method to impute ethnicities for previously not-known or uncommon surnames. Considering that uncommon surnames may make up a very small percentage of overall missing data, one could use other imputation methodologies, such as multiple imputations, to overcome this limitation. However, this would require further validation. Additionally, the trade-off between the percentage of missing data versus the accuracy of imputed data should be considered carefully before embarking on further data analysis. Since the strength of this imputation method relies heavily on known surname-ethnicity pairs, the construction of the imputation reference panel and the validation of the imputation method would have been more robust had the researchers had access to an independent dataset in SA with surnames linked to ethnicities where a comparison to a “gold standard” could have been performed. However, there are no publicly available datasets of this nature in SA.
Another limitation of this analysis lies in the inability to confirm the “missing at random” supposition. The imputation reference panel dataset originated from cancer pathology reports from both public and private healthcare sectors; however, a large proportion of ethnicity data was either withheld or not-collected from the private healthcare sector (Table 1). This non-collection of the ethnicity variable from the private healthcare sector could lead to a bias in imputation for the White, Coloured and Asian population groups where the ethnicity data are more likely to be missing when compared to the Black population groups who were more likely to access public healthcare facilities.
Imputation of ethnicity using a reference panel is not a novel method. A previous study used surname and geocoding data from the U.S. census in a similar approach to impute for missing ethnicity in electronic health record systems and has found this approach to be useful in imputing missing ethnicity data (88% correctly imputed) thus reducing outcome bias [8].
Racial/ethnic/population group classification in health data has been discussed extensively in SA and international health literature [9–11]. The NCR is fully cognisant of the arguments against such categorisation of health data; that it may entrench “race-based mindsets” and may be perceived as showing approval of race-based segregation [9]. Importantly for health researchers, race-based segregation may result in the actual social, political and cultural determinants of health being missed. Marmot cautions that “if two groups, however, defined, have different rates of disease, a productive aetiological investigation may follow” [11]. The NCR aims to motivate this productive aetiological investigation through its description of cancer incidence estimates by age, gender and ethnicity. Researchers agree that in SA, ethnicity is a proxy for socio-economic determinants and access to healthcare. Because of previous apartheid institutionalised racial segregation in SA, socioeconomic status and access to healthcare are largely divided according to ethnicity with a large percentage of the Black populations in the lower socioeconomic strata [12, 13].
The large number of cases with missing ethnicity data post-1998 can be ascribed to societal sensitivity to any form of racial classification post-apartheid. The system of apartheid before 1994, ensured the oppression of non-White individuals through social, economic and political policies based on race segregation [12]. During democracy, “people are more likely to perceive the obligation to declare one’s “race” on a form as an affirmation of the validity of “race” classification” [9]. Therefore, fewer people supplied information about ethnicity in the pathology request form. In addition, recent pathology-request forms from the private healthcare sector laboratories have excluded the ethnicity variable entirely.
Given that the NCR is a national surveillance organisation, a pragmatic cost-effective approach to the collection of variables is adopted with the number of variables kept to a minimum. In addition, the source of our surveillance data (pathology reports) determines that the NCR does not have access to the true socioeconomic, and cultural determinants of cancer. Therefore, the presentation of data classified as population group is meant to stimulate policy discussions and research into the underlying aetiological factors that present themselves as population group differences.