Derivation and Validation of Risk Scores for Cause-Specic Mortality using Verbal Autopsy Data

Background Deaths certication remains a challenge mostly in the low-resources countries which results in poor availability and incompleteness of vital statistics. In such sceneries, public health and developmental policies concerning the burden of diseases are limited in their derivation and application. The study aimed at developing and evaluating appropriate cause-specic mortality risk scores using Verbal Autopsy (VA) data. Methods A logistic regression model was used to identify independent predictors of NCDs, AIDS/TB, and CDs specic causes of death. Risk scores were derived using a point scoring system. Receiver operating characteristic (ROC) curves were used to validate the models by matching the number of reported deaths to the number of deaths predicted by the models. provided accurate results with sensitivities of and and false-positive error rates of and for and respectively.


Abstract Background
Deaths certi cation remains a challenge mostly in the low-resources countries which results in poor availability and incompleteness of vital statistics. In such sceneries, public health and developmental policies concerning the burden of diseases are limited in their derivation and application. The study aimed at developing and evaluating appropriate cause-speci c mortality risk scores using Verbal Autopsy (VA) data.
Methods A logistic regression model was used to identify independent predictors of NCDs, AIDS/TB, and CDs speci c causes of death. Risk scores were derived using a point scoring system. Receiver operating characteristic (ROC) curves were used to validate the models by matching the number of reported deaths to the number of deaths predicted by the models.

Conclusion
This study has shown that, in low-and medium-income countries, simple risk scores using information collected using verbal autopsy questionnaire could be adequately used to assign causes of death for Non-Communicable Diseases and AIDS/TB Background The need for correct identi cation of death is very crucial for measuring the burden of diseases in a country. Statistics in mortality and cause of death are an essential component of population health status and are required for identifying priority areas and for policymaking, planning, and programme implementation 1-3 . Vital registration (VR) and civil statistics remain the main source of data for demographic and health researchers 4 , challenges like incomplete and high proportions of ill-de ned deaths persists in low-and middle-income countries compared to high-income countries [5][6][7] . This problem is particularly alarming in the rural area environment and less developing countries where it can be very expensive to undertake a proper and accurate assessment. Well-trained personnel and/or doctors also play a crucial role in death certi cation using the World Health Organization (WHO) standardized international classi cation of diseases ICD-10 guide. In the populations lacking vital registration and quali ed medical certi cation, Verbal Autopsy (VA) has become a primary source of information on causes of death 1 .
Verbal autopsy is a procedure that is used in determining the probable cause of death, often in societies where medical certi cation of death is poor and limited 8 . It has its origins from the 19th century 1,9 and involves the use of detailed questionnaire administrated to collect information from relative or caregivers of the deceased on the symptoms and events during illness leading to death 8,10 . This is now widely used in the developing world to estimate cause-speci c mortality rates, disease surveillance, and sample registration. Data arising from the verbal autopsy procedure has been used to assess time trends, evaluations of health interventions, and to make comparisons in the burden of disease, locally as well as globally. However, the usefulness of the VA data relies on the correct identi cation and certi cation of death 11 .
The use of physician review VA analysis has been validated and has provided useful information to correctly assign the cause of death. An alternate procedure that is outgrowing and been validated is computer-based verbal autopsies analysis, hence physical-certi ed verbal autopsy is widely used than the Computer-coded verbal autopsy because computer-coded VA is easily implemented by astute individuals.
Computer-coded VA analysis ranges from Inter-VA, Tariff to probabilistic method 5,9,12−15 12,17,18 . However, no study had been conducted to develop a risk score for assigning the cause of death using signs and symptoms identi ed before death. This study aimed at developing and validating appropriate cause-speci c mortality risk scores using Verbal Autopsy (VA) data.

Data
The data for this study are from Population Health Metric Research Consortium (PHMRC) Verbal Autopsy data which is obtainable at http://ghdx.healthdata.org/record/population-health-metrics-researchconsortium-gold-standard-verbal-autopsy-data-2005-2011. Data were collected from six sites from four different countries (India, Mexico, Tanzania, and the Philippines) as a part of the Population Health Metrics Research Consortium (PHMRC) project. The data included the demographic characteristics, the period of illness, record of history on a chronic condition, and signs and symptoms. Data was cleaned to make it suitable for our expert algorithm analysis. Murray et al8 provided a full detailed paper on the PHMRC verbal autopsy data.
Cause-speci c were divided into three categories, i.e. AIDS/TB, communicable and Non-Communicable Diseases. AIDS and TB were combined to avoid cause misclassi cation due to TB being the most common opportunistic infection among HIV infected. Communicable diseases included Malaria, diarrhea, Pneumonia, and other Infectious Diseases. Non-Communicable diseases included Diabetes, Cancer, Heart diseases, Stroke, Asthma, Epilepsy, and other unspeci ed Non-Communicable diseases.
For analysis, demographic information, signs, and symptoms, and assigned cause-speci c were extracted from the dataset. The deceased ages were grouped into 10 to 34 years, 35 to 64 years 65 or older. Since other sites had smaller or even zero death count, sites were classi ed as Africa sites, Asia sites, and North America. Furthermore, other signs and symptoms were combined to reduce the number of symptoms, for example, fever and sweating were combined; cough and if cough produces sputum were combined. Dar-es-Salaam sample was considered for the analysis since it contained more data points and situated in Africa.

Statistical analysis
Descriptive statistics including frequency count and proportions of deaths were given for each cause speci c. Comparisons of proportions were undertaken using the Chi-square test for categorical data. Multivariable logistic regression was used to create a predictive model based on the derivation dataset, which included 1,196, 1,202, and 1,217 NCD's, AIDS/TB, and CD's deaths, where all covariates with a pvalue less 0.25 were considered for modeling. Validations were done on the 25% datasets of each speci c cause using the Africa site data. The discriminating ability of the predicted models was evaluated using the AUROC curves. Sensitivity, speci city, PPV, and NPV were reported for each cut-point and the best cut-point was decided upon. A point scoring system was used to derivate the risk scores associated with a single speci c cause. Statistical analyses were undertaken using STATA.

Results
A total of 6224 adult deaths were included in the verbal autopsy dataset across three sites. The majority of deaths were among males accounting for 54.7% while the remaining percentages accounted for female deaths. Of those deceased, 50.9% (3,166) were aged between 35 and 64, and 56.6% (3,213) were from Asia sites. Higher proportions of deaths were attributable to NCD (69%), followed by CD and AIDS/TB deaths. Gender, age at death, and site were statistically signi cantly associated with deaths Results are presented in Table 1.  Table 2.
Developed risk scores Table 3 presents NCD, AIDS/TB, and CD risk score values de ned using the multivariable logistic estimated risk score model. For a lower negative coe cient, the score was -4 while for a higher negative coe cient, the score was a -2. For a lower positive coe cient, the score allocated was a 2 while for a higher positive value, a score allocated was a 4. For each reference category, we allocated the score of 0 and the minimum and the maximum score were calculated for each speci c cause. NCD had a minimum score of -8 and a maximum score of 24, AIDS/TB score varied from -4 to 22, and CD score varied from -16 to 18. The discriminating score for the developed risk scores for the three-cause speci c decreased compared to the derivation dataset. NCD deaths discriminating score decreased from 0.81 to 0.76 and from 0.81 to 0.75; area under the curve for AIDS/TB and CD deaths decreased from 0.81 to 0.77 and 0.79 to 0.76; and 0.74 to 0.69 and 0.68 to 0.56 for both derivation and validation respectively.   Performance characteristics of the risk scoring system: Cut-offs.
The various threshold to predict the probability of correctly assigning cause of death for each risk scoring system were assessed. Sensitivity, speci city, positive predictive value (PPV), and negative predictive value (NPV) for best cut-offs were reported. The best cut-offs were decided based on the high sensitivity and high negative predictive value. With the threshold of greater equal to 6 for correctly assign the NCD cause-speci c had a sensitivity of 84%, along with speci city 54%, PPV 34%, and NPV 92% which increases by 2% using the validation set. AIDS/TB and CD deaths threshold had a higher speci city compared to sensitivity. At ≥6 cut value, the probability of correctly assigning AIDS/TB deaths was 44% and a negative predictive value of 86% while a sensitivity of 59% and an NPV of 99% for a ≥4 threshold for associated with CD deaths. The chosen cut-off values for the three studied causes performed better in assigning the correct cause of death on the validation set where NCDs models performed better compared to other cause-speci c models. Information is presented in Table 4.

Discussion
The study has used a binary logistic regression to derivate and validate the risk score of the speci c cause of death. The models have been applied to verbal autopsy data collected from Dar-es-Salaam, Tanzania. The site had more data points than the other ve sites. Our scoring tool achieved a good predictive ability and can be easily implemented in a setting of identifying the NCD's death using the VA responses. Cough with or without sputum and itching skin decreases the risk of NCD deaths while age, vomit with blood, weight loss, pale, lump, and protruding of the belly were associated with increased risk of NCD-related mortality. Our risk score for NCD's deaths performed well (AUC ranging from 0.77 to 0.85). Using a cutoff point of 0.6 and above, a maximum of 86% of a true positive NCD death was correctly assigned while only 56% of the true negative was achieved. A nal predicted risk score for AIDS/TB deaths model included gender, age, cough, and cough which produces sputum, itching skin, belly protrude, and fever and fever with sweating. The predictive ability of the AIDS/TB deaths model also performed well (AUC ranging from 0.73 to 0.85). The risk factors for CD included cough with sputum, fever with sweating, only fever, and the frequent loss of stools. The model predictive ability was only su cient (AUC ranging from 0.60 to 0.75).
Our tool showed its reliability to assign a probable cause of death. The ability to accurately predict the NCDs, AIDS/TB, and CDs related cause of death had acceptable AUC and true positive error rates were high compared to the accuracy of Inter-VA to assign adults deaths 13,17 . The true positive rates for our scoring system are comparable to Flaxman et al 19 with higher speci city observed. According to Desai et al 20 , the positive predictive value for computer-coded VA, and physician-coded VA ranged from 44 to 69, which are higher compared to our PPV for all three models. Additionally, sensitivity and PPV for accurately assign AIDS/TB related deaths was lower compared to sensitivity and PPV reported by Polprasert et al 21 however it is comparable to Lopmari et al 22 Though some methods outperformed our method to assign a cause of death, the strength is that this method can be used to assign an individual cause of death [23][24] . this study was subjected to some limitations. One of the limitations is that accurate information of signs and symptoms depends on the ability of the relative or caregiver to recall the information before one's death. Most African and South-East Asian countries lack complete death registration data and even worse do not record cause-of-death data. In most cases, Verbal Autopsy is the only option to assess causes of deaths. We have shown that simple risk scores using information collected in the verbal autopsy questionnaire could adequately be used to assign causes of death, especially in developing countries.