COVID-19 mortality and BCG vaccination: de ning the link using machine learning

Nathan A. Brooks The University of Texas MD Anderson Cancer Center, Department of Urology Ankur Puri McKinsey & Company Sanya Garg (  sanya_garg@mckinsey.com ) McKinsey & Company Swapnika Nag (  Swapnika_Nag@mckinsey.com ) McKinsey & Company Jacomo Corbo McKinsey & Company Anas El Turabi McKinsey & Company Noshir Kaka McKinsey & Company Rodney W. Zemmel McKinsey & Company Paul K. Hegarty Mater Private Hospital, Department of Urology, Cork, Ireland Ashish M. Kamat The University of Texas MD Anderson Cancer Center, Department of Urology


Introduction
The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and the resulting clinical condition coronavirus disease (COVID-19) have caused a worldwide pandemic. There have been 4.8 million con rmed infections and 318,000 deaths worldwide as of May 19, 2020 1 resulting in signi cant global and personal insecurity 2,3 . Mitigation of the pandemic requires a multifaceted strategy to reduce clinical morbidity/mortality, disease spread, and ultimately vaccination. The bacille Calmette-Guerin (BCG) vaccine has been administered to almost 4 billion people worldwide for over 100 years for the prevention of tuberculosis (TB) 4 . When given in conjunction with anti-viral vaccinations including yellow fever and in uenza, patients pre-treated with BCG have demonstrated reduced viremia, decreased levels of circulating cytokines associated with cytokine storms, and no difference in, or an improved, anti-viral antibody response 5,6 . These observations may be associated with a shift in the T-cell mediated response to pathogens, enhanced trained innate immunity, and/or an as yet undiscovered pathway 7 . However, they provide an immunologic foundation which suggests BCG vaccination is associated with clinically meaningful immunomodulatory function. Several studies have observed reduced CSM/CFR in countires with active BCG vaccination programs, suggesting that there is some degree of protection from severe COVID-19 infection, especially in elderly populations 8,9,10 . Employing unsupervised machine learning methods with adjustment for numerous variables and potential established confounders associated with mortality, we evaluated the association between covariates designated a priori including BCG vaccination programs and mortality associated with COVID-19 at a country level utilizing pre-speci ed inclusion criteria.

Reagents
None were used Equipment To execute the proposed protocol, researchers need access to data and computational resources to run the software packages required for the implementation of this protocol. There are several software packages available which help in processing large datasets and running ML algorithms for pattern recognition. Proposed below is one of the possible methods of implementing this protocol which corresponds to the methodology detailed in the manuscript.

Selection of countries for inclusion in analysis
Inclusion criteria included: more than 2,000 cases as of May 5, 2020, population greater than 5 million, and land area greater than 1,000 km 2 (to exclude city-states with the potential for non-representative population densities). Exclusion criteria included countries where BCG program start year could not be ascertained.

Data collection and consolidation
All data leveraged originated from publicly available data sources (Supplementary Table 1). A set of potential disease related mortality drivers spanning seven domains -socio-economic, health system readiness, environmental, existing disease burden, demographics, vaccination programs, and response to the pandemic were selected a priori (Supplementary Table 2). COVID-19 speci c mortality (CSM) was the primary outcome, de ned as deaths related to COVID-19 per million population assessed 30 days after 100 reported cases.

Data Treatment
All variables were uniformly capped at the 97'th percentile and oored at the 1'st percentile.
Missing value treatment is detailed in Supplementary Table 5. 4. Feature selection using exploratory factor analysis We sought to group countries into comparable clusters based on previously described CSM drivers. To do this, we rst assessed the correlation amongst pre-determined variables related to CSM ( Supplementary  Figure 1) which demonstrated substantial correlation between several explanatory variables. Therefore, exploratory factor analysis, an unsupervised machine learning method to reduce the original set of explanatory variables, was performed. The optimum number of factors were chosen using the scree plot (Supplementary Figure 2). An elbow was observed between 7 and 8 factors (Supplementary Tables 3a   and 3b) 11 . Varimax rotation was used to maximize the loading of each variable on a single factor. From each factor group, variables were chosen as inputs for subsequent clustering and multiple regression analysis based on loading characteristics and expert consensus where loading values were similar. Given the large size of the rst factor group, three variables were selected from the group. Population density was considered as a distinct group given low loading (below 0.3) value and included in addition to one other variable from group 6. There was low variation of values for factors in group 7 thus no variables were included from this group. The variables selected included GDP per capita, population, population density, temperature (Celsius), percentage of the population above 65 years of age, and stringency index (SI) (a measure of country level interventions in response to COVID-19) 12 .

Grouping of countries based on k-means clustering
Countries were clustered utilizing the k-means algorithm 13 . The optimal number of clusters was determined using the average silhouette coe cient and Dunn Index (Supplementary Table 4 Countries within each cluster demonstrated lower coe cients of variation in testing rates compared to the whole population, and therefore normalization of testing rates was not performed.

Validation using step wise linear regression
To explore whether the ndings were robust compared to alternate analytical approaches, we performed sensitivity analyses using linear regression models analyzing variables from each of the factor groups and CSM as the dependent variable.
Step wise linear regression was used to retain variables with a statistically signi cant impact on CSM.
Detailed R code can be found in the Supplementary Files.

Troubleshooting
The code has been uploaded in supplementary les. Any issues with loading of packages can be resolved usually by checking the Java version and reinstalling R and RStudio if needed.

Time Taken 1) Selection of countries for inclusion in analysis -1 day 2) Collection of data -2 weeks
3) Data treatment -3 to 4 days 4) Feature selection using exploratory factor analysis -2 to 3 days 5) Grouping of countries based on k-means clustering -1 to 1.5 weeks 6) Validation using step wise linear regression -2 to 3 days

Anticipated Results
Of 212 countries/territories, 57 countries were included in analysis (Figure 1). Nine city states with insu cient land area or population and 141 countries with insu cient cases were excluded. Four countries met inclusion criteria but start dates for BCG vaccination programs were not available. China was excluded from the analysis as it was the rst country to report widespread cases of the virus and therefore might have introduced a lead time bias.
Factor analysis resulted in the identi cation of six, distinct variables including GDP per capita, population, population density, temperature, percent population above 65 years, and stringency index (Table 1). Variables related to BCG administration were part of a distinct factor group. Countries within clusters had lower variation of both COVID-19 testing rates and Global Health Security Agenda (GHSA) scores, compared to the overall population. Two cluster solutions, with 6 and 9 clusters, demonstrated the highest scores (Dunn Index and Silhouette Score). Since ndings were similar between the 6 and 9 cluster groups and cluster 9 only included 1 country in the 9-cluster solution (Supplementary Table 6), data for the remainder of the manuscript is presented from the six-cluster solution.
Deaths per million related to COVID-19 (CSM) was assessed 30 days after each included country reported 100 cases. Five of 6 clusters allowed division and comparison of CSM by the presence or absence of BCG vaccination programs for the preceding 15 years (BCG15) (Figure 2a). The remaining cluster composed exclusively countries with BCG vaccination programs (no comparison group-cluster 2). All 6 clusters allowed division and comparison of CSM by the presence or absence of BCG vaccination programs in the preceding 40 years (BCG40) (Figure 2b). Four of 5 clusters demonstrated lower mortality when they had BCG15 and 4 of 6 clusters demonstrated the same association with BCG40. For BCG40, speci city, clusters 1, 3, 5, and 6 demonstrated improved CSM with hazard ratios of 0.03, 0.01, 0.17, and 0.47, respectively. Cluster 2 and 4 demonstrated worse CSM with hazard ratios of 2.43 and 2.24, respectively. The results from the 9-cluster analysis were similar (supplementary table 7). Granular data regarding clustering is presented in supplemental tables 8a/b.
Using multivariate regression analysis, the presence of BCG15 (reduction of CSM by 71% (95% CI: 53 to 89%), total population (for every 1 million person increase there was a 1% decrease in CSM (95% CI: 0.53 to 1.47%), and share of the population above 65 years (CSM increased by 10% for each percent increase in population over 65 (95% CI: 2 to 18%) were shown to be signi cantly associated with CSM. Percent coverage metrics for vaccinations including RCV1 (Rubella), MCV1 (Measles) and OPV (Polio) were forced into the model and were not signi cantly associated with CSM. Figure 1 Consort diagram for country selection