Ranking the explanatory power of factors associated with worldwide new Covid-19 cases

Disease spread is a complex phenomenon requiring an interdisciplinary approach. Covid-19 exhibited a global spatial spread in a very short time frame resulting in a global pandemic. Data of new Covid-19 cases per million were analysed worldwide at the spatial scale of a country and time replicated from the end of December 2019 to late May 2020. Data driven analysis of epidemiological, economic, public health, and governmental intervention variables was performed in order to select the optimal variables in explaining new Covid-19 cases across all countries in time. Sequentially, hierarchical variance partitioning of the optimal variables was performed in order to quantify the independent contribution of each variable in the total variance of new Covid-19 cases per million. Results indicated that from the variables available new tests per thousand explained the vast majority of the total variance in new cases (51.6%) followed by the governmental stringency index (15.2%). Availability of hospital beds per 100k inhabitants explained 9% extreme poverty explained 8.8%, hand washing facilities 5.3%, the fraction of the population aged 65 or older explained 3.9%, and other disease prevalence (cardiovascular diseases plus diabetes) explained 2.9%. The percentage of smokers within the population explained 2.6% of the total variance, while population density explained 0.6%.


Introduction
Patterns of infectious diseases across spatial and temporal scales are fundamental for understanding their dynamics and for designing eradication strategies [1,2]. Disease spread is a complex phenomenon and requires an interdisciplinary approach spanning from medicine to statistics and social sciences [3]. Covid-19 exhibited a global spatial spread in a relatively very short time frame resulting in being characterized as a pandemic by the World Health Organisation [4].
To date there is no known anti-viral treatment or vaccine against Covid-19 [5]. Therefore the available options against the virus are the characteristics of the immune system and health status of each individual, the health system of the country that the individual has access to, the social behaviour of the other individuals forming the society, public interventions of movement or public campaigns, and testing [6][7][8]. Thus, there are relatively few options to be employed in order to diminish the spread of the disease. This scarcity of options makes quantifying the factors associated with disease spread more important and ranking the relative contribution of each factor on Covid-19 spread may facilitate diminishing it [9].
Analysis so far has often been dominated either on a one-at-a-time factor analysis (e.g. new cases per testing effort or total cases per age structure) or on a country basis analysis (e.g doubling time in one country in comparison to other countries). While such comparisons are straightforward and comprehensive, from a statistical perspective they are examining one factor at a time and masking underlying characteristics within countries under a variable 'Country' [10]. This results in a hidden burden of the underlying factors regarding disease spread and causality is often discussed in a speculative manner. Admittedly the problem is complex due to the large differences across the potential underlying factors across countries. A small variance implies that the mean contains virtually all of the information, while a large variance implies that more information than the mean is present [11]. When examining several countries together and across time, quantifying the variance may be more informative than the mean [12]. Variance is introduced both by space, there are large e.g. climatic differences between locations on the same date, as well as by temporal e.g. climatic or behavioural changes within locations in seasons [13][14][15].
The potential effect of economic, epidemiological and public health, or governmental interventions may become clearer when the contribution of these factors into new Covid-19 cases are analysed accounting for the fact that they derive from different countries as well as in different time snapshots [16] but in a way that the effect of each factor can be quantified in conjunction with the effects of other factors. To that end methods that can account for both spatial and temporal autocorrelation [17] in the data of new Covid-19 cases but can quantify the effect of each epidemiological, economic, public health, and governmental intervention are key to our understanding of how the disease spreads in populations worldwide [18,19].
In this study spatio-temporal worldwide data of new Covid-19 cases form the "Our world in data" database [20] were analyzed using computational statistics. The spatial replicate of the dataset included over 150 countries time replicated for each country across a period of ≈five months. Data driven analysis was performed in order to quantify the optimal variables in cases where several candidate variables were available. During the data driven variable selection, the fact that data derived from different countries and were time replicated was accounted for by nesting the variance of time within the variance of country and treated as random effects [21,22]. The percentage of the total variance explained by each epidemiological, economic, public health, and governmental intervention variables associated with Covid-19 new cases was quantified using hierarchical variance partitioning [23,24] thereby ranking the relative contribution of the variance of each factor on new Covid-19 cases worldwide.

Data
The objective was to quantify epidemiological, public health, economic, and governmental intervention factors associated with Covid-19 spread worldwide. New Covid-19 cases per million per country per time step were used a proxy of disease spread. New cases per million in each country was chosen instead of new cases per country as this estimator is less biased by the total population of each country -countries with higher populations are more likely to produce higher new cases or total but the number may be relatively low in comparison to the total population pool. Data regarding new Covid-19 cases per million from the "Our world in data" database were analyzed. The dataset was last accessed on 25/05/2020 and the download location is https://github.com/owid/covid-19data/tree/master/public/data. The data derive from the European Centre for Disease Prevention and Control (ECDC), an EU agency with the aim to strengthen Europe's defense against infectious diseases. The ECDC collects and aggregates data from countries around the world. The most up-to-date data for any particular country is therefore typically available earlier via the national health agencies than via the ECDC. This lag between nationally available data and the ECDC data is not very long as the ECDC publishes new data daily; typically this time lag is at the level of some hours and less than a day. The ECDC collects compiles and harmonizes data from around the world in a consistent way which allows us to compare what is happening in different countries. The spatial replicate of the dataset comprised of 160 countries while the temporal replicate spans from 31/12/2019 to (including) 25/05/2020. The variables included: From the available variables male and female smokers were averaged as 'smokers'.

Data analytics
We employed generalised linear mixed effects models (LME; [26]) with new Covid-19 cases per million as the dependent variable. As the dataset contained several potential indexes of testing, population density, or age structure within each country, initial analysis was conducted in order to select the most informative index of each.
We initially sought to quantify the most parsimonious data driven index of testing which included the fixed effects of (i) news tests (ii) total tests (iii) new tests per thousand (iv) total tests per thousand (v) new tests smoothed, (vi) new tests smoothed per thousand. This was achieved by fitting six LMEs with new cases per million as the dependent variable and six LMEs with i -vi as the single independent variable. The random effect structure of each LME included the nested variance of time within each country (Random~Country/Time). Doing so the fitted LMEs accounted for both temporal and spatial autocorrelation in the time replicated data deriving from different geographic locations [21,22]. LMEs were fitted with Maximum Likelihood (ML) estimation to allow comparisons between models with different fixed effects and selecting the LME that exhibited the lowest Akaike (AIC) value [27,28].
Here and throughout the analysis, there were 19,709 data points in the analysis but there were variables with missing values at some time steps or at some countries. Missing values were omitted from the statistical analysis. Therefore AIC values are compared between models fitted with different fixed effects but also with potentially different sample sizes.
Similarly, LMEs with new cases per million as the dependent variable and the fixed effects of (i) population or (ii) population density, and the nested random effects of time within country were fitted with ML and compared against AIC values to select the optimal data driven population index.
Regarding age structure of the population within each country, the available variables were (i) median age of the population, (ii) the percentage of the population aged 65 or older, and (iii) the percentage of the population aged 70 or older. The analysis proceeded by selecting the ML fitted LME with the lowest AIC between the three available age population structure variables. All three fitted LMEs contained the random effects of time nested within country.
Regarding economic status of the population within each country, the available variables were (i) gdp per capita, and (ii) percentage of the population under extreme poverty. The analysis proceeded by selecting the ML fitted LME with the lowest AIC between the two available economy status variables. The two fitted LMEs contained the random effects of time nested within country.
Having selected the optimal index of testing, population density and age structures the analysis proceeded with the following variables: (1) population density, (2) new tests per thousand, (3) governmental stringency index, (4) percentage of the population aged > 65, (5) percentage of the population under extreme poverty, (6) cvd death rate, (7) diabetes prevalence, (8) percentage of smokers, (9) percentage of the population with access to hand washing facilities, and (10) hospital beds per 100k inhabitants within each country as independent variables.
Hierarchical Variance Partitioning (HVP) statistical modelling was implemented to account for the contribution of each data driven epidemiological, economic, public health, and governmental intervention explanatory variable to the total variance of new Covid-19 per million cases [29,30]. HVP is a statistical framework that is capable of handling correlated independent variables, whilst providing a reliable ranking of predictor importance of each variable [29]. Variance partitioning is calculated from the Akaike (AIC) weights of each explanatory variable and it is based upon the number of times that a variable was significant in all possible combinations of the explanatory variables. The HVP function produces a minor rounding error for hierarchies constructed from more than nine variables [31] -the available data driven variables were 10. To check if this error affects the inference from an analysis, the analysis was repeated several times with the variables entered in a different order [31]. The analysis resulted in changes in the derived results when the order of the variables was changed. The analysis proceeded by creating a new variable that merged together other disease related variables: other diseases variable= (cvd_death_rate + diabetes_prevalence) plus the other remaining eight variables resulting in a total of nine variables. There is no known statistical bias in HVP when 9 or fewer variables are used [31].

Results
The optimal data driven index for explaining new cases per million from the ones available regarding testing was new cases per thousand as fitted by LMEs and selected against the lowest AIC value (Table 2a). New tests per thousand smoothed could not be fitted as the LME did not converge (Table 2a). The optimal data driven population index for new cases per million was population density (Table 2b). The optimal data driven index for age structure within the population was the percentage of the population within each country aged 65 or older (Table 2c). The optimal AIC selected LME regarding the economic status of the population within each country in relation to new Covid-19 cases per million was extreme poverty (Table 2d).

INSERT TABLE 2
Results from HVP indicated that total tests per thousand explained 51.6% of the total variance of new cases per million, while governmental stringency index explained 15.2% (Fig. 1, Table 3). Availability of hospital bed per 100k inhabitants explained 9% (Fig. 1, Table 3). Extreme poverty explained 8.8% of the total variance of new cases per million, hand washing facilities 5.3%, the fraction of the population aged 65 or older explained 3.9%, other disease prevalence (cardiovascular diseases plus diabetes) explained 2.9% (Fig. 1, Table 3). The percentage of smokers within the population explained 2.6% of the total variance of new Covid-19 cases per million, while population density explained 0.6% (Fig. 1, Table 3).

Discussion
The best model fit regarding new Covid-19 cases per million and the economic status of the country where the new cases are recorded indicated that extreme poverty was a better predictor of new cases than gdp per capita. It is thus the poorest individuals within each country impacted rather than poor countries. From the data available, the fraction of the population aged 65 or older explained optimally new cases per million and not median population age. Total tests per thousand and not new tests or new tests smoothed or other available indexes is a better predictor of new cases per million, perhaps unsurprisingly as the number of new cases is already normalized by the population and thus the number of tests also normalized by the population explains better the pattern. The latter also applies for population density instead of total population as the best available predictor of new cases per million. Summing up, from the data-driven analysis it is evident that new Covid-19 cases per million are best explained by extreme poverty prevalence within each country as well as by the fraction of the population older than 65, thereby indicating association of the spread of the disease with the poor and older.
Results from variance partitioning of the data-driven selected 9 epidemiological, public health, economic, and governmental intervention variables explaining Covid-19 new cases per million across countries through time, indicated that the vast majority of new cases per million are explained by the number of tests conducted. The number of new tests per thousand explains over 50% of the total variance through time and countries and thus the message regarding the efficacy of testing against Covid-19 spread is strong, at least form the results derived here. The efficacy of testing has been highlighted as the best strategy against other diseases too across humans, agriculture, and wildlife [1, 24,32,33]. It therefore seems that the optimal strategies against Covid-19 spread should include high number of tests both to suspicious cases as well as random population testing.
Would increasing the number of tests result in detecting more Covid-19 cases? Lost Covid-19 cases are not uncommon [34,35]. From a statistical perspective, variance partitioning does not provide information on the sign of the effects (positive or negative) it simply shows in how many cases this variable could not be excluded from the final optimal statistical model in explaining new Covid-19 cases. The slope between new tests per thousand inhabitants and new cases per million vary between countries (Fig. 2a). Indeed there are countries where the slope between new cases per million and new tests per thousand are positive indicating that testing more would actually result into identifying more cases (Fig. 2a); [32]. However, there are also countries with a negative slope between new tests -new cases' indicating that testing frequency is saturated (Fig. 2a). Overall, using data from all countries and time steps, the relationship between testing and new cases is positive indicating that worldwide more tests would result in identifying higher number of new cases (Fig. 2b). Therefore the efficacy of testing has not been saturated.
The second best variable in explaining new Covid-19 cases per million was the governmental stringency index. The governmental stringency index contains several measures taken by governments including school closures, national and international movement restrictions, public gathering and public events restrictions, exiting home restrictions as well as testing policies and financial measures [25]. To that end testing policy is in part contained in the stringency index as a weighted percentage of the overall index. However the relationship between stringency index and new tests per thousand is very weak with both R 2 and slope close to zero: linear regression (new_tests_per_thousand) = 0.1959 + 0.002479 (stringency_index), S = 0.597, R 2 = 1.1%, P<0.001. Therefore of the available governmental measures summarized in the stringency index, testing frequency can be treated independently.
In general, countries increased their level of stringency as their number of confirmed COVID-19 cases raised, however there is significant variation in the rate and timing of this relationship [25]. Another study indicated that in the early and accelerating stages of the pandemic, many citizens across 58 countries viewed their governments' response as insufficient [36]. In general the status of the infection spread and policy implementation influence restrictions uniformly across every countries [37]. Given the overall large effect of testing on new cases, it has been investigated whether there exists a testing frequency for Covid-19 such that the shutdown could have been avoided [38]. The study concluded that indeed there is an optimal testing frequency such that lockdown and thus governmental stringency may not be deemed necessary [38]. The test against Covid-19 is known to be imperfect but not precisely known [39] and testing strategies to surmount this problem have been proposed [40].
The availability of hospital beds per 100k inhabitants, hand washing facilities, the effect of Covid-19 in the older people as well as prevalence of other diseases and smokers have been highlighted [19,41,42] and this study confirms their importance. Environmental factors have also been reported to play an important role in Covid-19 dynamics [43] however this study did not explore their relative contribution. Table 2. Data driven variable selection in cases where several candidate variables were available. Variable selection was conducted by fitting LMEs with new Covid-19 cases per million as dependent variable and the candidate explanatory variables as single fixed effect variables and the variance introduced time nested within the variance of countries as random effects. Models were fitted with ML to allow comparisons between LMEs fitted with different fixed effects. The optimal model is bolded and italicized in each case.