Distribution of the environmental and socioeconomic risk factors on COVID-19 1 death rate across continental United States: A spatial nonlinear analysis 2

The COVID-19 outbreak has become a global pandemic. Spatial variation in the environmental, health, 15 socioeconomic, and demographic risk factors of COVID-19 death rate is not well understood. Global 16 models and local linear models were used to estimate the impact of risk factors of the COVID-19, but 17 these do not account for the nonlinear relationships between the risk factors and the COVID-19 death rate 18 at various geographical locations. We proposed a local nonlinear nonparametric regression model named 19 geographically weighted random forest (GW-RF) to estimate the nonlinear relationship between COVID- 20 19 death rate and 47 risk factors derived from US Environmental Protection Agency, National Center for 21 Environmental Information, Centers for Disease Control and the US census. The COVID-19 data were 22 employed to a global regression model random forest (RF) and a local model GW-RF. The adjusted 𝑅 2 of 23 the RF is 0.69. The adjusted 𝑅 2 of the proposed GW-RF is 0.78. The result of GW-RF showed that the 24 risk factors (i.e. going to work by walking, airborne benzene concentration, householder with a mortgage, 25 unemployment, airborne PM 2.5 concentration and percent of the black or African American) have a high 26 correlation with the spatial distribution of the COVID-19 death rate and these key factors driven from the 27 GW-RF were mapped, which could provide useful implications for controlling the spread of COVID-19 28 pandemic. local 𝑅 2 of the GW-RF is 0.59. In GW-RF, the value of local 𝑅 2 higher than 0.4 89.4% of the counties and higher than 0.6 in 50.5% of the counties, indicating that the GW-RF well most of the study area. This shows that that the local nonlinear nonparametric model GW-RF can accurately estimate the relationship between the risk factors and COVID-19 death rate at various geographical locations.


32
The 2019 novel coronavirus disease (COVID-19) caused by SARS-CoV-2, is a rapidly spreading 33 infectious disease that mainly affects the respiratory system (Landi et al. 2020). Because the disease is 47 Bashir et al. (2020) found that air pollution including PM 10 , PM 2.5 , SO 2 , NO 2 , and CO are significant risk 48 factors to the COVID-19 epidemic. Tosepu et al. (2020) analysed the correlation between weather and the 49 COVID-19, and found that the average temperature was highly correlated with the COVID-19. Virus

120
(1) The data sets 1 , 2 , ⋯ , are extracted by repeatedly using the bootstrap method to 121 randomly extract the whole data set ; and the corresponding decision trees 1 , 2 , ⋯ , are 122 generated.

123
(2) At each node of the decision tree, randomly select ( < ) variables from all the variables 124 of the decision tree, and each node is split using the selected variables by the optimal 125 segmentation method determined by a segmentation criterion.

126
(3) The value of remains unchanged while the forest grows. Each tree grows to its largest extent 127 without pruning until it cannot be split.

128
Thus the correlation between the decision trees in the forest decreases through a random 129 selection of variables at each node of the tree and the optimal split of each node is determined by the 130 selected variables only, instead of all variables. Each tree can grow to its largest extent without pruning.

131
Therefore, the algorithm can deal with excessive redundant features and avoid over fitting. , ∈ (1,2, ⋯ , ) 178 As the local random forest of an individual unit need to consider the unit itself, the value of is 179 set to 1 ( = 1). According to the spatial weight rule, for spatial unit , if sample ( ∈

180
(1,2, ⋯ , ) ∧ ≠ ) is a "neighbour" of unit , the value of spatial weight between them is set to 1, 181 that is, = 1. While spatial unit is far away from spatial unit , not a "neighbour" of spatial 182 unit , = 0.

183
(2) Select all the "neighbours" of each spatial unit according to the spatial weight matrix. For unit , 184 the "neighbors" of it can be selected from the special weight matrix where ≠ 0, ( ∈
(3) The spatial unit and its "neighbours" are as the inputs to construct a local RF for unit (RF ( )).

187
By executing RF ( ), the variable importance for spatial unit can be computed.

188
(4) Repeat steps (2) and (3) to construct a local RF for each spatial unit in the study area and 189 estimate the local variable importance for each spatial unit.  We used the local 2 to estimate the performance of the GW-RF.  The proportion of counties with local primary risk factor (the risk factor with the highest value 231 of local variable importance) at county level in the GW-RF was calculated (see Table 3). Going to work 232 by walking was the most influential risk factor in 35% of the counties. The airborne benzene 233 concentration was the leading risk factor in 24% of the counties. 13% percent of counties were most 234 affected by householder with a mortgage and 12% percent of counties were most affected by 235 unemployment. Figure 4, Figure 5 and Figure 6 provide a detailed spatial distribution of the local 236 variable importance of first six factors with the highest value of average variable importance on the 237 COVID-19 death rate using the GW-RF.