Ambient temperature change Vs confirmed cases of COVID-19: A machine learning model


 Background The relation between ambient air temperature and prevalence of viral infection has been under investigation in recent years. The present study aimed at providing the statistical and machine learning based analysis to investigate the influence of climatic factors on frequency of COVID-19 confirmed cases in Iran.Method The data of confirmed cases of COVID-19 as well as some climatic factors related to 31 provinces of Iran, during 04/03/2020 to 05/05/2020 were gathered from the official resources. In order to investigate the important climatic factors on the frequency of confirmed cases of Covid-19 in all studied cities, a model based on an artificial neural network (ANN) was developed. Moreover, the statistical analysis were used to assess the trend of positive cases in comparison with the fluctuations of some climatic factors. Results The proposed ANN model showed the accuracy rate of 87.25% and 86.4% in the training and testing stage, respectively for classification of COVID-19 confirmed cases. Moreover, multiple linear regression analysis was obtained the R2 equal to 0.40 and 0.68 in two cities of Qom and Ahvaz, respectively. The results showed that in the city of Ahvaz, despite the increase in temperature, the coefficient of determination R2 has been increasing. Conclusion This study clearly showed that with increasing outdoor temperature, the use of air conditioning systems to set a comfort zone temperature is unavoidable; thus the number of positive cases of COVID-19 increases. Also, this study shows the role of closed air cycle condition in indoor environment of tropical cities, along with the impact of climatic factors in the frequency of positive cases of COVID-19 and the capacity of ANN classification in the surveys.


Background
The COVID-19 pandemic is originated from a type of beta-coronavirus called SARS-COV-2 and was rst identi ed in Wuhan, China. The disease is highly prevalent in low-temperature areas of the Northern Hemisphere (1). It is a respiratory virus and disappears after 15 minutes of heating at 56 ° C and can be inactivated by ultraviolet rays, also alkaline (PH > 12) and acidic (PH < 3) environments are able to eliminate it (2). Temperature and humidity play a major role in the transmission of respiratory viruses.
Many respiratory viruses also have a speci c seasonal spread, but there is no clear evidence for the seasonal spread of coronavirus, and in fact the frequency of positive cases of COVID-19 in southern Hemisphere countries such as Latin America and Australia is reported in the summer (3). However, the seasonal emergence of the SARS virus, which belongs to the family of cold viruses, especially in spring and winter, may indicate a seasonal outbreak cycle for the spread of these viruses. Climate change and rising temperatures are affecting the incidence of SARS, a member of the COVID-19 viruses (4)(5)(6). World Health Organization (WHO) emphasizes that only 4% of COVID-19 outbreaks occur in tropical countries, but regarding that it is di cult to accurately assess the behavior and structure of the COVID-19 virus, there is no scienti c evidence to suggest a rise in temperature can affect the shelf life of the virus. Some recent studies have declared a relationship between changes in ambient temperature/relative humidity and reported positive COVID-19 cases (7)(8)(9)(10)(11). The presentation of prediction models about relationship between variables is generally based on the statistical models. The new approach is to use machine learning methods to predict changes in the response variables in various problems (12). Arti cial neural network (ANN) is one of the most successful data mining methods that can predict the relationship between phenomena through a series of link models inspired by human brain behavior that have useful capabilities for medical research (13)(14)(15). Also, ANN is a data processing system that consists of a large number of interconnected and simple processing components similar to the biological neural system that has the ability to learn from experimental and real data sets to describe interventional and nonlinear effects with great success (16,17). The purpose of this study is to provide a statistical analysis to assess the relationship between temperature conditions and the number of cases of coronavirus in Iran. Then, a model based on ANN is presented to predict the incidence of COVID-19 disease according to meteorological factors.

Methods
The data of the present study was related to 31 provinces of Iran, during 04/03/2020 to 05/05/2020. Factors such as maximum outdoor temperature, minimum outdoor temperature and relative humidity (RH) were used to investigate the relationship between urban ambient temperature and the frequency of positive cases of COVID-19 disease. The initial data set included the air temperature and RH of the centers of each province in the mentioned time period. In order to more accurately investigate and provide better results on the frequency of positive cases situation in relation to climatic factors in each province, the following method was used: 1-Instead of using environmental factors on a daily basis and examining its relationship with the frequency of positive cases of the disease, rst, these factors were considered on a weekly basis.
2-Environmental factors were calculated as minimum outdoor temperature, maximum outdoor temperature and average outdoor temperature per week and the average number of patients per week were also calculated.
3 -Frequency of positive cases in two categories (cases above 500 and under 500 people) were considered. In order to examine the frequency of positive cases trend in the center of each province, a multilayer arti cial neural network (MLP) was used (18). In this network, the accuracy of classi cation and location of the center of each province in each of the two frequency of positive cases categories were calculated.

Neural networks
To perform the classi cation in the present study, the MLP was used and this method was implemented in MATLAB 2017 software. The architecture of the neural network is Feedforward net with a range of maximum hidden layer values including 5, 10, 15, 20 and 25. The outline of the neural network is depicted in Figure 1.
Neural networks are architecturally composed of a set of processing components called neurons or nodes, whose function is as a directional diagram in which each node i acts according to formula number one based on the transfer function: Where yi is the output of node; xj is the jth input of the node and wij is the weight between i and j. θi is also the bias threshold. is a nonlinear function and can be of the Gaussian or Sigmoid function. The transfer function in the architecture used in this study was a sigmoid type (19). In this study, 279 inputs, each containing four predictor variables, were examined. Each input rate represents the value in a week. Thus, each province includes nine weeks or range in the dataset, which create a total of 279 cases in the dataset. From the input data set in each examination, 223 data (80%) were selected randomly as training dataset and 56 data (20%) as testing dataset. The neural network training algorithm was error propagation. This algorithm uses the lowest error gradient slope to adjust the weights and bias threshold in order to better train the neural network. The mathematical equations can be expressed in formulas 2 and 3: The λ in the above formulas means the learning rate (λ> 0) (18). Also, four models with 20 different types of architecture were examined, which are fully listed in table 1. To nd the optimum architecture for ANN models including maximum number of hidden layers and neurons, the experts' knowledge as well as trial and error method are used (20). Before performing the classi cation using neural network, rst the relationship between independent variables (minimum temperature, average temperature, maximum temperature and RH) with the dependent variable (frequency of positive cases) was investigated. For this purpose, scatter plot, Pearson correlation coe cient and multiple regression were used. SPSS (version. 20) software was used to draw the scatter plot to investigate the relationship between the variables. Table 1 Types of ANN architecture in four models  15 14 14 15 13 13 15 12 11 11)  15   9 ( 20 19 18 20 17 16 17 17 16 16)  20 10 (25 23 21 22 23 21 23 22 21 21)  25   11 ( 5 3 5 4 5 5 5 5 4 5 4 5 4 4 3)  5 12 (10 8 9 7 6 6 6 9 9 10 6 6 6 6) 10 13 (15 14 13 14 12 12 11 13 15 12 14 12 12 11 11)  15 14 (20 17 16 18 16 17 19 20 17 17 16 18 17 16 16)  20 15 (25 24 22 24 23 22 24 25 22 22 24 23 22 21 21)  25   16 ( 5 3 4 4 5 4 5 4 4 5 4 5 4 4 3 4 3 4 3 3)  5 17 (10 8 7 9 6 8 7 6 6 8 7 7 6 6 7 7 8 6 6 6)  10 18 (12 12 13 14 12 12 14 13 13 12 14 13 12 12 12 13 12 15 14  13) 15 19 ( 16 16 17 18 17 16 18 17 17 16 19 18 16 16 16 16 17 16  Results Figure 2 shows the results of linear relationship analysis between the independent and dependent variables by a matrix scatter plot. Figure 2 shows that there is no linear relationship between the frequency of positive cases in 31 provinces and independent parameters. Of course, it should be noted that the scatter plot only shows the totality of the relationships between the data and does not show the relationship details in full dimensions (21). Therefore, to examine the exact relationship between environmental factors and frequency of positive cases, the graphs of each province should be drawn separately. In this study, Khuzestan province was selected as a sample to study the trend of frequency of positive cases and environmental factors. In addition to plotting the matrix dispersion in order to achieve a suitable model in the neural network, it is necessary to study the relationship between independent variables and input parameters, so Pearson correlation coe cient was used to determine the relationship between independent variables. Pearson correlation coe cient, also called torque correlation coe cient or zero-order correlation coe cient, was used to determine the relationship, type and direction of two distances or relative variables or a distance variable and a relative variable (22). The calculation of Pearson correlation coe cient was obtained from the following equation: In the correlation coe cient formula X and Y, the values of each variable also and are their average.
Obviously, the denominator of the fraction uses the product of the sum of variance. The closer the Pearson correlation coe cient is to number one, the more direct the relationship between the two variables. Therefore, according to Table 2, the obtained correlation coe cients can be the reason for the lack of relationship between the independent and the dependent variables ( Figure 1). Also, the correlation coe cients between frequency of positive cases and minimum, average and maximum temperatures and RH are -0.021, -0.133 -0.091 and 0.037, respectively, which indicates an inverse or no relationship. But the obtained correlation coe cient between the independent variables indicates the existence of a high relationship between maximum temperature and average temperature. The obtained coe cient is 0.817, which shows a very high dependence between these two variables. As correlation coe cient with the value higher than 0.8 indicates a strong correlation between variables, therefore the input variables were selected correctly in this study (23).
Arti cial Neural network model The MLP neural network architecture was implemented based on the topologies expressed in the method section in the MATLAB environment and the average accuracy from the different turns of its implementation in two stages of training and testing was listed in Table 3. However, the choice of the maximum number of hidden layers and the maximum number of neurons in each hidden layer was determined based on trial and error. The best average accuracy obtained in the training stage was 87.25%, which is related to model number 19, and the best average accuracy in the testing stage was 86.4%, which belongs to models number 10 and 15.

Correlation analysis charts
The relationship between frequency of positive cases and environmental factors was plotted for Ahvaz the capital city of Khuzestan province and is presented in Figure 3. The results obtained from Figure 3 show that from the 4 th of March to the 13 th of March in 2020, at the same time with the initial outbreak of this disease, despite the increase in temperature and RH, the disease has also increased. Also, from April 1 st to 22 th , 2020 a growing trend for the frequency of positive cases was reported.

Discussion
In the present study, statistical methods were applied to analysis the effect of weather factors on the frequency of positive cases of COVID-19 in Iran. Also, using an ANN method, each city was classi ed in two categorize (below 500 and above 500 positive weekly cases). The accuracy obtained from the classi cation of ANN was 87.25% and 86.4% in the training and testing stage, respectively, and indicates the high ratio accuracy and the importance of using this method in the prediction. However, the results of the scatter plot due to the entry of all data related to 31 provinces did not show strong accuracy rate in the relationship between frequency of positive cases and environmental factors, so to achieve accurate results, the data of each city should be examined, separately. The comparisons between machine learning and statistical methods in this study con rmed that the developed MLP model might be an appropriate method in predicting the frequency of positive cases of COVID-19 using the input daily climate factors. The relationship between predictors and the frequency of positive cases of data related to the city of Ahvaz and Qom was investigated. The regression equations of these two cities were obtained. In the city of Ahvaz, despite the increase in temperature, the coe cient of determination R 2 has been increasing. It shows the growth of the frequency among the positive cases of COVID-19 in Ahvaz city. According to a study by Poirier et al., temperature and humidity alone cannot indicate the exact trend of coronavirus outbreaks, and further studies are recommended to investigate the effects of environmental factors on coronavirus outbreaks (24). This study may explain Ahvaz condition and indicates the existence of other interfering and distorting factors in the study of the frequency of positive cases. According to the co-linearity diagnostics results and the existence of overlap with some predictive factors, it is not possible to make a de nite statement about the effect of predictive factors alone. The cause could be related to the interaction of outside temperature with indoor temperature. In such a way that with increasing outdoor temperature, indoor temperature decreases to maintain the range of comfort and optimal performance. According to the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) standard, this range is 21 to 24 degrees Celsius and RH is 50% (25).
Also, according to studies on the effect of temperature and RH on the persistence of viruses, it has been stated that low temperature and humidity provide the best conditions for the spread and stability of the virus (26)(27)(28). The evidences showed that not only in terms of temperature and RH but also population density in the city of Ahvaz and the lack of social distance, not using masks and other health protocols, as well as participating in ceremonies and close contact with people in the community can accelerate the virus speared. The results of the present have also been con rmed in a review study by Zhan

Conclusion
In the present study, the relationship between urban ambient temperature and the frequency of positive cases of coronavirus was investigated through the machine learning and statistical analysis. The developed MLP model presented the suitable performance capacities in predicting and classifying the positive COVID-19 cases. Moreover, the results showed the growth of the frequency among the con rmed cases of COVID-19 with increasing the climatic temperature in the tropical city. Considering the revers effect of outdoor temperature on indoor temperature (comfort zone), suitable conditions for virus spread is provided in sheltered environments where life is going on and the virus nds more host. It is recommended that other factor such as indoor ambient temperature and ventilation system effectiveness should be taken into account in feature studies.  Neural network architecture with four input factors, ten neurons and two output categories The relationship between independent and dependent variables