COVID-19 morbidity and mortality forecast in megalopolis: a data approach to public health management in São Paulo, Brazil

The need for a scientic approach that can provide subsidies to governments and public health authorities in decision making to face pandemics, epidemics and endemics was one of the aspects recognized worldwide with the rst wave of COVID-19. This article presents a methodology for the application of data mining as a support tool for coping with epidemic diseases. The methodological approach was applied in the city of São Paulo, Brazil, with the aim of predicting the evolution of COVID-19 in the metropolis and identifying air quality and meteorological variables correlated with conrmed cases and deaths from the coronavirus. Forecasting public health conditions is useful for preparing health teams in advance for a pandemic to prevent the system from collapsing. The statistical analyzes indicated the most important explanatory environmental variables, while the cluster analyzes showed which are the best input variables for the forecasting models. The forecast models were built by two different algorithms, J48 (C4.5) and CBA, and their results have been compared. The models developed can be used to predict new cases and deaths by COVID-19 in São Paulo. The methodological approach can be applied in other cities and for other epidemic diseases. (isol_avg_index and maximum (drought - without rain), agricultural drought (dry_drought - days without rain) at 10 mm), wind speed (wind_spe - kmh- 1 ), minimum dew point (dew_pointMin - Cº) and maximum (dew_pointMax -Cº), minimum atmospheric pressure (atm_pressMin - HpA) and maximum (atm_pressMax - HpA), potential evapotranspiration (pot_eva - mmd- 1 ), real evapotranspiration (real_evapo - mmd- 1 ), minimum (urMin -%) and maximum (urMax -%) humidity, soil water availability (soil_wat_avail -%), particulate material in the air (pm25 and pm10), ozone (o3) and nitrogen dioxide (no2). The air quality variables (pm25, pm10, o3 and no2) are normalized in the format of the air quality index (AQI), for each pollutant, using the United States Environmental Protection Agency (USEPA) standard.

to achieve this hypothetical scenario 1,22 . A third challenge has been to understand which social, economic and environmental variables are correlated with viral dynamics 23,24,25,26,27,28 . Temperature, for example, was a variable that has been speculated since the virus emerged as important, since experts pointed out that low temperatures are more conducive to viral transmission, and subsequently, studies con rmed this hypothesis 23,24,25,26,27,28 . The rapid response from Asian countries was a global exception in a context where most countries acted too late. This fact demonstrates the lack of global preparation to face diseases where there is an absence of an agile, fast and e cient methodology to provide subsidies to managers.
This article proposes a methodological approach that has been used to support decision making in several areas 30,31,32 , including recently in urban health 33 . Although such an approach has been little used, it is a potential tool in the context of the current health crisis caused by COVID-19 and can contribute to decision-making in facing the virus and other epidemic diseases. This approach makes it possible to identify variables (environmental, climatic, social, etc.) correlated to the disease and allows the prediction of its evolution in a coordinated and agile way with a high degree of accuracy. The scienti c community has been working intensively to identify variables related to COVID-19 23,24,25,26,27,28 and to predict the evolution of the disease 11,12,13,14,15,16,17,18,19,20,21,22 however, these efforts have taken place separately. This article proposes an approach that integrates the prediction and characterization of explanatory variables quickly and contributes important information to managers. The imminent need to improve preparedness to face outbreaks of epidemic diseases is a spoil that this rst wave of COVID-19 left. To apply the proposed approach in a fruitful way, the data studied were from the city of São Paulo, Brazil, the most populous in the Southern Hemisphere, being in population terms the eighth largest city in the world with more than 12 million inhabitants 29 . The contributions of this study are: Provide a methodology for using data mining as a tool for public health management in metropolises; Identify climatic and air quality variables that are correlated with the number of COVID-19 cases and deaths; Present and compare the rst forecasting models for COVID-19 based on association rules; Provide forecasting models of new cases and daily deaths by COVID-19 in São Paulo.
Study area context. Until the second half of September, USA, India and Brazil, respectively, are the 1st, 2nd and 3rd countries in the world with more cases of COVID-19 34, 36 . The spread of the virus occurs in the urban fabric, consequently, large cities are the rst focus of outbreaks. Brazil has more than 139 thousand deaths caused by the disease and 4.6 million reported cases 35 . The city of São Paulo is the epicenter of COVID-19 in Brazil, with more than 285 thousand cases and 12.4 thousand deaths according to o cial data 36 . In Brazil, the second city with the most deaths from the virus is Rio de Janeiro, with 10.7 thousand deaths, but a much smaller number of con rmed cases, about 99 thousand cases 36 . The signi cant difference in cases is apparently the result of underreporting which is higher in the city of Rio de Janeiro, the lethality rate of the virus (proportion of infected people who died) in the city is the highest in Brazil, about 10.7% 36 , and provides subsidies for this hypothesis. The remaining Brazilian cities have a much lower number of deaths and cases of COVID-19 than São Paulo and Rio de Janeiro. Fortaleza, for example, the third city in the country with the most cases, has about 3,800 deaths and 48,700 con rmed cases 36 .
In the city of São Paulo, the total number of con rmed cases and deaths of COVID-19 in 2020 surpassed in July all cases and deaths from compulsory noti cation diseases in 2019. The disease with the highest number of cases and deaths in 2019 were, respectively, dengue with about 17 thousand cases and severe acute respiratory syndrome (SARS) with 235 deaths 37 . These data allow to contextualize the relevance of the new coronavirus in the context of the metropolis of São Paulo. Due to low testing and technical and political problems regarding the counting of cases in Brazil, underreporting is quite high. The number of cases can be up to 12 times greater than that reported as indicated by investigations 38 . The noti ed cases of SARS in the city of São Paulo at the beginning of July 2020 were already around 22 thousand, about ten times higher than in the whole year of 2019, and the deaths already exceed 7 thousan 37 , about 30 times more than those killed by the disease in the previous year, which shows that these cases are probably COVID-19.

Research Methodology
Data acquisition and preparation. The data analyzed for the city of São Paulo are from February 25, 2020 to July 1, 2020. The patterns between climatic, air quality and epidemiological data (related to  were analyzed. The epidemiological database used is made available in real time by the state health departments and can be accessed directly through its electronic address (https://brasil.io/COVID-19/). Data on the isolation index are also available in real time by the state government at its own electronic address (https://www.saopaulo.sp.gov.br/coronavirus/isolamento/). While climatic data were obtained on the online platform of the Agrometeorological Monitoring System (Agritempo) of the National Institute of Meteorology (INMET) (https://www.agritempo.gov.br/agritempo/produtos.jsp?siglaUF=SP) and Air quality data were collected from the international platform "Air Quality Historical Data Platform" (https://aqicn.org/data-platform/register/) and the platform is provided by CETESB (State of São Paulo Environmental Company). The climatic data are from the automatic Meteorological Station of INMET (23 K 333498.53 m E; 7405721.27 m S) which among the six stations of Agritempo in SP is the one with the highest data continuity and number of monitored variables. The air quality data are from the CETESB Parque D. Pedro II automatic Air Quality Station (23 K 333573.00 m E; 7394924.00 m S) which among the 18 air quality stations in the city is in a strategic position, which represents the central region of the city is relatively close to the chosen weather station, and with good data continuity.
In the stage of data preparation, in addition to the uni cation of the database and standardization, according to the requirements of the statistical and modeling software, the data was discretized, since the modeling software has this requirement. Discretization, which consists of transforming a continuous variable into a categorical one, was done using the statistical tertile (low / 0-33%, medium / 34-66% and high / 67-100%) for each variable.
Multivariate analyzes. Statistical analysis and data grouping have the function of characterizing the database and identifying patterns of association between variables 32 . This step, in addition to guiding the development of the models 31 , can indicate the most important variables from the management point of view. Four analyzes were performed (linear correlation, factor analysis, similarity dendrograms and kmeans) using the Statistica software (developed by StatSoft). For factor analysis and linear correlation, strong correlations were considered to be positive or negative values greater than or equal to 0.6 39 . The similarity dendrograms were constructed from the Euclidean distance.
Data modeling. The modeling step consisted of developing predictive models for deaths and new cases on seven consecutive days (t+1, t+2, t+3, t+4, t+5, t+6, t+7). The two tools used in this stage were CBA (Classi cation Based on Associations) from the School of Computing, National University of Singapore 40 and J48, open implementation of the C4,5 41,42 algorithm in the Waikato Environment for Knowledge Analysis (Weka) tool developed by New Zealand University of Waikato 43 . These two modeling tools have similar principles as they use association rules to generate a classi er. However, the essential difference is that J48 is just a classi er expressed as decision trees (set of rules that make the classi cation of the target variable) 41 . The algorithm of this tool, C4.5, is one of the most important and widespread in the eld of data mining 42 . However, because it is a classi er, a predetermined target is needed to generate the predictive model. While the CBA is a tool that integrates classi cation and association rules, this allows the analysis of existing standards in the database, the association rules, to also guide the construction of classi ers 40 . The CBA only works with discrete intervals, while the J48 decision tree can work with continuous data, but there is a signi cant reduction in accuracy, which can reach up to 20% in these cases 44 . Therefore, the J48 also opted for the use of the same categorical intervals developed for the CBA.

Page 6/18
The classi cation rules have the format IF (A) / THEN (B) where from an interval (categorical variable) or a set of intervals it is possible to predict (classi cation rules). Therefore, the rules express an antecedent value (A) and a consequent value (B). So IF "A>1 THENà B>3". The rules have a support (S%) that corresponds to the percentage of records, A and B, which were classi ed correctly, in relation to all records in the database. Accuracy or reliability indicates the percentage of records that the rule forecasting was correct. The confusion matrix is a table that presents the classi cation frequencies for each class of the model (true positive, false positive, false true and false negative). The classi er's accuracy is the sum of the main diagonal (correct classi ers) of the matrix, divided by the total of values and multiplied by 100.

Results
Methodological approach.
The data approach proposed by this research to support public health in coping with pandemics, epidemics and endemics can be seen in the sequential diagram in Figure 1. Association rules are also useful tools in discovering patterns between the variables involved, and can be used, if necessary, if those patterns are not yet discovered by statistical analysis. In applying this approach in São Paulo, the statistical analyzes were conclusive and revealed the associations between the variables through factor analysis and linear correlation. The analysis of the input variables of the models can also be supported by the association rules, if it is not clear in the data cluster analysis (k-means and dendrograms).

Relevance variables.
The statistical analysis of the linear correlation coe cient showed a correlation relationship between cases of COVID-19 and climatic variables ( Table 1). The same did not happen with air quality variables.
Air temperature showed an inversely proportional correlation to new con rmed cases and deaths caused by the virus, in line with several studies that have already been carried out. But it was the climatic variables related to humidity that were most prominent in this statistical analysis. The agricultural drought and the total amount of water available in the soil were the main highlights, but the actual evapotranspiration also proved to be important.
In the same direction as the linear correlation coe cient, the PCA analysis showed that exactly the same climatic variables remain grouped in the main component that are the new con rmed cases and deaths by COVID-19 (Table 1). It is also the factorial loads of these variables that have the greatest correlation with the main components, as can be seen in the values highlighted in gray in Table 1. Therefore, it can be said that factor 1 corresponds to the main component of COVID-19 variables and in this component they have signi cant factor loads, greater than or equal to 0.6, the most important climatic variables. While the main component that corresponds to factor 2 is represented by climatic and air quality variables. The two cluster analyzes, in addition to complementing the characterization of the data initiated with the statistical analyzes, can indicate which are the best input variables for the generation of models. By the similarity dendrogram, constructed with the Euclidean distance between the variables, it is possible to identify 4 clusters, represented by the dashed lines in Figure 2. Cluster 3, which groups total deaths, minimum and maximum atmospheric pressure, new cases (NC) and the forecast of con rmed for the seven consecutive days (NC1, NC2, NC3, NC4, NC5, NC6 and NC7) shows the input variables for NC forecasting models. However, as the variables, deaths (ND), ND1, ND2, ND3, ND4, ND5, ND6 and ND7 that are focal variables of interest, are in a grouping of the dendrogram (cluster 4) that makes it di cult to visually separate. In the k-means method, which is interactive and which classi es the distance between variables in constant spaces, the variables were divided into ve groups, instead of four. Thus, the variables in cluster 4 ( Figure 2) were divided into two groups, where in one of them it was formed by pm25, maximum relative humidity, isolation index, new deaths (DN) and the death forecast for the seven days (ND1, ND2, ND3, ND4, ND5, ND6, ND7), the other variables were in the other cluster. And by the kmeans method, groupings 1, 2 and 3 of dendrogram variables (Figure2) remained identical. In this way, it was possible to identify that, in order to predict new con rmed ones, the model's input variables may be in addition to NC and total number of deaths at minimum and maximum atmospheric pressure. While it was possible to visualize that to predict deaths by COVID-19, as input variables in addition to the ND the isolation index, maximum relative humidity and pm25 variables can be used.
Predictive models. Figure 3 shows the accuracy of the two modeling tools used to predict new cases and deaths by COVID-19 in seven consecutive days in SP.The forecasts allow to identify the NC and ND of the next days based on the categorical intervals of the tertiles. Thus it is possible to know for the next few days if the number of new cases (patients) and deaths will correspond to the low (33%), medium (66%) or high (100%) value of the tertiles (NC: 115, 650 and 3500 patients; ND: 10, 45 and 150 deaths). As a general result, CBA's performance is slightly higher than J48. However, to predict NC on the 4th day, the accuracy of the two models is the same, 85%, on the other days the CBA has superior performance. While to predict ND it is only on the 5th day that these two modeling tools have equivalent accuracy of 89%.
The CBA generated 166 classi ers for predicting deaths within a week and 152 classi ers for predicting new cases of COVID-19. These rules were assessed for their support, accuracy and environmental and epidemiological coherence in their relationship. The choice of classi ers also prioritized those that combined variables related to COVID-19 with environmental variables (climatic and air quality), especially the input variables pointed out by the cluster analysis (atmospheric pressure, relative humidity, insulation index and pm25).
In the selection of the rules in Table 2, a diversi cation of the exit intervals was sought, so that there was a good representation of the three tertiles, small, medium and large. A total of thirty-eight models were selected to predict new cases and deaths within seven days ahead ( Table 2). All selected predictive rules can be used as a decision support tool by managers and authorities in the city of São Paulo. Fourteen decision trees were generated by J48, seven for new cases and seven for new deaths, within seven days ahead. Considering the accuracy, support and consistency of the classi cation rules, two decision trees were selected, one to predict new cases, NC1, and one to predict deaths, ND1 ( Table 3). The support of each rule that makes up the decision tree can be seen in the parenthesis after the exit interval of each rule.

Discussion
The proposed methodology ( Figure 1) can be used to face the COVID-19 pandemic and other important diseases such as dengue, malaria and different generations of in uenza. This approach is agile, e cient and can quickly provide subsidies for coping with these diseases, with an integrated perspective capable of covering the main issues that have been the subject of quantitative research on COVID-19, therefore, it can be an important support tool for public health authorities and governments. The approach has a reliability of support and forecasting proportional to the reliability of the data. Therefore, comprehensive tracking initiatives and assiduity and authenticity in noti cations are essential.
The air temperature showed an inversely proportional correlation with the new cases and deaths by COVID-19, in line with the several studies that have been carried out trying to understand the correlation of these variables 23,24,25,26,27,28 . The minimum and average temperature showed linear correlation coe cients, greater than -/+ 0.60, with the numbers of deaths and new daily cases of coronavirus in São Paulo (Table 1). This indicates that there is a greater number of people affected by COVID-19 when temperatures are lower, the same pattern as other respiratory diseases. Agricultural drought (number of days without precipitation greater than 10 mm) and the percentage of water available in the soil also showed signi cant linear correlation coe cients (greater than -/+ 0.60 with p <0.05) with the epidemiological variables of COVID-19 (Table 1). However, the correlation between water availability in the soil was inversely proportional to the epidemiological variables, while these variables had a directly proportional correlation with the agricultural drought. The relationship of these two meteorological variables with the epidemiological variables indicates that the dry climate is a factor that exacerbates the number of COVID-19 infections. This can be explained by the dryness of the airways, which compromises the nasal function of preventing the entry of viruses and bacteria in the human body. These variables need to be considered by managers when making decisions, since the total number of deaths and con rmed had the best correlations identi ed in this research with the agricultural drought (0.98, 0.97 respectively) and the availability of water in the soil (respectively -0.90, -0.89).
In this same direction, real evapotranspiration presented a high correlation coe cient, and inversely proportional, with the total number of deaths from the coronavirus. This fact indicates that the lower the actual evapotranspiration, loss of water by evaporation of the soil and transpiration of plants, the greater the total number of deaths from the virus. It is di cult to infer the reason for this correlation due to the complexity of factors involved in the evapotranspiration process (solar radiation, wind, temperature and humidity), but a viable possibility is the temperature, the greater the capacity of the air to contain water steam. Thus, the lower the temperature, the lower the evapotranspiration and the greater the number of deaths by COVID-19, considering that the total number of deaths was also inversely proportional to the average temperature (Table 1). The same meteorological variables, which presented signi cant linear correlation coe cients (greater than -/+ 0.60 with p <0.05), with the epidemiological variables of interest, were those which presented correlations greater than 0.60 with factor 1, in the factor analysis (Table 1).
Therefore, it is these meteorological variables (average and minimum temperature, agricultural drought, maximum dew point, real evapotranspiration and availability of water in the soil) that need to be monitored and studied to understand the dynamics of COVID-19 under the bias of modeling in megacities.
The data cluster analyzes made it possible to identify other environmental variables that also need to be measured. The total number of deaths and atmospheric pressure are the best input variables for new con rmed forecasting models. While the isolation index, relative humidity and pm25 are the best input variables for predictive models of new deaths ( gure 2). In tables 2 and 3, it can be seen that, in fact, these variables were recurrently selected by the algorithms as input variables of the models. This occurred mainly with the new con rmed cases, in which all the rules of the J48 used the variable the total of con rmed as input and among the twenty selected rules of the CBA eight used these variables as the total of con rmed or atmospheric pressure.
The predictive models generated by the two algorithms had a close performance in terms of accuracy, but the support of the J48 decision tree rules was lower than the statistically representative values (greater than 8%) to validate most of the rules. Therefore, it is recommended to use only the CBA models, for the case of São Paulo. However, for future studies it is recommended to reuse these two modeling tools in a comparative way and to use other data mining algorithms in an exploratory way. Discretization using the quartile is another possibility that can be tested to verify the behavior of the results by other studies.
Uncertainty about future outbreaks is a problem in recurrent epidemics and pandemics, especially in developing countries with few resources and poor health systems. An accurate predictive tool, capable of anticipating the levels of hospitalizations and deaths, can be useful for health managers.
This work elucidates the temporal patterns of morbidity and mortality by COVID-19 in São Paulo, a Brazilian megalopolis. A history of con rmed cases and deaths was organized with meteorological and air quality variables. Multivariate analyzes were performed to understand the relationships between the variables involved. Predictive models with high and satisfactory precision were built to predict morbidity and mortality.
Forecasting public health conditions is useful for preparing health teams in advance for an outbreak and prevents the system from collapsing. In addition, prior information can optimize the resources invested in COVID-19 or other outbreaks of other urban diseases.
Declarations Figure 1 Sequential diagram of the proposed data approach for public health management.  Comparison of the accuracy of the CBA and J48 models.