The quest for better machine learning models to forecast COVID-19-related infections: A case study in the state of Pará-Brazil

COVID-19 disease has become an unprecedented public health crisis. Although a relatively small percentage of people require intensive care, due to the high degree of contagion of the disease, the public system quickly collapses. Due to the highly complex nature this disease and variation in its behavior depending on the characteristics of each geographic region, in this work we analyze data from the Amazon region in Brazil (Pará). We applied several machine learning models to forecast the contagious curve to up 10 days. The Linear SVM and Multilayer Perceptron presented the best overall performances. Until the discovery of a vaccine, every effort is needed to understand and anticipate this disease.


Introduction
COVID-19 disease has become an unprecedented public health crisis. Due to its speed of proliferation, the total of confirmed cases and deaths is increasing every day. According to the World Health Organization (WHO) situation report -122 1 , until May 21, 2020, worldwide, the total confirmed cases and deaths were 4.893.186 and 323.256 respectively. Comparing these data with the data released in the first WHO situational report 2 , the number of deaths increased by 32.325.500% from January to May. With the exponential growth of the number of cases and deaths, Brazil is becoming the new epicenter of covid-19. These data are extremely worrying and, for this reason, researches and measures are being adopted to combat this pandemic. While the vaccine for this disease is not found, one of the most efficient ways to mitigate COVID-19 is information assist. Assist people with the importance of social isolation and preventive measures is essential to reduce the number of deaths. Besides, COVID-19-related issues can be analyzed by information systems to better understand the pattern of viral spread. For example, to analyze the growing trend in the number of confirmed cases and deaths can be provided by information systems that applied machine learning techniques. The prediction made by the machine learning models can be important information for strategic decisions performed by authorities, for example, to determine the institution of hospitals, acquisition of equipment, loosening or restriction of social isolation, or even the determination of lockdown.
In this context, some studies have proposed mathematical and machine learning models to predict the trend of the growth curve in the number of confirmed cases and deaths due to COVID-19. Works like [3][4][5][6][7] are examples of research that uses machine learning (ML) models along these lines. From the mentioned works, due to the highly complex nature of the COVID-19 disease and variation in its behavior from nation-to-nation, we note that two questions are central for the correctness models accuracy: a) the geographic context of the data, i.e. different countries and states show different patterns about the COVID-19 outbreak. b) the prediction window size in the number of days, i.e. short predictor windows have better accuracy. Considering these two factors, we understand that it is extremely relevant to further investigate ML models to predict cases and deaths of COVID-19-related in specific regions of Brazil. In this case, we investigated the pattern of infection in the state of Pará. According to IBGE data 8 , Pará is a state of Brazil, located in the Amazon Rainforest that has 144 counties, distributed in an large area of 1.247.955 km 2 , approximately 8 million inhabitants, and 13.720 hospital beds between public and private, 580 are UCI. Until May 21, 2020, the state of Pará had 20537 confirmed cases and 1893 deaths due to Covid-19 9 . In a statement on April 21, 2020, authorities said that 90% of the hospital beds in the state of Pará are already occupied. This news is alarming and highlights the idea that the information system as proposed in this work is promising to assist future public policy health care interventions.
This work aims to compare different machine learning models for forecasting COVID-19 infection cases in the Pará-Brazil state. Considering the collected data, we will also analyze the ideal prediction window size. In summary, this work evaluates different prediction windows Ω and the generalization capacity of machine learning models Ψ, in the task of estimating the number of infected due to COVID-19 in the state of Pará Φ, i.e., we evaluate how Φ can be solved by Ψ(Ω) and also which the influence of ω ∈ Ω ∀ ψ ∈ Ψ. For the Ω set, we consider windows of up to 10 days, i.e., given the values of the attributes for a given day N, we built the target value to predict the total number of infected on day N + d, where d goes from 1 to 10.
Regarding the Φ set, we consider six machine learning models and two statistical models, namely: Linear Regression, Linear SVM, Random Forest, Gradient Boost, Convolutional Neural Network, Multilayer Perceptron, Autoregression, and Prophet models. Among these machine learning models we have three classes of algorithms, which are: linear model (Linear Regression and Linear SVM), ensemble model (Random Forest and Gradient Boost), and neural network model (Convolutional Neural Network and Multilayer Perceptron). Ensemble and deep learning models were taking into account mainly because the state of the art shows that these models are the best techniques to solve most problems where machine learning can be employed. Linear models were applied based on the grounds of Occam's razor principle 10 , which states that the explanation of any phenomenon must assume the least number of premises possible. In machine learning, this often means that when faced with two classified algorithms with the same training performance and testing capacity, the simplest model will probably be the best choice. Taking into account the approach described, we consider that our work has three central contributions: (1) construction of efficient models to predict the number of infected due to COVID-19 in the state of Pará. (2) performance evaluation of models for different prediction windows. (3) comparison of different classes of machine learning models to apply Occam's razor principle. We use the metrics RMSE, RMSRE, MAE, MAPE, and R 2 to examine the performance of the models. The results show that Occam's razor principle was applied and that, despite the ensemble and deep learning models are the best techniques to solve most problems where machine learning can be employed, there are exceptions.
The remainder of the article is structured in the following way. In Section 2, there is a brief review of the literature on the most common ML employed to forecast COVID-19 outbreak. In Section 3, we describe how the data collection was carried out, and the database was formed. The designing and analysis of the models are carried out in Section 4. Finally, Section 5 summarizes the main conclusions of the research study.

Related Works
Machine learning is the field of artificial intelligence whose objective implies the development of computational algorithms that are capable of transforming experience into expertise 11 . In other words, machine learning algorithms aim to map patterns from an input domain to an output domain and subsequently recognize patterns from the output domain, even those not exemplified. Given the ability to generalize, ML techniques are applied in many areas to solve classification or prediction problems. For example, on the COVID-19 outbreak, ML techniques can be used to understand the pattern of viral spread, improve diagnostic accuracy, develop novel effective therapeutic approaches, and identify the most susceptible people based on personalized genetic and physiological characteristics 12 . In the review conducted by Bullock et al. 13 , they present an overview of recent studies using Machine Learning and, more broadly, Artificial Intelligence, to tackle many aspects of the COVID-19 crisis at different scales including molecular, clinical, and societal applications. In total, 20 application areas of machine learning are presented, and 82 works that developed applications are presented in the review.
According to the applications cited by Bullock et al. 13 , our work is in the category: Societal scale -Epidemiology and infodemiology. Researches in this subject want to understand how the virus is transmitted, and its likely effect on different demographics and geographic locations. These researches are therefore crucial for public policy health care interventions. In this type of problem, many well-established classical models, such as susceptible-infected-recovered (SIR) models are fine-tuned to the COVID-19 situation 14 . However, most ML applications developed for epidemiological modeling have presented promising results on forecasting national and local statistics such as the total number of confirmed cases, mortality, and recovery rates. In this context, we correlated our work with research that developed ML models with a methodology similar to that presented in this work. Table 1 shows researches that have improved machine learning models in the context of forecasting the COVID-19 global pandemic problem.
Ardabili et al. 3 showed promising results when using multi-layered perceptron, MLP, and adaptive network-based fuzzy inference system, ANFIS. In that work they collected data from five countries, including Italy, Germany, Iran, USA, and China on total cases over 30 days. The evaluation was conducted using the root mean square error (RMSE) and the correlation coefficient. The results show that the accuracy of the developed models is better than the accuracy of classic mathematical models such as logistic and linear.
Huang et al. 4 proposed a Convolutional Neural Network (CNN) to analyze and predict the number of confirmed cases in China. In this study, the input data were obtained from Surging News Network and WHO, respectively. To compare the overall efficacies of different algorithms, the indicators of mean absolute error (MAE) and root mean square error (RMSE) was applied in the experiment. The results show that the solution proposed by them achieves high predictive precision, even using small data sets.
Al-qaness et al. 5 conducted a case study in China to explore the COVID-19 prediction problem over a 10-day horizon. They developed an improved adaptive neuro-fuzzy inference system, (ANFIS) using an enhanced flower pollination algorithm (FPA) and salp swarm algorithm (SSA). For training and evaluation of the models, they used official data from the World Health Organization (WHO). To evaluate the performance of the models, they used the metrics mean absolute percentage error (MAPE), root mean squared relative error (RMSRE), and coefficient of determination (R 2 ). In comparison with different ML methods, the study shows that the proposed model has a high ability to forecast the COVID-19 dataset.
Ceylan 6 developed an Auto-Regressive Integrated Moving Average (ARIMA) model to predict the epidemiological trend of COVID-19. They took into account the cases reported in Italy, Spain, and France. In this study, they used official data released by the World Health Organization, from February 2020 to 15 April 2020. To evaluate the performance of the models, they used the metrics root mean square error (RMSE), mean absolute error(MAE) and mean absolute percentage error (MAPE). Considering the analysis of the results, they argue that the ARIMA model performs well to predict the cases of COVID-19 in the future.
Dutta and Bandyopadhyay 7 proposed a deep learning neural network to predict cases of confirmed, negative, released, deceased in Covid-19. In this study, the authors compare the performance of the Long short-term memory (LSTM) and Gated Recurrent Unit (GRU) models. According to the analysis of the results, they point out that the prediction of the models is compatible with the results predicted by clinical doctors. To conduct the training and evaluation of the models, the authors used a public dataset from the Kaggle repository that contains reported cases of COVID-19 in the period from 20th January 2020 to 12th March 2020. To evaluate the performance of the models, they used root mean square error (RMSE) and accuracy. The results show that the proposed deep learning models have a high prediction rate for the disease identification.
Looking at related work, we can see that neural networks and deep learning techniques are being used to solve the forecasting problem related to COVID-19. In addition, the evaluation metrics and the analysis of the results are performed in a similar way in all works. Considering this scenario, our work aims to compare different machine learning models for forecasting COVID-19 infection cases in the Pará-Brazil state. We will also analyze the ideal prediction window size and we will use the metrics RMSE, RMSRE, MAE, MAPE and R 2 to analyze the performance of the models.

Building dataset and choosing the models
COVID-19 cases are reported daily by the public health department of the state of Pará. So, we built our dataset collecting the cases reported by this department 9 . Data were collected from March 17, 2020 to May 21, 2020. In total we collected 66 instances containing the following attributes: total of confirmed cases, total of deaths, total of discarded cases, and total of recovered cases. Figure 1 shows the behavior of the data collected over the days. In the last ten days, we can see considerable growth in the number of confirmed cases, recoveries, and deaths. However, as the number of confirmed cases is greater than the number of people recovered, the situation is still worrying. Analyzing Figure 1, it may not be possible to notice the growth in the number of deaths. However, when we analyze Figure 2, it is possible to observe that the number of deaths is also increasing in the last ten days. This behavior reinforces the hypothesis that COVID-19 related problems tend to increase in the state of Pará. Considering the information collected, we built our dataset with 8 attributes as shown in Table 2. To build the target variable, we re-frame the time series dataset as a supervised learning problem using the sliding window method. That is, given the values of the attributes for a given day N, we built the target value to predict the total number of infected on day N + d, where d goes from 1 to 10. Altogether we have 10 different target values which imply 10 training and analysis of the models.
To build the most appropriate machine learning models, we analyzed the linear correlation of the attributes of our dataset (see Figure 3). As we can see in the heatmap, many attributes are linearly dependent on our target attribute (A). This behavior suggests that linear machine learning models will be more efficient in solving our problem. Although the state of the art shows that ensemble and deep learning models are the best techniques to solve most problems where machine learning can be employed, there are exceptions. In fact, the best ML depends on the patterns of input data. In our case, due to the correlation of the data, we assume the hypothesis that the linear models are better than the ensemble and deep learning models.
To prove our hypothesis we evaluated the performance of two linear models (Linear Regression and Linear SVM), two ensemble models (Random Forest and Gradient Boost), and two neural network models (Convolutional Neural Network and Multilayer Perceptron). Besides, we also compare these machine learning models with two statistical prediction models: AutoRegression and Prophet models 15 .

Modeling and evaluation
To perform the modeling of ML models, we built a pipeline with grid-search and cross-validation techniques. In machine learning, we can configure hyperparameters manually from the designer's experience, using heuristics or using brute force techniques such as grid search. In the grid search, we must define the hyperparameters and the range of values that are combined when adjusting the model 16 . In summary, in the grid search technique, each combination of values represents a model that must be validated. In this experiment, we define the range of values of the hyperparameters to verify which configuration provides the best generalization capacity for the models.
Regarding cross-validation, we use it to evaluate the models and maximize the number of validation samples. The idea of cross-validation is that each sample of the data set is used to test the model's performance. Due to the nature of the data we are using, we apply the time series cross-validation known as walk-forward cross-validation. According to Hyndman and Athanasopoulos 17 , in this procedure, there are a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred before the observation that forms the test set. Thus, no future observations can be used in constructing the forecast. This approach is interesting because it allows the model to be evaluated with the largest number of samples and eliminates the chronological order problem caused by other cross-validation methods, such as k-fold cross-validation, for example. Formally, we can define walk-forward cross-validation as an iterative assessment that works as follows: let {t 1 ,t 2 , ...t N } be a time series, the training and test set is given by training = {t 1 ,t 2 , ...t k+i } and test = {t k+i+w }

5/14
∀ 0 ≤ i ≤ N − k − w where k is the initial size of the training set and w is the prediction window. Step 1: split the data set into training and testing subsets.
Step 2: application of grid search and cross-validation to train and select the hyperparameters of the models.
Step 3: evaluation of the models with the test subset.
Step 4: hypothesis test to verify the statistical significance of the best models evaluated. Figure 4 describes the methodology that we use for the construction and evaluation of the prediction models. We use this methodology for each window prediction. In step 1, we split the data into training and test sets. The test set was built to test the prediction performance of the models for the last ten days. The set's shape depends on the prediction window to be tested. For example, when the window is N + 1, we use the data collected on May 1st to predict the number of cases on May 2nd, the data on May 2nd to predict the cases on May 3rd, and so on. When the window is N + 10, we use the data collected on April 29th to predict the number of cases on May 9th, the data on April 30th to predict the cases on May 1st and so on. This split procedure was used for each N + d window where 1 ≤ d ≤ 10.
In step 2, we perform the pipeline for each machine learning model. In this step, we define the range of values for each hyperparameter and provide it as input data from the pipeline together with the training data and the model. The grid search is performed for each hyperparameter combination and evaluated with walk-forward cross-validation. At the end of the pipeline,

6/14
for each ML, we have the best combination of hyperparameters, as well as the cross-validation evaluation performance. In step 3, we used the best configuration for each model and tested their generalizability. We use the data from the test set and evaluate the results with the metrics RMSE, RMSRE, MAE, MAPE and R 2 . In this step, the Autoregressive and Prophet statistical models were also evaluated. Finally, in step 4, we conducted a hypothesis test to check the equivalence between the models. In this test, the objective is to verify whether the simplest models are equivalent to the more complex models. If so, we can apply Occam's razor principle and choose the simplest models to solve the problem. Table 3 shows the configuration parameters and the range of values that we pre-determine for each model. The Linear Regression model was the only one that we did not submit to the grid search because it does not have hyperparameters. The parameters of the Regression Linear are all adjusted during supervised training.  In the SVM model, we set up the C parameter. A large C value makes the model to use a lower margin hyperplane, while a small C value makes the model to look for a larger margin separating hyperplane. In Table 3, we configure the C range to test super-adjusted and smooth hyperplanes. As noted in Table 4, for all prediction windows, the grid search selected small values for C. These small values indicate that the hyperplanes of each SVM model are not overfitted and, for this reason, the built models can achieve a better generalization capacity.

Model
In the Random Forest model, we have configured 4 hyperparameters. The N estimators parameter determines the number of trees in the forest. The higher the value of N estimators, the larger the ensemble. In the Criterion parameter, we configure the function to measure the quality of the splits that are performed during the construction of the trees. As it is a regression problem, we selected the MAE and MSE criteria. Max depth and Min sample split are parameters used to determine the tree configuration. In Table 3, we configure the hyperparameters to search for forests with shallow and deep trees, however, as noted in Table 4, the grid search selected trees with intermediate depth. trees.
In the Gradient Boosting models, the N estimator defines the number of boosting stages to perform. The boosting models are fairly robust to overfitting, so a large number usually results in better performance. In the learning rate, the set value has the purpose to define the contribution of each classifier. The nearer to zero is the learning rate, it means that the classifiers, individually, will have less contribution to the ensemble, i.e., for a small learning rate, the ensemble will take into account the joint performance of the classifiers. The parameters Min sample split and Max depth were also configured. In this case, these parameters have the same purpose as the configuration performed in the Random Forest model. Considering the ranges of values presented in Table 3, we can see that the grid search selected the values to configure Gradient Boosting models with many estimators of low learning rate and moderate depth. Table 4 shows the configuration selected for each prediction window.
For the Convolutional Neural Network and Multilayer Perceptron models, we configured the grid search to test the best architecture taking into account the number of layers and the number of neurons. As noted in Table 3, we varied the number of layers from 1 to 6 and the number of neurons from 10 to 60. To decrease the complexity of the grid search, we defined that for each tested architecture, the number of neurons would be the same in all layers. In the case of CNN, we do not perform the grid search for the filter size. Taking into account the observation of Szegedy et al 18 , the number of each filter was fixed at 2. In the development of CNN, Szegedy et al. show that the use of small kernels is more efficient than the use of larger kernels. In addition to decreasing the processing load, they also emphasize that the use of multiple small filters can match the representativeness of larger filters. The concept of representativity is related to the capacity of the convolution to be able to detect structural changes in the analyzed problem. In Table 4 Table 4. Hyperparameters setting by model and prediction window combination.

8/14 Evaluation
After training the models, we assess the generalizability of the models. In this step, we use the metrics MSE eq 1, RMSE eq 2, MAE eq 3, MAPE eq 4, and R 2 eq 5 to measuring the algorithm's performance. For the best interpretation of the results, we apply the Min-Max normalization to the expected and predicted values. Thus, for the metrics MSE, RMSE, MAE and MAPE, the values closer to zero indicate the best performance of the evaluated model. For the metric R2, the best performance is observed when the values are close to one. Tables 5 show the performance of each model by the prediction window.
From all the models analyzed, we can see that the Random Forest and Gradient Boosting models did not reach generalization in any prediction window. This behavior can be explained due to the way these algorithms work. Random Forest and Gradient Boosting are tree-based algorithms that operate by if-then rules that recursively split the input space. These algorithms consider the observations to be independent and identically distributed and, therefore, they're unable to predict values that fall outside the range of values of the target in the training set, i.e., they're unable to predict a trend. To try to solve this problem, we use data transformation techniques that are widely used in time series 17 . In the pre-processing step, we applied the Box-Cox transformation to fix the variance and the differentiation between the cases reported to make it stationary in the mean. However, the Random Forest and Gradient Boosting models still failed to generalize. The performance of the Random Forest and Gradient Boosting models proves our hypothesis that despite the ensemble models are the best techniques to solve most problems where machine learning can be employed, there are exceptions.
About the other models, we can see that there was a variation in performance due to the prediction window. To make easing the visualization of these variations, we group the performance of the models and plot the graphs presented in Figures 5 and 6. In Figure 5, we have the radar charts comparing the performance of the models for each prediction window. We plotted the radar chart, taking into account the 5 evaluation metrics. However, for better visualization, we inverted the MAE, MAPE, MSE, and RMSE metrics, i.e., we apply 1 − metric. With this transformation, the models with the best performance should present a performance closer to 1 for all metrics. In visualization, the best model is the one with the largest pentagon. In Figure 6, we are plotting each model's prediction according to the used prediction window. This plot refers to the test data that was not used in the training and validation phase. The black line corresponds to the confirmed true cases and the dashed lines to the model's predictions. Predictions were made for the 12th to the 21st of May 2020. In window 1, to forecast the 12th we use the data from the 11th, to predict the data from the 13th we use the data from the 12th and so on. Generalizing, considering a window w, the prediction of the day p d is performed using the data of the day p d−w .
Looking at the results, we can see that the Linear Regression model was losing performance as the prediction window grew. In Figure 5, we can see that the pentagon of Linear Regression is similar to the best models for the first 5 windows. However, for windows 6 to 10, the low performance of this model is notable. This behavior can also be seen in the predictions in Figure 6. As the prediction window increases, the (blue) curve of the Linear Regression model moves away from the true (black) cases. The results of the Linear Regression show that this model is only efficient for forecasting small horizons. In summary, to predict horizons > 5, we do not recommend using the Linear Regression model.
The Convolutional Neural Network models showed performance variation between the prediction windows. In Figure 5, in the prediction windows 5, 7, 8, and 10, we can see that the CNN pentagon (purple) is significantly smaller than the pentagon of   the best models. This behavior indicates a certain inconsistency in the predictability of the CNN model. When looking at the windows 5, 7, 8, and 10 in Figure 6, we can also observe this inconsistency when we notice that the CNN prediction curve deviates from the true cases throughout the time series. Although the CNN model shows good results for other prediction windows, in general, it showed inferior performance than the Linear SVM and Multilayer Perceptron models. This reinforces our assumption that depending on the problem and the data type, other machine learning models may perform better than the ensemble and deep learning models. In this experiment, the Linear SVM and Multilayer Perceptron presented the best overall performances. In Figure 5, we can see that the Linear SVM (red) model showed some performance variations in the smaller prediction windows. However, in the larger prediction windows, this model was similar to the performance of the Multilayer Perceptron (green). In Figure 6, when we analyze the fit of the predictions about the true cases, we find that the Linear SVM and Multilayer Perceptron predictions are well adjusted in almost all prediction windows. Besides, for windows ≥ 7, these models have the best predictions for the longest horizon (21/05/200). The only exception is window 8, where the CNN model has the best prediction for the longest horizon. In the real scenario, obtaining the best prediction for the longest horizon is ideal because this is the information used for decision making by the authorities. This assumption also reinforces our analysis that points out the Linear SVM and Multilayer Perceptron as the best models evaluated in our experiment.
In window 10 of Figure 6, we also compare the performance of the AutoRegressive and Prophet 15 statistical models. As can be seen in this figure and Table 6, these models did not achieve the desired generalization in our experiment. This result suggests that statistical models are not always sufficient to solve prediction problems like the one presented in this work.  Table 6. Error metrics for the AutoRegressive and Prophet models.

Checking the models' equivalence
Following the methodology presented in Figure 4, the last step of our experiment is to conduct a hypothesis test to verify the equivalence between the models. In this case, we have verified the equivalence between the two best evaluated models. The null hypothesis (H0) of our test considers that the Linear SVM and Multilayer Perceptron models are statistically equivalent and the alternative hypothesis (H1) that the models are different. The Student's t-test was employed to determine the significance of these differences. If we get a p-value ≤ 0.05, it means that we can reject the null hypothesis and the machine-learning models are significantly different with 95% confidence. To perform this test, we use window 10 to predict the cases from May 22, 2020, to May 31, 2020. Until the submission of this article, this prediction period had not been reported in the official panel yet. So, we want to check the equivalence of the best models for future cases. Figure 7 and 8 show the estimated prediction density of these two models and the prediction curve for the ten-day horizon. Looking at these two plots, we can see the similarity between these two models. When applying the Student's t-test we get a p-value = 0.0.438 indicating that we must accept the H0 hypothesis because the models are statistically equivalent. This result confirms the analysis carried out in the previous one in which we selected these two models as the best.

Conclusion
COVID-19's problems are growing every day, especially in the poorest countries. In Brazil, the number of infected people is growing rapidly and health, economic and social problems are becoming more difficult in almost all states. In the state of Pará, where we choose to conduct this research, the government has declared a lockdown since May 7, 2020. In an attempt to slow Covid-19's progress, the lockdown suspends all non-essential activities and prohibits the movement of people through the streets in the capital Belém and other 9 cities in the state. To contain the progress of this epidemic in the state, it is important to have access to information from a predictability perspective. In this sense, considering the official data reported by the Pará Department of Health, we have built a data set to analyze the performance of various models of machine learning in the task of predicting the number of infected due to COVID-19. Based on the prediction of the number of infected, we believe that the authorities can use this information to make the best decisions. In our experiment, we analyzed the performance of six machine learning models and two statistical models. The models analyzed were: Linear Regression, Linear SVM, Random Forest, Gradient Boosting, Convolutional Neural Network, Multilayer Perceptron, AutoRegressive, and Prophet. We chose these models because we had the hypothesis that although ensemble and deep learning models are used to solve almost any problem where machine learning models can be applied, there are exceptions, depending on the nature of the problem and the data type. We analyze the prediction performance of these models taking into  account 10 prediction windows. This analysis was performed to verify if there was a change in the models' performance due to the prediction horizon. At the end of our analysis, we applied the Occam's razor principle which, in the context of machine learning, says that when faced with two classified algorithms with the same training performance and testing capacity, the simplest model will probably be the best choice.
Returning to our problem formulation, we evaluate different prediction windows Ω = {1, 2, ..., 10} and the generalization capacity of machine learning models Ψ = {Linear Regression, Linear SVM, Rnadom Foresr, Gradient Boosting, CNN, Multilayer Perceptron}, in the task of estimating the number of infected due to COVID-19 in the state of Pará Φ, i.e., we evaluate how Φ can be solved by Ψ(Ω) and also which the influence of ω ∈ Ω ∀ ψ ∈ Ψ. From this formulation, we can conclude that the Ω set influenced the performance of the Linear Regression and Convolutional Neural Network models. The

13/14
performance of Linear Regression was inversely proportional to the values of the Ω set, i.e., the larger the prediction window, the lower the performance of Linear Regression. CNN, on the other hand, presented performance variation for some values of the Ω set. his performance fluctuation demonstrates the unpredictability of the model and, for this reason, CNN cannot be compared to the best prediction models analyzed in this work.
In our analysis, we also saw that Φ cannot be solved by Ψ * (Ω) where Ψ * = { Random Forest, Gradient Boosting}. These two models failed to generalize the problem with all the prediction windows. As previously explained, the poor performance of these models can be explained due to the way the decision tree algorithms work. Finally, the two best models analyzed in this experiment were Linear SVM and Multilayer Perceptron. The MultiLayer Perceptron showed excellent performance for all the predicted windows analyzed. The SVM model, on the other hand, showed little variation in the intermediate windows but showed excellent performance in the largest windows, which are the most important. Considering the performance equivalence of these models for the largest prediction windows, we applied Student's t-test to verify the statistical significance between these two models. The result of the hypothesis test showed that the Linear SVM and Multilayer Perceptron models are equivalent. Given the equivalence between the models, we applied the principle of Occam's razor and concluded that the Linear SVM model is the most recommended to perform the prediction of infected due to COVID-19 in the state of Pará.