This section explains the methodology used for this research, including the dataset used, the analysis process, the newly identified integrated input factors, the evaluation with machine learning models and the evaluation method. Fig. 1 illustrates the conceptual framework of our research.
Fig. 1: Conceptual Framework for Identifying Significant Climate Factors in Dengue Outbreak Prediction
The following sections provide a detailed description of each process involved in this framework.
Dataset
Data were collected from two sources. We obtained weekly data on dengue cases based on two federal territories, namely, Kuala Lumpur and Putrajaya, from January 2010 to December 2013. The data were obtained from the reports of the Disease Control Division of MOH, which are available from their official portal [23]. The weather data of Kuala Lumpur and Putrajaya were retrieved from MMD for the period of January 2010 to December 2013. The data are available upon request. Thus, 209 weeks of confirmed dengue cases and meteorological data were evaluated in this study. However, approximately 8% of the data were missing in the MMD datasheets for the study period. Thus, we obtained the missing data for this period from the US Weather Channel Interactive, which provides Malaysian meteorological data. The data were fitted simultaneously with the Putrajaya–Cyberjaya Station in Malaysia. Only minimum temperature, maximum temperature, average temperature, minimum humidity and rainfall were selected because many studies have emphasised that these factors are the most important risk factors for dengue outbreak prediction models, as shown in Table 1.
Analysis
Weather data from MMD provide daily weather information, and the incidence of dengue cases is published weekly by MOH. Thus, data were normalised and classified into two levels, namely, ‘low risk’ and ‘high risk’, on a weekly basis. Weather and meteorological factors play important roles in the incidence of dengue fever. Thus, the dataset was analysed, and the relationship between the incidence of dengue cases and weather information was determined every week using the Pearson correlation coefficient (PCC). The Pearson product– moment correlation coefficient (occasionally referred to as PPMCC, PCC or Pearson’s r) is a measure of linear dependence between two variables X and Y (Equation 1). This method is an important evaluation technique, providing a value between +1 and −1, where +1 indicates the total positive linear correlation, 0 exhibits no linear correlation and −1 indicates total negative linear correlation. This measure is widely used in various science fields [50].
(1)
Identification of Significant Factors
The most significant climate factors were identified based on the correlation analysis of the dataset, as shown in Table 2. The analysis result indicated that the highest correlation exists between minimum temperature and cumulative rainfall, with the incidence of dengue cases determined in different weeks.
Table 2: Correlation between Dengue Incidence Cases and Climate Factors
Minimum temperature and daily rainfall are the most significant dengue weather-based risk factors [38,51,52,53]. The average minimum temperature can be calculated as follows (See Supplementary Files for Equation 2):
where i is the number of weeks from which the average minimum temperature and [Week(i−1)] is the minimum temperature of the prior weeks to the current week plus minimum temperature of current week . The cumulative rainfall for week i can be calculated using Equation 3 (see Supplementary Files), as follows:
where i is the desired week from which the total rainfall will be calculated, cumulative rainfall week (i) is the final calculation and week (i−1) is the week prior to week (i ).
Table 3 provides the PCCs between the weather variables and the incidence of dengue cases. The underlined and highlighted high positive numbers showed the highest correlation and coefficients between weather parameters and the incidence of dengue fever. Table 3 presents the results for 7 weeks prior to the current week and the optimum value for the average minimum temperature (0.499).
The highest value for cumulative rainfall (0.0071) was obtained for 2 weeks prior to the current week (Table 3).
Table 3: PCC between Climatic Factors and Incidence of Dengue Cases
Thus, the average minimum temperature of Week 5 (plus the current week) and the cumulative rainfall for Week 2 (prior to the current week) exhibit high correlation with dengue cases in accordance with the correlation analysis. The two factors will be regarded as TRF and used as input parameters for dengue outbreak risk level prediction. The combination of factors is shown in Fig. 2.
Fig. 2: Components of TRF
The cumulative rainfall for 2 weeks prior to the current week is identified as a significant factor because it coincides with the life cycle of an Aedes aegypti mosquito, i.e. approximately 2 weeks [38,51,52,53,54,55]. Thus, this result clearly shows that dengue outbreak can happen immediately after an A. aegypti mosquito completes its life cycle and becomes an adult.
Prediction using machine learning models
Once significant factors have been identified, the research proceeded towards predicting the risk incidence level of dengue fever defined as ‘high risk’ and ‘low risk’. To predict this level, we tested five machine learning models using input factors with and without TRF. Table 4 provides the detailed input factors and descriptions.
Table 4: Input Factors with and without TRF
On the basis of the high output result [16,56], we selected Bayes network (BN) models, support vector machine(SVM), RBF tree, decision table and naive Bayes to evaluate the factors using WEKA version 3.8.0 [57]. We used the cross-validation (10-fold) technique to evaluate the models.
Evaluation Metrics
We can evaluate the performance of classifiers on the basis of several accuracy measures and parameters. Moreover, some accuracy and error measures are used to determine the distance between the predicted and the actual known values [58]. The confusion matrix is a useful tool for analysing the efficiency of a classifier in recognising tuples of different classes used in WEKA.
Sensitivity and specificity measures can be used to calculate the accuracy of classifiers. Sensitivity is also referred to as the true positive rate (i.e. the proportion of positive tuples that are correctly identified). Specificity is the true negative rate (i.e. the proportion of negative tuples that are correctly identified).
Equation 4 (see Supplementary Files) shows how accuracy was calculated using the confusion matrix.
We used the root-mean-square error (RMSE) to demonstrate the error rate [50,58]. RMSE was also adopted to identify the strengths of model evaluation. Optimising RMSE during model calibration may provide a small error variance at the expense of a significant model bias [50,59]. This statistic is determined as follows (Equation 5 in the Supplementary Files):
where Pi and Oi are known as the experimental and forecasted values, respectively; and n is the total number of test data.