DOI: https://doi.org/10.21203/rs.2.15755/v2
Pandemic infectious diseases are spreading in many geographical areas. The World Health Organisation (WHO) has reported that dengue fever is one of the most important mosquito-borne diseases. Dengue fever is a common problem and one of the deadliest infectious diseases worldwide. WHO has identified dengue fever as a major rapidly spreading mosquito-borne illness caused by the dengue virus. Accordingly, this disease is a threat and poses severe risk to human populations in tropical and subtropical regions [1- 6]. Health organisations should have a prediction and early warning system to control and monitor dengue fever [7]. Member states in the three WHO regions regularly reported an increase in the annual number of cases from 2.2 million in 2010 to 3.2 million in 2015 [8].
Moreover, WHO estimated an annual projection of 50–100 million dengue infections worldwide. An annual mortality of approximately 20,000–22,000 deaths caused by dengue fever has also been reported [8,9]. In contrast with yellow fever or other mosquito-borne diseases, no vaccine or treatment is available against all serotypes of the dengue virus, and no antiviral drug for treating dengue fever has been reported yet [10]. The only alternative is to prevent or control the outbreak of this disease.
The accuracy of a prediction system for outbreaks is the primary and important concern for controlling dengue fever [11]. Thus, establishing related risk factors is critical for prediction systems [12]. Given that climate factors play a key role in this disease, identifying the relation between weather information and dengue outbreak incidence is a major task in establishing an accurate prediction system for future outbreaks [13,14]. In the current study, important climatic risk factors, such as temperature, relative humidity and rainfall amount, were examined. The current accuracy for prediction systems based on climate factors ranges from 82.39% to 90.5% [12,15-20].
This study is essential because it identifies the critical climatic risk factors in dengue outbreak prediction, i.e. the TempeRain factor (TRF). Then, the identified critical factors (TRF) were applied to prediction models, increasing the accuracy of prediction and reducing the error of prediction models. This process is expected to particularly help authorised organisations or decision makers in health organisations, governments and other concerned groups to become aware and develop improved prevention programmes in the near future.
Related works
A recent study from WHO indicated that 390 million dengue infections occur annually (95% credible interval of 284–528 million); among which, 96 million (67–136 million) are manifested clinically with any severity of the disease [21]. Another study on the prevalence of dengue fever has estimated that 3.9 billion people in 128 countries are at risk of infection from dengue virus [22]. As of December 2018, the Ministry of Health (MOH) of Malaysia has recorded approximately 80,615 dengue cases with 147 deaths compared with 19,884 cases in December 2011 with 36 deaths [23]. The number of cases increased approximately fourfold. By the end of March 2019, 39,805 cases of dengue with 64 deaths were reported in Malaysia compared with 16,917 cases with 34 deaths in March 2018 [24].
Various early warning and monitoring systems are currently implemented to monitor dengue outbreaks worldwide. Dengue prediction models have been previously investigated, but some of these models still exhibit limitations in achieving high accuracy in dengue outbreak prediction [11,25]. Different models and techniques have been integrated into the design of several models for predicting dengue outbreaks. A number of studies have also established prediction models for dengue outbreaks using artificial neural networks [12].
Hybrid models have been used in outbreak prediction research. A hybrid model is an example of an integrated model, and many models based on genetic algorithms are available to determine the weight in a neural network model [11,13,14,20,26]. In Singapore, researchers found significant correlated dengue cases with climatic variables by using a Poisson regression model [27]. One researcher [17] developed a dengue outbreak prediction system in Singapore and obtained 90% accuracy. Thitiprayoonwongse established another prediction system based on a decision tree [18]. Different models of dengue outbreak prediction systems in Malaysia have achieved different accuracies [12,20].
Vulnerability maps of dengue incidences have been generated in Malaysia, resulting in the development and implementation of visualised and predictive modelling using geographic information systems (GIS) for dengue fever in Selangor, Malaysia [28]. One research in Indonesia was concerned with dengue outbreak prediction using a GIS-based early warning system [15]. Another study from the National Taipei University of Technology used C-support vector classification to forecast dengue fever epidemics in Taiwan, and the accuracy of radial basis function (RBF) model was 90.5% [16]. In 2015, Loshini et al. predicted localised dengue incidences in Malaysia using an ensemble system for identification and found that ensemble models exhibit better prediction power than a single model [29].
The prediction of dengue outbreaks is crucial worldwide because this infectious disease remains as a major issue in many countries [11,26,30,31]. Table 1 lists studies on different models of dengue outbreak prediction with distinct climatic risk factors. The asterisk (*) in the columns of the table denotes the risk factors used in different studies.
Table 1: Risk Factors for Dengue Outbreak Prediction Models
Most studies on dengue fever were conducted in Asian countries, such as East–West Asia and the Pacific Ocean regions. WHO reported that countries in East–West Asia, such as Malaysia, Singapore, Taiwan, Indonesia, Bangladesh and Thailand, are critical areas for dengue fever. Most studies have shown that temperature and rainfall directly and significantly affect dengue outbreaks [14,20,26,30,31].
Moreover, changing climatic factors, such as increasing temperature, rainfall and humidity, are the most influential driving forces of dengue virus transmission [31]. One study correlated dengue cases with climatic variables in the city of Singapore and the model for dengue cases was considered the dependent variable; meanwhile, climatic variables, such as rainfall, maximum and minimum temperatures and relative humidity, were considered independent variables [27]. On the basis of the grade of each risk factor used in the 22 references listed in Table 1, most studies primarily used total rainfall (17 studies), average temperature (16 studies), relative humidity (15 studies), minimum temperature (11 studies) and maximum temperature (10 studies) as inputs of prediction models. However, none of the studies focused on the detailed analysis of the factors nor investigated the detailed relationship that can exist amongst factors.
This research aims to describe the dengue prediction system accuracy and the level of risk factors that contribute to a dengue outbreak prediction system and identify the associations amongst new climate risk factors. The detailed factors are then used as inputs for predicting dengue outbreaks.
This section explains the methodology used for this research, including the dataset used, the analysis process, the newly identified integrated input factors, the evaluation with machine learning models and the evaluation method. Fig. 1 illustrates the conceptual framework of our research.
Fig. 1: Conceptual Framework for Identifying Significant Climate Factors in Dengue Outbreak Prediction
The following sections provide a detailed description of each process involved in this framework.
Dataset
Data were collected from two sources. We obtained weekly data on dengue cases based on two federal territories, namely, Kuala Lumpur and Putrajaya, from January 2010 to December 2013. The data were obtained from the reports of the Disease Control Division of MOH, which are available from their official portal [23]. The weather data of Kuala Lumpur and Putrajaya were retrieved from MMD for the period of January 2010 to December 2013. The data are available upon request. Thus, 209 weeks of confirmed dengue cases and meteorological data were evaluated in this study. However, approximately 8% of the data were missing in the MMD datasheets for the study period. Thus, we obtained the missing data for this period from the US Weather Channel Interactive, which provides Malaysian meteorological data. The data were fitted simultaneously with the Putrajaya–Cyberjaya Station in Malaysia. Only minimum temperature, maximum temperature, average temperature, minimum humidity and rainfall were selected because many studies have emphasised that these factors are the most important risk factors for dengue outbreak prediction models, as shown in Table 1.
Analysis
Weather data from MMD provide daily weather information, and the incidence of dengue cases is published weekly by MOH. Thus, data were normalised and classified into two levels, namely, ‘low risk’ and ‘high risk’, on a weekly basis. Weather and meteorological factors play important roles in the incidence of dengue fever. Thus, the dataset was analysed, and the relationship between the incidence of dengue cases and weather information was determined every week using the Pearson correlation coefficient (PCC). The Pearson product– moment correlation coefficient (occasionally referred to as PPMCC, PCC or Pearson’s r) is a measure of linear dependence between two variables X and Y (Equation 1). This method is an important evaluation technique, providing a value between +1 and −1, where +1 indicates the total positive linear correlation, 0 exhibits no linear correlation and −1 indicates total negative linear correlation. This measure is widely used in various science fields [50].
(1)
Identification of Significant Factors
The most significant climate factors were identified based on the correlation analysis of the dataset, as shown in Table 2. The analysis result indicated that the highest correlation exists between minimum temperature and cumulative rainfall, with the incidence of dengue cases determined in different weeks.
Table 2: Correlation between Dengue Incidence Cases and Climate Factors
Minimum temperature and daily rainfall are the most significant dengue weather-based risk factors [38,51,52,53]. The average minimum temperature can be calculated as follows (See Supplementary Files for Equation 2):
where i is the number of weeks from which the average minimum temperature and [Week(i−1)] is the minimum temperature of the prior weeks to the current week plus minimum temperature of current week . The cumulative rainfall for week i can be calculated using Equation 3 (see Supplementary Files), as follows:
where i is the desired week from which the total rainfall will be calculated, cumulative rainfall week (i) is the final calculation and week (i−1) is the week prior to week (i ).
Table 3 provides the PCCs between the weather variables and the incidence of dengue cases. The underlined and highlighted high positive numbers showed the highest correlation and coefficients between weather parameters and the incidence of dengue fever. Table 3 presents the results for 7 weeks prior to the current week and the optimum value for the average minimum temperature (0.499).
The highest value for cumulative rainfall (0.0071) was obtained for 2 weeks prior to the current week (Table 3).
Table 3: PCC between Climatic Factors and Incidence of Dengue Cases
Thus, the average minimum temperature of Week 5 (plus the current week) and the cumulative rainfall for Week 2 (prior to the current week) exhibit high correlation with dengue cases in accordance with the correlation analysis. The two factors will be regarded as TRF and used as input parameters for dengue outbreak risk level prediction. The combination of factors is shown in Fig. 2.
Fig. 2: Components of TRF
The cumulative rainfall for 2 weeks prior to the current week is identified as a significant factor because it coincides with the life cycle of an Aedes aegypti mosquito, i.e. approximately 2 weeks [38,51,52,53,54,55]. Thus, this result clearly shows that dengue outbreak can happen immediately after an A. aegypti mosquito completes its life cycle and becomes an adult.
Prediction using machine learning models
Once significant factors have been identified, the research proceeded towards predicting the risk incidence level of dengue fever defined as ‘high risk’ and ‘low risk’. To predict this level, we tested five machine learning models using input factors with and without TRF. Table 4 provides the detailed input factors and descriptions.
Table 4: Input Factors with and without TRF
On the basis of the high output result [16,56], we selected Bayes network (BN) models, support vector machine(SVM), RBF tree, decision table and naive Bayes to evaluate the factors using WEKA version 3.8.0 [57]. We used the cross-validation (10-fold) technique to evaluate the models.
Evaluation Metrics
We can evaluate the performance of classifiers on the basis of several accuracy measures and parameters. Moreover, some accuracy and error measures are used to determine the distance between the predicted and the actual known values [58]. The confusion matrix is a useful tool for analysing the efficiency of a classifier in recognising tuples of different classes used in WEKA.
Sensitivity and specificity measures can be used to calculate the accuracy of classifiers. Sensitivity is also referred to as the true positive rate (i.e. the proportion of positive tuples that are correctly identified). Specificity is the true negative rate (i.e. the proportion of negative tuples that are correctly identified).
Equation 4 (see Supplementary Files) shows how accuracy was calculated using the confusion matrix.
We used the root-mean-square error (RMSE) to demonstrate the error rate [50,58]. RMSE was also adopted to identify the strengths of model evaluation. Optimising RMSE during model calibration may provide a small error variance at the expense of a significant model bias [50,59]. This statistic is determined as follows (Equation 5 in the Supplementary Files):
where Pi and Oi are known as the experimental and forecasted values, respectively; and n is the total number of test data.
Table 5 presents the results from five machine learning models with and without TRF inputs. Improved results and reduced errors were obtained using the weather data (as external risk factors for a dengue fever outbreak prediction model) by applying machine learning models (as data analysers) and adding newly identified factors (TRF).
Table 5: Machine Learning Classifier Models Using Cross -validation (10-fold) with TRF
Thus, the proposed factors and machine learning model are beneficial for predicting the dengue risk level. The results also showed that models with TRF achieved higher accuracies compared with those without TRF. The highest accuracy was obtained by the BN classifier with TRF (92.35%) but with an extremely low RMSE (0.26).
Other studies exhibit different accuracies based on their own private databases, which consist of data collected from patients in hospitals, compared with our research area [15,18,20,61]. Our research database used accessible and open-source data for climate factors.
Table 6: Benchmarking with Previous Studies
Table 6 shows the accuracy of the BN classifier with TRF compared with the other models that used climate factors. The proposed model with TRF achieved the highest accuracy of 92.35% compared with the other models.
Conclusion and Future Work
We identified a new significant risk factor, called TRF, which combined the average minimum temperature at 5 weeks plus the current week and the cumulative rainfall at 2 weeks prior to the current week. TRF significantly contributed to dengue outbreak prediction. The use of accurate and appropriate input factors for outbreak prediction can also provide enhanced and precise results for model output. We used various machine learning models to apply the identified significant factors to predicting dengue outbreak risk.
The integration of factors into the BN model resulted in a significant accuracy of 92 .35%. This accuracy showed that using TRF in the BN model outperformed all other outbreak prediction models. Moreover, the RMSE of 0.26 of the proposed system was lower than those of the other models. We strongly believe that using TRF can improve outbreak prediction systems. In our future study, we will test our model with different prediction systems and models. Moreover, future research should emphasise the exploration of other hidden and important risk factors for predicting dengue outbreaks.
This research has several limitations, and the most important one is data availability, which is due to privacy issues and the regulation set by MOH Malaysia. Although many risk factors for dengue outbreak are available, we only focused on the detailed analysis of temperature and rain risk factors for dengue outbreaks, which have been emphasised as the most important factors, due to the analysis of importance and access limitation.
ANN: Artificial Neural Networks
BN: Bayes Network
DLNM: Distributed Lag Non-linear Model
FN: False Negative
FP: False Positive
GA: Genetic Algorithm
GEE: generalized estimating equation
GIS: Geographic Information System
GLM: Generalised Linear Model
MCMC: Markov Chain Monte Carlo
MMD: Malaysian Meteorological Department
MOH: Ministry of Health Malaysia
NBR: Negative Binomial Regression
PCC: Pearson correlation coefficient
PPMCC: Pearson Product-Moment Correlation Coefficient
RBF: Radial basis function
RMSE: Root Mean Squared Error
SRCC: Spearman's rank correlation coefficient
SVM: Support Vector Machine
TN: True Negative
TP: True Positive
TRF: TempeRain Factor
WHO: World Health Organization
Availability of data and material
The completed combined datasets generated and analysed during the current study are available from the corresponding author on reasonable request.
The dengue confirmed case data that support the findings of this study are available in Ministry of Health Malaysia, [http://www.moh.gov.my/index.php/database_stores/store_view/1]
The weather data that support the findings of this study are obtained from Malaysian Meteorological Department. Data are available from the authors upon reasonable request.
Competing interests
The authors declare that they have no competing interests
Funding
Research University Grant-Faculty Program (GPF011D-2019).
Authors' contributions
Felestin Yavari Nejad contributed on the related works, method, experiments and analysis of the studies. Kasturi Dewi Varathan contributed in method and discussions.
Acknowledgements
We would like to thank Research University Grant-Faculty Program (GPF011D-2019) for funding this research.
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Table 1: Risk Factors for Dengue Outbreak Prediction Models
Reference |
Technique |
Year |
Geographical Data Used |
Temperature |
Humidity |
Rainfall |
|
||||||
Min |
Avg |
Max |
Relative (Mean) |
Cumulative Rainfall |
Total Rainfall |
Max 24-h Rainfall |
Max 1-H Rainfall |
Bi-Weekly |
Mean |
||||
[32] |
Wavelet coherence analysis / quasi-Poisson regression combined with distributed lag nonlinear model (DLNM) |
2018 |
Philippines |
|
* |
|
|
* |
|
|
|
|
|
[33] |
generalized linear model |
2018 |
Bangladesh |
|
* |
|
* |
|
* |
|
|
|
|
[34] |
negative binomial regression (NBR)/ generalized estimating equation (GEE) |
2017 |
Vietnam |
|
* |
|
|
|
* |
|
|
|
|
[35] |
Artificial Neural Network (ANN) |
2016 |
Philippine |
|
* |
|
* |
|
* |
|
|
|
|
[36] |
Distributed lag non-linear models (DLNM)/ Generalised estimating equation models (GEE) |
2016 |
China |
* |
|
* |
* |
|
|
|
|
|
* |
[37] |
Spearman rank correlation / Distributed Lag Non-linear Model (DLNM) |
2014 |
Singapore |
* |
* |
* |
* |
|
* |
|
|
|
* |
[38] |
Distributed lag nonlinear model (DLNM) and Markov random fields |
2014 |
Taiwan |
* |
* |
* |
|
|
* |
* |
* |
* |
|
[39] |
Generalized Additive Model (GAM) |
2014 |
Europe |
* |
|
* |
* |
|
* |
|
|
|
|
[40] |
Generalized Additive Model (GAM) |
2013 |
Mexico |
* |
|
* |
|
|
* |
|
|
|
|
[41] |
Poisson generalized additive model/ distributed non-linear lag model (DLMN) |
2013 |
Malaysia , |
* |
* |
* |
* |
|
|
|
|
* |
* |
[17] |
Poisson multivariate regression models |
2013 |
Singapore |
|
* |
|
|
* |
|
|
|
|
|
[42] |
Autoregressive Integrated Moving Average (ARIMA) |
2013 |
Malaysia |
|
* |
|
* |
|
* |
|
|
|
|
[43] |
Poisson multivariate regression |
2012 |
Singapore |
|
* |
|
|
* |
|
|
|
|
|
[44] |
Spearman's rank correlation coefficient (SRCC) |
2012 |
Singapore |
* |
|
|
* |
|
* |
|
|
|
|
[3] |
vector–host transmission model |
2012 |
Taiwan |
* |
|
* |
* |
|
* |
|
|
|
|
[11] |
Neural Network and Genetic Algorithm |
2012 |
Malaysia |
|
|
|
|
|
* |
|
|
|
|
[45] |
Generalised linear model (GLM) / Bayesian framework using Markov Chain Monte Carlo (MCMC) |
2011 |
Brazil |
|
* |
|
* |
|
* |
|
|
|
|
[12] |
Artificial Neural Networks (ANN) |
2010 |
Singapore |
|
* |
|
* |
|
* |
|
|
|
|
[46] |
multiple regression and discriminant analysis techniques / Peirce skill score |
2010 |
Indonesia |
* |
* |
* |
* |
|
* |
|
|
|
|
[47] |
Artificial Neural Networks (ANN) |
2009 |
Turkey |
|
* |
|
* |
|
* |
|
|
|
|
[48] |
Entropy and Artificial Neural Network |
2008 |
Thailand |
* |
* |
* |
* |
|
* |
|
|
|
|
[49] |
Kolmogorov-Sminov test / Pearson’s correlation Coefficient / Stepwise regression techniques |
2005 |
Thailand |
* |
* |
* |
* |
|
* |
|
|
|
|
|
Total |
11 |
16 |
10 |
15 |
3 |
17 |
1 |
1 |
2 |
3 |
Table 2: Correlation between Dengue Incidence Cases and Climate Factors
Temperature |
Mean relative Humidity |
Rainfall |
||
Minimum Temperature |
Mean Temperature |
Maximum Temperature |
||
0.447 |
0.339 |
0.316 |
-0.176 |
-0.020 |
Table 3: PCC between Climatic Factors and Incidence of Dengue Cases
|
Average Minimum Temperature |
Cumulative Rainfall |
Current Week |
0.447 |
–0.0201 |
1 Week Prior |
0.465 |
0.0065 |
2 Week Prior |
0.480 |
0.0071 |
3 Week Prior |
0.494 |
–0.0005 |
4 Week Prior |
0.498 |
–0.0123 |
5 Week Prior |
0.499 |
–0.0139 |
6 Week Prior |
0.489 |
–0.0045 |
7 Week Prior |
0.476 |
0.0020 |
Table 4: Input Factors with and without TRF
Input Factors without TRF |
Input Factors with TRF |
||
Type |
Parameter Description |
Type |
Parameter Description |
Weather Factors |
Minimum temperature (°C) |
Weather Factors |
|
Mean temperature (°C) |
Mean temperature (°C) |
||
Maximum temperature (°C) |
Maximum temperature (°C) |
||
Mean relative humidity (%) |
Mean relative humidity (%) |
||
Cumulative of rainfall (mm) |
|
||
|
TRF Factors |
Average of minimum temperature 5 weeks before the current week (°C) |
|
|
Cumulative of rainfall for 2 weeks prior to the current week (mm) |
Table 5: Machine Learning Classifier Models Using Cross -validation (10-fold) with TRF
Models |
Accuracy (%) |
|
Root Mean Squared Error (RMSE) |
|
Bayes Net |
With TRF |
92.35 |
|
0.26 |
Without TRF |
91.39 |
|
0.28 |
|
SVM |
With TRF |
88.04 |
|
0.35 |
Without TRF |
88.00 |
|
0.33 |
|
RBF Tree |
With TRF |
89.47 |
|
0.29 |
Without TRF |
89.47 |
|
0.28 |
|
Decision Table |
With TRF |
90.41 |
|
0.28 |
Without TRF |
89.95 |
|
0.28 |
|
Naive Bayes |
With TRF |
89.4737 |
|
0.3064 |
Without TRF |
88.9952 |
|
0.2904 |
Table 6: Benchmarking with Previous Studies
Reference |
Year |
Model |
Accuracy (%) |
[60] |
2018 |
Correlation and Autoregressive Distributed Lag Model |
84.90 |
[16] |
2016 |
C-SVC Kernel and RBF |
90.50 |
[17] |
2013 |
Poisson Multivariate Regression Models |
90.00 |
[12] |
2010 |
Artificial Neural Networks |
82.39 |
[48] |
2008 |
Automatic Prediction System by Using Entropy and Artificial Neural Network |
85.92 |
Our Proposed Model |
Bayes Network Model using TempeRain Factor (TRF) |
Accuracy = 92.35 |