Data
We used the dataset of a historical cohort study, conducted on the Measles incidence in Markazi Province, the center of Iran, from April 1997 to February 2020. The data were extracted from the database of the Vice-chancellor of Health Services, Arak University of Medical Sciences, Markazi, Iran.
The data contained the information about individuals’ Measles test results (positive/ negative), gender (male/ female), age (year), location (urban/ rural), any contact with measles patients (yes/ no), ethnic (Iranian/ non-Iranian), and some clinical signal such as rhinorrhea (yes/ no), fever (yes/ no), conjunctivitis (yes/ no), cough (yes/ no), and history of vaccination (yes/ no).
The result of measles test was considered as the binary response variable and independent variables were utilized to classify the cases into two levels of outcome using classification approaches including Logistic Regression (LR), Linear Discriminant Analysis (LDA), Random Forest (RF), Artificial Neural Network (ANN), Bagging, Support Vector Machine (SVM), and Naïve Bayes were used to make the classification.
Moreover, regarding the nature of monthly measles new cases over the study period, time series models were evaluated. Regarding the excess zeros in the series and the count type of response variable (measles frequency), Zero-Inflated Negative Binomial (ZINB) regression for time series was utilized. The use of negative binomial is due to the presence of overdispersion in the series of observations.
Statistical Analysis
Logistic Regression (LR)
The response variable (measles) follows a binomial distribution and the effect of different predictors on the outcome is assessed via a logit link function. The model formula is as follows:
In this model, pi is the probability of measles, X's are the covariates and B’s are the regression. Odds ratio is used to report the effect of each variable on the outcome [15].
Linear Discriminant Analysis (LDA)
LDA is analogous to LR, and refers the dependent variable to linear predictors and classifies the outcome based on the independent variables. LDA addresses the problem by the conditional likelihood of the factors given the outcome class. This method eliminates the dispersion among the same category cases and optimizes the dispersion between the categories [16]. Standardized coefficients are used to determine the most important independent variables.
Random Forest (RF)
Leo Breiman presented the method for the first time in which the regression tree and classification are combined. Random Forest is a technique in which powerful and quick computations are achieved over large datasets. The dataset is sampled in RF to shape the trees by substitution and at the nodes random sets of predictors are picked. The most significant predictors can be detected by mean decrease Gini and mean decrease accuracy tools. The key variables define the binary result such that the analysis is done with the utmost precision [17].
Artificial Neural Network (ANN)
The approach is focused on the role of the human brain. Multilayer perceptron (MLP) is the most frequently adopted approach of many forms in artificial neural networking. This approach involves input, output, and secret layers, where there are multiple nodes in each row. Through adding a degree of nonlinearity, an activation mechanism converts the data within each layer into the next one. The input layer is composed of all risk factors that influence the outcome. measles as the binary outcome shows up throughout the output layer. To find the network's optimal results, dynamic nonlinear projection between input and output layers is carried out using the number of nodes [18, 19]. The normalized importance of independent variables is used to find more affecting factors.
Bagging
Bagging is a technique of machine learning which works by combining bootstrapping and aggregating. The number of B bootstrap samples in this method is selected from the training set. The noisy observations are reduced by bootstrapping, and even removed. Those sets must then supply the classifiers. Both sets should also have improved behavior for the classifiers relative to the original collection. This makes bagging strategy a valuable method for creating a stronger classifier when the training set poses noisy observations. This approach provides the importance of independent variables as the order of factors affecting the outcome [20].
Naïve Bayes
This method functions on the basis of the popular theorem of the Bayes and results in straightforward and quick classification. Using the Bayes theorem, the prior likelihood of contributing to each category of the outcome is conditioned on the predictor variables. At the final step, the subject will be assigned to the category with the highest posterior probability [21]. Naïve Bayes provides quality estimate of the attributes to conclude about the significant factors affecting measles [22].
Support Vector Machine (SVM)
The goal of the support vector machine method is to locate a hyperplane in an P-dimensional space (the number of attributes) that separately classifies the binary outcome. There are several different hyperplanes which could be selected to distinguish the two levels of the outcome. The aim is to seek a plane with the maximum range, that is to say the maximum gap between the dependent variable categories. Maximizing the gap from the margins offers sufficient clarification such that new observations can be identified with better trust [23].
Comparing the methods
The Several metrics of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and overall accuracy were given to assess the discriminative quality of the computational models. We split the dataset into two training sets (70 percent of cases) and testing sets (30 percent of cases). For each model, we then validated the method 500 times and listed the assessment criterion as the average of the 500 iterations.
Zero-Inflated Negative Binomial (ZINB) regression for time series
This model assumes a negative binomial distribution for the observation with excess zero and fits a two-part model. The first part assesses the impact of predictors on the counts observations such as a usual negative binomial regression and outputs the estimate of coefficients where a logarithm link function is used. The second part of the model uses a logit link function to evaluate the impact of independent variables on non-occurrence of the outcome [24-26].
The independent variables were set as the proportion of a certain level to the sample space of the variables to assess the effect of independent variables categories on the series of measles incidence frequencies over the time. Regarding the variables shown in Table 5, sex (male proportion) indicates the percentage of male in the population of our study. ethnic (Iranian proportion) is the percentage of an Iranian case, location (urban proportion) is the percentage of living in urban areas, vaccination (yes proportion) is the percentage of vaccinated case, contact (yes proportion) is the percentage of contact, fever (yes proportion) is the percentage of fever, cough (yes proportion) is the percentage of cough, rhinorrhea (yes proportion) is the percentage of rhinorrhea, and conjunctivitis (yes proportion) is the percentage of conjunctivitis. Moreover for the purpose of ZINB time series model, age is categorized into six levels as (<1, 1-4, 4-9, 9-15, 15-19, >19) to make the convergence of maximizing the likelihood function possible and for the purpose of ease of interpretation [27].
Software
In this study, all analyses were implemented in R version 3.6.3 using the randomForest, 1071, rpart, CORElearn, ZIM, and rminer packages.