Data
The data used in this study was extracted from an online cross-sectional survey of 15,366 university students from the ASEAN countries. The target universities consisted of 17 ASEAN University Network (AUN) member universities across seven ASEAN countries, namely, Brunei Darussalam, Indonesia, Malaysia, Philippines, Singapore, Thailand, and Vietnam.
The questionnaire was developed in several rounds of consultation meetings with experts from the AUN Health Promotion Network committee and member universities. The measurement tools used selected were widely used and validated in multiple countries (Appendix A). The features are extracted based on the focus of this study. Mental well-being was measured using the shortened Warwick-Edinburgh Mental Well-being Scale (WEMWBS), a reliable and valid tool for university student. WEMWBS score was dichotomized into poor well-being” (7.0-17.99) and good well-being” (≥18.00).
Physical activity (PA) was measured using the Global Physical Activity Questionnaire (GPAQ) version 2.0. Low PA is classified as those who had less than 600 Metabolic equivalents (MET)-minutes/week that resulted in a failed to comply with the conditions of minimum energy expenditure for physical activity. Number of sport activities were also collected and categorized into none, one to three, four to six, and more than six activities per week.
Health-risk behaviors were also collected including consumption of alcohol, smoking, fruits and vegetables, salts, and sugar-sweetened beverages were measured using items from existing instruments. For tobacco consumption, students who smoked daily were dichotomized into “Yes” (current smokers) and “No” (not current smokers). For alcohol consumption, students were asked if they do or do not drink alcohol. For fruit/vegetable consumption, students were asked how many servings of fruits/vegetable they usually eat each day, and consumption of ≥5 servings/day was considered healthy. Consumption of snacks/fast food was assessed by asking how many days per week students eat fast food. Students who consumed fast food every day were categorized into “Yes” and the remaining responses were collapsed into “No.” Salt intake was assessed by asking if they added salt in their food before eating (<1 tea spoon to ≥3 tea spoons). Adding ≥1 tea spoon or 6 gm/per day was considered excessive sodium intake. Students were also asked how many days they drank sugar-sweetened beverages. Response were handled similarly to the consumption of fast food. Participants provided demographic information including age, gender, GPA (grading system for students’ academic performance), and Body Mass Index (BMI). An open-ended question regarding opinion on physical activity was asked to obtain textual data.
Ethical approval was obtained from the institutional review board of each university prior to conducting the study (See Declarations).
Data preprocessing
Data cleaning procedures were employed including removal of ineligible cases, duplicate responses, responses with more than 50% missing values (listwise deletion), and invalid questionnaire responses. A total of 15,366 remaining cases were used in the subsequent analysis. Missing data in these valid cases were handled using multiple imputation techniques - MICE (Multivariate Imputation via Chained Equations) using 10 multiple imputations to replace missing with predicted values, using R package mice (Zhang 2016). The dataset with unbalanced with respect to the binary outcome of negative or poor mental well-being. To avoid potential bias in the AI/ML modeling, the dataset was re-balanced using the Synthetic Minority Oversampling TEchnique (SMOTE) (Chawla et al. 2002).
Feature selection
According to the principle of parsimony, simplicity or a simple apriori model often provides the best explanation of a problem, relative to more complex models because inclusion of unnecessary features creates intrinsic and extrinsic noise (Naser 2021). Accounting only for key data elements avoids model overfitting, provides better predictive accuracy and generalization, and facilitate practical application (Guan and Loew 2020). Due to limitations of different types of feature selection method, three strategies were used to validate selection of salient variables or features that will be used in the training models in this study. The first strategy was based on the Benjamini-Hochberg False Discovery Rate method that controls for expected proportion of false rejection of features in multiple significance testing (Benjamini and Hochberg 1995), which could be expressed as follows:
Second, a deterministic wrapper method based on stepwise selection, an iterative process of adding important features to a null set of features and removing worst-performing features from the list of complete features, was computed (Naser 2021). The final strategy utilized a randomized wrapper method, Boruta, which iteratively removes features that are relatively less statistically significant compared to random probes, was employed (Kursa, Rudnicki, and others 2010). Our aggregate feature-selection technique utilized the intersection of these three variable elimination strategies and generated a smaller collection of variables used in the subsequent AI modeling.
Training Machine Learning Classifiers
Classification is a supervised machine learning technique that group records into sets of homologous observations associated with particular classes. Different classifiers or classification algorithms are available. In this study, six different classifiers were trained including generalized linear model (glm), k-nearest neighbor (knn), naïve-Bayes (nb), neural network (nnet), random forest (rf), and Recursive partitioning (RPART).
The generalized linear model, specifically, logistic regression, is a linear probabilistic classifier. It takes in the probability values for binary classification, in this case, positive (0) and negative (0) mental well-being, and estimate class probabilities directly using the logit transform function (Myers and Montgomery 1997).
Naïve-Bayes predicts class membership probabilities based on the Bayes theorem and naive assumption that all features are equally important and independent (Dinov 2018). Bayes conditional probability could be expressed as:
$$Posterior\hspace{0.17em}Probability=\frac{likelihood\times Prior\hspace{0.17em}Probability}{Marginal\hspace{0.17em}Likelihood} .$$
Essentially, the probability of class level \(L\) given an observation, represented as a set of independent features \({F}_{1},{F}_{2},...,{F}_{n}\). Then the posterior probability that the observation is in class \(L\) is equal to:
$$P\left({C}_{L}\right|{F}_{1},...,{F}_{n})=\frac{P\left({C}_{L}\right)\prod _{i=1}^{n}P\left({F}_{i}\right|{C}_{L})}{\prod _{i=1}^{n}P\left({F}_{i}\right)},$$
where the denominator, \(\prod _{i=1}^{n}P\left({F}_{i}\right)\), is a scaling factor that represents the marginal probability of observing all features jointly.
For a given case \(X=({F}_{1},{F}_{2},...,{F}_{n})\), i.e., given vector of features, the naive Bayes classifier assigns the most likely class \(\widehat{C}\) by calculating \(\frac{P\left({C}_{L}\right)\prod _{i=1}^{n}P\left({F}_{i}\right|{C}_{L})}{\prod _{i=1}^{n}P\left({F}_{i}\right)}\) for all class labels \(L\), and then assigning the class \(\widehat{C}\) corresponding to the maximum posterior probability. Analytically, \(\widehat{C}\) is defined by:
$$\widehat{C}=\text{arg}\underset{L}{\text{max}}\frac{P\left({C}_{L}\right)\prod _{i=1}^{n}P\left({F}_{i}\right|{C}_{L})}{\prod _{i=1}^{n}P\left({F}_{i}\right)}.$$
As the denominator is static for \(L\), the posterior probability above is maximized when the numerator is maximized, i.e.,\(\widehat{C}=\text{arg}{\text{max}}_{L}P\left({C}_{L}\right)\prod _{i=1}^{n}P\left({F}_{i}\right|{C}_{L}).\)
Artificial neural networks, or simply neural nets, simulate the underlying intelligence of the human brain by using a synthetic network of interconnected neurons (nodes) to train the model. The features are weighted by importance and the sum is passed according to an activation function, and generate an output (y) at the end of the process (Dinov 2018). A typical output could be expressed as:
$$y\left(x\right)=f\left(\sum _{i=1}^{n}{w}_{i}{x}_{i}+{w}_{o}b\right).$$
Random forest classifier is a randomized ensemble of decision trees that recursively partition the dataset into roughly homogeneous or close to homogeneous terminal nodes. It may contain hundreds to thousands of trees that are grown by bootstrapping samples of the original data. The final decision is obtained when the tree branching process terminates and provides the expected forecasting results given the series of events in the tree (Dinov 2018; Nguyen, Wang, and Nguyen 2013).
Recursive partitioning (RPART) is another decision tree classification technique that works well with variables with definite ordering and unequal distances. The tree is built similarly as random forest with a resultant complex model. However, RPART procedure also trims back the full tree into nested terminals based on cross-validation. The final model of the sub-tree provides the decision with the ‘best’ or lowest estimated cross-validation error (Therneau, Atkinson, and others 1997).
The caret package was used for automated parameter tuning with repeatedcv method set at 15-fold cross-validation re-sampling that was repeated with 10 iterations (Kuhn 2009).
In this study, random forest outperformed other machine learners. However, general decision trees might overfit model to noise in the training dataset. To overcome this, we implemented bootstrap aggregation (bagging) and boosting to reduce variance and bias, respectively.
Bagging decreases the variance in the prediction model by essentially generating additional data for training original dataset using bootstrapping methods. Boosting reduces bias in parameter estimation by sub-setting the original data to produce a series of models and boost their performance (in this case, measured by accuracy) by combining them together (Dinov 2018).
Model performance metrics
Classification model performance could not be evaluated with a single metric, therefore, a number of metrics were used to assess model performance including Accuracy, Error rate, Kappa, Sensitivity, Specificity, Area Under the Receiver Operating Characteristics Curve (AUC), and Gini Index.
In binary classification, accuracy is calculated using the \(2\times 2\) confusion matrix, which can be expressed as:
$$accuracy=\frac{TP+TN}{TP+TN+FP+FN}=\frac{TP+TN}{\text{Total number of observations}} .$$
Where, True Positive(TP) is the number of observations that correctly classified as “yes” or “success.” True Negative(TN) is the number of observations that correctly classified as “no” or “failure.” False Positive(FP) is the number of observations that incorrectly classified as “yes” or “success.” False Negative(FN) is the number of observations that incorrectly classified as “no” or “failure” (Dinov 2018).
Whereas, error rate is the proportion of mis-classified observations calculated using:
$$errorrate=\frac{FP+FN}{TP+TN+FP+FN}=\frac{FP+FN}{\text{Total number of observations}}=1-accuracy .$$
The accuracy and error rate and accuracy add up to 1. Therefore, a 95% accuracy means 5% error rate (Dinov 2018).
Kappa statistic measures the possibility of a correct prediction by chance alone and evaluate the agreement between the expected truth and the machine learning prediction. When kappa = 1, there is a perfect agreement between a computed prediction and an expected prediction (typically random, by-chance, prediction). Kappa statistics can be expressed as (Dinov 2018):
$$kappa=\frac{P\left(a\right)-P\left(e\right)}{1-P\left(e\right)}.$$
where P(a) and P(e) simply denote the probability of actual and expected agreement between the classifier and the true values.
A common interpretation of the Kappa statistics includes (Dinov 2018):
-
Poor agreement: less than 0.20
-
Fair agreement: 0.20-0.40
-
Moderate agreement: 0.40-0.60
-
Good agreement: 0.60-0.80
-
Very good agreement: 0.80-1
Sensitivity is a statistic that indicates the true positive rate measures the proportion of “success” observations that are correctly classified (Dinov 2018). This can be expressed as:
$$sensitivity=\frac{TP}{TP+FN}.$$
On the other hand, specificity is a statistic that indicates the true negative rate measures the proportion of “failure” observations that are correctly classified (Dinov 2018). This can be expressed as:
$$sensitivity=\frac{TN}{TN+FP}.$$
The Receiver Operating Characteristic (ROC) curve plots the trade-off between classification of true positive (sensitivity) and avoiding false positives (specificity). The area under this curve serves as a proxy of classifier performance and is normally interpreted as (Dinov 2018):
The Gini index is based on variable importance measure and evaluate information gain by calculating the estimated class probabilities (Dinov 2018). This can be expressed as:
$$GI=\sum _{k}{p}_{k}(1-{p}_{k})=1-\sum _{k}{p}_{k}^{2}.$$
where k is the number of classes.