Predicting Dental Caries Outcomes in Young Adults Using Machine Learning Approach

OBJECTIVES To predict the dental caries outcomes in young adults from a set of longitudinally-obtained predictor variables and identify the most important predictors using machine learning techniques. METHODS This study was conducted using the Iowa Fluoride Study dataset. The predictor variables - sex, mother’s education, family income, composite socio-economic status (SES), caries experience at ages 9, 13, and 17, and the cumulative estimates of risk and protective factors, including fluoride, dietary, and behavioral variables from ages 5–9, 9–13, 13–17, and 17–23 were used to predict the age 23 D2+MFS count. The following machine learning models (LASSO regression, generalized boosting machines (GBM), negative binomial (NegGLM), and extreme gradient boosting models (XGBOOST)) were compared under 5-fold cross validation with nested resampling techniques. RESULTS The prevalence of cavitated level caries experience at age 23 (mean D2+MFS count) was 4.75. The predictive analysis found LASSO to be the best performing model (compared to GBM, NegGLM, and XGBOOST), with a root mean square error (RMSE) of 0.70, and coefficient of determination (R2) of 0.44. After dichotomization of the predicted and observed values of the LASSO regression, the classification results showed accuracy, precision, recall, and ROC AUC of 83.7%, 85.9%, 93.1%, 68.2%, respectively. Previous caries experience at age 13 and age 17 and sugar-sweetened beverages intakes at age 13 and age 17 were found to be the four most important predictors of cavitated caries count at age 23. CONCLUSION Our machine learning model showed high accuracy and precision in the prediction of caries in young adults from a longitudinally-obtained predictor variables. Our model could, in the future, after further development and validation with other diverse population data, be used by public health specialists and policy-makers as a screening tool to identify the risk of caries in young adults and apply more targeted interventions. However, data from a more diverse population are needed to improve the quality and generalizability of caries prediction.


INTRODUCTION
Dental caries is a chronic infectious disease that destroys tooth structure and has signi cant public health implications, including in young adults. 1 The etiology of dental caries is multifactorial, with the most central etiological factors being the cariogenic diet, the action of bacteria, susceptible tooth structure, and time. 1 Few studies have explored the prevalence of cavitated caries in young adults and associated etiological factors.Brown et al.'s study using two National Health and Nutrition Examination Surveys (NHANES I and NHANES III) data found the mean Decayed, Missing Filled, Surface (DMFS) score of 24.78 and 13.85 among participants aged 18 to 25 years in NHANES I and NHANES III, respectively. 2Also, Ismail et al.'s study using the 1982-1983 Hispanic Health and Nutrition Examination Survey (HHANES) found a total mean DMFT score of 6.0 among these Mexican-Americans aged 18 to 24. 3 Garcia-Cortes et al. 4 saw a high caries prevalence of 86.3% (mean DMFT of 5.76) among aged 22 to 25 applicants to San Luis Potosi University, Mexico with females having signi cantly higher DMFT than males (4.32 ± 4.01 vs 3.78 ± 3.78; p = 0.04).Drachev et al. 5 found high caries prevalence (96.0%; mean DMFT of 7.58) among Russian students, with higher mean DMFT in high socioeconomic status (SES) students compared to low SES students.A cohort study of Swedish children following clinical and radiographic examinations (age 20 mean DFS = 5.8) showed that previous caries experience at a younger age (ages 3, 6 and 15) was associated with caries experience at age 20 (p < 0.05). 6Jamieson et al.'s study on Australian Aboriginal young adults aged 16 to 20 found a mean DMFT of 4.84 and that sex and sweets intake were signi cantly associated with higher mean DMFT. 7ven the multifactorial and complex etiology of caries, there is a need for studies that use robust predictive modeling techniques like machine learning (ML) to accurately identify the best predictors of caries from complex datasets.Supervised machine learning is a type of arti cial intelligence used to predict the value of an outcome measure based on several input measures.An ideal ML model has a favorable bias-variance trade-off (i.e., no model under tting or over tting). 8It provides a robust approach for the identi cation and selection of the most important predictors, while avoiding convergence issues and some aspects of the curse of dimensionality (Hughes phenomenon) 9 , common issues in traditional statistical modeling with a large number of variables.
There are substantial gaps in our understanding of the predictive effects of longitudinally-obtained dietary/behavioral and uoride variables on caries outcomes, especially in young adulthood, which is one of the most active stages of life.Previous ML studies 10,11 have focused on prediction of caries outcome in children and we found no studies on the prediction of dental caries in young adults with a very wide range of comprehensive and cumulative (childhood) exposure variables using a machine learning approach.The objectives of this study were: 1) to predict the dental caries outcome in young adults using machine learning techniques and 2) identify the most important predictors of the dental caries outcome from a large set of sociodemographic, dietary, uoride, and behavioral variables.

METHODS
This was a secondary analysis of data collected from age 5 to 23 within the Iowa Fluoride Study (IFS), a prospective cohort study that completed data collection in February 2019.The recruitment of IFS participants was done in the post-partum wards of eight Iowa hospitals from March 1992 to February 1995. 12The participants had dental exams approximately every four years (except for ages 17 to 23, an interval of 6 years) and received oral health questionnaires every six months Approval for the IFS was obtained from the University of Iowa Institutional Review Board for all components and procedures before the initiation of the study and for each examination, with annual renewal, as well as review when any modi cations were done. 13

(Appendix II)
The IFS dental examinations were done by one of three trained and calibrated dentists using portable dental equipment and halogen headlights with ongoing inter-examiner reliability assessment. 13After drying the teeth, a DenLite® mirror (Welsh-Allyn Medical Products, Inc., Skaneateles Falls, NY) was used to enhance lighting and provide transillumination.The examinations were based primarily on visualization only, without radiographs, however, gentle explorer probing was used to con rm scoring, when in doubt.They were performed either at the University of Iowa College of Dentistry (Iowa City, IA) or at remote locations (Waterloo, IA, and Des Moines, IA) for those who could not make it to Iowa City.Caries status of each surface was recorded as either sound (S), arrested (D 0 ), non-cavitated (D 1 ), or cavitated (D 2+ ); those with restorations were recorded as lled (F); missing teeth due to caries were recorded as missing (M) surfaces; and dental sealants were recorded separately. 13he inclusion criteria for these analyses were 1) completion of the dental exams at age 23 and 2) having su cient cumulative exposure trapezoidal AUC estimate data (see Appendix I) for at least 35 out of the 51 independent variables for the time periods from ages 5 to 9, 9 to 13, 13 to 17, and 17 to 23.
The primary outcome variable (age 23 cavitated caries (D 2+ MFS) count) was de ned as the sum of decayed (D 2+ cavitated), missing (M), and lled (F) surfaces at age 23.A total of 51 independent variables were considered, including four sociodemographic variables and 47 other predictors (cumulative exposure) variables.The sociodemographic variables were sex, family income level, mother's level of education and composite SES, with the last three assessed with data from a questionnaire in 2007.The main predictor variables were the cumulative exposure AUC variables for the periods from ages 5 to 9, 9 to 13, 13 to 17, and 17 to 23.They were de ned for daily brushing frequency category, daily uoride intake from combined sources, concentration of uoride from home water, and the beverage variables (daily sugar-free beverage intake (no sugar added), daily milk intake, daily 100% juice intake, daily sugar-sweetened beverages intake, frequency of sugar-free (water-based) beverages consumption, frequency of milk consumption, frequency of 100% juice consumption, and frequency of sugar-sweetened beverages consumption).Additional variables were dental caries experience at ages 9, 13, and 17.Details of the variable de nitions are provided in Appendix II.

Exploratory data analysis:
Descriptive statistics were determined for the person-level age 23 D 2+ MFS count and all independent variables.Bivariate analyses were conducted to ascertain the relationships between the dependent variable and each of the 51 independent variables.Mann-Whitney U tests were used to explore the relationships between age 23 D 2+ MFS count and sex and brushing frequency category; Kruskal-Wallis tests were used to explore the relationships between age 23 D 2+ MFS count and family income, mother's level of education, and composite SES.Spearman (Rho) correlation tests were conducted to assess the relationships between the age 23 D 2+ MFS count and each of the continuous independent variables (home uoride concentration, total uoride intake, and the beverage variables).All statistical analyses were performed with R software version 4.1.2,with two-tailed alpha-level set at 0.05 for bivariate analyses.

Multivariable Predictive Modeling:
Multivariable predictive modeling was performed using four ML models -Least Absolute Shrinkage and Selection Operator (LASSO) regression, 8 negative binomial regression, generalized boosting machines (GBM) 14,15 , and extreme gradient boosting (XGBOOST) 16 -using the MachineShop 17 package for R (see Appendix III for description of LASSO, GBM and XGBOOST).These models were chosen because of their abilities to 1) perform well with high dimensional data, 2) perform variable selection, and 3) handle different data types and distributions with very few assumptions (Details in Appendix III).
Data preprocessing: Prior to tting the ML models, the k-nearest neighbor (KNN) imputation technique was used to handle the remaining missing data for these participants. 18Additional data preprocessing (scaling and normalization) of the data was performed using the recipes package in R. 19 Model tting (training and testing): The training and testing of all models were done using the nested resampling technique with 5-fold cross-validation, which consists of an inner resampling loop and an outer resampling loop for testing the model performance. 20We chose the nested resampling technique due to its ability to use different portions of the data to iteratively perform training and testing, thereby obtaining an unbiased performance estimate.In the outer resampling loop, we had ve training/test sets (each with an 80 to 20 ratio).On each of these outer training sets, we optimized the model by performing parameter tuning and feature selection on the inner resampling loop.The optimized models then were tted on the outer training sets and their performances were evaluated on the outer test sets.This technique gives a more honest estimation of model performance, although it is computationally expensive. 20These models then were optimized by tuning them using the TunedModel function in MachineShop package and the tuning parameters were chosen using the cross-validation technique. 17del evaluation: Model performance was assessed using root mean square error (RMSE), mean absolute error (MAE), and the R-squared value (coe cient of determination).Lower RMSE and MAE values indicate better model performance, while a higher R-squared value indicates better model performance.The best-performing model was selected based on the RMSE and R 2 .However, MAE was de ned to better understand the overall model performance.
The metrics for model performance were obtained by averaging the scores obtained from nested resampling with 5fold cross-validation.
For easier interpretability, the observed and predicted values from the selected best model were rst discretized and then dichotomized into dental caries as Yes (if values were above zero, indicating cavitated caries) or No (if values were zero, indicating no caries present).The following metrics then were used to show the model performance: accuracy, receiver operating characteristics area under the curve (ROC AUC), positive predictive value (precision), and sensitivity (recall).Details of the codes are provided in Appendix X.This study was reported using both the STROBE (Appendix XI) and TRIPOD guidelines (Appendix XII).

RESULTS
There were 258 participants who ful lled the inclusion criteria, with 41 participants (16%) having at least one imputed data point and 3,458 out of 18,126 data points (19%) imputed using the k-nearest neighbor technique.There was favorable tooth-level inter-examiner reliability, with kappa statistics of 0.73, 0.71, 0.77, and 0.82 at ages 9, 13, 17, and 23, respectively.Table 1 summarizes the frequency distributions of the categorical predictor variables.Fifty-eight percent of participants were female, and 13% of the subjects' family income levels were below $40,000, with 48% $80,000 and above.About 14%, 32% and 54% of participants were in the lower, middle, and higher SES group, respectively.The assessment of variable importance (Table 5) showed that 4 of the 51 independent variables (age 13 caries count, age 13 caries count, the amount of sugar-sweetened beverages intake from age 9 to 13, and the frequency of sugarsweetened beverages intake from age 13 to 17) were important in the prediction of and all were positively associated with the cavitated caries outcome count at age 23.The age 17 caries count was the most important variable in the prediction of the D 2+ MFS 23 count (see Appendix VIII for variable importance plot).Relative in uence = the percentage contribution of the predictor variable in the prediction of the outcome variable relative to other variables in the model.

DISCUSSION
Dental caries is a chronic infectious disease with signi cant public health implications, including in young adults.Our study is one of the rst to use machine learning to predict cavitated caries outcomes in young adults from using longitudinally obtained behavioral, and dietary variables.
Our study found a relatively high prevalence of cavitated caries, similar to the ndings from the Garcia-Cortes et al. 4 and Jamieson et al. 7 studies conducted within the same age group.However, other studies had much higher mean DMFT/S and percentage prevalence (D 2+ MFS > 0) for this age group compared to our study. 5,6These variations might have been due to the variation in the studies' caries assessment methods, geographic differences and time periods, with caries rates now generally lower overall than in the past.
Exploratory data analysis showed that the D 2+ MFS 23 count was signi cantly correlated with family income and composite SES, agreeing with Ismail et al.'s study 3 , but contradicting Drachev et al.'s study 5 .Also, the correlations between D 2+ MFS 23 count and previous caries experience at 9, 13, and 17, found in our study are consistent with the conventional knowledge and ndings of other studies . 21,22,23t of all four of the ML models assessed, LASSO regression was the best-performing model, followed by GBM, then GLM (negative binomial) and lastly the XGBOOST model.The LASSO model had the lowest error rate (RMSE and MAE) and highest R-squared compared to the rest of the models.This is contrary to our conventional approach in traditional statistics where count data are usually analyzed using Poisson regression or negative binomial regression models.This clearly demonstrates one of the capabilities of ML to objectively identify models that best t and explain the variability in the data, rather than relying on statistical assumptions as in regular statistics.Based on the R-squared, only about 44% of the variability in the age 23 caries counts was explained by the variables in the model.A limitation of the use of only R-squared as a performance metric is that it cannot indicate prediction bias in a model (i.e., bias-variance trade-off).It does not tell if the model adequately ts the data or not.
With the discretization and dichotomization of the observed and predicted values of the LASSO model, the model was 84% accurate overall in predicting whether or not a young adult will have caries given their previous caries experience and exposure to dietary, uoride and behavioral elements.Our study's precision (86%) and recall (93%) mean that only 14% were wrongly diagnosed as having had caries experience when they did not, while only 7% of those who had caries experience were misdiagnosed and predicted as having had no caries.There are no other similar studies in children, adolescents, young adults, middle-aged, or older adults with which to compare our ndings.
We identi ed four variables (age 13 caries experience, age 17 caries experience, the amount of sugar-sweetened beverages intake from age 9 to 13, and frequency of sugar-sweetened beverages intake from age 13 to 17) as the most important ones in the prediction of age 23 cavitated caries counts.Age 17 caries experience was the most important predictor of caries counts in young adults, followed by the age 13 caries count, then the amount of sugarsweetened beverages intake at age 13, and nally, the frequency of sugar-sweetened beverages intake at age 17.This agrees with our hypotheses and conventional knowledge that there are positive associations between caries outcomes and consumption of sugar-sweetened beverages and previous caries experience.Other variables like total uoride intake, SES, and brushing frequency which were signi cant in the bivariate analysis were not selected in the nal model.Our nding also suggests that it takes about 5 to 10 years for the teeth to show obvious cavitation following exposure to sugar-sweetened beverages.The policy implication of this nding suggests that it will take about 5 to 10 years to truly observe the effects of preventive oral health interventions such as sugar taxes on caries outcomes at a population level.
The limitations of the study include the moderate sample size, inability to include all possible explanatory variables like genetic variable, and non-generalizability of the ndings due to the local nature of the data (mostly non-Hispanic white and higher than average SES Iowans).We attempted to address the issue of limited sample size by using the nested resampling technique with cross-validation.The addition of other variables, like genetic factors, oral bacterial pro les, dental visits, malocclusion, and other systemic diseases might help improve the accuracy and precision of the predictive models.
This study is unique and innovative because it is the rst study to use machine learning to predict a cavitated caries experience outcome in young adults using longitudinal obtained uoride, dietary, and behavioral variables.The longitudinal predictor variables and the use of data from prior years to make predictions add some level of temporality to our study, allowing us to attribute some level of causality to our study ndings and prediction.The use of nested resampling with cross-validation helped minimize bias in prediction by ensuring multiple portions of the data were prospectively used in the prediction of the caries outcome.Finally, unlike regular statistical modeling, the choice of a ML model like LASSO regression allowed for the capability of performing dimensionality reduction and feature (variable) selection, as well as assessment of variable collinearity and possible interactions among predictor variables.

CONCLUSION
Our ML model generated an accurate, sensitive, and precise model for caries prediction of caries in young adults using longitudinally obtained exposure variables.Our model suggests that continued exposure to sugary diet for about 5 to 10 years could result in cavitated caries.Our ML algorithm could, in the future, after further development and validation with other diverse population data, be used by dentists and non-dentists as a screening tool to identify the risk of caries in young adults.This will facilitate the translation of caries research into actionable insights that can help improve the quality of life of young adults.

Declarations
Ethics approval and consent to participate Approval for the Iowa Fluoride Study was obtained from the University of Iowa Institutional Review Board for all components and procedures of the study.Informed consent was obtained from the participants prior to the examinations and questionnaires during age 23 assessments, with assent obtained at ages 13 and 17.Consent also

Table 1
Descriptive analyses of the categorical independent variables.Cumulative exposure variable based on trapezoidal AUC estimates and transformed into two categories.Composite SES was de ned based on the combination of two variables (mother's educational level and family income (see Appendix II for details of the variable de nition).As shown in Table2, the prevalence of cavitated caries at age 23 (D 2+ MFS 23 count) was 69.1%, with mean D 2+ MFS 23 milk intake per day (cups/day); 0.61, 1.16, 1.42, and 1.76, respectively, for 100% juice intake (cups/day); and 0.61, 1.16, 1.42 and 1.76, respectively, for intake of sugar-sweetened beverages (cups/day).Also, mean caries (D 2+ MFS) experience at ages 9, 13 and 17 were 0.46, 1.15 and 2.94, respectively (See Appendix IV for more details about the descriptive statistics).As shown in Table4, the best performing model was from the LASSO regression, with a RMSE of 0.70, R 2 of 0.44, and MAE of 0.48.The GBM and the Negative binomial GLM also performed fairly well, with RMSE scores of 0.74, and 0.76, respectively.The worst performing model was the XGBOOST, with RMSE score of 0.79.More details on the model performance are provided in Appendix V.The lower RMSE and a boxplot showing the comparison of the performance metrics (RMSE, R 2 , and MAE) across all four ML models can be found in Appendix VI.The observed values were found to be calibrated well with the predicted values, as shown in the calibration plot (Appendix VII).After dichotomization from the LASSO model, the classi cation results (Table4) showed an accuracy of 83.7%, precision (positive predictive value) of 85.9%, recall (sensitivity) of 93.1%, and ROC AUC of 80.6%.

Table 4
Generalization performance of all the predictive models and performance of the LASSO regression model (best performing model) on a binary scale.

Table 5
Variable importance and beta-coe cients from the LASSO regression model.