This study was approved by Tehran University of Medical Science Ethical Committee (ethical code: IR.TUMS.DENTISTRY.REC.1398.099). After describing the study objectives, written informed consent was obtained from a parent or guardian for participants under 16 years old.
A cross-sectional case-control study was undertaken on 20 cases of ECC and 20 CF children as a control group. The participants enrolled in this study was selected randomly from children that referred to the dental clinic of Tehran University of Medical Sciences (Tehran, Iran) for routine oral examinations. The oral health status of each participant was determined by three professional dentists. The inclusion criteria were children aged between 48 and 72 months. CF controls were matched regarding age and gender.
After oral examination of participants, researcher completed the checklist by interviewing parents which collected information about the demographic characteristics of child and parents, dietary intakes , birth weight, oral hygiene behaviors, parental education level, medicine intake and night breastfeeding. Exclusion criteria include; children had chronic systemic diseases or syndromes, influenza or infection of respiratory system, taking medicine and received antibiotic therapy within three months, fluoride prophylaxis within 1 year and medical history of congenital diseases, parents refused to participate or refused to sign the informed consent and children who did not agree or cooperate with the participation. Children were assessed using Decayed, Missing, and Filled Teeth (dmft) index based on WHO Oral Health Surveys Basic Methods  and diagnosis of ECC was done based on the diagnostic criteria of American Academy of Pediatric Dentistry . Those having a dmft index of zero were considered CF. After surveying and regarding the exclusion criteria finally, 20 children were selected as cases and 20 of them considered as control.
Unstimulated whole saliva samples by suction method were collected. Before sampling children have been in rest position and they did not eat anything for 30 minutes. Saliva sampling was done between 9:00 am to 11:00 am to avoid circadian variations in case and control group. A protease inhibitor cocktail (Roche Diagnostics GmbH, Mannheim, Germany) was immediately added after the completion of saliva collection. Saliva samples centrifuged at 10,000×g for 15 min at 4°C, the supernatant was obtained and stored in − 80°C.
Determination of salivary cystatin S levels
Cystatin S concentrations were determined using human cystatin enzyme-linked immunosorbent assay (ELISA) kit (ZellBio GmbH, Ulm, Germany). The samples were thawed at 25°C and assayed in accordance with the manufacturer’s instructions. The absorbance of samples at 450 nm was measured using Hyperion ELISA microplate reader. The concentrations of cystatin S were determined by spectrometer software based on standard curves.
Statistical analyses was performed with SPSS software (version 22; SPSS Inc., Chicago, IL, USA) and GraphPad Prism 8.2.1 for Windows (GraphPad Software, San Diego, California). Assuming normal distribution of data, in order to assess relationship between dental caries and age with cystatin S salivary level, T-test and Levenes' test was used. Mean value and standard deviation of cystatin S level was reported. Regression analyses were performed for cystatin S level with backward stepwise method. Contribution of each variable (including ECC, demographic and clinical characteristics and nutrition habits) was expressed by the p-value (p) and standardized coefficients beta (β). The receiver operating characteristic (ROC) curve analysis was done to evaluate the potential role of cystatin S salivary level and combination of cystatin S salivary level with weight of birth to prediction of ECC. All results are presented as mean ± standard deviation (SD) and p ≤ 0.05 was considered statistically significant.
Machine learning analysis
To evaluate the effectiveness of extracted features, including cystatin S, demographic and clinical characteristics, and nutrition habits, we used various supervised machine learning methods, including feed-forward neural network, XGBoost, Random Forest, Support vector machine (SVM), and Logistic regression. For implementing feed-forward neural network we used Python Software Foundation, Version 3.7, and the open-source deep learning package, namely Keras , which is a high-level neural network API.
The feed-forward neural network method has two hidden layers, and each hidden layer has 32 neurons (Additional file 1: Figure. S1, S2). We used the ReLU activation function for the hidden layers, and for the output layer, the sigmoid activation function was used. To evaluate the result of the feed forward neural network model, the binary cross-entropy loss function was used as the following form:
Where N is the total number of training samples, L is the ground-truth label, is the predicted label, and λ is the regularization parameter for the loss function. We use the 5-fold cross-validation and the Adam optimizer  for both datasets. The hyper-parameter settings for both datasets are 100 for batch size, 0.001 for the learning rate, 0.2 for the dropout probability, and λ =0.01 for the regularization parameter of the loss function. Also, to implement other mentioned supervised machine learning methods, we used the scikit-learn library and ran all the classifiers using default parameter settings.
During the training phase of the constructed models, for each training data sample, the extracted features (including salivary cystatin S, demographic and clinical characteristics, oral health status, and dietary intakes) cyst were entered into the models, and the output value indicating the label of the data sample was computed. 80% of data samples were used during the training and validation phase, and 20% remaining were used during the test phase. The ROC curve analysis was done using demographic data and also a combination of demographic data and cystatin S salivary level by machine learning method.