Study population
Data were acquired from a cross-sectional study of 1546 CSs, in Shanghai Cancer Rehabilitation Club (SCRC), from June to September 2018. All the recruited participants were new members registered to SCRC in 2018, pathologic diagnosed with cancer, able to independently participate in the activities of SCRC. Participants were asked to finish a self-reported structured questionnaire including a range of questions about basic socio-demographic factors, socioeconomic status, life behavior, health conditions, social support, anxiety and depression. Informed consent was obtained from each study participant. Our study was approved by the Medical Research Ethics Committee of the school of public health, Fudan University (The international registry NO. IRB00002408 &FWA00002399).
Target variable: Comorbid anxiety and depression (CAD)
As a dependent variable, CAD was considered as CSs who were both anxiety and depression. Anxiety and Depression was assessed by using the Zung self-rating anxiety scale (SAS)(13) and Zung self-rating depression scale (SDS)(14), respectively. Both SAS and SDS were a 20-item self-administrated scale, and each question is scored on a scale of 1 to 4 (rarely, sometimes, frequently, and always). The total score of each scale ranges between 20 and 80, and were then multiplied by 1.25 to obtain a standard scale. SAS standard scores ≥50 indicated anxiety and SDS standard scores ≥53 indicated depression. The respondents who experienced both anxiety and depression were categorized within the CAD group.
Contributing features
Contributing features included basic socio-demographic factors (age, gender and marital status), socioeconomic status (education level, working status and income), life behavior (smoking, drinking, dietary intake frequency of vegetables, fruits, fish, shrimp/crab/shell, eggs, milk, bean products and nuts), health conditions (BMI, comorbid chronic disease, cancer treatment, time since cancer diagnosis, recurrence and metastasis), and social support.
Marital status was divided as married and divorced/widowed/ separated/single. Education level was categorized as less than senior high school, senior high school and above senior high school. Income was categorized as <2000 yuan/month, 2000-4000 yuan/month and ≥4000 yuan/month. BMI was categorized as <18.5 kg/m2, 18.5–22.9 kg/m2, 23.0–27.4 kg/m2 and ≥27.5 kg/m2 according to the World Health Organization (WHO) recommendation for Asians. Surgery, radiotherapy, chemotherapy, traditional Chinese medicine, biotherapy, recurrence and metastasis were divided as yes and no. Time since cancer diagnosis was categorized as <1 years, 1~3 years, 3~5 years and ≥5 years.
Questionnaire included questions about a list of comorbid chronic diseases (CCD), including hypertension, hyperlipidemia, hyperuricemia, diabetes mellitus, heart and cardiovascular diseases, stroke, respiratory diseases, digestive diseases, and musculoskeletal diseases. And each type of CCD was categorized as “yes” or “no”. All these CCD must be clinical diagnosed by physician from secondary or tertiary hospitals in China.
Smoking was categorized as never smoked, former smoker and current smoker. Drinking frequency was categorized as no, occasionally and usually. Dietary intake frequency of each food items (vegetables, fruits, eggs, fish and nuts) were obtained through a food frequency questionnaire (FFQ) that included 4 frequency categories for each kind of food (<1 times/week, 1-2 times/week, 3-4 times/week and ≥5 times/week).
The level of physical activity was measured by the long form of the International Physical Activity Questionnaire (IPAQ)(15) , and according to the IPAQ scoring guideline, physical activity level were then categorized into three groups: high, moderate, and low.
Social Support was assessed by the Multidimensional Scale of Perceived Social Support (MSPSS) (16). MSPSS is composed of 12 items to measure perceived social support form family, friends and a significant other. Respondents use a 7-point Likert-type scale (from “very strongly disagree” to “very strongly agree”) with each item. The total MSPSS score was calculated by adding all the item scores together and then dividing by 12, and higher total scores represent higher social support.
Machine learning algorithms to predict CAD
Three machine learning algorithms were used to train models to predict CAD: Support Vector Machine (SVM)(17), Decision Tree (DT)(18) and Random Forest (RF)(19). The data set (n=1546) was randomly divided as a training set (n=1160, 75%) to train prediction models and a testing set (n=386, 25%) to evaluate the real performance of the prediction methods. Since using models with feature selection was more efficient than that searching routine for all external predictors into the model, we applied the feature selection by filter to the entire training set with cross-validate by the R package caret. And 13 features (gender, cancer site, hypertension, hyperlipidemia, heart disease, stroke, respiratory diseases, digestive diseases, musculoskeletal diseases, smoking, fish intake frequency, egg intake frequency, and social support) were finally selected using simple univariate statistical methods and was then used to build machine learning models.
A 10-fold cross validation was implemented to tune hyper-parameters and to prevent performance overfitting. This means that, the training dataset was split in 10 equally-sized random folds, at each time, a random subsample containing 90% of the training data was used to train a prediction model, and the remaining 10% part of the training data was used as validation. The above process was repeated 10 times until all folds had served as the test set. Via the 10-fold cross validation, the optimal hyper-parameters were searched through grid search for each machine learning prediction methods. The area under the receiver operating characteristics (ROC) curve (AUC) was used to assess performance during parameter selection. The optimal parameters of each machine learning algorithms selected by grid search were:SVM (cost=3, gamma=0.005); DT (maxdepth=5, minbucket=5, cp=0.005, xval=5); RF (maxnodes=13; mtry=6, ntree=500). These final optimal hyper-parameters were then passed to the machine learning models and applied to the testing set to evaluate the performance of the prediction methods on new data. Sensitivity, specificity, accuracy and AUC were used as evaluation measures to predict CSs with CAD. For the machine learning models with optimal hyper-parameters, model-based variable importance evaluation was conducted to quantifying the importance of each feature.
Statistical Analysis
Means and standard deviations were calculated for continuous variables, and numbers and percentages were computed for categorical variables. The distribution of CAD among different socio-demographic factors, socioeconomic status, life behavior, health conditions were compared using Chi-square test or Students’t-test. Multivariate logistic regression was used to identify factors associated with CAD using the odds ratio (OR) and its corresponding 95% confidence interval (95%CI), adjusted for all other confounders. A stepwise logistic regression(20) was used to build the final model for predicting CAD. The machine learning algorithm was developed using R 3.6.1 with the package mlr 2.15. All statistical analyses were performed by R version 3.6.1. A two-sided P value < 0.05 was considered significant.