Study design and participants
This study retrospectively collected data from the Chinese Longitudinal Healthy Longevity Survey (CLHLS). CLHLS is an ongoing, prospective cohort study of community-dwelling Chinese older individuals [16, 17]. It covers 22 of the 31 mainland provinces, encompassing 85% of the total population in China. Starting in 1998, a follow-up investigation was conducted every 3 years, and there are totally 8 waves (1998, 2000, 2002, 2005, 2008, 2011, 2014, and 2018) so far. The CLHLS study was approved by the Research Ethics Committee of Peking University (IRB00001052-13074), and was publicly available at the Peking University Open Research Data (https://opendata.pku.edu.cn/dataverse/CHADS). More details of the study design have been described in previous studies, and the survey data have been widely reported as high-quality [18].
The first 2 waves (1998 and 2000) were excluded because they mainly targeted for participants older than 80 years. Thus, six recent investigations from 2002 to 2018 were used for analysis. Data in 2002 were selected as baseline in this study. For trajectories analysis, those who had complete information in the Chinese version Mini-Mental State Examination (MMSE) for at least two waves were included. Finally, a total of 3502 participants aged ≥ 65 years were included. The flow chart of the CLHLS follow-up and the sample selection of the current analysis is presented in Fig. 1.
Cognitive Assessment
The CLHLS used the Chinese version of the Mini-Mental State Examination (MMSE), whose validity and reliability have been verified [17, 19], as a measure of global cognitive function at each wave. MMSE contains a total of 24 questions, involving 7 dimensions of orientation, food counting within one minute, memory, calculation, drawing, recall and language. Except for food counting within one minute (one point for each food, and not exceed 7 points of a maximum score), other questions were coded as follows: 1 point (correct answer) and 0 point (wrong answer). The total scores of the MMSE range from 0 to 30, with higher scores representing higher cognitive function. The MMSE scores were used to identify the potential trajectories of cognitive function in the current analysis.
Measurement of Predictors
Predictors in this study included sociodemographic characteristics, lifestyles, psychological well-being (PWB) and physical activity, and chronic diseases. For sociodemographic characteristics, age, sex (man, woman), ethnicity (Han ancestry, minority), education (illiterate, primary school, junior high and above), marital status (unmarried, separated or divorced or widowed, married), residence (rural, urban), and co-residence (yes, no) were included. For lifestyles, regular fruits intake (yes, no), regular vegetables intake (yes, no), regular tea consumption (yes, no), smoker (yes, no), alcohol drinker (yes, no), regular exercise (yes, no), and leisure activity (yes, no) were considered. Specifically, leisure activities included housework, personal outdoor activities, garden work, reading newspapers or books, raising domestic animals or pets, playing cards or mahjong, watching TV or listening to the radio, and taking part in some social activities. According to previous study [5], each item had 5 levels and was coded as follows: “almost every day” (coded as 1), “not daily, but once a week” (coded as 2), “not weekly, but at least once a month” (coded as 3), “not monthly, but sometimes” (coded as 4), and “never” (coded as 5). The total scores of the above 8 items ranged 8–40. A high frequency was defined as “being scored at 40th percentile or below”, and a low frequency as “being scored over 40th percentile”. From 1998 to 2005 in CLHLS, 7 items (optimism, conscientiousness, personal control, happiness, neuroticism, loneliness, and self-esteem) were used to test the psychological state. The use of these items for measuring PWB has been used in several previous studies [20–22]. Specifically, if the item is helpful to PWB, we coded as follows: 5 (“always”), 4 (“often”), 3 (“sometimes”), 2 (“seldom”), and 1 (“never”); and if the item is harmful to PWB, we coded in an opposite way, and 0 point were code for “unable to answer” for all items. Therefore, the total scores of PWB ranged 0–35, of which a higher score indicating much better psychological state. Physical activity included activities of daily living (ADL) and instrumental ADL (IADL). ADL was assessed by the 6 daily activities (bathing, dressing, continence, using the toilet, indoor transferring, and feeding themselves), and IADL was assessed by the 8 instrumental activities (shopping, cooking, visiting neighbors, doing laundry, walking continuously for 1 km, continuously crouching and standing up 3 times, lifting a weight of 5 kg, and taking public transportation) [23]. Each item was coded as: 0 (“with complete assistance”), 1 (“with part assistance”), or 2 (“independently”). So, the total scores of ADL and IADL ranged 0–12, and 0–16, respectively. For chronic diseases, self-reported diagnosis of hypertension (yes, no), diabetes (yes, no), and stroke (yes, no) were selected. Detailed measurements of variables were summarized in Supplementary Table S1. The multiple imputation approach was applied to reduce the influence of missing values on predictors in the analyses.
Feature selection
Feature selection was performed for reducing the dimensions of variables and improving model performance. In the current study, recursive feature elimination (RFE) was used. RFE is an ML method for feature selection that combines with several classifiers to eliminate redundant variables, thus identifying the most important factors for each classifier [24]. In order to select the best combination of predictors, a 10-flod cross validation was combined with RFE, that was, RFE was performed on each subset of input data, and validation error of all subsets was calculated. Finally, the subset with smallest error was selected as the optimal combination.
Trajectories of cognitive function
The Growth Mixed Model (GMM) was used to explore the heterogeneity of cognitive trajectories, which could divide populations into several groups based on the differences in growth trajectories. Previous studies suggested that a latent growth curve model (LGCM) and latent class growth model (LCGM) should be used to explore the shape of growth curve and the number of potential trajectory classes before GMM analysis [25, 26]. When the optimal LCGM model was selected, the GMM model was fitted subsequently. For model selection, statistical indices and interpretability are often considered. Statistical indices include sample-size adjusted Bayesian information criteria (SABIC), entropy, Vuong-Lo-Mendell-Rubin likelihood ratio test (VLMR-LRT), and proportion of the smallest class. SABIC is an information criterion with a more reduction, representing an improvement of model. Entropy is a measure of classification accuracy, ranging from 0 to 1. The larger the entropy, the better the trajectories classification. VLMR-LRT compares the results of the k-1 class model with k class model. A significant p value (< 0.05) indicates that k class model is better than k-1 class model. Besides, each trajectory class must contain enough samples, no less than 5% of total population. Followed with previous studies [5, 23], trajectory classes ranged 1 to 5 were tested in this study, and then selected the most favorable class according to the above indices.
Derivation and evaluation of cognitive trajectories prediction models
For prediction of cognitive trajectories, the commonly used logistic regression (LR) and support vector machine (SVM) in the field of psychology were selected. Considering that LR and SVM are all single classifiers, so a new method that combines both LR and SVM (known as stacking) is further constructed for distinguishing different trajectories. Stacking is one of the ensemble learning algorithms, which can integrate various of ML algorithms into achieving a more powerful learner [27]. For LR, two hyperparameters (types of penalty: “none”, “L1”, “L2”, “elasticnet”, and the regularization term C: searching from 0.01, 0.1, 1, 10, 100) were tuned with 10-fold cross validation [24] in the training set, and finally penalty and C were set as “L1” and 10, respectively. For SVM, the kernel parameter (“linear”, “rbf”, “poly”, “sigmoid”) and regularization term (searching from 0.01, 0.1, 1, 10, 100) were tuned, and finally the “linear” kernel and 1 of C were selected. As for stacking, LR and SVM were served as the base classifiers in the first layer, and LR was used for the final prediction in the second layer.
Given that the imbalanced classes between cognitive trajectories, synthetic minority oversampling technique (SMOTE) [28] was used to process the training data before prediction. In the stage of performance evaluation, balanced accuracy, considering both positive and negative classes, were chosen to measure how accurate is the overall performance of prediction model is. Also, the F1 score, which combines both precision and recall, was also calculated. Area under the receiver operating characteristic curve (AUROC) and its 95% confidence interval (95% CI) were used to evaluate the discrimination of prediction models, and calibration was evaluated by brier score. In order to obtain stable estimation of model performance, we looped through the code 1000 times for LR and 100 times for SVM and stacking (for time consideration).
Sensitivity analysis
Given that trajectory analysis is more stable for participants with 3 or more observations over time, we conducted a sensitivity analyses by including participants who had complete information on MMSE for at least 3 waves. Followed by the selection criteria in this study, 1668 participants were included for sensitivity analysis. Besides, trajectory of cognitive function may also be influenced by potential factors such as age, sex, and education [29]. Therefore, a further sensitivity analysis was performed with a consideration of age, sex, and education as covariates for participants who had complete information on MMSE for at least 2 waves (n = 3502).
Statistical analysis
Continuous variables were presented as mean ± standard deviation. Categorical variables were presented as percentages. The comparisons of baseline characteristics among different trajectories were performed by appropriately choosing ANOVA test and chi square test. All the above analysis were conducted with SPSS 25.0. Trajectory class analyses were performed with Mplus 8.3 (Muthén and Muthén, 2019). Feature selection, model derivation, and model evaluation were performed with scikit-learn package in Python 3.7.6. A two-sided p-value of < 0.05 was considered statistically significant.