Can Loneliness be Predicted? Development of a Risk Prediction Model for Loneliness among Elderly Chinese: A Study Based on CLHLS

Background: Loneliness is prevalent among the elderly, worsened by global aging trends. It impacts mental and physiological health. Traditional scales for measuring loneliness may be biased due to cognitive decline and varying definitions. Machine learning advancements offer potential improvements in risk prediction models. Methods: Data from the 2018 Chinese Longitudinal Healthy Longevity Survey (CLHLS), involving over 16,000 participants aged ≥65 years, were used. The study examined the relationships between loneliness and factors such as cognitive function, functional limitations, living conditions, environmental influences, age-related health issues, and health behaviors. Using R 4.4.1, seven predictive models were developed: logistic regression, ridge regression, support vector machines, K-nearest neighbors, decision trees, random forests, and multi-layer perceptron. Models were evaluated based on ROC curves, accuracy, precision, recall, F1 scores, and AUC. Results: Loneliness prevalence among elderly Chinese was 23.4%. Analysis identified 16 predictive factors and evaluated seven models. Logistic regression was the most effective model for predicting loneliness risk due to its economic and operational advantages. Conclusion: The study found a 23.4% prevalence of loneliness among elderly individuals in China. SHAP values indicated that higher MMSE scores correlate with lower loneliness levels. Logistic regression was the superior model for predicting loneliness risk in this population.

alienation, highlighting how social isolation can contribute to loneliness.(Wang et al., 2023) Current studies predominantly focus on interventions targeting loneliness among the elderly, as well as exploring individual differences in factors related to loneliness across elderly populations.Another research direction examines the impact of loneliness on physiological illnesses.
In recent years, machine learning-based risk prediction models have emerged in the eld of emotional and affective domains.(Richter et al., 2021)Thus, developing a risk prediction model for loneliness among elderly individuals based on machine learning represents a promising endeavor.Factors in uencing loneliness among the elderly are still under exploration, encompassing demographic and social factors, physiological conditions, psychological factors, social and living environments.Demographic and social factors include age, gender, marital status, and education level.Increasing age often accompanies transitions in social roles, such as retirement and children leaving home, potentially reducing social opportunities.Gender differences manifest signi cantly in experiences of loneliness; for instance, women may be more susceptible to loneliness following widowhood.(Lim et al., 2022) Marital status signi cantly in uences loneliness; married elderly individuals generally experience less loneliness than unmarried, divorced, or widowed individuals.(Koren et al., 2024)Education level may impact the breadth and quality of social networks, with higher-educated elderly individuals typically possessing more social resources.(Sánchez et al., 2024) Physiological factors encompass health status, chronic illnesses, and physical impairments.Poor health among the elderly may reduce social activities due to decreased physical ability, exacerbating loneliness.(Sunwoo, 2020) Chronic illnesses such as heart disease and diabetes not only affect physical health but also negatively impact mental health.Physical impairments, such as hearing or vision impairments, may hinder communication with the outside world, further exacerbating loneliness.(Thompson et al., 2024) Psychological factors such as depression, anxiety, and selfesteem directly in uence feelings of loneliness.Depression symptoms often co-occur with loneliness and may mutually reinforce each other.(Liang et al., 2023) Anxiety may reduce social interactions among the elderly, increasing loneliness.Additionally, elderly individuals with low self-esteem may perceive social rejection more readily, contributing to loneliness.(Perlman & Peplau, 1981) Cognitive decline, including memory loss and cognitive impairment, may limit social capabilities and opportunities, thereby increasing loneliness.(Camacho et al., 2024)Social environment involves social support networks and social participation levels.Strong social support networks, such as close family relationships and friendships, effectively alleviate loneliness.(Hogan et al., 2002)Social participation levels, such as engagement in community activities, volunteering, or social gatherings, negatively correlate with loneliness.(Zhang et al., 2018)Social support and participation not only provide emotional support but also alleviate loneliness through practical assistance.Living environment factors include housing conditions, community safety, and neighbor relationships.Good housing conditions, such as comfortable living environments and convenient community facilities, improve elderly individuals' quality of life and reduce loneliness.Community safety encourages elderly individuals to engage in social activities.Positive neighbor relationships provide daily social interactions and emotional support, reducing loneliness.(Bhuyan & Yuen, 2022) Existing studies have extensively explored the causes and impacts of loneliness through methods such as questionnaires, interviews, and psychological assessments.These studies offer valuable insights into the determinants of loneliness and its effects on elderly populations.However, most studies primarily conduct post-hoc analyses to identify the presence and severity of loneliness, limiting the ability to provide timely and effective intervention strategies.Proactive research on prospective predictions of loneliness risk among the elderly is noticeably lacking in the literature.Accurate prediction models can assist healthcare providers, policymakers, and caregivers in identifying high-risk individuals before loneliness becomes a severe issue, thereby promoting early intervention to potentially mitigate the negative impacts of loneliness.
The primary aim of this study is to develop a machine learning-based predictive model to assess loneliness risk among elderly individuals.This model aims to comprehensively consider various factors, including demographic and social factors, physiological factors, psychological factors, social environment, and living environment, to provide a comprehensive risk assessment.By accurately predicting loneliness risk, the model aims to guide the development of targeted intervention strategies to improve the quality of life for elderly individuals.In this paper, we rst review existing literature on loneliness among the elderly, highlighting the strengths and limitations of current methodologies.This review provides background for understanding the complexity of loneliness and the necessity for predictive models.Following the literature review, we detail our research methodology, including the dataset used, variable selection process, and model construction techniques.The dataset comprises multiple variables assumed to in uence loneliness, selected based on theoretical considerations and empirical evidence.We constructed seven machine learning models, including logistic regression, ridge regression, support vector machines, decision trees, random forests, K-nearest neighbors algorithm, and multilayer perceptron.The model construction section outlines the statistical and computational techniques for each model.We employed cross-validation methods to evaluate model performance and used multiple evaluation metrics (such as accuracy, recall, F1 score, etc.) for comprehensive model comparison.Through detailed performance evaluation, we identi ed the optimal model, its practical application value, and potential applications.Finally, we discuss the limitations of the study and propose future research directions.The limitations section acknowledges challenges encountered in data availability and the generalizability of research ndings.Despite these limitations, this study contributes to the existing knowledge base by offering a new approach to predicting loneliness risk among elderly individuals.By identifying high-risk individuals early, this model has the potential to inform targeted intervention strategies, thereby improving elderly well-being.The study underscores the importance of adopting proactive approaches in addressing loneliness and contributes to broader efforts aimed at enhancing the quality of life for elderly individuals.

Study Design and Participants
The Chinese Longitudinal Healthy Longevity Survey (CLHLS) is an elderly tracking survey organized by the Center for Healthy Aging and Development Studies at Peking University/National Development Research Institute.It covers 23 provinces, municipalities, and autonomous regions nationwide, targeting individuals aged 65 and older as well as their adult children aged 35-64.The survey consists of two types of questionnaires: one for living participants and another for deceased participants' family members.The living participant questionnaire covers basic elderly and family conditions, socio-economic background, family structure, economic sources and status, self-rated health and quality of life, cognitive function, personality traits, daily activity ability, lifestyle, caregiving, disease treatment, and medical expenses burden.The questionnaire for deceased participants' family members collects information on the time and cause of death.The survey commenced with a baseline study in 1998 and subsequent follow-ups were conducted in 2000, 2002, 2005, 2008-2009, 2011-2012, 2014, and 2017-2018 For this study, we utilized the cross-sectional database from 2018, comprising approximately 16,000 elderly individuals aged 65 and above.During the data processing stage, selected questions were recoded and corresponding scale scores were computed.Loneliness was assessed by the question "Do you often feel lonely?" with responses categorized into "always," "often," "sometimes," "rarely," and "never," as validated in previous research for loneliness assessment.
For statistical analysis, responses were recoded into a binary variable: "always," "often," and "sometimes" were de ned as "feeling lonely (FL, 23.4%)," while "rarely" and "never" were de ned as "not feeling lonely (NFL, 76.6%)."(Wei et al., 2022)Functional limitations were assessed using the Katz Activities of Daily Living (ADL) scale and the Lawton Instrumental Activities of Daily Living (IADL) scale.Di culty performing any of the ADL tasks (bathing, dressing, toileting, transferring, continence, eating) or IADL tasks (visiting neighbors, shopping, cooking, doing laundry, walking 1 km, carrying 5 kg weight, kneeling and standing 3 times, using public transportation) was de ned as having ADL or IADL limitations, respectively.(Wei et al., 2022) Covariates included baseline demographic measurements: gender (0 = female/1 = male), marital status (1 = married/2 = separated but married/3 = divorced/4 = widowed/5 = single), years of education, adequacy of life sources (0 = insu cient/1 = su cient), satisfaction with current life (0 = dissatis ed/1 = satis ed), satisfaction with health status (0 = dissatis ed/1 = satis ed), elderly personality (0 = introverted/1 = extroverted), living conditions of elderly respondents (0 = urban/1 = rural), current living arrangements (1 = with family/2 = living alone/3 = in care institutions), environmental factors (0 = no musty odor at home/1 = yes), use of air puri cation devices or activated carbon to improve indoor air quality at home (0 = no/1 = yes), age-related health issues: visual impairment (0 = unimpaired/1 = impaired), hearing impairment (0 = unimpaired/1 = impaired), toothache, cheek or jaw pain in the past six months (0 = no/1 = yes), history of falls in the past year (0 = no/1 = yes), and chronic diseases (summarized as the number of chronic diseases).Personal health behaviors included current smoking status (0 = no/1 = yes), regular alcohol consumption (0 = no/1 = yes), regular physical exercise (0 = no/1 = yes), regular calcium supplement intake (0 = no/1 = yes), and social engagement: responses were recoded into a binary variable: "almost every day," "at least once a week but not daily," "at least once a month but not weekly," "sometimes" were de ned as participating in social activities, while "never" was de ned as not participating in social activities.Data cleaning involved imputation for missing data: continuous variables (actual age, MMSE score, Katz ADL score, Lawton IADL score, years of education) were imputed with mean values, and unordered discrete variables such as questions in covariates were imputed with mode values.Data for participants lost to follow-up were excluded from analysis.The constructed machine learning models were evaluated using metrics such as ROC-AUC, precision, accuracy, recall, and F1 score (Alghamdi et al., 2020) to comprehensively assess and identify the optimal risk prediction model for loneliness among Chinese elderly.

Loneliness
The occurrence rates of FL (Feeling Lonely) and NFL (Not Feeling Lonely) among the elderly were 23.4% and 76.6%, respectively.Table 1 presents the results of the feature selection of variables adopted in this study.A total of 15 variables were included, namely Katz scale, IADL, number of chronic diseases, current living arrangements, presence of mildew odor in the home, self-reported quality of life, self-reported health, optimistic outlook, current drinking status, current exercise status, social interaction with friends, su ciency of nancial support for daily expenses, current marital status, visual function regarding perceiving a break in a circle, and calcium supplement intake.

Development and Evaluation of Risk Prediction Models
In this study, we utilized data from 15,874 individuals to construct and evaluate risk prediction models for loneliness among the elderly.The data were randomly divided into training and validation sets at a ratio of 7:3, with 11,112 cases included in the training set and the remaining 4,762 cases in the validation set.Homogeneity of the training and validation sets was veri ed using R language, aiming to reduce statistical bias and ensure results closer to real values (see Table 2).During the initial model construction phase, all 16 predictor variables were used as input, with loneliness (FL) as the output variable.
Using the training set, we employed the glmnet, svmLinear, knn, rpart, rf, and nnet packages in R 4.4.1 to build seven machine learning models: Logistic Regression, Ridge Regression, SVM-Linear, K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), and Multi-Layer Perceptron (MLP).K-fold cross-validation was employed to ensure model stability.The homogeneity test between the training and validation sets was conducted to minimize statistical bias and ensure that the results more accurately re ect true values.
To assess the predictive performance of the constructed models, we used the validation set data and plotted Receiver Operating Characteristic (ROC) curves, calculating the Area Under the Curve (AUC) values.The ROC curve is a graphical method that displays model performance by changing the discrimination threshold, while the AUC represents the area under the ROC curve, ranging from 0.5 (random prediction) to 1 (perfect prediction).An AUC of 1.0 indicates a perfect test, whereas in general, an AUC of 0.9-0.99 is an excellent test, 0.8-0.89 a good test, 0.7-0.79 a fair test, and < 0.7 a nonuseful test.(Carter et al., 2016) Accuracy is an intuitive and easily understood metric but may be misleading in cases of imbalanced sample classes (where positive and negative samples differ greatly in number).(Wang et al., 2021) For instance, if a disease is very rare in the overall population, a model that simply predicts everyone as non-diseased could achieve high accuracy but would be of little practical use in identifying actual disease cases.To address this issue, additional evaluation metrics such as Precision, Recall, and F1 Score are typically combined to comprehensively assess model performance.This gure presents the evaluation metrics used for assessing the performance of seven risk prediction models constructed using machine learning techniques.

Interpretability Analysis Based on the SHAP Algorithm
SHAP is a visualization method based on game theory used to interpret machine learning model outputs.(Petch et al., 2022) Some researchers have successfully employed this algorithm to overcome the "black box" nature of machine learning, providing consistent interpretability for models.
Studies (Fainberg et al., 2024) have shown that interpretable machine learning offers new insights for explaining complex and heterogeneous biological data.This study employs Python 3.12 and PyCharm software to compute and visualize SHAP values, using the SHAP algorithm to provide interpretability analysis for the MLP model and visualize all features.The SHAP feature importance ranking indicates that marital status has the strongest predictive value across all forecasting periods.Speci cally, elderly individuals who are never married, widowed, divorced, or separated are more likely to experience loneliness compared to their married counterparts.Additionally, features such as cohabitation status and self-rated health of the elderly also hold signi cant predictive value.(Fig. 3 SHAP summary plot.)Prior to this visualization, the study utilized the Pairplot function in Python to examine the relationships and data distribution characteristics between loneliness and independent variables, and to display the most signi cant features.The top ve features depicted are the Kazt scale, IADL scale, number of chronic diseases, living alone, and residing in care facilities.(Fig. 4 the Pairplot of the top ve important features.) This gure illustrates the SHAP summary plot, which provides an overview of feature importance in the model.This plot visualizes the impact of each feature on the model's predictions, showing the relative importance of features such as marital status, co-residence, and self-rated health.The summary plot helps in understanding how different features contribute to the prediction of loneliness among the elderly.

4.Discussion
This study developed a predictive model based on machine learning algorithms to assess the risk of loneliness among elderly individuals in China.By integrating various demographic, physiological, psychological, social, and environmental variables, we constructed and compared seven machine learning models: logistic regression, ridge regression, support vector machine, random forest, decision tree, k-nearest neighbors, and multi-layer perceptron (MLP).
Our ndings indicate that logistic regression and MLP performed best in predicting loneliness risk among the elderly, demonstrating high accuracy and recall rates.Logistic regression excelled due to its simplicity and interpretability, particularly advantageous for handling high-dimensional data and multicollinearity issues.MLP, as a neural network model, effectively captured complex non-linear relationships within the data, demonstrating outstanding performance in complex loneliness prediction tasks.
To identify signi cant features associated with loneliness among elderly individuals in China, this study employs the SHAP algorithm to interpret a logistic regression model.The SHAP algorithm estimates feature importance by assigning Shapley values, which re ect the optimal contribution of each feature.(Bifarin, 2023) In the model, the ve features with the most substantial impact on elderly loneliness are: marital status, cohabitation status, self-rated health (a protective factor), frequency of social interactions with friends, and functional limitations.Pairplot visualization of these key features reveals that functional limitations (as measured by the Katz and IADL scales) and lack of social support (living alone and residing in care facilities) are prominent.These ndings provide additional support for Weiss's theory of loneliness, which differentiates between emotional and social loneliness.
This study highlights a signi cant relationship between loneliness and marital status.Chaya's phenomenological research on elderly individuals who are divorced in later life explores their experiences of freedom and loneliness from a dual-family perspective through semi-structured interviews.The study identi es generational gaps regarding the bene ts and costs of late-life divorce.While most elderly individuals view late-life divorce as emphasizing the bene ts of freedom, their adult children often describe the drawbacks of loneliness, perceiving both loneliness and freedom as negative aspects.recommendations for designing conversational companion robots for the elderly, utilizing foundational models such as LLMs and visual-language models.These models aim to offer social and emotional support to alleviate loneliness and social isolation in elderly individuals.Chiang Liang Kok (Kok et al., 2024)has developed a social robot designed to bridge the gap between humans and machines, integrating embedded systems, robotics, and basic soft skills to enable effective interactions.This technological solution is anticipated to address caregiver shortages, reduce elderly individuals' feelings of isolation, and potentially transform elderly care through innovative applications, thereby improving their overall well-being.
Despite these signi cant ndings, several limitations should be acknowledged.Firstly, the data predominantly originated from China, necessitating further validation of model accuracy among elderly populations with diverse cultural backgrounds and living environments.Future research should consider crosscultural and cross-regional comparative studies to validate model generalizability and stability.Secondly, our study relied on cross-sectional data, limiting insights into the dynamic changes and causal relationships of loneliness.Future studies should adopt longitudinal designs to track changes in loneliness among the elderly, providing deeper insights into its mechanisms and causal relationships.Furthermore, while our predictive models demonstrated high . The most recent follow-up in 2017-2018 included 15,874 elderly participants aged 65 and older, collecting information on 2,226 deceased elderly individuals between 2014 and 2018.(Center for Healthy & Development, 2020)

(
Schrodi et al., 2014) Precision, also known as Positive Predictive Value (PPV), is a metric for evaluating model performance in classi cation problems.(Tharwat, 2021) It measures the proportion of predicted positive class samples that are actually positive.In other words, precision evaluates the accuracy of predictions made by the model.F1 Score is the harmonic mean of precision and recall, used to provide a balanced assessment of model performance.(Xu et al., 2022) Fig.1presents the comprehensive evaluation results of seven machine learning models used in this study for predicting loneliness among the elderly.The evaluation metrics used include ROC curves with corresponding AUC values, precision, recall, and F1 scores.These metrics collectively assess the predictive performance of each model based on the validation dataset.The gure provides a comparative analysis of how each model performed across these metrics, highlighting strengths and potential areas for improvement in their predictive capabilities for loneliness among older adults.

Figure 2 :
Summary of ROC Curves for Seven Models.

Figure 4
Figure 4 The top ve important featuresThis gure presents the Pairplot visualization of the top ve important features identi ed in the analysis.This plot illustrates the relationships and distributions of the Kazt scale, IADL scale, number of chronic diseases, living alone, and residing in care facilities.The Pairplot effectively highlights the correlations and patterns among these features with respect to loneliness in the elderly population.
(Koren et al., 2024) For single elderly individuals, their experience of loneliness is closely linked to their social support networks.(Golden et al., 2009) The absence of a spouse often results in a smaller social circle and less frequent interaction with friends and family, thereby increasing social loneliness.(Teater et al., 2021)Additionally, single elderly individuals may experience lower life satisfaction due to a lack of intimate relationships, exacerbating emotional loneliness.(Park et al., 2021) Widowhood entails the loss of long-term companionship and support, which can lead to profound feelings of loneliness and grief.(Kinget al., 2021) The death of a partner often results in feelings of isolation, especially in the absence of other close relationships.(Pietromonaco& Overall, 2022) Widowhood itself acts as a stressor, requiring elderly individuals to cope with the emotional trauma of losing a partner while adapting to a new life phase.(Carr, 2020) Reduced physical and cognitive functions in the elderly further hinder their ability to adjust effectively.Interventions for single, divorced, separated, and widowed elderly individuals typically focus on enhancing emotional and social support to enrich their internal well-being.With advancements in arti cial intelligence, robots and generative AI are emerging as potential tools for emotional support.Bahar Irfan(Irfan et al., 2024) has provided actionable

Figure 1 Evaluation 2 ROC
Figure 1 Evaluation Metrics for Seven Risk Prediction Models Based on Machine Learning

Table 1 :
Summary of Variable Feature Selection: Integration of Univariate and Multivariate Analysis.This table consolidates the ndings from both univariate and multivariate analyses to highlight key variable features for predicting loneliness among elderly individuals.2.2.2 Data Standardization: To enhance feature interpretability and standardize variable units, Min-Max normalization was employed in this study.This normalization method ensures that all non-binary variables are uniformly scaled.After Min-Max normalization, the importance of features in regressionbased machine learning models (such as logistic regression) becomes easily interpretable.(Cabello-Solorzanoet al., 2023) Construction of a Risk Prediction Model for Loneliness Among Chinese Elderly: 2.2.3 Data Splitting: The study utilized R version 4.4.1 code to split data into training and validation sets at a ratio of 7:3.2.2.4 Model Selection: This study considered several machine learning models: Logistic Regression, Ridge Regression (Ridge C), Linear Support Vector Machine (SVM-Linear), K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), and Multi-Layer Perceptron (MLP).Logistic Regression is widely used in medicine for analyzing and predicting binary outcomes due to its strong ability to handle binary classi cation problems, interpret model coe cients, provide probability outputs, and its simplicity in computation and implementation.(Hassanet al., 2021) Ridge Regression introduces a penalty term in the loss function to constrain the size of regression coe cients, reducing model complexity and over tting risks, making it suitable for medical statistical analysis and decision support.(vanWieringen, 2015) SVM is a popular machine learning algorithm used for classi cation and regression analyses, with SVM-Linear speci cally effective in handling linearly separable data, high-dimensional data, and performing well on small sample datasets in medical research and practice.(Salcedo-Sanzet al., 2014) KNN is a simple and widely used supervised learning algorithm for classi cation and regression, based on measuring distances between new samples and training samples to predict outcomes based on the nearest neighbors' categories or values.(Taunket al., 2019) DT is a common supervised learning algorithm applicable to classi cation and regression problems, using a hierarchical approach to partition data into subsets, forming a tree structure where nodes represent features and edges represent feature values, with leaf nodes indicating nal classi cation or prediction results.(Maimon& Rokach, 2014) RF is an ensemble learning method constructing multiple decision trees and combining their results for classi cation and regression, enhancing model accuracy, robustness, and handling of high-dimensional data and missing values, thus effective in medical research and practice.(Qi,2012) MLP is a feedforward neural network composed of an input layer, one or more hidden layers, and an output layer, using weighted connections and activation functions for non-linear transformations.(Sharmaet al., 2017) MLP nds broad applications in medicine for disease diagnosis (e.g., cardiovascular disease(Deepika & Balaji, 2022) and diabetes classi cation(Theerthagiri et al., 2022)), medical image analysis(Jiang et al.