Proteomics and Machine Learning in the Prediction and Explanation of Low Pectoralis Muscle Area

Low muscle mass is associated with numerous adverse outcomes independent of other associated comorbid diseases. We aimed to predict and understand an individual’s risk for developing low muscle mass using proteomics and machine learning. We identified 8 biomarkers associated with low pectoralis muscle area (PMA). We built 3 random forest classification models that used either clinical measures, feature selected biomarkers, or both to predict development of low PMA. The area under the receiver operating characteristic curve for each model was: clinical-only = 0.646, biomarker-only = 0.740, and combined = 0.744. We displayed the heterogenetic nature of an individual’s risk for developing low PMA and identified 2 distinct subtypes of participants who developed low PMA. While additional validation is required, our methods for identifying and understanding individual and group risk for low muscle mass could be used to enable developments in the personalized prevention of low muscle mass.

Posted Date: March 4th, 2024 DOI: https://doi.org/10.21203/rs.3.rs-3957125/v1Introduction Sarcopenia is a clinical syndrome characterized by low muscle strength and low muscle quality or quantity, and its presence is often associated with low physical performance. 1,2While sarcopenia often considered a result or a complication of age and comorbid conditions, sarcopenia as a disease in and of itself is independently associated with numerous adverse outcomes including injury, disease, and mortality. 1Thus it is crucial to identify those at risk for developing sarcopenia in order to intervene before adverse outcomes occur. 3e approach to measuring the low muscle quantity aspect of sarcopenia is the use of computed tomography (CT), including the measurement of pectoralis muscle area (PMA) on CT imaging of the chest.Prior work has demonstrated the utility of these measurements for predicting adverse outcomes such as exacerbations of respiratory disease and death. 4,56[9] However, little research has been conducted evaluating the prediction of incident low muscle mass, a key problem that must be addressed in order to help prevent it from occurring, and the studies that do exist are often limited by a small sample size or a lack of longitudinal data. 5,6,9Additionally, more work needs to be done examining what drives the risk for low muscle mass on the individual level.This is especially relevant as the bene ts of precision-based approaches to medicine over disease-based approaches have become more realized in the medical community.Muscular dystrophies, sarcopenia, and cachexia have all been viewed as appropriate for undergoing precision-based care due to the variability of patients' genetic makeup, health, and exposure to therapies. 11 leveraged longitudinal data collected from a large cohort of current and/or former smokers to identify peripheral protein blood biomarkers associated with the development of CT-derived PMA. 12 In hopes of identifying those at highest risk for developing low PMA, we hypothesized that we could predict the development of low PMA by using a machine learning classi cation model that utilizes the identi ed biomarkers in conjunction with clinical measures and demographics.Additionally, we aimed to not only predict low muscle mass but also to illustrate and understand individual and group risk for it.

Participant Characteristics
The Genetic Epidemiology of COPD (COPDGene) study enrolled 10,305 participants at baseline.For this study the analysis was limited to the 598 current and/or former smoking participants and 98 neversmoking control participants with complete data available (e-Figure 1).The current and/or former smoking cohort was made up of 48% men and 52% women.The cohort was 10.7% Black and 89.3% White.The mean age and BMI were 61.8 and 28.9 respectively.36.3% were current smokers, 63.7% were former smokers, and the mean pack years was 42.9.Among the never-smoking control group, the 25th percentile of gender-strati ed PMA at baseline was 44.9 cm 2 for men (n = 32) and 24.5 cm 2 for women (n = 66).Based on these values, there were 415 current and/or former smoking participants who did not have low PMA at baseline and 183 who did.Of the 415 current and/or former smoking participants that did not have low PMA at baseline, 22.9% developed low PMA at phase 2 (Table 1).

Individual Risk
For the combined model, the order of importance of the predictors was GDF15, EFEMP1, CDON, Lymphotoxin a1/b2, VCAM-1, age, ON, NXPH1, Hat1, gender, pack years, height, and weight (Fig. 2).Feature importance analysis of the clinical-only and biomarker-only models are found in the supplementary material (e-Figures 5-6).
Visual evaluation of the relationships between the measurements of each model's training set's (n = 168) predictors and their respective Shapley additive explanation (SHAP) values suggests that several may have de nable thresholds.For example, for the combined model, GDF15 and EFEMP1 had breakpoints near the middle of their range.(Combined model: predominantly made up of participants who did not develop low PMA, and the remaining 2 clusters were predominantly made up of participants who did develop low PMA (Fig. 5).All the feature selected biomarkers' SHAP values were signi cantly different between the 3 clusters via one-way ANOVA (P < 0.001).The clusters that were predominantly made up of participants who developed low PMA had different SHAP pro les from one another despite having the same outcome.The cluster that was predominantly made up of participants who did not develop low PMA had consistently low SHAP values (Fig. 6).
Feature Selected Biomarkers Relationship with PMA Finally, of the 5 most important feature selected biomarkers, baseline EFEMP1 was signi cantly (P = 0.008) negatively correlated (r = -1.29)with PMA change.Baseline CDON was signi cantly (P = 0.009) positively correlated (r = 0.127) with PMA change.The remaining 3 biomarkers at baseline were not signi cantly correlated with PMA change (Table 3).
Table 3 Relationships between the 5 most important feature selected biomarkers at baseline for predicting low pectoralis muscle area and the change in pectoralis muscle area (cm^2) between baseline and phase 2 (n = 415).

Discussion
Leveraging longitudinal data from the COPDGene study, we developed a machine learning classi cation model that predicted the development of low PMA in smokers using clinical measures, demographics, and peripheral protein blood biomarkers.This model outperformed a model that utilized only clinical measures and demographics as predictors and performed similarly to one that incorporated biomarker information only.In addition, subsequent analysis of the models suggests that there may be speci c cutpoints of interest for the biomarkers identi ed, and that there is a large amount of heterogeneity in what drives an individual patient's risk for developing low PMA.This heterogeneity was used to cluster the participants into distinct subtypes.
This work has several strengths, one of which is the use of a large-scale longitudinal research cohort that enabled the prediction of low muscle mass utilizing an abundance of protein biomarkers in the initial panel.Prior efforts to predict low muscle mass using biomarkers have often been cross-sectional with relatively small and non-diverse cohorts and with relatively small candidate biomarker panels. 7,10,13Also, by utilizing all-relevant feature selection tools such as Boruta, we were able to select a small number of relevant biomarkers of interest.Subsequent evaluation using SHAP analysis and K-Means clustering provided insights into potential threshold values for those biomarkers as well as demonstrating the heterogeneity in what contributes to a speci c individual's probability of developing low PMA.We believe our methods for biomarker selection and analyzing patient risk are novel to the issue of low muscle mass.
In terms of speci c ndings, the 8 biomarkers that were deemed important for predicting low PMA were surprisingly diverse, with roles ranging from leukocyte migration regulation to histone acetylation. 14,15ome of the biomarkers found validated prior research.For example, serum GDF15 has been identi ed as a potential biomarker for sarcopenia due to it being negatively correlated with muscle mass 16 and muscle power 17 in humans.Although, we could not nd any research relating circulating CDON to muscle mass, it has been shown that mice with satellite cell-speci c CDON ablation had impaired muscle generation 18 and it is believed that CDON positively regulates skeletal myogenesis. 19,20terestingly, some of the biomarkers found contradicted prior research.For example, Hat1-haplode cient mice have been revealed to have a shorter lifespan and more premature age-related phenotypes, including muscle atrophy, than wildtype mice. 21Moreover, satellite cell VCAM-1 null mice had delayed, or decreased myo bril growth compared to wildtype mice. 22These contradictions may be due to species differences and contrasts in function between circulating biomarkers and biomarkers' expression in muscle, a notable weakness of our current work which relies on peripheral biomarkers.
Some of the biomarkers found may help elucidate prior unclear research.For example, a cross-species meta-analysis identi ed EFEMP1 as consistently overexpressed in the muscle with age, and even consistently overexpressed in all studied tissues in their analyses. 23However, there are areas where EFEMP1 appears to be reduced during aging such as the super cial zone of the articular cartilage 24 , and mice with inactivated EFEMP1 appear to age prematurely. 25In our study, EFEMP1 was found to increase the likelihood of developing low PMA in our model, and it was found that EFEMP1 measurements were higher in the cohort that had low PMA at baseline.Altogether, this suggests that the upregulation of EFEMP1 may be an adaptive response to delay the inevitable aging and muscle loss processes.Similarly, con icting data also exists for the role of SPARC in muscle biology and sarcopenia.For example, there has been evidence that SPARC both positively and negatively effects the differentiation of myoblasts. 26,27Moreover, one group found that serum SPARC was signi cantly higher in a sarcopenic cohort compared to a non-sarcopenic cohort while, another group found the opposite, although the latter nding was not statistically signi cant and there were concurrent disease processes. 7,28In our study SPARC was found to decrease the likelihood of developing low PMA in our model, and it was found that SPARC measurements were higher in the cohort that did not have low PMA at baseline.Together, this suggests that SPARC likely has a negative role in the complex muscle loss process.Hopefully, our results concerning EFEMP1 and SPARC will help minimize the ambiguity of these biomarkers.
With regards to the identi cation of novel biomarkers related to low muscle mass, neither NXPH1 nor Lymphotoxin a1/b2 appear to have a connection with low muscle mass in the literature.Whether our ndings re ect true associations or confounding is unclear and further work is needed to better elucidate what roles, if any, these proteins may play in the development of low muscle mass.
Interestingly, when assessing the feature importance of the combined model's predictors we noticed that the protein biomarkers appeared more important than most of the clinical predictors.While this could be taken to support the use of proteomics for identifying those at risk for low muscle mass, it is important to caution that there are numerous other clinical predictors that can and should be evaluated, including both complicated screening tools as well as simple clinical questions related to weight loss and exercise capacity.These extensive analyses are beyond the scope of this current investigation but should be done to better explore these issues.
Notably, for the quantitative predictors in our models there is a greater range of positive impact values than negative impact values.In other words, the models avoid giving strong negative impact values regardless of the predictors' actual values, insinuating that there is not one realistic predictor value that can drastically negatively affect the model's outcome.Interestingly, the 5 most important biomarkers for predicting low PMA, when assessed individually at baseline, were not highly correlated with change in PMA between baseline and phase 2. This highlights the potential strength of tools such as machine learning to identify predictors that may not be readily apparent when using more traditional statistical analyses.Similarly, tools such as SHAP analysis may enable insights into speci c relationships between predictors and outcomes.For example, plotting the SHAP values against the predictor measurements allowed us to examine the threshold at which the impact direction changes.The plots for age and pack years are especially illustrative.This information may help determine threshold values for concern in clinical applications.The SHAP force plots also help illustrate what is happening on the individual level and show the multifactorial nature of low muscle mass.This could be especially helpful when considering personalized medicine approaches to speci c patients, as different patients may have different pathobiological processes responsible for the same phenotype, and thus they may respond differently to targeted therapy.Our cluster analysis supports this theory as they illustrated 2 distinct subtypes of participants who developed low PMA.This could be due to differences in biomarker pro les, or perhaps due to underlying conditions, for example, aging and smoking-related disease.Interestingly, of the 3 clusters, it appears that the cluster that mostly did not develop low PMA is the densest cluster, and therefore has a less variance than the other 2 clusters.Perhaps this consistency is indicative of a "normal" pro le subtype.As expected, when comparing the biomarkers' SHAP pro les between the 3 clusters, the cluster that was mostly composed of those who did not develop low PMA consistently had the lowest SHAP values (when examining the median).The other 2 clusters had considerably different biomarker SHAP pro les from one another.For example, the participants in Cluster 1 developed low PMA with CDON and Lymphotoxin a1/b2 having a negative impact on their predicted probability for developing low PMA.On the other hand, Cluster 3 developed low PMA with CDON and Lymphotoxin a1/b2 having a positive impact on their predicted probability for developing low PMA.Surprisingly, the most important biomarkers overall, GDF15 and EFEMP1, had similar SHAP values in both clusters, indicating that it may be the less important biomarkers that are the most responsible for this strati cation.
Clinically, this study demonstrates that it may be possible to identify patients at highest risk for low muscle mass before it develops, potentially enabling targeted interventions ranging from diet and exercise to current and novel pharmacologic therapies.This is especially important given both the growing recognition of the bene ts of personalized medicine and the growing recognition that muscle loss, while often related to other co-morbid diseases, is a distinct process independently associated with morbidity and mortality.Finally, our approach to biomarker selection and risk analysis is not unique to low muscle mass and could be expanded to other domains as well, potentially enabling the identi cation of important biomarkers and underlying pathways for other clinical problems.
Unfortunately, this project had several limitations.We did not have a validation cohort and the participants enrolled in this study were less diverse than the general population, which may reduce its generalizability.In addition, there is likely collinearity between some of the biomarkers and clinical measures.For example, plasma GDF15 has been shown to be signi cantly positively associated with age. 29It is therefore di cult to separate the effects of age from the effects of speci c protein biomarkers.Moreover, SHAP analyses assume independence between the predictors, which may not be the case.Finally, although the feature importance results are interesting, they do not indicate causality, only association, signi cantly limiting their interpretation.In summary, using proteomics and machine learning, we identi ed protein biomarkers associated with low PMA in smokers, developed risk prediction tools able to predict the development of low PMA over 5 years of follow-up, and analyzed individual risk and group risk for developing low PMA.

Parent Study
was acquired through COPDGene study: an ongoing longitudinal observational study that examines the development of chronic obstructive pulmonary disease in smokers.There were 10,198 current and/or former smokers and 107 non-smoking control participants initially enrolled in COPDGene (e-Figure 1).All participants were non-Hispanic white or African American, and all current and/or former smokers had a minimum of 10 pack years.Data was collected at baseline (phase 1) and after 5 years of follow-up (phase 2).Additional phase 3, 10-year follow up visits are currently in progress and are not included in this current study.Data used for this study included an extensive questionnaire at baseline, CT of the chest at baseline and phase 2, and peripheral protein blood biomarker measurements via the SomaScan assay at baseline.The biomarkers were measured in relative uorescent units and the measurements were normalized and natural log transformed. 30PMA (cm 2 ) was derived using a single axial CT image at the level of the aortic arch and the suprasternal notch using a previously described method. 5All research was performed in accordance with relevant guidelines.All participants provided written informed consent, and the study was approved by the institutional review board at each of the 21 centers including Brigham and Women's Hospital. 12 ning Low PMA For this study, we de ned the current and/or former smokers as having low PMA if they had a PMA that was less than the 25th percentile of baseline never-smoking control participants, strati ed by gender.We de ned the current and/or former smokers as having low PMA at baseline and at phase 2.

Biomarker Feature Selection
To identify protein biomarkers of interest, we performed an initial univariate screen comparing mean biomarker measurements in current and/or former smokers with (n = 183) and without (n = 415) low PMA at baseline.There were 1,317 initial biomarkers and only the biomarkers with a Welch's t-test false discovery (FDR) q < 0.10 were retained.We then utilized Boruta feature selection with a one-step correction to identify the most relevant biomarkers for predicting the development of low PMA, i.e., the change from not having low PMA at baseline to having low PMA at the 5-year follow-up visit.The default parameters were used except for the number of estimators which was set to 'auto' and the maximum depth which was set to 8. Boruta was chosen due to it being an all-relevant feature selection method, meaning that it aims to uncover all the relevant features as opposed to uncovering the minimal number of features that score well. 31,32edicting Low PMA with Machine Learning To identify participants at highest risk for developing low PMA and to determine the utility of clinical and biomarker data to predict low PMA, we built 3 random forest classi cation models to predict the development of low PMA, i.e. the change from not having low PMA at baseline to having low PMA at the 5-year follow-up visit. 33The rst was a clinical-only model that incorporated easily attainable baseline clinical measures (height, weight, pack years) and demographics (age and gender).The second was a biomarker-only model that incorporated the baseline protein biomarkers selected using the feature selection process.The third model incorporated both the clinical measures/demographics and the selected biomarkers.All models were trained on the same 2/3 random sample and tested on the remaining 1/3.Finally, 2:1 down-sampling was performed to account for event prevalence.Model hyperparameters were tuned using Bayesian optimization.The models' performances were summarized by the AUROC, the calibration curve, and the Brier score ("the mean squared difference between the predicted probability and the actual outcome") of their respective testing sets. 33The calibration curves were calculated using 10 bins.

Individual Risk
To assess the importance of the combined model's individual predictors and to examine the predictors' impact (strength and direction) on the predicted probability for developing low PMA, a SHAP summary plot was built. 34SHAP plots utilize SHAP values which are assigned to each predictor and indicate how much the predictor, alone, contributes to a model's prediction.This is based on the game theory idea of Shapley values which represent the average marginal contribution of a predictor across all possible combinations of predictors.In other words, on the individual level, the difference between the predicted probability and the expected (base) probability is the sum of the SHAP values for every predictor. 34,35To determine if there were possible threshold values for the predictors, the clinical measurements and the 5 most important biomarker measurements were then plotted against their respective SHAP values.In addition, to visualize how SHAP values were affecting the prediction on the individual level, SHAP force plots were built for 10 randomly selected individuals: 5 predicted to develop low PMA and 5 predicted to not develop low PMA (using the mean predicted probability of the combined model's training set as the cutoff point). 36All SHAP analyses focused on the training set of the combined model unless otherwise speci ed.

Group Risk
Additionally, to examine whether there were any distinguishable groups within the participants, clustered the combined model's training set based on the biomarkers' standardized SHAP values.This was done using PCA, to reduce dimensionality, and K-Means clustering.The optimal number of clusters was based on the silhouette coe cient of the raw SHAP values.We then strati ed the clusters based on whether the participants developed low PMA in phase 2. Differences in the biomarkers' raw SHAP values between the 3 clusters were then assessed using a one-way ANOVA and visualized using box plots.All SHAP analyses focused on the training set of the combined model unless otherwise speci ed.
Feature Selected Biomarkers with PMA Finally, to explore the relevance of the 5 most important biomarkers, Pearson correlation coe cients were calculated between the biomarkers at baseline and the change in PMA between the 2 phases (cm 2 ) amongst participants without low PMA at baseline.

Statistics
All analyses were conducted using Python 3.9.7 and R 4.0.3.All statistical tests were 2-tailed and P values < 0.05 were taken to mean statistical signi cance unless otherwise speci ed.The initial univariate screen included a Welch's t-test where FDR q < 0.10 (calculated using the Benjamini-Hochberg procedure) was taken to mean statistical signi cance.The AUROCs were compared using a t-test.A one-way ANOVA and boxplots were used to examine and visualize the differences in biomarker SHAP values between clusters.Boxplots included means (red triangles), medians (black lines) and error bars (1.5x the interquartile range).Pearson correlation coe cients were calculated to examine the relationship between biomarkers and change in PMA between baseline and phase 2.

Fig. 3 ,
e-Figure7, clinical-only and biomarker-only models: e-Figures8-9.)In addition, visual evaluation of the force plots from 10 randomly selected participants revealed a large amount of heterogeneity in the covariates that drive the individual participant's nal predicted probability.The mean predicted probability of the combined, biomarker-only, and clinical only models' training sets were 0.337, 0.337, and 0.333 respectively (combined model: Fig.4, e-Figure10, clinical-only and biomarker-only models: e-Figures11-14).Group RiskK-Means clustering resulted in 3 distinct clusters of participants based on the silhouette coe cient.Performing principal component analysis (PCA) on the combined model's biomarkers' standardized SHAP values resulted in the rst component explaining 27.6% of the variance and the second component explaining 20.5% of the variance.When strati ed for the development of low PMA, one cluster was

Figures Figure 1 Predicting
Figures

Figure 2 Combined
Figure 2

Figure 4 Force
Figure 4

Table 1
Baseline characteristics of COPDGene participants used in this study, non-strati ed and strati ed by low pectoralis muscle area at baseline.

Table 2
Biomarkers that underwent a univariate screen (Weltch's t-Test, FDR q < 0.10) between those without and with low PMA at baseline and were considered relevant for predicting the development of low pectoralis muscle area via Boruta feature selection.