A Machine Learning Approach to Predict Body Composition in Advanced Cancer Patients.

Body composition and its changes affect cancer patient outcomes. Its determination requires specic and expensive devices. We designed a study to evaluate machine learning approaches to predict fat and skeletal muscle mass using daily practice clinical variables. METHODS: We designed a cross-sectional study in advanced gastrointestinal cancer patients. Response variables were skeletal muscle mass and body fat mass, measured by bioimpedance analysis. Predictors were laboratory and anthropometric variables. Imputation methods were applied. Six approaches were analyzed: (1) multicollinearity analysis, best subset selection (BSS) and multiple linear regression; (2) multicollinearity, BSS and generalized additive models (GAM); (3) multicollinearity, lasso to perform variable selection and GAM; (4) ridge regression; (5) lasso regression; (6) random forest. Model selection was performed evaluating the Mean Squared Error calculated by leave-one-out cross-validation. RESULTS: We included 101 patients under chemotherapy treatment. For skeletal muscle mass, the best approach was the combination of multicollinearity analysis followed by BSS and GAM using smoothing splines with 6 variables (albumin, Hb, height, weight, sex, lymphocytes). The adjusted R 2 was 0.895. The best approach for fat mass was multicollinearity analysis, variable selection by lasso, and GAM using smoothing splines with 3 variables (waist-hip ratio, weight, sex). The adjusted R 2 was 0.917. CONCLUSION: We developed the rst accurate predictive models for body composition in cancer patients applying daily practice clinical variables. This study shows that machine learning is a useful tool to apply in body composition. This is a starting point to evaluate these approaches in research and clinical practice. their 95% condence interval, were The normal distribution assumption of was previously checked. we plotted the difference between measurements (observed – predicted) against their mean. Statistical signicance was dened as a two-sided p-value < 0.05. Statistical analysis was performed using R software Version 3.6.3. The following packages were used: Matrix, tidyverse, carData, car, glmnet, boot, foreach, leaps, mgcv, caret, randomForest, lattice, mice.


Introduction
Body composition is an important subject in oncology because its modi cations re ect the way in which cancer affects body mass status, having implications in patients' nutrition, symptoms, and treatments. While weight loss has been considered a prognostic factor, the decrease in speci c compartments could be more relevant. Losses in lean body mass (LBM) can result in a wide range of physiological impairments. This metabolically active compartment plays a role in immune function, glucose metabolism, protein synthesis and mobility [1][2][3]. Additionally, body composition has been linked to cancer patients' outcomes [3,4]. Sarcopenia was associated with worse overall survival [5] and post-surgical complications [6][7][8][9]. Sarcopenia and LBM could be better to determine drug dose than body-surface area (BSA) or at-xed dosing [10,11]. There is a signi cant association between sarcopenia and a decrease in LBM with toxicity across different oncology treatments, tumor types, and stages; suggesting an effect of sarcopenia on pharmacokinetics [3,4,[10][11][12][13][14][15][16].
Several methods are available to determine body composition [17], such as anthropometry, computed tomography, and magnetic resonance [18][19][20][21], dual-energy X-ray absorptiometry (DXA) [22], and bioelectrical impedance analysis (BIA) [23]. Most of them imply high costs and are not applied in clinical practice. Anthropometry, through predictive models, has been used in some clinical elds but it has the disadvantage of having been developed only on samples of healthy people and its implementation needs training, speci c supplies, and time [24]. Additionally, these predictive models have methodological issues that could be considered. For instance, all of them use linear models [25][26][27][28][29]. Nowadays, there is a wide set of statistical and computational tools, such as machine learning, to reach a better understanding of data (see Supplementary Material: summary of machine learning and imputation concepts) [30]. There is a variety of modern machine learning techniques which can be used to predict quantitative variables, as ridge regression, lasso regression, and generalized additive models (GAM) [30]. On the other hand, machine learning methods, which share the concept of learning from the data with machine learning, apply computational algorithms to resolve their tasks [31]. One example is random forests (RF). All these techniques, from classical to most sophisticated ones, have a key point: the way in which they can model the data. There are more restrictive methods, as linear regression, and other exibles ones as ridge, lasso, GAM with smoothing splines and RF [30,32]. With regards to how to select the most important variables, some techniques were developed to solve it as best subset selection (BSS) or lasso [30]. Crossvalidation, a strategy to avoid over tting, is another important aspect to consider when a predictive model is developed. Finally, missing data is a frequent problem that could introduce bias and weaken generalizability [33]. Thus, imputation methods are useful tools to handle it [34,35].
In conclusion, body composition has prognostic value, treatment implications, and is related to patients' symptoms and care. Speci c devices, as DXA or BIA, can measure it. However, these are used in research and they are not implemented in daily clinical practice. Furthermore, neither equations nor predictive models applying clinical variables have been built in cancer patients to estimate body composition, especially considering current machine learning methods.
We performed a study to develop two predictive models to estimate body fat mass and skeletal muscle mass with clinical variables, applying several modern statistical techniques, to analyze the performance of machine learning methods and to develop a practical everyday tool.

Study design and patients
Considering the impact of body composition in cancer patients, a cross-sectional study was designed to evaluate several machine learning approaches to estimate skeletal muscle mass and body fat mass using variables obtained in the clinical practice. The goal of developing a predictive model using these variables is to facilitate body composition determination in this scenario. This cross-sectional study was nested in a larger prospective study, being the data of the cross-sectional study the rst measurement of the prospective one. The development of this study and the reporting process was made following the EQUATOR Network guidelines [36].
Patients aged 18 years or over were eligible for enrollment if they had histologically con rmed advanced gastric, hepatobiliary, pancreatic or colorectal adenocarcinoma, had an Eastern Cooperative Oncology Group (ECOG) performance-status score of 0 to 2, weight loss in the last six months (≥ 5%) (non-refractory cachexia) [37], had adequate hepatic, renal, and bone marrow function. Patients were ineligible if they were receiving systemic glucocorticoids, had dehydration, severe edema, or a cardiac pacemaker.
All the participants provided written informed consent. The study was approved by The Ethics Board of the Gastroenterology National Hospital "Dr. Bonorino Udaondo" Buenos Aires, Argentina, and met the recommendations stated in the Helsinki Declaration.

Dependent variables
Two variables were considered: skeletal muscle mass and body fat mass, both measured in kg. They were determined by BIA with multi-frequency (Inbody 120) according to the manufacturer speci cations.

Predictor variables
Thirteen predictor variables were chosen. They were selected considering their potential association with regards to body composition. Five were anthropometric variables: height, weight. waist-hip ratio (WHr), body mass index (BMI) and BSA (by [(height (cm) x weight (kg)) / 3600] ½ ). Two were sex and age.
Statistical and data analysis A descriptive analysis was developed. We evaluated the data distribution of each variable; mean or median were used according to the former, standard deviation, and rst and third quartile were used, respectively.
Some key aspects need to be considered to build a predictive model or algorithm: the problem of variable selection, the collinearity between these variables, and the exibility to model the shape of the data (see the summary on Supplementary Material).
Therefore, different approaches were implemented to solve these issues. Two techniques were used to deal with variable selection: BSS and lasso regression. In this setting, multicollinearity was analyzed, considering collinear those with a Variance In ation Factor (VIF) above 5. Associated with these variable selection methods, two regression techniques were applied. One restrictive, as linear regression, and another more exible, as GAM with smoothing splines. On the other hand, two other different approaches were carried out separately: ridge and lasso regression. Here, a variable selection method was not added considering the capacity of each one to detect variable importance. This task is solved by adjusting variables' weight (by ridge) or dropping those less relevant (by lasso). Finally, random forests were employed, a machine learning technique which belongs to classi cation and regression trees. It is an accurate classi cation and prediction tool, and it can handle the variable importance issue.
Thus, six different approaches were applied for each response variable and their predictive performance was compared. They were: (1) multicollinearity, BSS and multiple linear regression; (2) multicollinearity, BSS and generalized additive models (GAM); (3) multicollinearity, lasso to perform variable selection and GAM; (4) ridge regression; (5) lasso regression; (6) random forest (RF). We applied smoothing splines with GAM using cross-validation to nd the best degree of freedom for each variable. The tuning parameter λ for (4) and (5) was the value that produced the minimum Mean Squared Error (MSE) calculated by cross-validation. However, the value of λ for variable selection in (3) was which produces the minimum MSE plus 1 standard deviation.
Leave-one-out cross-validation (LOOCV) was used to calculate the MSE for each model to compare them. With each nal model, we measured the agreement between the model and observed values by BIA according to the Bland and Altman method [38]. The 95% limits of agreement, with their 95% con dence interval, were determined. The normal distribution assumption of differences was previously checked. Besides, we plotted the difference between measurements (observed -predicted) against their mean.

Study population and general characteristics
From August 2016 to January 2018, 101 patients with advanced upper or lower digestive adenocarcinomas were evaluated. The most frequent diagnosis was colorectal cancer (n = 52, 51.5%) and pancreatic cancer (n = 27, 26.7%). Performance Status was ECOG 0-1 in 67.3% (n = 68) and median age was 59.5. All the patients were under chemotherapy treatment with regimens based on uoropyrimidines. Table 1 shows patients' characteristics.  There were 1.8% missing data ( Figure S1). Lymphocyte has the highest percentage of missing values (14.9%), but the rest below 3%. The predictive mean matching method with 5 iterations was applied to impute each variable considering the same dataset: both response variables and predictors [35,39].
Skeletal muscle mass predictive model Considering the six approaches described previously, test MSE over the validation set was presented in Fig. 1A and in gure S2 to S6 of Supplementary Material. Table 2 shows the role of each variable for every approach. The best model was obtained with approach 2 according to what will be described below. BSS showed four sets of variables with similar values for R 2 ( gure S2). These sets were: with 5 variables (albumin, Hb, height, weight, sex) (R 2 = 0.8771), with 6 (same plus lymphocytes) (R 2 = 0.8814), with 7 (same plus WHr) (R 2 = 0.8824) and with 8 (same plus age) (R 2 = 0.8818). Although the highest adjusted R 2 was obtained for the combination of 7 variables, since the four combinations had close values, all of them were analyzed. Therefore, the lowest test MSE was obtained applying GAM using smoothing splines with the combination of 6 variables. In this setting, the best model found was the result of the combination of linear models (1 effective degree of freedom (edf)) for albumin (p = 0.001), height (p < 0.0001), weight (p < 0.0001), and lymphocytes (p = 0.019); a positive parametric coe cient for sex (male) (p < 0.0001), and a smoothing splines model with 6.68 edf for Hb (p = 0.013). The adjusted R 2 obtained for this model was 0.895.
In Fig. 2A, predictive values and observed values for skeletal muscle were plotted. Dots are close to the line and homogeneously distributed to both sides of it. The plot of differences against the mean (Fig. 2B) shows a random distribution of each observation and no relationship was seen between discrepancies (difference) and the values (mean). We formally examined this relationship with the Spearman's rank correlation coe cient, which is 0.132, proving no correlation (p = 0.19). The 95% limits of agreement were − 4.20 and 4.12 (Fig. 2B) Body fat mass predictive model Out of the six analysis approaches the lowest MSE, calculated on the validation set, was obtained for approach 2 with multicollinearity analysis, BSS, and GAM with 5 variables (MSE 11.01) ( Fig. 2A and gure S7 to S11 in Supplementary Material).
However, an MSE value of 11.04 was the result of approach 3: multicollinearity analysis, variable selection by lasso regression, and GAM with 3 variables ( gure S10). This model was selected as the best regression model, considering the parsimony principle and the small difference between both MSE values. This approach is described below. In Table 3, similar to skeletal muscle, all the approaches and their variables are displayed. Finally, Fig. 2C shows the scatter plot of the observed values and predictive values of body fat mass. Dots are narrowly disposed around the equality line. The plot of differences against the mean (Fig. 2D) does not show a speci c pattern, revealing no relationship between discrepancies and values. Additionally, the Spearman's rank correlation coe cient is 0.084, con rming formally no correlation (p = 0.40). The 95% limits of agreement were − 6.51 and 6.58 (Fig. 2D).

Discussion
In this cross-sectional study with digestive cancer patients, we made two predictive models to estimate fat mass and muscle mass with high accuracy. These models were developed applying current machine learning methods throughout the whole process, from variable selection to model building. Additionally, these models used clinical variables which can easily be obtained during daily clinical practice. Accordingly, it is the rst article in this eld with this kind of approach.
Machine learning assembles a wide range of tools which could be used to explore data, to nd patterns on patients' characteristics, to understand relationships between variables or to predict an outcome [30]. The implementation of these techniques to biomedical research has been growing during the last decade, however, they are not widely known and used [40]. Thus, we added a summary, in supplementary material, of some concepts of machine learning to bring this knowledge closer. In this study, a variety of methods was applied to analyze their usefulness and accuracy in the body composition eld. Variable selection is one of the rst problems to handle when a model is built. While manual variable selection is strongly in uenced by our knowledge, techniques, as best subset selection or lasso, apply an algorithm and show an output as a measure of their performance to easily nd the best set of variables. Besides, these techniques could show relevant variables not previously considered. Hence, this step becomes a part of the research process allowing to identify new predictors. Here, we found similar results during the selection process with different methods, reaching the best variable set for skeletal muscle with BSS and for fat mass with lasso regression. These comparable results, applying different processes as BSS and RF, highlight the importance of those variables. Additionally, these variable selection methods improved the accuracy of regression techniques. With regards to current predictive models, developed with anthropometric measurements, they are built with linear regression [25][26][27][28][29]. We showed how the incorporation of more exible approaches can outperform linear models. Finally, cross-validation used for model selection, avoided over tting as well as contributed to variable selection, allowing to nd the best approach.
Skeletal muscle and fat mass have been considered central in several issues in cancer. Patients under oncology treatment and sarcopenia have more toxicity, worse clinical outcomes, and shorter survival [41][42][43]. Currently, the knowledge of skeletal muscle mass and fat mass can determine a bad prognosis in cancer patients and that may also be true in a wider patients population [41,44].
Some aspects of our work can be considered. In this study, all patients showed weight loss because this cross-sectional study was nested in a prospective one. Although DXA is usually determined as the reference method, BIA is a valid and widespread methodology in patients without dehydration or edema. Concerning the methodology, we used LOOCV to test each model because the sample could not be split to have a test set due to its small size. Incomplete data is a common setting in health data sets.
Leaving those as missing values implies dropping out the whole case when techniques as regression methods are used. Besides, this is a way to introduce bias. Thus, we used imputation methods as a way to obtain more accurate predictions.
This work shows the accuracy and utility of machine learning approaches to predict body composition. These ndings prove the e cacy of these methods as well as the accuracy of our models, allowing the possibility of them being used as an investigation tool for pharmacokinetic models [45]. No general guidelines exist for drug adjustment in cachectic patients. Our models could be useful to adjust antineoplastic doses, in the clinical research and later in clinical practice.
It is now clear that body composition has an impact on oncology patients and their outcomes. Up to the present, no tool which allows determining body composition in daily practice without speci c devices has been established. Working with current statistic learning techniques we were able to develop the rst predictive model which uses daily practice clinical variables. Therefore, it can accurately predict fat mass and skeletal muscle mass. We believe that this could be a useful tool in the clinical setting allowing oncologists to obtain relevant information in an easy way, enabling more adequate patients' management and treatment.

Funding
The longitudinal study where data were collected (from August 2016 to January 2018) was conducted with a grant from the National Cancer Institute, Argentina (number 15001837). Then, the current study was developed without a grant support.