Establishment of metabolic syndrome prediction model for occupational population based on the Lasso regression algorithm

Background: Metabolic syndrome (MS) screening is important for the early detection of occupational population. This study aimed to screen out biomarkers related to MS and establish a risk assessment and prediction model for the routine physical examination of an occupational population. Methods: The least absolute shrinkage and selection operator (Lasso) regression algorithm of machine learning was used to screen biomarkers related to MS. Then, the accuracy of the logistic regression model was further verified based on the Lasso regression algorithm. Finally, the screened biomarkers were used to establish a logistic regression model and calculate the odds ratio (OR) of the corresponding biomarkers. Results: A total of 2844 occupational workers were included, and 10 biomarkers related to MS were screened. The area under the curve (AUC) value for non-Lasso and Lasso regression was 0.652 and 0.907, respectively. The established risk assessment model revealed that the main risk factors were basophil absolute count (OR: 3.38), platelet packed volume (OR: 2.63), leukocyte count (OR: 2.01), red blood cell count (OR: 1.99), and alanine aminotransferase level (OR: 1.53). Conclusion: The risk assessment model based on the Lasso regression algorithm helped identify Metabolic syndrome with high accuracy in physically examining an occupational population.


Introduction
Metabolic syndrome (MS) refers to a group of metabolism-related diseases, including obesity, dyslipidemia, diabetes/impaired glucose tolerance, hypertension, and other diseases [1]. The number of patients with MS has increased with the increasing number of obese patients worldwide [2]. At present, the global prevalence of MS is about 25%, indicating that nearly one billion people are affected. Among these, the occupational population occupies a significant part, and still continues to increase [3].
It has posed a huge economic burden, and has become a serious public health problem.
China ranks first in the world, with nearly 900 million occupational people. Every year, nearly 25 million workers suffer from occupational work hazards, among which MS is already an important risk factor seriously affecting the health of the occupational population [4]. Many studies were conducted on the relationship between the working environment of the occupational population and MS. Ma et al. confirmed that exposure to heavy metal elements in the work environment affected the body's metabolic function and increased the risk of MS in the Chinese population [5]. Huang et al. confirmed that the long-term exposure to noise in the work environment increased the chance of suffering from MS in the Chinese professional population [6]. At the same time, some related studies confirmed the relationship of MS with the type of work in different occupational groups [7][8][9]. Therefore, performing early MS screening for the occupational population is of great significance. Machine learning, whereby a computer algorithm learns from prior experience, was recently shown to have better performance over traditional statistical modeling approaches [10][11]. The machine learning algorithms have been widely used to screen biomarkers for related diseases with the rapid development of artificial intelligence [12][13][14]. Various supervised machine learning models based on the least absolute shrinkage and selection operator (Lasso) regression algorithm have been successfully applied to medical data [15]. However, no relevant studies used the Lasso regression algorithm to screen relevant biomarkers for MS. Therefore, the risk of MS can be better predicted if the biomarkers related to MS are screened, and a risk prediction model is established for routine physical examination markers. In this study, the Lasso regression feature selection algorithm of machine learning was used to screen the biomarkers related to MS, and a risk prediction model was established. The objective of the study was to provide early warning and preventive measures for MS in an occupational population.

Materials and Methods
Population and data collection This study included occupational workers with high-temperature operations in Zhejiang Province, China (referring to operations with an average wet bulb globe temperature(WBGT ) index of ≥25°C at the workplace during the production process) between September 2010 and September 2020. The working environment included the metallurgical industry, including steelmaking, ironmaking, steel rolling, coking, and so forth; casting, forging, heat treatment, and so forth in the machinery manufacturing industry; and kiln workers and furnace workers in the glass and refractory industries. A total of 3577 workers were examined, of which 733 workers were excluded due to incomplete records and errors. Finally, 2844 workers were selected for the study. This study included 32 basic biomarkers for routine physical examination in the population ( Table 1).

Identification of MS
The diagnostic criteria referred to the diagnostic criteria set by the Diabetes Branch

Lasso regression
Lasso regression feature selection is an unbiased estimation used to process highdimensional complex collinearity data. The basic idea is to construct a penalty function to select the main variables with a strong correlation with the output parameters from the input variables and build a refined regression model [17]. The penalty function constructed is as follows: is the dependent variable,

Statistical analysis
A one-way analysis of variance was used to compare the differences between the metabolome and non-metabolome biomarkers in routine physical examination. The random sampling method was used to deal with the sample imbalance between workers with and without MS [18]. The tenfold cross-validation method was used to determine the best Lasso penalty coefficient. Based on the selected biomarkers, the logistic regression risk prediction model was established, and the value of each biomarker was given. A test P value less than 0.05 indicated a statistically significant difference. The Lasso algorithm used the "glmnet" package for calculation. The receiver operating characteristic (ROC) curve was used to evaluate the accuracy of the predictive risk model. All analyses were performed using the statistical programming environment R (version 3.6.0).

Results
Basic characteristics of the population and biomarkers in physical examination A total of 2844 occupational workers were involved (Table 2)

Selection of physical examination biomarkers
The biomarkers were selected using the Lasso binary logistic regression model.  [19][20]. They both confirmed the routine examination of biomarkers such as serum cholesterol, triglyceride, and blood glucose levels, height, weight, blood pressure, and so forth. The multivariate logistic regression analysis could be used as an effective predictor of MS. In this study, 10 biomarkers related to MS were further screened, including red blood cell count, total protein level, percentage of neutrophils, red blood cell distribution width CV, absolute number of neutrophils, leukocyte count, absolute value of basophils, alanine aminotransferase level, monocyte count, and platelet count. These potential biomarkers could be used to assess the risk of

MS.
A low-level inflammatory state is considered to be a major potential mechanism of MS. Recent studies have found that the leukocyte count is associated with MS and cardiovascular disease. A longitudinal cohort study of a healthy population in China showed a significant correlation between white blood cell count and MS (relative risk = 2.66). At the same time, the total numbers of white blood cell, neutrophils, monocytes, and basophils were the risk factors for obesity [21]. Liu et al. found a significant positive correlation between alanine aminotransferase level and risk of MS through quantitative and qualitative analyses, which had a predictive value for the incidence of MS [22]. Further, a positive correlation was reported between red blood cell parameters, hematocrit, and MS for a large longitudinal cohort in China [23]. Laufer et al. found that the prevalence of MS was 29% when the red blood cell distribution width was less than 14%, and the prevalence of MS was 34% when the red blood cell distribution width was more than 14% [24]. The findings on the biomarkers screened in the aforementioned studies were the same as those in the present study.
In this study, the Lasso feature selection algorithm was used to accurately screen the physical examination markers related to MS for an occupational population. The study showed that Lasso feature selection made the screened biomarkers more explanatory and reduced the complexity of the subsequent risk model. Other machine learning algorithms, such as decision trees [25], random forests [26], neural networks [27], and so forth, can be used to compare the accuracy of each method in future studies.