3.1 Clinical information of patients
We retrospectively collected and summarized the information (Table 1) about the relevant metabolic information, tests and examination results during their hospitalization. Then the clinical data with significant differences were selected, and a summary table of relevant difference information (Table 2) was established. We constructed the subsequent GSD risk prediction model based on their metabolism-related tests and examination results.
Table 1
Clinical features of GSD cases.
Covariates | Characteristics |
GSD case | NGSD case |
Sex(M/F) | 23/41 | 42/34 |
Age(years) | 50.9 ± 14.1 | 54.9 ± 15.2 |
BMI(kg/m2) | 23.8 ± 2.8 | 22.8 ± 3.4 |
Smoke(Y/N) | 9/55 | 14/62 |
Drink(Y/N) | 3/61 | 10/66 |
HBsAg(pos/neg) | 3/54 | 1/75 |
HCV-RNA(pos/neg) | 0/57 | 0/76 |
TG | 1.31(0.52–17.12) | 1.19(0.46–6.5) |
HDL-C | 1.23(0.54–2.25) | 1.21(0.55–2.21) |
LDL-C | 2.39(1.36–4.85) | 2.31(1.00-5.46) |
VLDL | 0.70(0.21–1.32) | 0.60(0.09–2.31) |
FBG | 5.36(3.38–13.57) | 5.28(3.78–10.87) |
SBP | 123(90–168) | 129.5(99–182) |
DBP | 76(56–105) | 80.5(56–100) |
ALT | 21(6-454) | 15(4-109) |
AST | 20(7-268) | 18.5(9-108) |
ALP | 81(15–711) | 77.5(39–257) |
GGT | 26(8-782) | 19(6-231) |
WBC | 5.71(1.5-12.42) | 6.15(3.67–18.2) |
N% | 59(6.78–84.2) | 63.7(35–92) |
PLT | 219(67.3–365) | 212(82–385) |
NAFLD(No/light/severe) | 33/23/8 | 61/15/0 |
liver·cirrhosis(Y/N) | 1/63 | 1/75 |
Data in normal distribution was presented by mean ± SD, and data in nonnormal distribution was presented by median (IQR (interquartile range)). Abbreviations: BMI: body mass index; TG: triglyceride; HDL-C: High density liptein cholesterol; LDL-C: Low-Density Lipoprotein Cholesterol; VLDL: very low density lipoprotein; FBG: Fasting Blood Glucose; SBP: Systolic Blood Pressure; DBP: diastolic blood pressure; ALT: alanine aminotransferase; AST: aspartate aminotransferase; ALP: alkaline phosphatase; GGT: γ-glutamyl transpeptidase; WBC: white blood cell; N%: neutrophil; PLT: platelet; FLD: fatty liver disease.
We classified the collected clinical information, analyzed the difference of counting data (age, BMI, TG, etc.) through independent sample T-test. We used ratio comparison to statistically describe the two categorical variable (gender, smoking and alcohol history, fatty liver, etc.). Then We selected relevant indicators with a P value (OR value) less than 0.05 to establish a summary table of relevant difference information (Table 2).
Table 2
Summary of Differential Clinical Information
Covariates | Pr(> F) | P value |
Age(years) | 0.6865 | 0.0546 |
BMI(kg/m2) | 0.4639 | 0.0361 |
SBP | 0.698 | 0.0461 |
ALP | 0.03515 | 0.0390 |
GGT | 0.005381 | 0.0030 |
WBC | 0.02455 | 0.0142 |
N% | 0.02455 | 0.0004 |
NAFLD | ** | 0.0149 |
Sex(M/F) | ** | 0.0244 |
Compile a table for statistical analysis of clinical information with significant differences and their P-values. Pr (> F): The test P value in the analysis of homogeneity of variance. When it is greater than 0.05, the degree of dispersion of the data variance represented by it does not differ significantly, indicating homogeneity of variance; When it is less than 0.05, the degree of dispersion of the data variance represented by it differs greatly, indicating uneven variance.The mark “**” indicates that this type of data is a second categorical variable.The P-value is a parameter used to determine the difference in clinical information between the GSD and NGSD groups. When it is less than 0.05, there is a significant difference in this clinical indicator between the two groups.
3.2 The clinical risk prediction model
We applied the nomogram to predict the risk of gallstone disease in individuals, screened the clinical data of the patients through lasso regression analysis[25]. Furthermore, logistic regression was used for further screening. The clinical variables significantly related to the incidence rate of GSD in the lasso regression and logistic regression models were selected to be included in the risk prediction nomogram of GSD disease by using the R language pack “rms”. We reflected the risk degree of each clinical index in the process of GSD formation by assigning the score of each variable on the nomogram model (Fig. 2A,B,C). The AUC values of clinical nomogram were calculated respectively, and the ROC curve and calibration curve were made by using the R language pack "rms" and “ROCR” to verify the accuracy of the nomogram prediction model.
The AUC value of the prediction model is 0.812 (Fig. 2D,E), indicating that the model has strong predictive value. In this prediction model, the total risk points for most patients in this study ranged from 125 to 175. The indicators such as NAFLD severity, BMI size, and FBG have a higher risk points on the incidence of GSD.
3.3 Analysis of serum metabolomics data
We screened significantly different metabolites by using differential analysis of serum non-targeted metabolomics data results from GSD and NGSD patients. 245 differential metabolites were obtained by HMDB annotation. We conducted PCA analysis on these 245 metabolites (Fig. 3) and visualized them using volcanic maps.
Through principal component analysis, we found significant differences in the expression of these 245 metabolites in peripheral blood samples of GSD and NGSD patients. The PCA model (R2X1 = 0.243,R2X2 = 0.117) and OPLS-DA model (R2 = 0.12,Q2=-0.39) both showed significant separation between the two groups of samples(Fig. 3A,B,C).
3.4 The clinical risk prediction model for metabolomics data
Using the R language package "rms", we screened 245 annotated metabolites through lasso regression and Logistic regression models, and obtained 23 significantly related metabolite data. We visualized these metabolites by using heatmaps and pods (Fig. 4).Through the pod plot, we can visually see the difference in expression levels of various metabolites between the GSD group and the NGSD group(Fig. 4A). In the heatmap, 8 metabolites were highly enriched in NGSD group data and 15 metabolites were highly enriched in GSD group data (Fig. 4B).
We included these 23 metabolites in the risk prediction column chart for GSD diseases (Fig. 5). Generated a metabolite related GSD risk column chart(Fig. 5A,B,C).The AUC values of clinical nomogram were calculated. The ROC curve and calibration curve were made by using the R language pack "rms" and “ROCR” to verify the accuracy of the nomogram prediction model.
The AUROC value of the prediction model is 1 (Fig. 5D,E), indicating the model has strong predictive value. In this prediction model, the total risk points for most patients in this study ranged from 206 to 228. The indicators such as eplerenone (HMDB0014838), Citrusinine I (HMDB0030374) and oleamine (HMDB0002117) have a higher risk points on the incidence of GSD.
3.5 Pathway enrichment for the overall metabolomics data
We have compiled overall non-targeted metabolomics data for both groups of patients. We annotated the omics data of peripheral blood serum after non-targeted metabolomics testing using HMDB and KEGG, conducted the metabolic pathway enrichment analysis through the online tool "MetaboAnalyst" (https://www.metaboanalyst.ca/) to identify the pathway with significant differences and correlations (Fig. 6A,B).
In the pathway map, we found that the increased expression of Citrate cycle (TCA cycle)(P < 0.01) and Pantothenate and CoA biosynthesis pathways(P < 0.01) are significantly correlated with the incidence of GSD. The decreased expression of Glycerophospholipid metabolism pathway(P < 0.05) is significantly correlated with the incidence of GSD.
3.6 Weighted Metabolome Coexpression Network Analysis of GSD risk
We used R language software to draw WGCNA related module diagrams between 245 selected annotated metabolites with significant statistical differences and clinical data through the “WGCNA” language package. The co-expression status of the genes was analyzed by clustering, and 10 co-expression modules (Fig. 7A,B) were detected based on a predefined cut-off value of 0.75. We analyzed the topological overlap heatmap between each module, with light colored areas representing lower overlap. This analysis result indicates that the metabolites between each module are relatively independent(Fig. 7C). Modular trait plot was obtained through the correlation analysis of co-expression module and clinical phenotype. And we found the grey module (0.74 2e-24) and turquoise module (0.48 3e-09) with high correlation with GSD onset from the figure. In addition, the grey module also showed a significant correlation with the NAFLD (0.34 6e-05) (Fig. 7).
3.7 The module pathway enrichment analysis with high correlation
We selected gray modules (0.74 2e-24) and turquoise modules (0.48 3e-09) with high correlation with GSD pathogenesis, extracted the metabolites for KEGG enrichment analysis, and obtained the enrichment pathway map (Fig. 8, Fig. 9) in the high correlation modules.
In the metabolic pathway enrichment results of the gray module in the WGCNA model (Fig. 7) of GSD, the Pantothenate and CoA biosynthesis(P < 0.05) and Linoleic acid metabolism pathways showed a high degree of enrichment association with the incidence of GSD (Fig. 8A,B); In the Pantothenate and CoA biosynthesis pathway, Pantothenoylcysteine (C04079) in the peripheral blood of GSD patients shows an elevated state(Fig. 8C,E). In the Linoleic acid metabolism pathway, the levels of Linoleic acid (C01595) in the peripheral blood of GSD patients showed an upward trend(Fig. 8D,F).
The enrichment of turquoise module reflects the importance of Citrate cycle (TCA cycle)(P < 0.01) and Glyoxylate and dicarboxylate metabolism pathway(P < 0.01) (Fig. 9A,B). In the Citate cycle (TCA cycle), the levels of Pyruvic acid (C00022), cis Aconitiate (C00417), and 2-Oxoglutarate (C00026) in the peripheral blood of GSD patients are increasing(Fig. 9C,E). In the Glyoxylate and dicarboxylate metabololism pathway, the levels of Pyruvic acid (C00022), Acetate (C00417), and Tartrate semialdehyde (C01146) in the peripheral blood of GSD patients showed an upward trend(Fig. 9D,F).