Identification of Coagulation-Related Genes in LUAD Patients
To begin, Figure 1 presents the flowchart outlining the entire study. We obtained gene expression data, clinical information data, and mutation data for lung adenocarcinoma patients and normal tissues from the TCGA database. By intersecting the coagulation-related genes (138) with the lung adenocarcinoma genes in the TCGA database. Furthermore, based on the gene expression levels, we identified 51 DEGs(FDR<0.05, |log2FC|>1.5), which consisted of 37 high-expression genes and 14 low-expression genes. The expression patterns of these differential genes in normal tissue and LUAD tissue are presented in the heatmap and volcano plot (Fig 2a and 2b).Finally, we found that DEGs in LUAD were all CRGs(Fig 2c)
Enrichment Analysis of Coagulation-Related Genes
To elucidate the underlying mechanisms associated with the identified 51 genes, we conducted Gene Ontology (GO) analysis. The results, depicted in a bar plot, demonstrated that coagulation-related genes are primarily enriched in processes such as blood coagulation, coagulation, hemostasis, wound healing, and regulation of body fluid levels. Furthermore, in the Cellular Component (CC) category, these genes showed enrichment in collagen-containing extracellular matrix, secretory granule lumen, cytoplasmic vesicle lumen, and vesicle lumen. In terms of Molecular Function (MF), they were mainly associated with serine-type endopeptidase activity, serine-type peptidase activity, serine hydrolase activity, and endopeptidase activity (Fig 2d). Subsequently, we performed KEGG analysis, which revealed that these coagulation-related genes were predominantly enriched in complement and coagulation cascades as well as in Coronavirus disease - COVID-19 pathways (Fig 2e).Collectively, these findings support the association between the coagulation process and the development of lung adenocarcinoma. Moreover, our risk model, based on coagulation genes, demonstrates a relationship with coagulation metabolism.
Construction of a Coagulation Metabolism-Related 4-Gene Risk Model
Through expression difference analysis, we identified 51 genes showing differential expression. Univariate Cox analysis (p<0.05) was performed on these 51 genes, revealing that 13 genes were significantly associated with overall survival (OS) prognosis. Among them, 12 genes (MMP1, MMP10, CTSV, F2, FGA, KLK8, GDA, HRG, HNF4A, F2RL2, PROZ, APOC3) were identified as high risk genes, while one gene (DPP4) was classified as a low risk gene (Fig 3a). Subsequently, using the glmnet package in R, we conducted LASSO regression analysis on these 14 genes in the training set to identify the most informative genes for prognostic purposes. As a result, a 4-gene model consisting of MMP1, MMP10, CTSV, and F2 was obtained (Fig 3b, c). The risk score formula for this model is as follows: risk score = (0.0837569167784308 * MMP1) + (0.0878985239886888 * MMP10) + (0.159310542027166 * CTSV) + (0.258137401792528 * F2).
We performed a random division of patients with lung adenocarcinoma from the TCGA database into three groups: the total set, train set, and test set. Utilizing the risk score calculation formula, we categorized them into high risk and low risk groups. In the total set, there were 253 patients in the high risk group and 254 patients in the low risk group, whereas the train set consisted of 127 patients in the high risk group and 127 patients in the low risk group. Similarly, the test set comprised 126 patients in the medium and high risk group and 127 patients in the low risk group. We compared different clinical information across the total set, train set, and test set (Table 2).
Table1: Primer sequences for qRT–PCR.
Genes
|
Forward (5′-3′)
|
Reverse (5′-3′)
|
ACTIN
|
CCTTCCTGGGCATGGAGTC
|
TGATCTTCATTGTGCTGGGTG
|
MMP10
|
GAGTTTGACCCCAATGCCAG
|
TCTTCCCCCTATCTCGCCTA
|
CTSV
|
TCAGGCAGATGATGGGTTGC
|
GCCCAACAAGAACCACACTG
|
F2
|
CACGGCTACGGATGTGTTCT
|
AGTTCGTACCCAGACCCTCAG
|
Table2: Clinical characteristics of 3 sets of data randomly generated by the TCGA database
Clinical Features
|
Type
|
Total set
n= 507
|
Train set
n= 254
|
Test set
n= 253
|
P-value
|
Age
|
<= 65
|
239 (47.14)
|
120 (47.24)
|
119 (47.04)
|
1
|
>65
|
258 (50.89)
|
130 (51.18)
|
128 (50.59)
|
Gender
|
FEMALE
|
272 (53.65)
|
135 (53.15)
|
137 (54.15)
|
0.8912
|
MALE
|
235 (46.35)
|
119 (46.85)
|
116 (45.85)
|
Stage
|
Stage I
|
272 (53.65)
|
127 (50)
|
145 (57.31)
|
0.3882
|
Stage II
|
120 (23.67)
|
65 (25.59)
|
55 (21.74)
|
Stage III
|
81 (15.98)
|
45 (17.72)
|
36 (14.23)
|
Stage IV
|
26 (5.13)
|
13 (5.12)
|
13 (5.14)
|
T
|
T1
|
169 (33.33)
|
80 (31.5)
|
89 (35.18)
|
0.6613
|
T2
|
271 (53.45)
|
141 (55.51)
|
130 (51.38)
|
T3
|
45 (8.88)
|
21 (8.27)
|
24 (9.49)
|
T4
|
19 (3.75)
|
11 (4.33)
|
8 (3.16)
|
N
|
N0
|
327 (64.5)
|
155 (61.02)
|
172 (67.98)
|
0.2544
|
N1
|
95 (18.74)
|
52 (20.47)
|
43 (17)
|
N2
|
71 (14)
|
42 (16.54)
|
29 (11.46)
|
N3
|
2 (0.39)
|
1 (0.39)
|
1 (0.4)
|
M
|
M0
|
338 (66.67)
|
170 (66.93)
|
168 (66.4)
|
0.9886
|
M1
|
25 (4.93)
|
12 (4.72)
|
13 (5.14)
|
By examining the K-M curve, we found that in the total set, train set, and test set, the OS of high risk patients was significantly lower than that of the low risk group (Fig 4a, c, e). Additionally, when assessing the Progression-Free Survival (PFS), the high risk group exhibited significantly lower survival rates compared to the low risk group (Fig 4b, d, f). The risk heatmap displayed a gradual increase in the levels of MMP1, MMP10, CTSV, and F2 across the total set, train set, and test set, which are considered high risk genes (Fig 5a, d, g). Furthermore, the relationship between the risk score and survival time between the high risk and low risk groups was plotted, clearly demonstrating that as the risk score increases, the patient's survival time significantly decreases (Fig 5b, e, h). A higher risk score is associated with a significant decrease in the patient's survival time (Fig 5c,f,i).
Independent Analysis of Prognostic Factors by Predictive Models
The study employed univariate Cox regression analysis (Fig 6a) and multivariate Cox regression analysis (Fig 6b) to determine whether risk factors could serve as independent prognostic factors differentiating other clinical traits. The results of univariate Cox regression analysis in the population sample revealed several significant factors: Stage (P < 0.001, Hazard ratio = 1.595), T (P < 0.001, Hazard ratio = 1.618), M (P = 0.034, Hazard ratio = 1.858), N (P < 0.001, Hazard ratio = 1.732), and risk score (P < 0.001, Hazard ratio = 1.517), all of which were considered high risk factors. Moreover, based on the results of multivariate Cox regression analysis in the population sample, T (P = 0.010, Hazard ratio = 1.362) and risk score (P < 0.001, Hazard ratio = 1.415) were identified as independent risk factors and both were classified as high risk factors. To assess the accuracy of the risk model, the ROC curve was utilized. The area under the curve (AUC) was calculated as 0.687 for 1 year, 0.634 for 3 years, and 0.622 for 5 years (Fig 6c). Furthermore, the one-year survival prediction demonstrated that the AUC under the risk score curve (0.687) was significantly higher than that of Age (0.490), Gender (0.550), T (0.652), M (0.501), and N (0.666). Therefore, the risk model exhibited significantly higher accuracy (Fig 6d).
The Principal Component Analysis (PCA) and the Construction of Predictive Nomograms
PCA was utilized to examine the distribution between high risk and low risk groups in lung adenocarcinoma patients. Significant differences were observed among all genes (Fig 6e), coagulation-related genes (Fig 6f), and model-building genes (Fig 6g), with the most pronounced differences observed in the model-building genes,which genes exhibited the highest discriminatory power.Subsequently, a nomogram model was constructed incorporating age, sex, TNM stage, T-stage, N stage, M stage, and risk score of lung adenocarcinoma patients (Fig 6h). This model aimed to predict the 1-year, 3-year, and 5-year survival rates (Fig 6i). This demonstrates the accuracy of the signature model and its ability to compare patient survival via various clinical characteristics.
Relationship between Different Clinical Traits and Risk Scores
To gain deeper insights into the relationship between the different clinical traits and risk scores, we constructed heat maps to illustrate the association between the expression of model genes and various clinical traits and pathological features. (Fig 7a). Furthermore, we compared the risk scores across the different clinical traits and confirmed their higher accuracy in distinguishing clinical stages (stage I-II or stage III-IV) and lymph node involvement (N0 or N1-N3) (Fig 7b-m).
Effects of different clinical features on Kaplan–Meier survival curves between high risk and low risk groups.risk groups: age (b, c), sex (d, e), stage (f, g), T (h, i), N (j, k), M (l, m).
GO, KEGG, GSVA Enrichment Analysis
Based on the aforementioned results, it can be inferred that the gene model constructed using the four genes exhibits a reliable predictive value for distinguishing between high risk and low risk groups. GO enrichment analysis revealed that the coagulation-related genes were primarily enriched in sulfur compound binding, glycosaminoglycan binding, serine hydrolase activity, serine-type peptidase activity, and serine-type endopeptidase activity. In terms of CC, these genes were mainly enriched in the collagen-containing extracellular matrix, apical part of the cell, vesicle lumen, and cytoplasmic vesicle lumen. Furthermore, they were enriched in the secretory granule lumen. Additionally,MF analysis showed enrichment primarily in epidermis development, negative regulation of proteolysis, skin development, and antimicrobial humoral response(Fig 8a, b). Subsequently, KEGG analysis was conducted, indicating that these coagulation-related genes were predominantly enriched in Neuroactive ligand-receptor interaction (Fig 8c, d). To further validate the distinction between high risk and low risk groups, GSVA enrichment analysis was performed. The enrichment pathways were compared between the two groups, and five pathways were selected based on p-value ranking. The results demonstrated that the cell cycle, neuroactive ligand-receptor interaction, pentose and glucuronate interconversions, porphyrin and chlorophyll metabolism, and starch and sucrose metabolism were primarily enriched in the high risk group. Conversely, allograft rejection, asthma, autoimmune thyroid disease, ribosome, and systemic lupus erythematosus were primarily enriched in the low risk group (Fig 8e, f).
Immune-related Differences
In terms of TME scores, including StromalScore, ImmuneScore, and ESTIMATEScore, no significant differences were observed between the high risk and low risk groups (Fig 9a). However, distinct variations were observed in the composition of immune cell infiltrates between the two groups, including B cells naive, T cells CD4 memory resting, NK cells resting, Macrophages M0, Dendritic cells resting, Mast cells resting, and Mast cells activated (Fig 9b, c). Furthermore, an analysis of immune function between the high risk and low risk groups revealed notable disparities in B cells, CCR, HLA, iDCs, Macrophages, Mast cells, MHC class I, NK cells, Parainflammation, T cell co-stimulation, Type I IFN Response, and Type II IFN Response (Fig 9d).
TMB and TIDE
Initially, we obtained somatic mutation data from the TCGA database and examined the disparities in somatic mutations between the low risk group (n=245) and the high risk group (n=251). Notably, the high risk group exhibited a higher prevalence of somatic mutations compared to the low risk group, as evidenced by the following percentages: TP53 (low risk 40%, high risk 51%), TTN (low risk 40%, high risk 47%), MUC16 (low risk 40%, high risk 39%), CSMD3 (low risk 36%, high risk 40%), RYR2 (low risk 34%, high risk 37%), LRP1B (low risk 32%, high risk 32%), ZFHX4 (low risk 27%, high risk 35%), USH2A (low risk 28%, high risk 33%), KRAS (low risk 25%, high risk 31%), XIRP2 (low risk 20%, high risk 26%), FLG (low risk 21%, high risk 25%), SPTA1 (low risk 18%, high risk 26%), NAV3 (low risk 19%, high risk 21%), ZNF536 (low risk 19%, high risk 21%), COL11A1 (low risk 18%, high risk 20%) (Fig 10a, b).Subsequently, we performed tumor mutation burden (TMB) calculations, which demonstrated significant distinctions between the high risk and low risk groups (Fig 10c). Furthermore, we categorized patients into the high TMB group versus the low TMB group, with the former exhibiting a higher OS rate compared to the latter (Fig 10d). Interestingly, the high risk low-TMB group displayed the poorest prognosis (Fig 10e).To assess TIDE, we conducted an analysis using TIDE scores, where higher scores indicate a stronger immune evasion mechanism. In our study, the TIDE score was significantly higher in the high risk group compared to the low risk group (Fig 10f). However, further investigation is required to fully understand the differences in immunotherapy between the high risk and low risk groups. Additionally, by comparing drug susceptibility, we identified clear discrepancies between the high risk and low risk groups.
Drug Susceptibility Analysis
We conducted a comparison of drug sensitivity between the high risk and low risk groups and observed distinct variations in the IC50 values between the two groups. Specifically, the high risk group exhibited greater drug sensitivity and showed enhanced efficacy in BI-2536 (p=0.00011) (Fig 11a), Dasatinib (p=3.3e-06) (Fig 11b), PD0325901 (p=7.3e-08) (Fig 11c), SCH772984 (p=3.4e-16) (Fig 11d), and Trametinib (p=5.2e-07) (Fig 11e). Conversely, in the low risk group, BMS-754807 (p=9.9e-08) (Fig 11f), Doramapimod (p=4e-10) (Fig 11g), JAK1_8709 (p=0.00059) (Fig 11h), Ribociclib (p=2.5e-05) (Fig 11i), and SB216763 (p=1.4e-05) (Fig 11j) exhibited higher drug sensitivity and demonstrated greater effectiveness against these drugs.
Verification of the expression level of CRGs
By evaluating the expression of MMP1, MMP10, CTSV, and F2 in normal tissues and lung adenocarcinoma (LUAD), we confirmed that these genes are indeed high risk genes (Fig 12a-h).To further validate our findings, we compared the protein expression encoded by these four genes between LUAD and normal tissues using the Human Protein Atlas (HPA) database. Consistent with the mRNA expression levels, the protein expression levels of MMP10, CTSV, and F2 were significantly higher in lung adenocarcinoma than in normal tissues (Fig 12i-k). However, the protein expression of MMP1 was not observed. To further validate the accuracy of the CRGs diagnostic model, we conducted additional RT-qPCR experiments to verify the expression of mRNA in both normal lung tissue and lung adenocarcinoma tissue. The RT-qPCR results revealed significantly higher mRNA expression levels of MMP10,CTSV, and F2 in A549 and H1299 cells (lung adenocarcinoma cell line) compared to BEAS-2B (the human normal bronchial epithelial tissue cell line) (Fig 12l-n). Furthermore, these results are fully consistent with our bioinformatics analysis based on the TCGA database.