A prognostic score model based on eight metabolism-associated genes predicts the survival of adult females with lung adenocarcinoma

As a non-small cell lung cancer, lung adenocarcinoma (LUAD) is common in women and non-smokers. This study is aimed to construct a prognostic score (PS) model for adult females with LUAD. Methods The gene expression data of adult females with LUAD from The Cancer Genome Atlas database was obtained as the training set, and GSE50081 and GSE37745 from Gene Expression Omnibus database were downloaded as the validation sets. The differentially expressed genes (DEGs) between LUAD and normal samples were screened by limma package. The metabolism-associated DEGs were selected by Gene Set Enrichment Analysis, and were conducted with enrichment analysis using DAVID tool. After the independent prognosis-associated genes were identied by survival package, the optimal gene combination was screened using penalized package to build the PS model. Besides, the nomogram survival model based on the independent prognostic clinical factors was constructed by rms package. Using HPAanalyze package, the protein expression levels of the optimal genes were mined. There were 2388 DEGs between and normal Totally, 150 metabolism-associated DEGs were screened, for which PPAR signaling pathway was enriched. The optimal gene combination (involving CYP17A1, ASPG, DUOX1, CIDEC, TH, B4GALNT1, APOA2, and GCKR) was selected, based on which the PS model was built. Combined with pathologic stage and PS model status, the nomogram survival model was constructed. Moreover, CIDEC The PS model and the nomogram survival model might be applied for the prognostic prediction of adult females with LUAD.


Introduction
Lung adenocarcinoma (LUAD) belongs to non-small cell lung cancer (NSCLC) and makes up 40% of lung cancers (1). LUAD has no special symptoms in early stage, which only presents as common respiratory symptoms (such as cough, low fever, phlegm blood, chest pain, and chest tightness (2). Majority of LUADs originate from the mucosal epithelium of the small bronchioles, and a few of LUADs derives from the mucous glands of the large bronchioles (3). LUAD is more common in women and non-smokers, which has a lot to do with air pollution, lampblack, and working environment (4). LUAD generally grows slowly, but it sometimes occurs hematogenous metastasis or lymphatic metastasis (5,6). Therefore, exploring the biomolecules signi cantly associated with LUAD in women is important for improving the outcomes of LUAD patients. Some genes implicated in the prognosis of LUAD patients have been reported by several researches in recent years. For example, glucose-6-phosphate dehydrogenase (G6PD) expression has correlations with lymph node metastasis, invasion, higher TNM stage, poorer differentiation, and adverse outcome of LUAD, and thus G6PD may be a poor prognostic marker for the disease (7). Increased collagen type V alpha 1 (COL5A1) is related to LUAD metastasis, and the overexpression of COL5A1 is detected in LUAD patients with recurrence and poor prognosis (8). Kinesin family member 18A (KIF18A) overexpression has independent correlation with adverse recurrence-free survival and overall survival (OS) of LUAD patients, which is a promising prognostic marker and therapeutic target for the tumor (9,10). Up-regulated S100 calcium binding protein A14 (S100A14) promotes the invasion and migration of LUAD cells, and may be applied for the diagnosis, prognostic prediction, and treatment of LUAD (11).
However, the prognostic mechanisms of LUAD still needed to be studied deeply.
Several gene signatures for LUAD have been built to predict the outcomes of LUAD patients. Yun-Yong et al nd a 193-gene gene expression signature that can independently predict the OS of LUAD patients, helping to identify the high risk patients in stage I and the patients suitable for adjuvant chemotherapy (12). Wang et al develop a fourgene signature for the LUAD patients with lymph node metastasis, contributing to recognizing the high risk patients and improving their treatment and clinical outcomes (13). Besides, oncogenic signalling pathways have in uences on metabolic processes, and glucose, lipid, and protein metabolisms play roles in tumor cell growth and survival (14,15). Nevertheless, the risk model based on metabolism-associated genes has not been constructed for LUAD patients.
In this study, the gene expression data of LUAD samples from adult females were extracted from The Cancer Genome Atlas (TCGA) database. The differentially expressed genes (DEGs) were compared with the genes correlated with metabolism of amino acids and derivatives, metabolism of carbohydrates, and metabolism of lipids to identify the metabolism-associated DEGs. From the independent prognosis-associated genes screened from the metabolism-associated DEGs, the optimal genes were further selected for constructing the prognostic score (PS) model. Furthermore, the protein expression levels of the optimal genes in different tumors were analyzed. The PS model constructed in this study might be valuable for predicting the prognosis of LUAD patients.

Materials And Methods
Data downloading and data preprocessing From TCGA (https://cancergenome.nih.gov/) database, the gene expression data of LUAD (platform: Illumina HiSeq 2000 RNA Sequencing; including 585 samples; downloaded in March 10, 2020) was extracted. According to the clinical information of the samples, 300 samples from females older than 18 years (including 266 LUAD samples and 34 normal samples) were included in the training set.
Using "Homo sapiens" and "lung cancer" as the searching words, eligible datasets satisfying the following criteria were searched from Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) database: (1) the samples were provided with histology type information; (2) the total sample size was equal to or larger than 150, and the LUAD samples was no less than 100; (3) LUAD samples had age and gender information; (4) LUAD samples had survival prognosis information, and the valid samples should be no less than 50. After searching, GSE50081 (platform: GPL570 Affymetrix Human Genome U133 Plus 2.0 Array; including 62 LUAD samples from adult females with survival prognosis information; the validation set 1) and GSE37745 (platform: GPL570 Affymetrix Human Genome U133 Plus 2.0 Array; including 60 LUAD samples from adult females with survival prognosis information; the validation set 2) were selected as the eligible datasets.

Differential expression analysis
Based on the limma package (16) (https://bioconductor.org/packages/release/bioc/html/limma.html, version 3.34.7) in R, the DEGs between the LUAD samples and normal samples in the training set were analyzed. The false discovery rate (FDR) < 0.05 and |log 2 fold change (FC)| > 1 were de ned as the thresholds for selecting the DEGs.
Combined with the expression of the DEGs in the training set, bidirectional hierarchical clustering was conducted by the pheatmap package (17) (https://cran.r-project.org/web/packages/pheatmap/index.html, version 1.0.8) in R.

Screening of metabolism-associated DEGs
From Gene Set Enrichment Analysis (GSEA, http://software.broadinstitute.org/gsea/downloads.jsp) database (18), the genes correlated with metabolism of amino acids and derivatives, metabolism of carbohydrates, and metabolism of lipids were downloaded. The downloaded genes were compared with the identi ed DEGs, and the intersected genes were taken as metabolism-associated DEGs. For the metabolism-associated DEGs, Gene Ontology (GO)_biological process (BP) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed using DAVID tool (19) (version 6.8, https://david.ncifcrf.gov/). The threshold of enrichment analysis was the p-value < 0.05.

Construction of PS model
Using the univariate and multivariate Cox regression analyses in the R package survival (20) (version 2.41-1, http://bioconductor.org/packages/survivalr/), the DEGs signi cantly correlated with OS and independent prognosis were selected from the metabolism-associated DEGs. The log-rank p-value < 0.05 was the threshold.
Combined with the independent prognosis-associated genes, the optimal gene combination was screened using the LASSO Cox regression model in the R package penalized (21)(version 0.9.50, https://cran.rproject.org/web/packages/penalized/index.html). The optimized parameter "lambda" in the model was calculated by 1000 cross-validation likelihood (cvl). Based on the prognostic coe cients and expression levels of the optimal genes in the training set, the following PS model was constructed: Prognostic score (PS) = ∑β DEGs × Exp DEGs β DEGs and Exp DEGs separately represent the prognostic coe cients and expression levels of the optimal genes.
The PSs of the samples in the training set were calculated, and then their median was applied for dividing the samples into high (PS ≥ median) and low (PS < median) risk groups. Using the Kaplan-Meier (KM) curve method in survival package (20), the correlation between the actual survival prognosis and the risk grouping was evaluated.
Meanwhile, the expression levels of the optimal genes were extracted from the validation sets, and the PSs of the samples in the validation sets were calculated. Subsequently, the samples in the validation sets were classi ed into high and low risk groups, and the correlation between the actual survival and the grouping status was assessed.

Establishment of nomogram survival model
From the training set, the independent clinical prognostic factors were screened using the univariate and multivariate Cox regression analyses in the survival package (20). The log-rank p-value < 0.05 was de ned for screening the independent clinical prognostic factors. To reveal the correlation between the risk grouping and the independent clinical prognostic factors, the independent clinical prognostic factors were performed with strati ed analysis. The samples were classi ed into different groups based on clinical factors, and the correlation analysis of PS model was carried out in these different groups. Combined with independent prognostic clinical factors and the PS model, nomogram survival model was established by the rms package (22) (version 5.1-2, https://cran.rproject.org/web/packages/rms/index.html) in R.

Searching of the protein expression levels of the optimal genes
The Human Protein Atlas (HPA) database (23) is developed for providing the tissue and cell distribution information of all 24,000 human proteins and examining the distribution and expression of each protein in 48 normal human tissues, 20 tumor tissues, 47 cell lines and 12 blood cells. Using the HPAanalyze package (24) (version 1.4.3, http://www.bioconductor.org/packages/release/bioc/html/HPAanalyze.html) in R, "tissue atlas" and "pathology atlas" were mined for the optimal genes.

Differential expression analysis
There were a total of 2388 DEGs (including 1574 up-regulated genes and 814 down-regulated genes) between the LUAD samples and normal samples in the training set (Fig. 1A). Besides, hierarchical clustering heatmap was draw based on the expression of the DEGs, indicating that the samples were clearly clustered in two directions ( Fig. 1B).

Screening of metabolism-associated DEGs
Based on GSEA database, 372 genes correlated with metabolism of amino acids and derivatives, 293 genes correlated with metabolism of carbohydrates, and 738 genes correlated with metabolism of lipids were downloaded. Through comparing the downloaded genes and the DEGs, 150 intersected genes were obtained as metabolism-associated DEGs (Fig. 2). According to the results of enrichment analysis, the metabolism-associated DEGs were involved in 17 GO_BP functional terms (such as oxidation-reduction process, p-value = 1.270E-19; cellular amino acid biosynthetic process, p-value = 8.430E-08; and hormone biosynthetic process, p-value = 2.710E-06) and 20 KEGG pathways (such as Metabolic pathways, p-value = 3.630E-32; Biosynthesis of amino acids, pvalue = 1.050E-08; PPAR signaling pathway, p-value = 6.420E-08) ( Table 1). Table 1 The biological processes and KEGG pathways enriched for the metabolism-associated DEGs.
Moreover, the samples were divided into high and low risk groups based on PS model, and KM curve analysis of pathologic stage was conducted in the low risk (p = 3.076e-06, HR: 2.401 (1.615-3.567)) ( Fig. 5A)   Note: TCGA, The Cancer Genome Atlas; HR, hazard ratio; CI, con dence interval; PS, prognostic score.
In the training set, the nomogram survival model was established to analyze the correlations between the independent clinical prognostic factors and survival prognosis (Fig. 6A). The "Total points" axis of the nomogram was used to predict the survival of the samples by integrating pathologic stage and PS model status. Through comparing the survival probabilities predicted by the nomogram with the actual survival probabilities, the Cindexes for three-year survival probability and ve-year survival probability separately were found to be 0.731 and 0.702 (Fig. 6B).

Searching of the protein expression levels of the optimal genes
Combined with HPA database, the protein expression levels of the optimal genes in different tissues, different cell components, and different kinds of tumors were analyzed. Since the subjects of this study were adult females, several tumors with female characteristics (including lung cancer, breast cancer, cervical cancer and ovarian cancer) were selected. According to the results, CIDEC was found to be a characteristic gene in lung tissue. In addition, the expression of CIDEC was also characteristic in other cancers, especially in breast cancer (Fig. 7).

Discussion
In this study, 2388 DEGs (including 1574 up-regulated genes and 814 down-regulated genes) between the LUAD samples and normal samples in the training set were identi ed. For the 150 metabolism-associated DEGs, 17 GO_BP functional terms and 20 KEGG pathways (such as PPAR signaling pathway) were enriched. Based on the optimal gene combination involving eight genes (CYP17A1, ASPG, DUOX1, CIDEC, TH, B4GALNT1, APOA2, and GCKR), the PS model was constructed. Combined with pathologic stage and PS model status, the nomogram survival model was established.
Through mediating the p(38)/β-catenin/PPARγ pathway, transforming growth factor β (TGFβ) enhances the invasion, migration, and epithelial mesenchymal transition of NSCL CH460 cells (25,26). Therefore, the PPAR signaling pathway might function in the pathogenesis of LUAD. The silver nanobiocomposite of ASPG has lower cytotoxicity against lung cancer cells, and thus it can be applied as an promising anticancer agent for treating the tumor (27). DUOX1 affect epithelial homeostasis and innate airway host defense, and its silencing accelerates epithelial-to-mesenchymal transition in lung epithelial cancer and may be related to invasive and metastatic lung cancer (28, 29). DUOX1 expression is inhibited by epigenetic mechanisms in lung cancer, which may be implicated in the diagnosis and therapy of the tumor (30). Thus, ASPG and DUOX1 might be implicated in the prognosis of LUAD patients.
CIDEC, a member of the cell death-inducing DNA fragmentation factor-like effector family, could promote lipid droplet formation in adipocytes and mediate adipocyte apoptosis (31). It was reported that the CIDE family regulated lipid metabolism and played an important role in the development of metabolic disorders such as obesity (32), insulin resistance (33) and hepatic steatosis(34)Ming Yu et al.According to HPA database, CIDEC was a characteristic gene in lung cancer and other cancers. A four-gene signature (involving CIDEC; dickkopf WNT signaling pathway inhibitor 1, DKK1; ubiquitin speci c peptidase 4, USP4; and ZFP3 zinc nger protein, ZFP3) is an independent prognostic marker, which can be used to predict the outcomes of LUAD patients (35). Through CIDEC/extracellular regulated MAP kinase (ERK)/p38 pathway, ADP ribosylation factor like GTPase 14 (ARL14) is involved in the mechanisms of LUAD and can be novel prognostic marker and therapeutic target of the disease (36). B3GALNT1 and tubulointerstitial nephritis antigen-like 1 (TINAGL1) are found to be key genes mediating the metastasis of NSCLC, which are candidate target genes for the treatment of the disease (37,38). A model involving six biomarkers (including APOA2) and age has high sensitivity and speci city, which is effective for the diagnosis of the patients with lung cancer (39). These suggested that CIDEC, B4GALNT1, and APOA2 might also be correlated with the prognosis of LUAD patients.
Although the roles of CYP17A1, TH, and GCKR in LUAD have not been studied, they have in uences on other tumors. For example, CYP17A1 and CYP19A1 are important for estrogen biosynthesis, and their expression may contribute to improving the accuracy of diagnosis and selecting the proper treatment for invasive ductal breast cancer (40).Recently,one research reported that CYP17A1 signi cantly distinguished between non-smoking and smoking-associated adenocarcinomas. (41)The CT and CC genotypes of TH have higher frequencies in gastric cancer patients compared with the controls, therefore, they are correlated with a signi cantly higher risk of the tumor (42). The rs780093 and rs780094 polymorphisms in GCKR have signi cant correlations with OS and progression-free survival, and thus GCKR polymorphisms may be independent prognostic marker in metastatic gastric cancer patients receiving EOF chemotherapy (43). Therefore, CYP17A1, TH, and GCKR might play roles in the prognosis of LUAD patients.
This study constructed a PS model based on metabolism-associated genes and a nomogram survival model combined with multiple bioinformatics methods. However, no experiments were conducted to validate these ndings. Therefore, these results still needed to be con rmed by experimental studies.
In conclusion, 2388 DEGs between the LUAD and normal samples were obtained. Besides, the PPAR signaling pathway might affect the pathogenesis of LUAD. Furthermore, the PS model (involving CYP17A1, ASPG, DUOX1, CIDEC, TH, B4GALNT1, APOA2, and GCKR) and the nomogram survival model might be effective for the prognostic prediction of LUAD patients.   The Venn diagram showing the intersected genes of the metabolic genes and the differentially expressed genes (DEGs).   black and red represent 3-year prediction and 5-year prediction, respectively).