Identication of a lncRNA-mRNA Regulatory Module for Prognosis, Classication and Potential Treatment of Lung Adenocarcinoma

Background: Lung cancer remains the leading cause of cancer death worldwide, with lung adenocarcinoma (LUAD) being the most prevalent subtype of lung cancer. This study aimed to identify a lncRNA-mRNA regulatory module related to the prognosis, classication, and potential treatment of LUAD. Methods: Publicly available gene expression data of three cohorts were downloaded from The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO) databases. Differential expression analysis between LUAD and normal samples, as well as the survival analysis, was performed. Protein-protein interaction (PPI) network and co-expression analyses were conducted to further identify key genes. A least absolute shrinkage and selection operator (LASSO) Cox regression model was developed to predict overall survival. Five machine learning models, including logistic regression, K-nearest neighbor (KNN), support vector machine (SVM), random forest, and extremely gradient boosting (XGBOOST), were trained to distinguish early-stage or epidermal growth factor receptor (EGFR)-mutation LUAD from others. Furthermore, connectivity map (CMap) and molecular docking analyses were performed to identify compounds with the ability to reverse the expression proles of the key genes. Results: A cohort comprised of 535 LUAD and 59 normal samples in TCGA was used as the training set, while the GSE31210 and GSE30219 datasets were used as validation sets. 189 mRNAs and 11 lncRNAs were differentially expressed and associated with the overall survival. 43 hub mRNAs were further identied from the PPI network, and 3 lncRNAs were signicantly correlated to the expression of hub mRNAs. Six genes with nonzero coecients were selected by using the LASSO COX regression analysis, and the corresponding risk score was derived. The time-dependent ROC and Kaplan Meier analysis demonstrated that the risk score accurately discriminates the patients with a high or low risk. KNN and XGBOOST were the best models to recognize early-stage and EGFR-mutation LUAD, respectively. Purvalanol-a and Etoposide obtained a score of -99.98 in CMap and were successfully docked with three key genes.

need to identify novel risk genes with prognostic value and explore new potential therapeutic targets for LUAD.
In the last decade, numerous studies have attempted to predict and stratify prognosis of LUAD. An increasing number of genes associated with the survival of LUAD patients have been identi ed and validated, including immune-related genes (4,5), glycolysis-related genes (6), hypoxia-related genes (7) and ferroptosis-related genes (8). Besides, long non-coding RNAs (lncRNAs) were also proved to play a critical role in the pathogenesis of LUAD and other various cancers (9)(10)(11)(12). However, the research on integrated analysis of messenger RNA (mRNA) and lncRNA in LUAD still remains limited, although they are closely related in involved biological processes. Several studies attempted to develop a combined signature but some of them failed to validate their results in independent datasets (13,14), while some only used machine learning methods to predict survival, limiting thus their clinical applicability (15,16).
Moreover, most of the studies aiming to identify prognosis-related gene signatures failed to explore the potential treatment of LUAD, partly limiting the clinical impact of their research ndings. Therefore, deeper research is essential to promote the understanding of lncRNA-mRNA interactions and explore potential treatments for LUAD.
In this study, we aimed to use bioinformatic analysis and machine learning to identify a lncRNA-mRNA regulatory module associated with prognosis, classi cation and potential treatment of LUAD. As shown in Fig. 1, publicly available gene expression data of three cohorts were downloaded from The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO) databases. Differential expression analysis and survival analysis were performed to identify differentially expressed genes (DEGs) with prognostic value. Then, mRNAs and lncRNAs were further screened by using protein-protein interaction (PPI) network and co-expression analyses. A least absolute shrinkage and selection operator (LASSO) Cox regression model was developed to predict survival. Five machine learning models, including logistic regression, K-nearest neighbor (KNN), support vector machine (SVM), random forest and extremely gradient boosting (XGBOOST), were trained to distinguish early-stage or epidermal growth factor receptor (EGFR)-mutation LUAD from others. Furthermore, connectivity map (CMap) and molecular docking analyses were performed to identify several drugs that can reverse the expression pro les of the key genes demonstrating thus that they can serve as a potential treatment for LUAD.

Datasets
Raw read count data of gene expressions and clinical information were downloaded from TCGA, including 535 LUAD and 59 normal samples (17). Then, the read count data were normalized by using the voom function in the limma R package (version 3.42.2) (18). Besides, two independent cohorts from GEO were downloaded and analyzed for validation (GSE31210, GSE30219). mRNAs and lncRNAs were identi ed according to Genome Reference Consortium Human Build 38 patch release 13 (GRCh38.p13), and only common genes in the three cohorts were analyzed.

Differential expression and survival analyses
The differentially expressed mRNAs (DEmRNAs) and lncRNAs (DElncRNAs) between LUAD and normal groups in the TCGA cohort were identi ed using the limma R package (18). Those with | log2 fold change | > 2 and adjusted p-value < 0.01 were prioritized as DEGs. Heatmaps and volcano maps were drawn to visualize the DEGs.
A univariate COX regression model was developed using the TCGA LUAD cohort to assess the prognostic value of each DEmRNA and DElncRNA using the survival R package (version 3.2-7). The genes signi cantly associated with the prognosis of LUAD (p<0.01), were selected for the subsequent analysis.

Additional ltering of mRNAs and lncRNAs
The DEmRNAs with prognostic value were uploaded to the STRING database (19) and a PPI network was constructed. We visualized the network using the Cytoscape software (20). In the network, each node represents a protein or a gene and an edge represents the interaction between two nodes. The degree of a node is the number of its connections with other nodes. Based on the PPI network, the MCODE algorithm was used to cluster the network into important modules and then the cytoHubba tool was used to calculate the degree value of each node. In line with the previous research, only the nodes with a degree value > 20 were considered as hub mRNAs (21).
The co-expression analysis was conducted by assessing the Pearson correlation between mRNAs and DElncRNAs with prognostic value for LUAD. A lncRNA-mRNA co-expression network was derived according to the criteria of | Correlation Coe cient | > 0.4 and P < 0.01 using the limma R package (18).
The DElncRNAs that correlated to the hub mRNAs were screened out.
The hub mRNAs selected from the PPI network and the lncRNAs co-expressed with the hub mRNAs were then combined. It is noteworthy that these genes had been screened by univariate COX regression, and therefore, were associated with prognosis.

COX regression
A LASSO Cox regression model was developed based on the combined genes using the glmnet R package (version 4.0-2). The genes with a regression coe cient of zero were removed, and then a risk score was generated from the regression model. The genes included in the risk score were considered as the nal lncRNA-mRNA regulatory module. The formula of the risk score is: RiskScore = coef 1 ×gene 1 + coef 2 ×gene 2 + coef 3 ×gene 3 + ⋯ where the gene represents the normalized expression of a given mRNA or lncRNA in the nal model and the coef represents its coe cient in the LASSO COX regression model. LUAD patients were split into high-and low-risk groups according to whether their risk scores were greater than the median risk score in TCGA. The Kaplan Meier analysis with a log-rank test was applied to evaluate the prognostic difference between the two groups. The time-dependent receiver operator characteristic curve (ROC) was used to assess the accuracy of the suggested risk score. Univariate and multivariate COX regression analyses were successively performed among the risk score and available clinical variables to determine whether the risk score is an independent predictor of survival. In addition, the risk score was validated in two GEO cohorts.
Further, gene set enrichment analysis using the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) was conducted to assess the DEGs between high-and low-risk groups, using the clusterPro ler R package (version 3.14.3) (22). The in ltrating score of 16 immune cells and the activity of 13 immune-related pathways were calculated with the single-sample gene set enrichment analysis (ssGSEA) (23) using the GSVA R package (version 1.34.0) (24). The annotated gene set le was obtained from previous research (25).

Machine learning
We designed two classi cation tasks to explore the ability of the nal regulatory module to discriminate different types of LUAD by using machine learning models. Five representative models were assessed, including logistic regression (26), KNN (27), SVM (28), random forest (29) and XGBOOST (30,31). In the rst task, these ve machine learning models were applied in the TCGA cohort to build classi ers for the discrimination of early-stage (pathological Stage I-II) LUAD from others (pathological Stage III-IV). In the second task, these models were trained to recognize LUAD with EGFR mutation, based on the GSE31210 cohort. Only the genes of the nal regulatory module were used as features in both tasks. Ten-fold cross validation was performed taking into consideration the limited sample size (31), randomly splitting the dataset into 10 subsets, and using in each iteration, 9 of them to train the models and last 1 for validation. After 10 iterations, each subset had been validated and the validation results were combined to robustly assess the model performance. Then, the SHapley Additive exPlanation (SHAP) values were used to illustrate the classi cation results of the XGBOOST model, according to a game theory approach (32).
A web-based tool to use our score and models A web-based tool was developed using HTML and javascript language, and the back-end of this tool was managed using Django python package. When a user enters the normalized expression data of genes, the risk score and classi cation results of machine learning models are calculated and provided. This tool will allow clinicians or cancer patients to learn more about the condition and bene t from the results of our study.

Connectivity map-based drug screening and molecular docking
The L1000-based next-generation CMap (33) (Touchstone v1) was used to assess the functional connections between drugs and genes. The L1000-based CMap expanded the original CMap by using nearly 28,000 perturbagens including over 19,000 small molecules and about 7000 genetic modulations (33). The hub mRNAs were split into upregulated and downregulated gene groups, and were uploaded to the CMap web server. The LUAD cell line A549 and all compounds in the database were chosen for this analysis, and a list of compounds with connectivity scores ranging from -100 to 100 was obtained. In brief, a positive score indicates the similarity between a compound's signature and the signature of the query, while a negative score indicates that the two signatures are opposing (21,34). The magnitude of the score corresponds to the magnitude of similarity or dissimilarity. Therefore, compounds with a connectivity score of -90 or lower were considered as potential drugs for LUAD.
Molecular docking was used to verify the targeting association between the compounds and the revealed proteins. The 3-dimension models of targeted proteins and compounds were downloaded from the RCSB PDB and the ZINC database, respectively. Pre-processing such as removing solvent, adding hydrogens and choosing torsion, was conducted using the AutoDock Tool (version 1.5.6). Molecular docking was simulated by the AutoDock Vina program (version 1.1.2) (35). The interaction energy between the ligand and the receptor was calculated for the entire binding site and expressed as a nity. In line with previous research, a docking result with an a nity score < -5 kcal/mol is considered a strong blinding between the compound and protein (36). The docking results were visualized by the PyMOL program (version 2.4).

Results
Differentially expressed mRNAs and lncRNAs associated with prognosis Expression data and clinical information of three cohorts were downloaded from TCGA and GEO. The baseline characteristics were summarized in Table 1. The rst cohort consisted of 535 LUAD and 59 normal samples from TCGA, and was used for differential expression analysis. 17122 mRNAs and 1826 lncRNAs were screened, and 1456 mRNAs and 92 lncRNAs were differentially expressed between LUAD and normal groups. The expression pro les of the DEmRNAs and DElncRNAs were visualized in Fig Table S1 (Supplementary Materials).

Additional ltering of mRNAs and lncRNAs
A PPI network was constructed by uploading the 189 prognosis-associated DEmRNAs on the STRING database. Fig. 3A depicts the derived PPI network, which was visualized by the Cytoscape program and contained 77 nodes and 919 edges. Then, 2 modules were identi ed using the MCODE algorithm (shown in Fig. 3B-C). 43 mRNAs were further characterized as hub genes by the cytoHubba tool. Co-expression analysis of mRNAs and lncRNAs was also performed, as shown in Fig. 3D. 3 lncRNAs were signi cantly correlated to the expression of hub mRNAs. As a result, 46 genes (43 hub mRNAs and 3 lncRNAs) were used for the following risk modeling analysis.
Risk scores derived from COX regression 6 genes with nonzero coe cients were identi ed in the LASSO COX regression model. The risk score was calculated for each sample using the following formula: Samples were classi ed into high-and low-risk groups by comparing their risk scores to the median score in the TCGA cohort. As shown in Fig . It is noteworthy that a different time interval was assessed for GSE31210 because LUAD was relatively mild (mostly in stage I) and survival time was generally longer in this cohort. Additionally, the Kaplan Meier analysis with log-rank tests revealed a signi cant difference in the overall survival (p<0.01) between the two groups in the TCGA cohort as well as in the two validation cohorts (shown in Fig. 4D-F). Moreover, as shown in Fig. 4G-I, the proposed risk score was proven to be an independent prognostic factor for LUAD (HR>1, p<0.05) by successively using univariate and multivariate COX regression.
GO and KEGG pathway enrichment analyses were performed on the DEGs between the high-and low-risk groups to elucidate the biological functions and pathways associated with the risk score. As shown in Fig. 5A-B, the DEGs between different risk groups are mainly enriched in the functions associated with mitosis and cell cycle, such as DNA replication, chromosome segregation and spindle. KEGG pathway enrichment analysis (Fig. 5D-E) also revealed that DNA replication and cell cycle pathways were enriched. Interestingly, the DEGs are signi cantly enriched in pathways of neurodegeneration and related diseases, such as Huntington disease and Alzheimer disease. Fig. 5C&F shows the correlation between the risk score and immune status. As seen, the scores of aDCs, pDCs, APC co-stimulation, and MHC class I, are signi cantly different between the low-risk and high-risk groups. Most of the immune functions are signi cantly higher expressed in the high-risk group with the exception of the Type II IFN Response.

Machine learning modeling
The distributions of gene expression and risk scores in the different stages of LUAD are presented in Fig.  6A. Gene expression and risk scores were all signi cantly higher in pathological Stage III-IV LUAD, suggesting poorer survival for these LUAD stages. Five machine learning models, including logistic regression, KNN, SVM, random forest and XGBOOST, were applied to classify early-stage LUAD (pathological Stage I-II) from others (pathological Stage III-IV) based on the 6 risk genes. The AUROCs were assessed and are shown in Fig. 6B. The KNN model presented the highest AUROC (0.790) outperforming the other machine learning models. Fig. 6C shows the SHAP illustration of the XGBOOST model. Blue through red color indicates low to high expression level. SHAP values depict the effect of gene expression on the classi cation model. A positive SHAP value means that gene expression leads to an increase in the possibility of being Stage I-II, while a negative value represents that gene expression reduces this possibility.
Interestingly, the 6 genes' expression and risk scores were lower in EGFR mutation than other LUAD. Five machine learning models were trained to recognize EGFR mutation, and the XGBOOST model was the one that presented the best performance (AUROC: 0.778). The AUROCs and SHAP values are shown in Fig. 6F&G.
The web-based tool A web-based tool was developed (http://www.aimedicallab.com/tool/aiml-luad.html), and its interface is shown in Fig. 6D&H. Users only need to enter the normalized expression data of the 6 genes for each patient and click the 'predict' button. Then, the risk score and classi cation results of the machine learning models will be calculated and shown. Note that the early-stage classi cation was performed by KNN and the EGFR mutation was recognized by XGBOOST, since these were the best performing models in the two classi cation problems, respectively. Fig. 6D shows the input and results of a high-risk patient from the TCGA cohort; he was 71 years old with stage-III LUAD and survived only about 2 months in the study. In contrast, Fig. 6E is an example of low-risk patients. Her condition was more moderate (Stage I) and she survived more than 6 years.

CMap analysis and molecular docking
The forty-three hub mRNAs were all upregulated in LUAD samples and were uploaded to the CMap web server. A connectivity list of thousand compounds was obtained, with 20 of them at the top or bottom presented in Table S2  Molecular docking was conducted using the AutoDock Vina. The a nity scores of TOP2A-Purvalanola, Ki67-Etoposide and CCNB1-Etoposide were -6.7, -6.4 and -6.8 kcal/mol, respectively. These scores were all less than -5 kcal/mol suggesting, therefore, a strong blinding. The docking results are visualized using the PyMOL program and they are presented in Fig. 7.

Discussion
A lncRNA-mRNA regulatory module associated with prognosis, staging, EGFR mutation, and a potential therapeutic target for LUAD, was identi ed in the present study. The risk score derived from this module presented high AUROCs across the three independent cohorts mined from TCGA and GEO. LUAD patients were classi ed into low-and high-risk groups which presented a signi cant difference in the overall survival. Early-stage and EGFR mutation were accurately recognized by the 5 machine learning models.
CMap and molecular docking analysis provided a prioritized list of drugs that may be useful for treating LUAD.
The identi ed genes, including 4 mRNAs (DLGAP5, MKI67, TOP2A, and CCNB1) and 2 lncRNAs (PICSAR and SCAT1), are signi cantly related to the prognosis of LUAD. Previous studies have demonstrated that these genes are involved in the pathogenesis of LUAD. For instance, the prognostic value of DLGAP5 has been assessed in other cancers, such as colorectal cancer (37), pancreatic cancer (38) and breast cancer (39). Prior studies on LUAD have also identi ed prognostic signatures including DLGAP5 (40); however, little is known about how this gene affects the development of LUAD (41,42). MKI67 is the protein coding gene for the Ki67 protein, and this protein is known for decades to be a proliferation marker for human tumor cells. In fact, Ki-67 has roles in both interphase and mitotic cells (43) and was found to be associated with survival outcomes in lung cancer (44,45). TOP2A was found to be broadly expressed in many types of cancers (46). Several studies have reported that TOP2A expression was aberrantly upregulated in LUAD, targeting cell proliferation, migration and invasion (46, 47), and associated with poor prognosis (48). Wenxia Ma, et al. have found that TOP2A can potentially regulate the development of NSCLC working together with TPX2 (49). Shweta Arora, et al. reported that CCNB1 and other genes target cell cycle and immunosurveillance mechanisms via HMGA2 and E2F7 in NSCLC (50). Besides, Fan Kou, et al reported that inhibition of TOP2A reduces the expression levels of CCNB1 and CCNB2, which indicates that TOP2A targeting CCNB1 and CCNB2 promotes GLC82 and A549 cells proliferation and metastasis. Our study con rmed the important role that the 4 mRNAs have in LUAD development.
The two lncRNAs included in our signature, namely, PICSAR and SCAT1 were also associated with the development and prognosis of cancers. It was reported that PICSAR promotes the growth of cutaneous squamous cell carcinoma by regulating ERK1/2 activity (51) and PICSAR/miR-588/EIF6 axis regulates tumorigenesis of hepatocellular carcinoma by activating PI3K/AKT/mTOR signaling pathway (52). Besides, Chaoqi Zhang, et al have identi ed a three-lncRNA signature from pretreatment biopsies, including SCAT1 to predict pathological response and outcome in esophageal squamous cell carcinoma with neoadjuvant chemoradiotherapy (53). However, PICSAR and SCAT1 have received much less attention in LUAD research. In the present study, two lncRNAs were found to have the potential to be prognostic biomarkers and act as potential therapeutic targets for LUAD. This result should be further validated in prospective studies and may help in the strati cation and treatment of LUAD patients.
A risk score was developed to predict LUAD survival based on the selected prognostic genes and the cohorts were accordingly divided into high-and low-risk groups. Kaplan Meier analysis demonstrated that there was a signi cant difference in the overall survival between the two risk groups. The time-dependent ROC and COX regression proved that the risk score was a valuable and independent prognostic factor for LUAD. Therefore, this risk score can help predict and stratify the prognosis for LUAD, and also assist other future studies.
GO and KEGG pathway enrichment analyses were performed to further elucidate the differences between high-and low-risk patients. DNA replication, chromosome segregation, cell cycle G2/M phase transition were signi cantly enriched functions, as shown in Fig. 5. KEGG enrichment analysis also con rmed that homologous recombination, DNA replication, cell cycle were the signi cantly enriched pathways. Interestingly, pathways of neurodegeneration, Huntington disease and Alzheimer disease were also enriched. In fact, epidemiological studies indicated that patients suffering from Alzheimer's disease had a lower risk of developing lung cancer (54,55). Jon Sánchez-Valle, et al have found inverse patterns of expression in Alzheimer's disease and lung cancer by using transcriptomic meta-analyses (56). However, little is understood about the inverse relationship between lung cancer and neurodegenerative disease. More research is needed to shed light on it.
Additionally, the correlation between the risk score and immune status was assessed. Compared with low-risk group, the high-risk group had signi cantly higher scores of B cells, CD8 + T cells, Macrophages, pDCs, Tfh, Th1 cells, Th2 cells, and Treg. In contrast, the scores of aDCs, Mast cells, and Neutrophils were signi cantly lower in the high-risk group. The present study con rmed the conclusion of previous research that tumor-in ltrating immune cells affect LUAD outcome (57)(58)(59). Besides, the scores of immune functions were signi cantly higher in the high-risk group, with the exception of cytolytic activity, HLA and Type II IFN response. These ndings suggest that immune status was aberrantly changed in high-risk LUAD compared to low-risk group. As the future work, immunotherapy guided by risk scores targeting the abnormal immune status can be a promising treatment for LUAD patients.
Besides risk strati cation, the regulatory module was found to be differentially expressed in different stages of LUAD or LUAD with EGFR mutation, in relation to pathogenesis and prognosis. In this study, ve machine learning models were developed and applied to recognize early-stage LUAD or EGFR mutation in LUAD, based on the 6 selected genes. Despite using a limited gene number, the models presented a remarkable performance (AUROC > 0.75). Models based on broader sets of genes were also developed to further assess the potential of machine learning. The highest AUROCs of the 5 models were 0.815 for early-stage classi cation and 0.806 for EGFR mutation recognition based on the 43 hub mRNAs. The AUROCs exceeded 0.85 in both classi cation tasks based on all DEmRNAs and DElncRNAs with prognostic potential. However, these models and results are not shown in the present study, in consideration of the inconvenience and high cost in clinical practice. Currently, gene signatures comprised of several or a dozen of genes are still the research focus. Integrated analysis and clinical application of genomics using advanced algorithms are awaiting technical development and more studies.
CMap and molecular docking were performed to nd potential drugs with the ability to reverse the expression pro les of the key genes. Among the drugs with the top connectivity scores, several have already been studied in cancer treatment. For example, doxorubicin has been used to treat various types of cancers. Studies were conducted to overcome the toxic side effects and multidrug resistance during doxorubicin chemotherapy by using nanomedicine and combination treatment (60, 61). Teniposide and etoposide have been established as active compounds in the treatment of small cell lung cancer (62). The occurrence of these drugs in the table showed that the CMap results were reasonable. Furthermore, since the a nity scores were all below − 5 kcal/mol, this indicated a strong blinding between the compounds and selected genes.
Several limitations of this study should be considered. Firstly, data used in the bioinformatic analysis were obtained from public datasets, without veri cation of in vitro or in vivo biochemical experiments. Thus, the robustness of the six-gene signature requires further validation in large-scale prospective investments. Secondly, only mRNA and lncRNA were analyzed in our study. Multi-omics data may help us deeply understand the pathogenesis and better predict survival for LUAD. Lastly, potential drugs were selected by using CMap and molecular docking, which cannot replace biochemical experiments or clinical trials. There is still a long way to go before using these potential drugs.

Conclusions
In this study, a lncRNA-mRNA regulatory module, including 4 mRNAs and 2 lncRNAs, was identi ed, which is related to prognosis, staging, EGFR mutation and potential treatment of LUAD. Availability of data and materials TCGA data were available on the project website at https://www.cancer.gov/aboutnci/organization/ccg/research/structural-genomics/tcga, while the GEO datasets were available at https://www.ncbi.nlm.nih.gov/geo/.

Ethics approval and consent to participate
The study was an analysis of third-party anonymized publicly available datasets with pre-existing institutional review board (IRB) approvals.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplementarymaterial.docx