Identification of the potential signature genes between lung adenocarcinoma and lung squamous cell carcinoma

Objectives: The purpose of this study was to identify and compare the potential signature genes along with the key pathways related to the pathogenesis of lung adenocarcinoma and lung squamous cell carcinoma through bioinformatics analysis. Methods: The differential expressions of miRNAs (DEmiRNAs) and mRNAs (DEmRNAs) in lung adenocarcinoma (AD) and lung squamous cell carcinoma (SC) were identified using the microarray data of 2 miRNA and 6 mRNA from the Gene Expression Omnibus (GEO) database. The interaction between the DEmiRNAs and DEmRNAs was explored, followed by the KEGG pathway enrichment, Gene Ontology annotation (GO) enrichment analysis, DEmiRNA-gene interactive network, and protein-protein interaction (PPI) network to predict the hub genes and identify the key pathways. The sequencing data was then downloaded from The Cancer Genome Atlas (TCGA) to validate the hub genes, perform survival analysis, construct the ROC curve along with identifying the regulatory networks of mRNA-miRNA-lncRNA. Finally, the resulting signature genes and significant regulatory networks in AD and SC were filtrated across these analyses. Results: The DEmiRNAs and DEmRNAs were found found that the hsa-miR-21-5p, hsa-miR-200a-5p, and COL1A1 may act as potential meaningful biomarkers for the AD while hsa-miR-141-3p, hsa-miR-429, hsa-miR-130b-3p, hsa-miR-205-5p, and FRMD6 may play an important role in SC. The HMGA1/hsa-miR-424-5p/PVT1 may be a significant regulatory network of AD while KIF11/hsa-miR-30d-5p/DLEU2, CCNE2/hsa-miR-30d-5p/DLEU2, and SPTBN1/hsa-miR-218-5p/MAGI2-AS3 may be the useful networks in regulating the SC progression. Conclusion: This study provided the signature targets and theoretical basis for further research on the biomarkers and molecular mechanisms of the SC, AD, and NSCLC.


Introduction
Non-small cell lung cancer (NSCLC) accounts for approximately 80-85% of lung malignancy cases, including squamous cell carcinoma (SC), adenocarcinoma (AD), and large cell carcinoma [1]. Lung adenocarcinoma comprises 40% of all the lung cancer (LC) cases, and it arises from small airway epithelial type II alveolar cells, which secrete mucus and other substances [2,3]. It tends to occur in the periphery of the lung and remains the main histological subtype of lung cancer in both smokers and nonsmokers, irrespective of their age and gender [4]. Lung squamous cell carcinoma accounts for about 30% of lung cancer, and it develops from early versions of squamous cells or epithelial cells in the airway of the bronchial tubes found in the center of the lungs. A large amount of data indicates that smoking, gender, and age are the crucial factors that influence the risk of SC. For example, 80% of lung cancer deaths are attributed to cigarette smoking, where most of them are male and aged around 50 years [5,6].
Compared to SCLC, the cells in NSCLC tend to grow and divide more slowly, leading to relatively late spreading and metastasizing of the tumor. However, more than 70% of these cases are diagnosed as unresectable advanced stage tumors [7]. Despite many intervention methods being proposed, the prognosis for the NSCLC patients remains poor, with a high recurrence rate even after the treatment. Recently, both targeted therapy and immunotherapy have appeared as genetic alteration-guided targeted therapies. These include TK inhibitors (TKIs) that target the EGFR and the antibody-directed therapies that target CTLA-4, PD-1, and PD-L1 to show remarkable early success in the management of advanced NSCLC [8][9][10]. To improve the clinical outcomes of lung cancer patients, biomarkers are urgently required to accurately subclassify lung cancer and also design precisely targeted therapy for lesions.
Technologies such as DNA microarray and sequencing can easily and quickly obtain a lot of data. However, due to the differences in the production methods, heterogeneity becomes a certain influencing factor between different technologies, which may hinder a high-throughput comprehensive analysis [11]. Therefore, to reduce the heterogeneity, we used 4 miRNA microarray and 6 mRNA microarray data to synthetically select hub genes for AD and SC, which was followed by downloading of the sequencing profiles from TCGA to validate the previously selected hub genes by microarray. After processing and strict analysis, the signature genes and key signaling pathways were screened out to explore an in-depth understanding of the molecular mechanism of lung cancer, which provided therapeutic targets for drug designing and molecular markers for clinical diagnosis.

Materials and methods Collection of Datasets and analyses of DEGs
The public gene expression profiles of AD and SC were obtained from the GEO database (http://www.ncbi.nlm.nih.gov/geo/), which included miRNA microarray datasets from GSE74190 and GSE51855 for both AD and SC along with mRNA microarray datasets from GSE32863, GSE75037, GSE116959 and GSE30219, GSE33479, GSE51852 for AD and SC, respectively. The data were collected, and all the samples of lung cancer tissues were compared with the normal samples. The detailed information of the microarray datasets is shown in Table 1. The workflow demonstrating the identification of core genes and pathways is presented in Figure 1.
The statistical analyses were carried out using the R (version 4.0.3) software. All the data normalization and differential gene expression analyses were performed using the limma package [12]. The non-paired t-test was used to calculate the p-values to screen differentially expressed miRNA and mRNA (DEmiRNA and DEmRNA), and the threshold was set as |logFC| ≥1 and adjusted P-values < 0.05.

Relationship pairing between DEmiRNA and DEmRNA
The predicted target genes from the DEmiRNA were obtained by using at least five out of the seven commonly used databases, including miRWalk [13], starBase [14], miRDB [15], miRanda, Targetscan [16], Tarbase v.8 [17], and miTarbase [18]. The predicted target genes from DEmiRNA were matched with the DEmRNA selected by the microarray analysis to obtain the target gene from DEmiRNA.

KEGG pathway and GO analysis
The Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis and Gene Ontology annotation enrichment (GO) analysis of DEmiRNA and DEmRNA were performed using the R clusterprofiler package [19], and the threshold values were selected as gene count ≥ 2 and adjusted P-values < 0.05.

Constructing the DEmiRNA-gene interaction and PPI network
As per the interaction information obtained between the DEmiRNA and its target genes, the interaction network was constructed using Cytoscape 3.7.0 software to obtain a genetic relationship map [20].
The STRING database was used to predict the protein-protein interaction (PPI) network for the the target genes of DEmiRNA. The species used was homo sapiens and the parameter of PPI was set as 0.4 [21].

Validation of hub genes
The significant mRNA and miRNA were selected based on the KEGG analysis, GO analysis, interaction network, and PPI network analysis. Then, the RNA-seq and miRNA-seq data were downloaded from The Cancer Genome Atlas (TCGA) to validate the hub genes, which were previously selected by the microarray analysis from the GEO database. The resulting data were treated using the edgeR package [22]. The genes having a threshold false discovery rate (FDR) of < 0.05 and |logFC| of ≥ 1 were considered as statistically significant differential genes.

Survival analysis and ROC curve of the hub genes
The hub mRNA and miRNA obtained after the validation by TCGA were used to perform the overall survival analysis using the R package of survival. The HR and log-rank p-value was calculated and presented on the Kaplan-Meier curves.
The receiver operating characteristic (ROC) curve was used to judge the diagnostic value of the hub genes previously validated by TCGA. The area under the curve (AUC) and the p-value were calculated and presented on the plot.

Prediction and construction of the regulatory networks of mRNA-miRNA-lncRNA
The DEmiRNAs obtained after the validation by TCGA were submitted to the starbase database (http://starbase.sysu.edu.cn/) to predict the lncRNA, with the chip dataset at medium stringency (>= 2) [14]. Then the selected lncRNAs were used to predict the interactive mRNA by Multi Experiment Matrix (MEM, https://biit.cs.ut.ee/mem/) [23] with the output limit set as the top 50. Finally, the target mRNA predicted by the lncRNA was matched with the target mRNA of miRNA to get the common target genes that might interact with both miRNA and lncRNA.
A correlation analysis was conducted among the selected mRNAs, miRNAs, and lncRNAs using the R software through Pearson's correlation analysis, and p < 0.05 was defined as the statistically significant threshold. The obtained statistically significant mRNAs, miRNAs, and lncRNAs were then used to create a network between them (mRNA-miRNA-lncRNA).
To explore the cancer-related pathways associated with the gene expression levels in these regulatory networks, the Gene Set Enrichment Analysis (GSEA) was performed. The reference gene set used in this study was the c2.cp.kegg.v7.2.symbols.gmt. A gene set was considered as significantly enriched if the normal p-value and FDR were both found to be less than 0.05.

Validation of the hub genes
The significant genes in the AD and SC, previously obtained from the microarray data were further selected by combining the KEGG analysis, GO analysis, interaction network, and PPI network analysis. Table 3 shows these hub mRNAs and miRNAs after the validation by TCGA. To select the mRNAs in AD, 510 patients and 58 healthy controls were analyzed while for SC, 496 patients and 49 healthy controls were analyzed. For the selection of miRNAs in AD, 510 patients with 45 healthy controls were analyzed, whereas for miRNA in SC, 473 patients with 45 healthy control were analyzed. The miRNAs and mRNAs with higher differential expression levels in AD compared to SC were observed in 3 miRNAs, including hsa-miR-21-5p, hsa-miR-625-5p, and hsa-miR-200a-5p, and 5 mRNAs, including PIK3R1, CCND2, EFNB2, GJA1, and COL1A1. Similarly, the miRNAs and mRNAs with higher differential expression levels in SC compared to AD were observed in 15 miRNAs, including hsa-miR-149-5p, hsa-miR-200c-3p, hsa-miR-29c-3p, hsa-miR-30b-5p, hsa-miR-145-5p, hsa-miR-181a-5p, hsa-miR-181c-5p, hsa-miR-497-5p, hsa-miR-30d-5p, etc., and 6 mRNAs, including FRMD6, FOXP1, DDAH1, PRUNE2, ENPP4 and NAA50. Supplementary Figure S3 shows the expression plots of the selected hub genes in both AD and SC, where the data was derived from TCGA.
[b] the genes expressed both in AD and SC but the difference in the expression levels of hub genes between the patients and healthy control in AD was higher than SC.
[c] the genes that had the same difference in the expression levels of hub genes in AD and SC.
[d] the genes expressed both in AD and SC with the difference in the expression levels of hub genes in SC being higher than the AD. [e] the hub genes with FDR < 0.05 and |logFC| ≥1 expressed differentially just in the SC.

Survival analysis
The hub genes validated by TCGA were used to perform the overall survival analysis. Table 4 shows the meaningful hub genes in the overall survival of the patients with AD and SC. The genes with the expression in AD > SC including hsa-miR-21-5p, hsa-miR-200a-5p, COL1A1, PIK3R1, and CCND2 are closely related to the overall survival in AD, while the genes with the gene expression in AD < SC including hsa-miR-141-3p, hsa-miR-429, hsa-miR-130b-3p, hsa-miR-205-5p, FRMD6, FOXP1, DDAH1, and ENPP4 are associated with the overall survival in SC. Figure 5 presents the Kaplan-Meier curves of these hub genes with either gene expression being AD > SC or AD < SC.

Construction of regulatory networks for mRNA-miRNA-lncRNA
Based on Pearson's correlation analysis, the probable interactive relationship among the mRNAs, miRNAs, and lncRNAs are shown in Table 6 and Figure 7. In AD, the correlation among HMGA1, hsa-miR-424-5p, and PVT1 mostly revealed a probable significant regulatory network while in the SC, the correlations among KIF11, hsa-miR-30d-5p, and DLEU2; CCNE2, hsa-miR-30d-5p, and DLEU2; SPTBN1, hsa-miR-218-5p, and MAGI2-AS3 were more obvious than the others, suggesting that all of them might function as regulators in SC. Figure 8 shows the expression and correlation plots of the regulatory network in AD, while Figure 9 shows the expression and correlation plots of regulatory networks in SC. Categorizing each regulatory network as one group, the KEGG pathway was found to be enriched in all the mRNA, miRNA, and lncRNA of each group. The related GSEA of each regulatory network is presented in Table 7     represents the predicted direction. The rhombus represents miRNA, the rectangle represents mRNA, while the ellipse represents lncRNA. Red represents up-regulated expression, while blue represents down-regulated expression. Figure 9. The expression and correlation plots of the ceRNA network in the SC. a-b, k-m: the expression plots of KIF11, hsa-miR-30d-5p, DLEU2, CCNE2, SPTBN1, hsa-miR-218-5p, and MAGI2-AS3 in the SC. e-i, n-p: the correlations found in the SC. j-q: the regulatory networks in the SC. The arrow direction represents the predicted direction. The rhombus represents miRNA, the rectangle represents mRNA, while the ellipse represents lncRNA. Red represents up-regulated expression, while blue represents down-regulated expression.  Table 7. The related GSEA of each regulatory network (KEGG pathway).

Discussion
The KEGG pathway analysis was enriched by the DEmiRNA and DEmRNA. Phagosome, Human T-cell leukemia virus 1 infection, and ECM-receptor interaction were significantly enriched in the AD, which was closely associated with the immune microenvironment and the maintenance of cell structure and function [24,25]. The cell cycle and p53 signaling pathways were significantly enriched in the SC, which was closely connected to the basic biological processes and signaling pathways. According to the GO analysis, cardiac septum morphogenesis (related to BP), proteinaceous extracellular matrix (related to CC), and the growth factor binding (related to MF) coexisted in the AD and SC, suggesting that they are essential points for the formation of NSCLC.
Thus, hsa-miR-21-5p, hsa-miR-200a-5p, COL1A1 might be signature regulators for AD while hsa-miR-141-3p, hsa-miR-429, hsa-miR-130b-3p, hsa-miR-205-5p, and FRMD6 might be important for SC. Several articles have proved that miR-21-5p and miR-200a regulate the progression of AD. The miR-21-5p targets SET/TAF-Iα, WWC2, etc [26,27] while the miR-200a correlates with MALAT1, DNMT1, GOLM1, etc. [28,29]. Collagen 1A1 (COL1A1) is an extracellular matrix protein [30], which participates in the process of focal adhesion and may influence the metastatic ability of the cells [31]. The abnormal expression of COL1A1 has been reported in several cancers, including AD [31][32][33]. The miR-429 gene can bind to SNHG22 and SESN3 in the SC cells [34], and several reports have even proved that the expression of miR-205 can distinguish between squamous cell carcinoma and adenocarcinoma in lung cancer biopsies [35,36]. To summarize, the selected differentially expressed genes between the AD and SC, especially with statistically significant ROC curve and association with overall survival, may act as significant biomarkers to subclassify NSCLC or important targets helping in designing a precise therapy for lung cancer.
Besides, miR-424-5p was filtrated in the previous research since it coexpressed with lncRNA PVT1 and was associated with NSCLC radiosensitivity [37]. Similarly, we also found that the regulatory network of HMGA1/hsa-miR-424-5p/PVT1 might play important biological functions in regulating the development of AD. Meanwhile, the regulatory networks of KIF11/hsa-miR-30d-5p/DLEU2, CCNE2/hsa-miR-30d-5p/DLEU2, and SPTBN1/hsa-miR-218-5p/MAGI2-AS3 might also be closely related to the tumor formation and progression of SC, where the values of cor > 0.3 and p < 0.05 in the SC regulatory groups indicated much more powerful connections. However, these regulatory networks were derived from one certain lung cancer (AD or SC), and we did not prove if it contributed to another cancer. Thus, these regulatory networks may not be able to distinguish between the two cancers, but these could be one of the obstacles during the treatment of lung cancer.