LUAD And LUSC Cancer Classification, Biomarker Identification, and Pathway Analysis Using Overlapping Feature Selection Methods


 Lung cancer is one of the deadliest cancers in the world. Two of the most common subtypes, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), have drastically different biological signatures, yet they are often treated similarly and classified together as non-small cell lung cancer (NSCLC). LUAD and LUSC biomarkers are scarce and their distinct biological mechanisms have yet to be elucidated. Many studies have attempted to improve traditional machine learning algorithms or develop novel algorithms to identify biomarkers, but few have used overlapping machine learning or feature selection methods for cancer classification, biomarker identification, or pathway analysis. This study proposes selecting overlapping features as a way to differentiate between cancer subtypes, especially between LUAD and LUSC. Overall, this method achieved classification results comparable to, if not better than, the traditional algorithms. It also identified multiple known biomarkers, and five potentially novel biomarkers with high discriminating values between the two subtypes. Many of the biomarkers also exhibit significant prognostic potential, particularly in LUAD. Our study also unraveled distinct biological pathways between LUAD and LUSC.


Introduction
Lung cancer is the most commonly diagnosed malignant tumor and is a leading cause of cancerassociated mortality. It is the second highest cause of new cancer cases in both genders in the United States and is the second leading cause of cancer deaths in females globally [1,2]. The most common subtypes of lung cancers are lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), classi ed together as non-small cell lung cancer (NSCLC) [3,4]. However, recent studies have suggested that LUAD and LUSC should be classi ed and treated as different cancers [5].
Identifying the mechanisms underlying LUAD and LUSC is needed to develop useful biomarkers for better diagnosis, and to design therapeutic interventions. Multiple gene expression and immunohistochemistry studies have identi ed biological pathways and biomarkers that differentiate between LUAD and LUSC [6][7][8]. Other studies classi ed cancers using both novel and traditional machine learning or feature selection methods [9][10][11][12]. However, few have investigated cancers by applying multiple feature selection methods and selecting the overlapping features. This approach is promising because different feature selection methods select features using different criteria; nevertheless, each method has its strengths and weaknesses. Focusing on overlapping features, will optimize the strength of each method and minimize the probability of false positive features.
In this study, we downloaded LUAD and LUSC RNA-Seq datasets from The Cancer Genome Atlas (TCGA) [13] and analyzed them with ve feature selection methods with ranking abilities: Differential Gene Expression Analysis (DGE), Principal Component Analysis (PCA), Least absolute shrinkage and selection operator (Lasso), minimal-Redundancy-Maximal Relevance (mRMR), and Extreme Gradient boosting (XGboost). We overlapped the outcome of the ve methods and identi ed 131 genes which were used for LUAD and LUSC classi cation and pathway analysis. The classi cation results were validated with random forest, a well-known and effective machine learning method. Among the 131 genes, 17 genes were proposed to be biomarkers and their diagnostic potential was assessed using Area Under Curve (AUC) value in Receiving Operating Characteristics (ROC) curve analysis. The prognostic values of the 17 genes were also assessed. The pathway analysis results were used to elucidate the potential clinical differences between LUAD and LUSC.

Study Design
We obtained LUAD and LUSC RNA-Seq data from TCGA [13] and selected discriminatory genes by overlapping DGE, PCA, mRMR, XGboost, and lasso. The genes that were overlapped by two or more algorithms were validated and used for LUAD and LUSC classi cation as well as pathway analysis. The genes that were overlapped by three or more algorithms were selected as biomarker candidates, and their diagnostic values were assessed using ROC analysis and AUC value. The prognostic values of the biomarker candidates were also assessed using Kaplan Meier Plotter [14].

Selection of genes
Top 500 genes from DGE (Supplementary Table S1) were selected as top features based on their lowest p-values. Similarly, top 500 genes from the rst principal component in PCA and the top 500 genes from mRMR (Supplementary Table S1) were selected based on the ranking of the algorithm. Also, 148 genes in Xgboost (Supplementary Table S1) and 68 genes in lasso (Supplementary Table S1) with threshold of 0.5 were identi ed and selected. Since each of these methods has its own selection criteria, the overlapping genes must satisfy multiple selection criteria, making them signi cant candidate biomarkers that differentiate LUAD and LUSC. Therefore, the ve independent sets of top genes were compared with a Venn diagram to identify the overlapping genes detected by multiple algorithms. Venn diagram ( Fig. 1) comparison detected 131 genes (Supplementary Table S2) overlapped by two or more algorithms and 17 genes (Table 1) overlapped by three or more algorithms.

Validation of Selected Genes
To evaluate how effective the selected genes are in classifying LUAD and LUSC, we used random forest to validate the top 500 genes selected from PCA, mRMR, and DGE, as well as the 148 genes from xgboost and 68 genes from lasso (Supplementary Table S1). All of the validation results for each feature selection method returned high classi cation accuracies of over 90% (Table 2). To compare to the previous feature selection methods, the overlapping 131 genes were validated the same way as the other algorithms. The binary classi cation statistics ( Table 2) were calculated using LUAD as 'positive' and LUSC as 'negative'.
The overlapping 131 genes showed comparable, if not better, results to the other algorithms (Table 2). Heatmaps for the top 131 and the top 17 genes were also generated ( Fig. 2a and 2b). Both heatmaps, in particular the heatmap with 17 genes, displayed clear borders separating LUAD from LUSC.  Among the downregulated genes, however, ARHGEF38 has the highest AUC of 0.9574, which is on par with the upregulated genes and signi cantly higher than all the downregulated genes reported in Xiao's study, indicating that ARHGEF38 can potentially be a novel biomarker in determining the two cancers.

Kaplan Meier Plotter Analysis of the 17 Potential Biomarkers
Of the 17 potential biomarkers (Table 1), only CELSR2 shows a signi cant prognostic value in LUSC, with its higher expression corresponding to a more favorable prognosis in LUSC (Table 3). In contrast, many genes show signi cant prognostic potential in LUAD. High expressions of KRT17, KRT6A, S100A2, TRIM29, REPS1, and GPC1 correspond to an unfavorable prognosis in LUAD, while high expressions of PERP, ELFN2, ARHGAP12, and QSOX1 correspond to a favorable prognosis in LUAD (Table 3).

GO Term Enrichment Analysis
To further understand the biological differences between LUAD and LUSC, we performed pathway analysis by splitting the identi ed 131 genes into two groups: 57 genes downregulated and 74 upregulated in LUSC compared to LUAD. Functional pathway annotation of these two groups of genes was performed using The Database for Annotation, Visualization and Integrated Discovery (DAVID) [15] analysis tool with Gene Ontology (GO) biological pathway enrichments. GO terms with P-value < 0.05 were obtained (Supplementary Tables S3 and S4). The top 10 most signi cantly upregulated and downregulated GO terms ranked by p-value are shown in Table 4. In addition, DAVID has the functionality to group similar GO terms into clusters of the same biological pathway. To elucidate the potential biological differences between LUAD and LUSC, the top ve most signi cantly upregulated and downregulated clusters ranked by enrichment scores were determined ( Table 5, and Supplementary  Tables S5 and S6).  In the upregulated group, most pathways are concentrated in cell adhesion, intermediate lament organization, and cell junction assembly. In the downregulated group, the most signi cant cluster is platelet degranulation and cell exocytosis, as well as other pathways such as tyrosine kinase signaling pathway, homeostasis, protein translation and circulatory system. These results suggest that LUSC tends to express more genes related to cell adhesion and cytoskeleton organization, and LUAD tends to express more genes involved in platelet degranulation and exocytosis, along with other signaling pathways.

Reactome Pathway Analysis
Reactome pathways [16] were also generated for both the upregulated and the downregulated groups. The most signi cantly upregulated pathway is the corni cation, or the keratinization pathway (Fig. 4, Supplementary Table S7), along with other similar pathways related to cell adhesion, which is consistent with GO term analysis. TP53 regulation pathway, which is often implicated in cancer, is among the top enriched pathways as well (Supplementary Table S7). For the downregulated group, the most signi cant pathway is peptide elongation synthesis (Fig. 5, Supplementary Table S8), which GO term analysis also reveals to be signi cant.

KEGG Pathway Analysis
Only the p53 signaling pathway is signi cant in the upregulated group (Table 6) in Kyoto Encyclopedia of Genes and Genomes (KEGG) [17] pathway analysis; this is consistent with Reactome analysis which ranks TP53 regulation as the second most upregulated pathway after keratinization and other cell junction related pathways. Only the lysosome and ribosome pathways are signi cant in the downregulated group (Table 6). The lysosomal pathway is coherent with platelet degranulation and exocytosis, as reported in GO term analysis. Even though the ribosomal pathway has a p-value slightly greater than 0.05, it is most likely signi cant as it is also shown to be signi cantly enriched in both GO and Reactome term analyses (Supplementary Tables S3 and S8).

Discussion
Lung cancer remains the second most common cancer in the world [18]. Among the lung cancers, the two most common subtypes LUAD and LUSC are often categorized together as Non-small cell lung cancer.
However, increasing evidence suggests that LUAD and LUSC should be considered as different diseases due to their vastly different biological and clinical signature [5]. Identifying biomarkers and unraveling the biological differences between the two can therefore provide a future direction in reaching better diagnosis and treatment for each condition.
Previous studies have utilized traditional feature selection and machine learning methods for cancer diagnosis, detection, and classi cation [10,11,19], but few have extended them to study potential biomarkers and biological pathways to discriminate between LUAD and LUSC. To improve cancer classi cation accuracy, novel machine learning, and feature selection methods have been developed [12,[20][21][22]. However, few studies have used overlapping features from different methods for classi cation, pathway analysis, and biomarker discovery, especially for LUAD and LUSC.
Here we took advantage of the ranking capabilities and the strengths of PCA, mRMR, XGboost, DGE, and Lasso to select 131 overlapping genes for classi cation and pathway analysis and identify 17 overlapping genes as potential biomarkers. Overall, the overlapping 131 genes showed several highranking metrics with lasso and PCA methods. Though the best method may vary depending on the metric, the classi cation result of using the overlapping 131 genes was by many metrics comparable if not better than the other methods that use more genes. The 131 overlapped genes achieved the highest sensitivity with PCA, the second highest accuracy with lasso, and the second highest F-measure overall, indicating that overlapping feature selection methods can be used to perform cancer classi cation.
NECTIN1 is a cell adhesion protein that plays a key role in herpes simplex virus type 1 (HSV-1) viral entry and has been shown to be sensitive to herpes oncolytic therapy in squamous cell carcinomas [31,32]. ELFN2 (extracellular leucine-rich repeat and bronectin type III domain-containing 2) is also known as protein phosphatase 1 regulatory subunit 29 and belongs to the leucine-rich repeat family. Studies show that ELN2 is prevalent in tumors of the brain and granular cells [33]. REPS1 is a gene that codes for RALBP1 associated Eps domain containing protein 1 and is associated with the endocytosis pathway [34]. ARHGEF38 and ARHGAP12 are both part of the Rho family GTPase regulators. Rho GTPases are essential to cell cytoskeletal structure, motility, and morphogenesis, and they have been implicated in many cancer metastases [35,36]. ARHGEF38, in particular, has been associated with aggressive prostate cancer [37]. ARHGAP12, though not shown to exhibit invasion potential, has also been implicated in cell proliferation [38].The other upregulated genes ELFN2, QSOX1, and MUC1 have been shown to directly promote metastasis in various cancers [39][40][41][42][43], including lung cancer. Intriguingly, all 5 biomarker candidates (ARHGAP12, ARHGEF38, ELFN2, QSOX1, MUC1) that are upregulated in LUAD are involved in cancer proliferation and metastasis; the most enriched pathway in LUAD, which is platelet degranulation, is associated with metastasis as well [44]. Furthermore, the loss of certain genes upregulated in LUSC such as TRIM29 and KRT6A is associated with more cellular invasion [45,46]. REPS1 and KRT6A genes which were upregulated in LUSC, were also shown to contribute to metastasis in cancer cells lines [47,48]. Clinical differences between LUAD and LUSC are well known. In particular, LUAD has a higher metastatic rate than LUSC [49]. Studying these potential biomarkers may provide insight into tumor progression, metastatic, and therapeutic differences between LUAD and LUSC. Overall, the mechanisms by which many of these genes may regulate NSCLC development and metastasis remain unknown; therefore, studies to elucidate the exact mechanisms are warranted.
Different tumor subtypes arise from different types of cells located within each speci c region, and consequently, the tumor transcriptome and morphology are thought to re ect this idea. In support of this view, our study, along with previous studies [6,8,24,50], found that pathways speci c to squamous cell tumors, including cell adhesion and keratinization, were associated with LUSC (Tables 4 and 5, Fig. 4), and pathways related exocytosis and surfactant homeostasis were associated with LUAD (Table 4).
Aside from cell adhesion or cytoskeleton organization, LUSC demonstrates higher regulation of p53 signaling in both KEGG and Reactome analyses. It is known that TP53 mutation is more common in LUSC than in LUAD [51][52][53], and that such mutation may predominantly be a non-truncated mutation in LUSC leading to higher expression levels of genes involved in the p53 regulation pathway [54]. Moreover, P53 mutations often lose their tumor suppression function while gaining oncogenic abilities, leading to increased cell growth and proliferation compared to LUAD [55].
The most prominent pathway associated with LUAD, compared to LUSC, is platelet degranulation and exocytosis (Table 4, Table 5). Interestingly, lung cancer is the most common malignancy to coexist with venous thromboembolism, especially pulmonary embolism [56]. LUAD, in particular, has been shown to be an independent risk factor for pulmonary embolism even among lung cancers [57,58]. One of the top downregulated clusters also show circulatory system regulation (Supplementary Table S6). Because platelet granulation directly causes thrombus formation, the differential enrichment of platelet granulation pathway can therefore help explain a more active and a more common hypercoagulation and thrombotic process in LUAD compared to LUSC [59]. In addition, platelets have been implicated in both the innate and the adaptive immune systems; platelet degranulation can modulate innate immunity via the release of cytokine, and platelet-leukocyte interactions can lead to leukocyte recruitment and activation in cancer [60]. In fact, CD63, one of the genes in the platelet degranulation pathway (Supplementary Tables S3 and S6), is directly involved in leukocyte recruitment through endothelial Pselectin [61]. LUSC has been associated with a relatively more suppressed immune response, implying a more active immune response in LUAD, which supports our result [55,62]. The other top enriched LUAD pathways include tyrosine kinase signaling pathways and protein translation, which are known pathogenic pathways in cancers [63][64][65][66].
There are several limitations of this study. One of them is that this study does not take into account the RNA expression fold changes, which some groups have used to rank differentially expressed genes [67,68]. Also, although this study aims to minimize the discovery of false positive biomarkers by overlapping different feature selection methods, the proposed biomarker candidates in this study still lack experimental veri cation. Nevertheless, these results may shed light into the biological differences between LUAD and LUSC, as well as aid the discovery of better diagnosis and treatment for each [63,69].
In conclusion, we designed and implemented a work ow of overlapping ve different feature selection methods to perform cancer classi cation, identify novel biomarkers, and study biological differences in NSCLC. We identi ed ARHGAP12, ARHGEF38, ELFN2, NECTIN1, and REPS1 as novel biomarkers, along with 12 other biomarker candidates. We also provided insight into potential explanations for different clinical ndings and biological characteristics between LUSC and LUAD through pathway analysis. Further validation studies of these biomarkers and biological mechanisms are therefore warranted.

RNA-Seq data processing
The LUAD and LUSC HTSeq read counts data were downloaded from TCGA [13] using TCGAbiolinks from R [70,71]. As of June 2020, there were 533 LUAD and 502 LUSC samples. The samples were normalized using TMM method and standardized using the CPM (read counts per million) function in R. Genes < 1 CPM in over 600 samples were considered noise and discarded to obtain 14,010 genes. The ltered genes were analyzed with different gene selection methods to further narrow down potential gene candidates for biomarkers and pathway analyses.

Gene Selection and Cancer Classi cation
Gene selection analysis was performed using ve different selection methods to generate ve independent sets of top genes. The 5 independent sets were compared, and the resulting overlapped genes were used for cancer classi cation, biomarker identi cation, and pathway analysis. The selection methods used were DGE, PCA, xgboost, lasso, and mRMR. DGE between LUAD and LUSC was performed using the edgeR package [72]. Top 500 differentially expressed genes (Supplementary Table 1) were selected as top features based on their lowest p-values; validation of these genes was performed using random forest with the ranger package [73]. The top 500 genes from the rst principle component in PCA and the top 500 genes from mRMR [74] were selected and validated the same way as the differentially expressed genes. Genes selected from Xgboost [75] and lasso [76] (Supplementary Table 1) were validated in a similar manner. For each validation, the data were randomly split into a training set and a testing set in a 7:3 ratio, where the training set was used to construct the model and the testing set was used to evaluate the model's performance. To compare each selection method more effectively, we split the training sets and testing sets the same way for all validations. We applied 5-fold cross validation to decide the optimal parameters for each training model and estimated its accuracy by applying the best determined parameters to the test set.
For classi cation and pathway analysis, we selected genes that were detected by at least two methods, and they were validated using ranger [73]. We also used bootstrapping [77] with 10,000 replicates to calculate the con dence interval for the accuracy of each method, including the proposed method of classi cation. The genes that were detected by at least 3 methods were considered candidate biomarkers. Their diagnostic potential was determined assessed using receiving operating characteristics (ROC) curve analysis.

Prognostic Value Analysis Using Kaplan-Meier Plotter
Prognostic values of the identi ed biomarkers in LUAD and LUSC were evaluated using Kaplan-Meier Plotter. Kaplan-Meier Plotter is an online database that contains comprehensive clinical and microarray data for various cancers, including lung cancer [14].

Pathway Analysis of Selected Genes
To further investigate and understand the biological difference between LUAD and LUSC, we performed pathway enrichment analysis using KEGG [17], Gene Ontology (GO), and Reactome [16]. Modi ed Fisher's exact tests were performed using DAVID v6.8 [15]. Pathways with false discovery rate (FDR) < 5% or pvalue less than 0.05 were considered signi cant. These databases were all accessed in November 2020. Authors' contributions JC proposed the method of overlapping feature selection methods to investigate LUAD and LUSC. JC obtained, analyzed, and interpreted the data. JC wrote the manuscript and generated the gures and tables. JD supervised the study and prepared the gures. JD made substantial suggestions and revisions of the manuscript. All authors read and approved the nal manuscript.