Lung cancer remains the second most common cancer in the world . Among the lung cancers, the two most common subtypes LUAD and LUSC are often categorized together as Non-small cell lung cancer. However, increasing evidence suggests that LUAD and LUSC should be considered as different diseases due to their vastly different biological and clinical signature . Identifying biomarkers and unraveling the biological differences between the two can therefore provide a future direction in reaching better diagnosis and treatment for each condition.
Previous studies have utilized traditional feature selection and machine learning methods for cancer diagnosis, detection, and classification [10, 11, 19], but few have extended them to study potential biomarkers and biological pathways to discriminate between LUAD and LUSC. To improve cancer classification accuracy, novel machine learning, and feature selection methods have been developed [12, 20–22]. However, few studies have used overlapping features from different methods for classification, pathway analysis, and biomarker discovery, especially for LUAD and LUSC.
Here we took advantage of the ranking capabilities and the strengths of PCA, mRMR, XGboost, DGE, and Lasso to select 131 overlapping genes for classification and pathway analysis and identify 17 overlapping genes as potential biomarkers. Overall, the overlapping 131 genes showed several high-ranking metrics with lasso and PCA methods. Though the best method may vary depending on the metric, the classification result of using the overlapping 131 genes was by many metrics comparable if not better than the other methods that use more genes. The 131 overlapped genes achieved the highest sensitivity with PCA, the second highest accuracy with lasso, and the second highest F-measure overall, indicating that overlapping feature selection methods can be used to perform cancer classification.
Moreover, this method may prove to be valuable in biomarker discovery. In agreement with our result, previous studies have reported levels of several genes to be greatly elevated in LUSC compared to LUAD; these genes include KRT6 [6, 8, 23, 24], KRT5 [6, 8, 25], KRT14 [8, 23, 24], KRT17 [8, 23], PERP [8, 23], TRIM29 [8, 23], GPC1 , CELSR2 , S100A2 , and TUBA1C . Also, consistent with our result, levels of QSOX1  and MUC1  were reported to be lower in LUSC than in LUAD. Many current biomarkers such as Tumor Protein P63 (TP63), Napsin A Aspartic Peptidase (NAPSA), Melanophilin (MLPH), Desmocollin 3 (DSC3), and others are also part of the top 131 genes selected by our method [23, 27–30]. To our knowledge, ARHGAP12, ARHGEF38, ELFN2, NECTIN1, and REPS1 are among the top 17 genes in this study to be identified as biomarkers for the first times. Moreover, it is important to note that ARHGEF38 and NECTIN1 have two of the highest diagnostic values among selected genes based on ROC curve analysis, with ARHGEF38 having the highest AUC value among upregulated genes and the second highest AUC value overall. Furthermore, many of the 17 genes show significant prognostic importance, particularly in LUAD (Table 3).
NECTIN1 is a cell adhesion protein that plays a key role in herpes simplex virus type 1 (HSV-1) viral entry and has been shown to be sensitive to herpes oncolytic therapy in squamous cell carcinomas [31, 32]. ELFN2 (extracellular leucine-rich repeat and fibronectin type III domain-containing 2) is also known as protein phosphatase 1 regulatory subunit 29 and belongs to the leucine-rich repeat family. Studies show that ELN2 is prevalent in tumors of the brain and granular cells . REPS1 is a gene that codes for RALBP1 associated Eps domain containing protein 1 and is associated with the endocytosis pathway . ARHGEF38 and ARHGAP12 are both part of the Rho family GTPase regulators. Rho GTPases are essential to cell cytoskeletal structure, motility, and morphogenesis, and they have been implicated in many cancer metastases [35, 36]. ARHGEF38, in particular, has been associated with aggressive prostate cancer . ARHGAP12, though not shown to exhibit invasion potential, has also been implicated in cell proliferation .The other upregulated genes ELFN2, QSOX1, and MUC1 have been shown to directly promote metastasis in various cancers [39–43], including lung cancer. Intriguingly, all 5 biomarker candidates (ARHGAP12, ARHGEF38, ELFN2, QSOX1, MUC1) that are upregulated in LUAD are involved in cancer proliferation and metastasis; the most enriched pathway in LUAD, which is platelet degranulation, is associated with metastasis as well . Furthermore, the loss of certain genes upregulated in LUSC such as TRIM29 and KRT6A is associated with more cellular invasion [45, 46]. REPS1 and KRT6A genes which were upregulated in LUSC, were also shown to contribute to metastasis in cancer cells lines [47, 48]. Clinical differences between LUAD and LUSC are well known. In particular, LUAD has a higher metastatic rate than LUSC . Studying these potential biomarkers may provide insight into tumor progression, metastatic, and therapeutic differences between LUAD and LUSC. Overall, the mechanisms by which many of these genes may regulate NSCLC development and metastasis remain unknown; therefore, studies to elucidate the exact mechanisms are warranted.
Different tumor subtypes arise from different types of cells located within each specific region, and consequently, the tumor transcriptome and morphology are thought to reflect this idea. In support of this view, our study, along with previous studies [6, 8, 24, 50], found that pathways specific to squamous cell tumors, including cell adhesion and keratinization, were associated with LUSC (Tables 4 and 5, Fig. 4), and pathways related exocytosis and surfactant homeostasis were associated with LUAD (Table 4).
Aside from cell adhesion or cytoskeleton organization, LUSC demonstrates higher regulation of p53 signaling in both KEGG and Reactome analyses. It is known that TP53 mutation is more common in LUSC than in LUAD [51–53], and that such mutation may predominantly be a non-truncated mutation in LUSC leading to higher expression levels of genes involved in the p53 regulation pathway . Moreover, P53 mutations often lose their tumor suppression function while gaining oncogenic abilities, leading to increased cell growth and proliferation compared to LUAD .
The most prominent pathway associated with LUAD, compared to LUSC, is platelet degranulation and exocytosis (Table 4, Table 5). Interestingly, lung cancer is the most common malignancy to coexist with venous thromboembolism, especially pulmonary embolism . LUAD, in particular, has been shown to be an independent risk factor for pulmonary embolism even among lung cancers [57, 58]. One of the top downregulated clusters also show circulatory system regulation (Supplementary Table S6). Because platelet granulation directly causes thrombus formation, the differential enrichment of platelet granulation pathway can therefore help explain a more active and a more common hypercoagulation and thrombotic process in LUAD compared to LUSC . In addition, platelets have been implicated in both the innate and the adaptive immune systems; platelet degranulation can modulate innate immunity via the release of cytokine, and platelet-leukocyte interactions can lead to leukocyte recruitment and activation in cancer . In fact, CD63, one of the genes in the platelet degranulation pathway (Supplementary Tables S3 and S6), is directly involved in leukocyte recruitment through endothelial P-selectin . LUSC has been associated with a relatively more suppressed immune response, implying a more active immune response in LUAD, which supports our result [55, 62]. The other top enriched LUAD pathways include tyrosine kinase signaling pathways and protein translation, which are known pathogenic pathways in cancers [63–66].
There are several limitations of this study. One of them is that this study does not take into account the RNA expression fold changes, which some groups have used to rank differentially expressed genes [67, 68]. Also, although this study aims to minimize the discovery of false positive biomarkers by overlapping different feature selection methods, the proposed biomarker candidates in this study still lack experimental verification. Nevertheless, these results may shed light into the biological differences between LUAD and LUSC, as well as aid the discovery of better diagnosis and treatment for each [63, 69].
In conclusion, we designed and implemented a workflow of overlapping five different feature selection methods to perform cancer classification, identify novel biomarkers, and study biological differences in NSCLC. We identified ARHGAP12, ARHGEF38, ELFN2, NECTIN1, and REPS1 as novel biomarkers, along with 12 other biomarker candidates. We also provided insight into potential explanations for different clinical findings and biological characteristics between LUSC and LUAD through pathway analysis. Further validation studies of these biomarkers and biological mechanisms are therefore warranted.