Identi cation of Molecular Biomarkers Associated with Non-Small-Cell Lung Carcinoma (NSCLC) Using Whole-Exome Sequencing Analysis

Varsha Singh All India Institute of Medical Sciences Amit Katiyar All India Institute of Medical Sciences Prabhat Malik All India Institute of Medical Sciences Sunil Kumar All India Institute of Medical Sciences Anant Mohan All India Institute of Medical Sciences Harpreet Singh ICMR-AIIMS Computational Genomics Center, Indian Council of Medical Research Deepali Jain (  deepalijain76@gmail.com ) All India Institute of Medical Sciences


Introduction
Various studies have proven NSCLC to be a histologically and molecularly heterogeneous group of cancer. The two main histological NSCLC subtypes are adenocarcinoma (ADCA) and squamous cell carcinoma (SQCC). Although the incidence of ADCA is on the rise, SQCC is currently the second most frequent histologic subtype. Distinct subtypes of NSCLC are driven by a speci c genetic alteration, the molecular mechanisms of which remain to be fully elucidated. The Cancer Genome Atlas (TCGA) has conducted comprehensive genome studies of NSCLC, displaying a great diversity of molecular variations. Some of the mutated genes were common in both the histology subtypes and some were group speci c. ADCA shows more complex and heterogeneous molecular patterns than SQCC, with a greater number of associated genomic aberrations (1,2). Tumor genotype analysis has identi ed driver alterations in 50-80% of NSCLC patients according to demographics, and particularly ethnicity. Asian people have unique clinical characteristics, tumor histology and show different prevalence of oncogenic mutations (3).
Signi cant advancement has been made in the treatment of patients with pulmonary ADCA on the basis of the molecular pro le. The discovery of EGFR mutations and ALK rearrangement has opened a new era of targeted therapy in ADCA. However, no such molecular target exists for squamous cell carcinoma (SQCC). Whole exome sequencing (WES) has been in wide use for the discovery of new genetic markers which may offer more information for the development of personalized medicine for all subtypes of lung cancer (4). WES has been widely used in clinical research for the discovery of new genetic markers. The key objective of this study is to nd out novel genetic markers for NSCLC which can be used as a universal biomarker for the treatment. This study also identi es and compares the genomic alterations of ADCA subtype with SQCC subtype.  Variant calling and quality control Quality of raw reads (FASTQ les) were examined using FastQC (6). Adaptor and low-quality sequences were trimmed using Trimmomatic software (5). Paired clean reads with longer than 50 bases were aligned against the Genome Reference Consortium Human Build 38 patch release 7 (GRCh38.p7) using BWA-MEM algorithm of Burrows-Wheeler Aligner (BWA 0.6.1) (7). SAM/BAM post-processing steps including SAMto BAM conversion, sorting, adding read group information, mark duplicates, and base quality score recalibration were performed using the Genome Analysis Toolkit (GATK 4.0.6.0)(8, 9). The quality of the recalibrated BAM les was checked with QualiMap v2.0.2(10). Finally, a genomic variation, including single-nucleotide polymorphisms (SNPs) and small INDELs (insertion and deletion) were detected for each sample individually using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and the results were combined using GenotypeGVCFs. Raw variant calls were soft ltered using GATK VariantFiltration based on the following parameters: LowCoverage (DP < 5), LowQual (Q < 50), StrandBias (FS P-value > 60), SNV cluster (three or more SNVs within 10 bp), Poor Mapping Quality (>10% of reads have nonunique alignments).
Additional lters to reduce false positives somatic variants A high-con dence somatic variants for tumor samples without a matched normal control were selected based on the following criteria: 1) mutations were considered true positives if they have a) QUAL ≥ 20, b) genotype quality (GQ) ≥ 20, c) mapping quality (MQ) ≥ 20, d) coverage depth at candidate site (DP) ≥ 20, e) QualByDepth (QD)≥ 2.0, and g) frequency ≥ 25% in tumor samples (20), 2) all common variants with minor allele frequency (MAF) of >1% in the germline/population databases (ExAC, and 1000 Genomes) were ltered out since those variants are deemed polymorphic/benign rather than pathogenic somatic driver mutations, 3) known germline variants reported at dbSNP (version 151) were excluded, and alterations listed as known somatic variations in COSMIC, The International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) were retain (21), 4) MAF threshold of 0.0001 was used in the gnomAD or TopMed database to lter variants for the somatic mutation (22), 5) variants were considered if the variant allele frequency (VAF; also known as variant allele fraction) is deviate from germline polymorphisms (~0.5/1 for heterozygous/homozygous) (23,24), 6) variants were considered if they are truncating variants (nonsense mutations, frameshift deletions/insertions, mutations located at exon-anking regions, and highly conserved intronic splice sites), or apparent missense mutations predicted to be pathogenic by in-silico prediction tools, 7) synonymous variants that were not previously reported as pathogenic and not predicted to alter splicing were ltered out.

Enrichment analysis and candidate gene prioritization
Known or predicted variants to be involved in the lung or in related cancers were predicted using DisGenNET (25). The Human Gene Damage Index server (http://lab.rockefeller.edu/casanova/GDI) was used to predict LoF-intolerant genes. The gene product physically interact with a protein encoded by a known disease gene was explored using NetworkAnalyst (26). Gene product in a pathway associated with the disease and gene expressed in the tissue or organ of interest was retrieved from the literature.

Sanger sequencing
PCR was carried out on ABI Palm thermal cycler (Applied Biosystem, California), using both forward and reverse primers for greater accuracy and the results were analyzed using SeqMan II software (DNASTAR 5.07). The mutations with both base count more than 10% and QV (Quality value) more than 20 were considered to be trusted mutations.

Ethical clearance
The study on 19 NSCLC patients retrieved from the Department of Pathology, A.I.I.M.S., New Delhi was conducted in accordance with the ethical guidelines and regulations of the AIIMS and after obtaining approval from the AIIMS ethics committee. The ethical approval number is IECPG No. 480/29.08.2016. Study participants were enrolled following their voluntary written informed consent.

Patient characteristics and sequencing statistics
A case study of 19 patients was included in this study, where 10 were ADCA (EGFR, ALK and ROS1 wild type), and 9 were SQCC histological subtypes of NSCLC. Patients were early-onset with the average diagnostic age of 56 years, where male: female ratio was 5.6: 1. Among, three patients were non-smoker, whereas one case was with unknown smoking history. Nearly 75% of patients were present with co-morbidities, where all patients were either in stage III or IV. A total of 13.48 GB raw data and 11.98 GB processed data were generated per exome for the tumor sample of ADCA, whereas on an average 13.76 GB raw data and 12.24 GB processed data were obtained for the tumor sample of SQCC using WES. A higher percentage of reads were aligned to the human reference genome (GRCh38) in the tumors of ADCA patients (98.87%; range 96.48-99.95%) compared to tumors of SQCC patients (97.40%; range 85.03-99.51%), indicating that the generated dataset was highly relevant with the reference genome. The average GC content (49.74%) in the tumors of ADCA patients ranged from 42.10 to 64.63%, whereas average GC content (48.50%) in the tumors of SQCC patients ranged from 41.81 to 52.28%. Clinical characteristics and sequencing summary of lung cancer patient's participants in this study are listed in Table S1 in Supplementary File 1.

Detection and characterization of SNVs
After initial variant ltering (as described in methods), a total of 1,157,921 (single-allelic 1,157,792 and multi-allelic 129) variants were retained in the tumors of ADCA patients (n=10) which was slightly higher than total variants (1,076,209; single-allelic 1,076,069 and multi-allelic 140) detected in SQCC patients (n=9). The number of variants per chromosomes ranged from 6,227 (chrY) to 98,788 (chr4) in the tumors of ADCA subtype, whereas ranged from 8,385 (chrY) to 104,645 (chr4) in the tumors of SQCC subtype. The variants rate per chromosomes varied from 1,925 (chr4) to 9,190 (chrY) and revealed on average 1 variants after every 2,667 bases in ADCA subtype, whereas it was after every 2,869 base in SQCC subtype. We observed higher known variants i.e. 616,999 (53.285%) in ADCA subtype as compared to 537,980 (49.988%) in SQCC subtype. The distribution of variants by their type disclosed 1,066,489 SNPs, 38,751 insertions and 52,681 deletions in ADCA subtype. However, the different distributions of insertions/deletions (35,799/49,010) and SNP (991,400) was observed in SQCC subtype. The identi ed SNPs from NSCLC were categorized into two clusters based on nucleotide substitutions i.e. transitions (A/G and C/T) and transversions (A/C, A/T, C/G, and G/T). The transition-to-transversion (Ts/Tv) ratio was slightly higher (1.698 for all SNPs and 2.214 for known SNPs) in ADCA subtype compared to Ts/Tv ratio of 1.4859 (all SNPs) and 2.137 (known SNPs) in SQCC subtype. Among detected SNPs, 23.05% were heterozygous, and 76.93% were homozygous in ADCA subtype, whereas it was 21.35% and 78.64% in SQCC subtype, respectively. The ratio of heterozygous SNVs to homozygous SNVs (Het/Hom ratio) was 0.29 and 0.27 in ADCA and ADCA, respectively where the lower value was associated with true positive variants. The ratio of nonsense to missense mutations (0.007), and missense to silent mutations (0.829) in ADCA subtype was nearly similar to 0.007 and 0.863, respectively in SQCC subtype. The ratio of nonsense to missense and missense to silent mutations in the human genome may re ect a role for natural selection, especially purifying selection. The 'GAT' codons have been replaced maximum times by 'GAC' codons in both ADCA (588) and SQCC (665) subtype. The characterization of SNVs in ADCA and SQCC subtype, revealed that ADCA was more genetically unstable compared to SQCC. Variant's summary identi ed by whole exome sequencing are listed in Figure 1 and Table S2 in Supplementary File 1.

Detection of somatic variants in tumor only samples
To detect somatic SNV, the present study focused on missense variants in the exonic regions or splice sites. The downstream ltering by genomic location revealed a total of 6,712 and 8,000 exonic variants in ADCA and SQCC subtype, respectively. Among exonic SNVs (ADCA subtype), 2,113 were synonymous, 1,985 were nonsynonymous, 28 were frame-shift indels, 51 were nonframe-shift indels, 15 were stop-gain, 2 were stop-loss, and 2,518 were non-coding SNVs. In SQCC-subtype, 3,249 were synonymous, 2,656 were nonsynonymous, 52 were frame-shift indels, 59 were nonframe-shift indels, 36 splicing-variant, 1 was gene_fusion, and 1,947 were non-coding SNVs. Exonic missense, nonsense, stop-loss, frameshift and splice site variants all have potential to affect protein function. Therefore, we excluded the synonymous variants that have no functional impact and retained 4,599 and 4,751 variants from ADCA and SQCC subtype, respectively. As rarity (27) is the key criterion to have a functional effect on the encoded protein, the ltered variants were used to eliminate the common germline mutations (minor allele frequency below 5% in population/germline databases) and as a outcome 1,642 and 2,141 variants were retained in ADCA and SQCC subtype, respectively. After excluding false positive mutations (based on additional lter criteria's 1 a-g given in methodology), 500 and 734 variants in ADCA and SQCC subtype, respectively was observed. To exclude deemed polymorphic/benign variants, high quality rare variants (MAF ≤ 1% and QUAL ≥ 500) were excavated which revealed 94 variants in ADCA subtype, whereas 87 variants in SQCC subtype. To identify candidates likely to have deleterious effects, combination of multiple variant annotation tools were applied that revealed the impact of amino acid changes on protein function based on the combine scores. The variants were classi ed as damaging (predict pathogenic by maximum number of tools), probably damaging (predict pathogenic or benign by an equal number of tools), benign (predict benign by maximum number of tools) and uncertain signi cance (unknown) as per the variant assessment guidelines by the American College of Medical Genetics. The alterations listed in COSMIC, ICGC and TCGA were considered as known somatic variations in this study. The nal outcome, revealed a total of 24 high con dence somatic variants (ADCA=14, SQCC=10) associated with 18 genes and were classi ed as known somatic variant (n=10), deleterious variant (n=8), and variant of uncertain signi cance (VUS) (n=6) (Table 1, Figure 2). The gene (n=11) namely CTBP2, ESRRA, FDFT1, FOLR3, GPRIN2, HRNR, KCNJ18, KRT18, LILRA2, MTRNR2L8, and TEKT4 were observed to be mutated in both ADCA and SQCC histologic subtype of lung cancer. In addition, mutated gene (n=4) namely LRP2, MPRIP, NYX, and TBP were observed in histologic subtype of ADCA only, whereas FBXO6, MIR3689F, and UMPS mutated genes were found in SQCC subtype only ( Figure 3). Out of 24 high con dence somatic variants, 19 (79.17%) variants had a previously known dbSNP ID while the remaining 5 (20.83%) were unassigned, new variants. The 17 variants had higher proportion of driver mutations (≥25% allele frequency) in tumor samples, whereas 21 variants were observed to be true somatic mutation based on variant allele frequency which is used to infer whether a variant comes from somatic cells or inherited from parents when a matched normal sample is not provided.

Knowledge-driven variant prioritization
Though, we followed the guidelines suggested for experimental design and variant ltering, yet we obtained more candidates with likely functional effects than can be veri ed experimentally. High priority candidates based on the biological hypothesis de ned for this study were selected. The analysis of known or predicted variants to be involved in the lung or in a related cancers revealed eight genes (CTBP2, ESRRA, FBXO6, FDFT1, KRT18, MPRIP, TBP, and UMPS) associated with the lung carcinoma. In addition, four genes i.e. LRP2 (bone, breast, colorectal, pancreatic, prostate, and renal cell carcinoma), HRNR (breast, liver and pancreatic carcinoma), FOLR3 (breast and ovarian carcinoma) and TEKT4 (breast and thyroid carcinoma) were involved in other's cancer types. The genes namely MIR3689F, MTRNR2L8, NYX, GPRIN2 and KCNJ18 were observed to be not associated with any type of cancer and considered as low priority genes for the validation. Loss-of-function (LoF) variants in three genes i.e. FOLR3, UMPS and LILRA2 were observed which damage or eliminate the function of the encoded protein. The LoF-intolerant genes (CTBP2, GPRIN2, HRNR and TEKT4) were classi ed as extremely loss of function intolerant (pLI ≥ 0.9), whereas gene (MTRNR2L8) with low pLI scores (≤ 0.1) was considered as LoF-tolerant (common loss-of-function variants) and was not selected for the validation. We also prioritized genes based on the interactome of known disease-associated proteins. to be overexpressed in lung cancer which might have a potential role in cancer development, proliferation, and metastasis. Remarkably, most of the genes were involved in important cancer-related pathways. Close observation showed that the variant in gene MIR3689F (miRNA) and MTRNR2L8 (It is unclear if this is a transcribed protein-coding gene, or if it is a nuclear pseudogene of the mitochondrial MT-RNR2 gene) was incorrect for this study and hence not selected for the validation. The characteristics of candidate variants from public resources and published literature are given in Table 2.

Somatic variants validation by Sanger sequencing
To eliminate false-positive rates of the identi ed high con dence somatic mutations from WES data, we selected 10 genes for Sanger sequencing validations. Interestingly we observed that 7 genes were mutated in more than 60% samples and 3 genes were mutated in either one or two samples. Further, Sanger sequencing results showed 100% concordance in 7 genes and the remaining 3 genes concordances were found only in 80% cases. The mutations observed along with WES and Sanger sequencing data have been depicted in the Table 3 and Figure S1 in Supplementary File 1.

Discussion
In this study, whole-exome sequencing was used to predict genomic alterations in ADCA and SQCC histological subtypes of NSCLC. Overall, we detected 24 high con dence somatic variants (ADCA=14, SQCC=10) in 18 genes. Many of the gene alterations were common in both subtypes whereas few were group speci c, these ndings will throw more light on personalized medicine. Of interest, 16 genes (≥50% mutation frequency) were observed to be mutated in lung cancer, where gene GPRIN2, KCNJ18 and TEKT4 was found mutated in all the patients (100% mutation frequency). Pathway enrichment analysis con rmed that the majority is involved in processes relevant for tumorigenesis such as cell differentiation and proliferation. In the end, 10 novel somatic variants (affecting 10 genes i.e. CTBP2, ESRRA, FBXO6, FOLR3, GPRIN2, HRNR, KCNJ18, MPRIP, TBP, and TEKT4) that were identi ed for the rst time were validated by Sanger sequencing. Our data expand the mutation spectrum for NSCLC and will be a useful resource for the NSCLC research community. Each biomarker has been discussed in details in the following paragraphs.
Mutated genes present in all samples of ADCA and SQCC subtype Of interest, the three genes, KCNJ18, TEKT4, and GRIPN2 are mutated in all NSCLC samples and can serve as common diagnostic markers for both subtypes.
Potassium inwardly rectifying channel subfamily J member 18 (KCNJ18) gene encodes a member of the inwardly rectifying potassium channel family and plays a role in resting membrane potential maintenance.(28) The potassium channel involvement in tumour cell proliferation has been studied previously in colorectal carcinoma cell line DLD-1 and human prostate cancer cell line LNCaP by modulating calcium in ux. (29,30) The E139K (rs76265595), G145S (rs75029097) and A185V (rs73979896) mutations in KCNJ12/KCNJ18 gene were identi ed in esophageal SQCC (31). Mutations were found in KCNJ18 gene in all the NSCLC patients studied but the amino acid variations (c.C631T, p.L211F) were different from those reported earlier. So, it can be postulated that KCNJ18 might be involved in p53 pathway, and it may be investigated in larger cohort of patients.
G protein-regulated inducer of neurite outgrowth 2 (GPRIN2) gene is located on chromosome 16 and encodes glutamate NMDA receptor (32). Variations in this gene have been found in malignant as well as non-malignant diseases (31,33). Rare damaging novel mutations in GPRIN2 genes has been found in 33% melanoma patients (somatic)(34), familial human esophagheal SQCC (germline/somatic)(31) as well as 501Mel melanoma cell line.(34) Mutated GPRIN2 might play a major role in tumorigenesis via glutamate pathway where excess release of glutamate showed more aggressive growth (35,36). In the present study we found p.V241M (c.G721A) mutation in all the NSCLC cases, although mutation observed was different from those reported in the literature (p.A233S, rs11204659). The role of this mutation in tumorigenesis is unclear; however, high frequency observed in our study hints that it may be explored in other studies.
Tektin 4 gene (TEKT4) is present on chromosome 2, encodes tektin4, a constitutive protein of microtubules in cilia, agella, basal bodies, and centrioles (37). The biological function of TEKT4 has not been well explained in cancer initiation and development. Variations in the TEKT4 gene play an important role in papillary thyroid cancer progression. TEKT4 knockdown in papillary thyroid cancer cell lines inhibits tumorigenesis by impairing cell proliferation, colony formation, migration, and invasion via blocking the activity of PI3K/AKT pathway(38). We found TEKT4 gene mutations in all 20 cases studied. However, mutations (c.G1213A, A405T) were different from those reported in papillary thyroid cancer (c.1276_1279delinsACCC). Mutated TEKT4 is associated with increased paclitaxel resistance and poor prognosis in breast cancer patients (39). This mutation might play a vital role in the pathogenesis of lung cancer however, the role of TEKT4 gene in PI3K/AKT pathway signalling and treatment resistance require further investigations.

Mutated genes present in 80% samples of ADCA and SQCC subtype
In addition to the three aforementioned genes, HRNR, FOLR3, CTBP2 and ESSRA are signi cantly mutated in more than 80% of the NSCLC samples. All these mutated genes are directly or indirectly play a role in tumorigenesis and can additionally serve as common pathogenetic link for subtypes of NSCLC.
C-terminal-binding protein 2 (CTBP2) is a member of the CTBP family protein located in the human chromosome 10. CTBP2 is evolutionarily conserved transcriptional co-regulator that interact with DNA binding transcription factors and chromatin remodelers. CTBP2 represses a number of tumour suppressor genes (E-cadherin, PTEN, INK4), induces the epithelial-to-mesenchymal transition and functions as an apoptosis antagonist. Aberrant expression of CTBP2 has been found to be associated with tumorigenesis, cancer progression, and poor prognosis (40,41). Accumulating evidences indicated that CTBP2 expression is elevated in several types of malignancies which include gastric cancer, melanoma, breast cancer, esophageal SQCC, prostate cancer, hepatocellular carcinoma, and ovarian cancer. High expression of CTBP2 results in progression of esophageal SQCC through negatively regulating p16 (INK4A). CTBP2 is considered as a co-factor of TGF-β-signalling pathway in promoting cancer metastasis and also participates in the regulation of WNT signalling. CTBP2 modulated the androgen receptor to promote prostate cancer cell proliferation through c-MYC signalling and also promoted its progression. CTBP2 can be considered as driver oncogene in solid tumours and also as an emerging target in cancer as it encodes a druggable dehydrogenase domain for which rst and second generation inhibitors have already been identi ed (42). CTBP2 plays a crucial role in NSCLC progression, and its depletion can provide a new target for NSCLC treatment (43). CTBP2 was mutated (c.G2292T, Q764H) in 17 cases in the present analysis. We believe that CTBP2 has the potential to become a high-e cacy target however, it warrants further research.
Estrogen related receptor alpha (ESRRA) is evolutionarily related to estrogen receptor and can e ciently bind to estrogen receptor that are commonly shared by many target genes. Over-expression of ESRRA has been found in carcinoma of thyroid, ovary, breast, prostate, colon and endometrium (44,45). It is correlated with poor prognosis. ESRRA suggested being a molecular target for treatment of endometrial cancer. Other investigators reported ESRRA as one of the negative prognostic factors in human prostate cancer. ESRRA is also over-expressed in lung cancer patients and cell line A549 while some studies report low or undetectable (46), estrogen receptor expression in NSCLC cells. ESRRA is up-regulated in NSCLC tissues and promotes the progression, proliferation and invasion via NF-κB mediated up-regulation of IL-6 (47). ESRRA knockdown xenografts sensitized cells to paclitaxel and reduce tumour growth and angiogenesis. Overall review of literature and our preliminary experience with ESRRA suggest that it can be studied in detail in NSCLC patients.
The hornerin gene (HRNR) is clustered on the chromosome region 1q21 and it is a member of the S100 protein family. The function of HRNR is poorly clari ed in the development of human tumours. Altered expression of HRNR was reported to be involved in cancer development, malignant transformation and invasion. Elevated HRNR has been found in many tumours viz lung SQCC, hepatocellular carcinoma, colorectal cancer, prostate cancer, glioblastoma and cell lines, breast carcinoma and cell linesand acute myeloid leukemia (48). HRNR has been found to contribute to hepatocellular carcinoma progression via the regulation of the AKT pathway (49). In lung SQCC and colorectal carcinoma, altered HRNR expression has been associated with disease recurrence (50). In the current study we have also found recurrence occurred in 9 patients all of whom were mutated with the HRNR gene.
The gene for folate receptor gamma(FOLR3) is located on chromosome 11 and consists of ve exons. The FOLR3 receptor is a constitutively secreted form of the folate receptor. FOLR3 is one of the key genes involved in the pemetrexed pathway. Variation in FOLR3 gene affects pemetrexed uptake, metabolism, treatment tolerability, response and survival (51). In NSCLC and mesothelioma patients, variation in the FOLR3 gene has been reported. FOLR3 germline mutation (rs61734430, c.292C>T variant) has been associated with an increased rate of disease progression (51). Pemetrexed is a folate antimetabolite (52) approved for the treatment of advanced NSCLC in the rst line, second line setting as well as for maintenance therapy. We found c.46_47del, Y16fs mutation which is different from reported mutation type. Future studies are required to know the role of FOLR3 as predictive marker for personalised pemetrexed therapy (which can improve both e cacy and tolerability).

Conclusions
In this study, novel somatic mutations and subtype-speci c mutations were found using WES and subsequently con rmed by Sanger sequencing. TBP and MPRIP mutated genes were solely associated with ADCA subtype whereas FBOX6 with SQCC. In addition, GPRIN2, KCNJ18 and TEKT4 genes were found mutated in all the patients (70). Although the mechanism of GPRIN2, KCNJ12 and TEKT4 in tumorigenesis is unclear; our results suggest that these may play a major role in NSCLC and it is worth to be investigated in future. The identi ed target genes from our study can be used as biomarkers for the detection and diagnosis of NSCLC. This study demonstrates that WES can be applied to FFPE clinical samples for nding or validation of biomarkers in cancer research.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryFile1SciRep.docx