Patient characteristics and sequencing statistics
A case study of 19 patients was included in this study, where 10 were ADCA (EGFR, ALK and ROS1 wild type), and 9 were SQCC histological subtypes of NSCLC. Patients were early-onset with the average diagnostic age of 56 years, where male: female ratio was 5.6: 1. Among, three patients were non-smoker, whereas one case was with unknown smoking history. Nearly 75% of patients were present with co-morbidities, where all patients were either in stage III or IV. A total of 13.48 GB raw data and 11.98 GB processed data were generated per exome for the tumor sample of ADCA, whereas on an average 13.76 GB raw data and 12.24 GB processed data were obtained for the tumor sample of SQCC using WES. A higher percentage of reads were aligned to the human reference genome (GRCh38) in the tumors of ADCA patients (98.87%; range 96.48-99.95%) compared to tumors of SQCC patients (97.40%; range 85.03-99.51%), indicating that the generated dataset was highly relevant with the reference genome. The average GC content (49.74%) in the tumors of ADCA patients ranged from 42.10 to 64.63%, whereas average GC content (48.50%) in the tumors of SQCC patients ranged from 41.81 to 52.28%. Clinical characteristics and sequencing summary of lung cancer patient’s participants in this study are listed in Table S1 in Supplementary File 1.
Detection and characterization of SNVs
After initial variant filtering (as described in methods), a total of 1,157,921 (single-allelic 1,157,792 and multi-allelic 129) variants were retained in the tumors of ADCA patients (n=10) which was slightly higher than total variants (1,076,209; single-allelic 1,076,069 and multi-allelic 140) detected in SQCC patients (n=9). The number of variants per chromosomes ranged from 6,227 (chrY) to 98,788 (chr4) in the tumors of ADCA subtype, whereas ranged from 8,385 (chrY) to 104,645 (chr4) in the tumors of SQCC subtype. The variants rate per chromosomes varied from 1,925 (chr4) to 9,190 (chrY) and revealed on average 1 variants after every 2,667 bases in ADCA subtype, whereas it was after every 2,869 base in SQCC subtype. We observed higher known variants i.e. 616,999 (53.285%) in ADCA subtype as compared to 537,980 (49.988%) in SQCC subtype. The distribution of variants by their type disclosed 1,066,489 SNPs, 38,751 insertions and 52,681 deletions in ADCA subtype. However, the different distributions of insertions/deletions (35,799/49,010) and SNP (991,400) was observed in SQCC subtype. The identified SNPs from NSCLC were categorized into two clusters based on nucleotide substitutions i.e. transitions (A/G and C/T) and transversions (A/C, A/T, C/G, and G/T). The transition-to-transversion (Ts/Tv) ratio was slightly higher (1.698 for all SNPs and 2.214 for known SNPs) in ADCA subtype compared to Ts/Tv ratio of 1.4859 (all SNPs) and 2.137 (known SNPs) in SQCC subtype. Among detected SNPs, 23.05% were heterozygous, and 76.93% were homozygous in ADCA subtype, whereas it was 21.35% and 78.64% in SQCC subtype, respectively. The ratio of heterozygous SNVs to homozygous SNVs (Het/Hom ratio) was 0.29 and 0.27 in ADCA and ADCA, respectively where the lower value was associated with true positive variants. The ratio of nonsense to missense mutations (0.007), and missense to silent mutations (0.829) in ADCA subtype was nearly similar to 0.007 and 0.863, respectively in SQCC subtype. The ratio of nonsense to missense and missense to silent mutations in the human genome may reflect a role for natural selection, especially purifying selection. The ‘GAT’ codons have been replaced maximum times by ‘GAC’ codons in both ADCA (588) and SQCC (665) subtype. The characterization of SNVs in ADCA and SQCC subtype, revealed that ADCA was more genetically unstable compared to SQCC. Variant’s summary identified by whole exome sequencing are listed in Figure 1 and Table S2 in Supplementary File 1.
Detection of somatic variants in tumor only samples
To detect somatic SNV, the present study focused on missense variants in the exonic regions or splice sites. The downstream filtering by genomic location revealed a total of 6,712 and 8,000 exonic variants in ADCA and SQCC subtype, respectively. Among exonic SNVs (ADCA subtype), 2,113 were synonymous, 1,985 were nonsynonymous, 28 were frame-shift indels, 51 were nonframe-shift indels, 15 were stop-gain, 2 were stop-loss, and 2,518 were non-coding SNVs. In SQCC-subtype, 3,249 were synonymous, 2,656 were nonsynonymous, 52 were frame-shift indels, 59 were nonframe-shift indels, 36 splicing-variant, 1 was gene_fusion, and 1,947 were non-coding SNVs. Exonic missense, nonsense, stop-loss, frameshift and splice site variants all have potential to affect protein function. Therefore, we excluded the synonymous variants that have no functional impact and retained 4,599 and 4,751 variants from ADCA and SQCC subtype, respectively. As rarity(27) is the key criterion to have a functional effect on the encoded protein, the filtered variants were used to eliminate the common germline mutations (minor allele frequency below 5% in population/germline databases) and as a outcome 1,642 and 2,141 variants were retained in ADCA and SQCC subtype, respectively. After excluding false positive mutations (based on additional filter criteria’s 1a-ggiven in methodology), 500 and 734 variants in ADCA and SQCC subtype, respectively was observed. To exclude deemed polymorphic/benign variants, high quality rare variants (MAF ≤ 1% and QUAL ≥ 500) were excavated which revealed 94 variants in ADCA subtype, whereas 87 variants in SQCC subtype. To identify candidates likely to have deleterious effects, combination of multiple variant annotation tools were applied that revealed the impact of amino acid changes on protein function based on the combine scores. The variants were classified as damaging (predict pathogenic by maximum number of tools), probably damaging (predict pathogenic or benign by an equal number of tools), benign (predict benign by maximum number of tools) and uncertain significance (unknown) as per the variant assessment guidelines by the American College of Medical Genetics. The alterations listed in COSMIC, ICGC and TCGA were considered as known somatic variations in this study. The final outcome, revealed a total of 24 high confidence somatic variants (ADCA=14, SQCC=10) associated with 18 genes and were classified as known somatic variant (n=10), deleterious variant (n=8), and variant of uncertain significance (VUS) (n=6) (Table 1, Figure 2). The gene (n=11) namely CTBP2, ESRRA, FDFT1, FOLR3, GPRIN2, HRNR, KCNJ18, KRT18, LILRA2, MTRNR2L8, and TEKT4 were observed to be mutated in both ADCA and SQCC histologic subtype of lung cancer. In addition, mutated gene (n=4) namely LRP2, MPRIP, NYX, and TBP were observed in histologic subtype of ADCA only, whereas FBXO6, MIR3689F, and UMPS mutated genes were found in SQCC subtype only (Figure 3). Out of 24 high confidence somatic variants, 19 (79.17%) variants had a previously known dbSNP ID while the remaining 5 (20.83%) were unassigned, new variants. The 17 variants had higher proportion of driver mutations (≥25% allele frequency) in tumor samples, whereas 21 variants were observed to be true somatic mutation based on variant allele frequency which is used to infer whether a variant comes from somatic cells or inherited from parents when a matched normal sample is not provided.
Knowledge-driven variant prioritization
Though, we followed the guidelines suggested for experimental design and variant filtering, yet we obtained more candidates with likely functional effects than can be verified experimentally. High priority candidates based on the biological hypothesis defined for this study were selected. The analysis of known or predicted variants to be involved in the lung or in a related cancers revealed eight genes (CTBP2, ESRRA, FBXO6, FDFT1, KRT18, MPRIP, TBP, and UMPS) associated with the lung carcinoma. In addition, four genes i.e. LRP2 (bone, breast, colorectal, pancreatic, prostate, and renal cell carcinoma), HRNR (breast, liver and pancreatic carcinoma), FOLR3 (breast and ovarian carcinoma) and TEKT4 (breast and thyroid carcinoma) were involved in other’s cancer types. The genes namely MIR3689F, MTRNR2L8, NYX, GPRIN2 and KCNJ18 were observed to be not associated with any type of cancer and considered as low priority genes for the validation. Loss-of-function (LoF) variants in three genes i.e. FOLR3, UMPS and LILRA2 were observed which damage or eliminate the function of the encoded protein. The LoF-intolerant genes (CTBP2, GPRIN2, HRNR and TEKT4) were classified as extremely loss of function intolerant (pLI ≥ 0.9), whereas gene (MTRNR2L8) with low pLI scores (≤ 0.1) was considered as LoF-tolerant (common loss-of-function variants) and was not selected for the validation. We also prioritized genes based on the interactome of known disease-associated proteins. The genes, FBXO6 (degree 153; betweenness centrality 77253.04), TBP (degree 152; betweenness centrality 78171.64), KRT18 (degree 89; betweenness centrality 44603.52), CTBP2 (degree 64; betweenness centrality 33732.54), ESRRA (degree 44; betweenness centrality 21895.15), LRP2 (degree 38; betweenness centrality 20031.73), MPRIP (degree 24; betweenness centrality 10184.34), FDFT1 (degree 17; betweenness centrality 7452.33), UMPS (degree 15; betweenness centrality 7266.93), and HRNR (degree 14; betweenness centrality 6258.26) were observed to be the most highly ranked hub genes in this study. Moreover, relevant information for the genes of interest was retrieved from the literature. Nine gene/proteins including HRNR, KCNJ18, ESRRA, MPRIP, FBXO6, FOLR3, FDFT1, KRT18, and LILRA were observed to be overexpressed in lung cancer which might have a potential role in cancer development, proliferation, and metastasis. Remarkably, most of the genes were involved in important cancer-related pathways. Close observation showed that the variant in gene MIR3689F (miRNA) and MTRNR2L8 (It is unclear if this is a transcribed protein-coding gene, or if it is a nuclear pseudogene of the mitochondrial MT-RNR2 gene) was incorrect for this study and hence not selected for the validation. The characteristics of candidate variants from public resources and published literature are given in Table 2.
Somatic variants validation by Sanger sequencing
To eliminate false-positive rates of the identified high confidence somatic mutations from WES data, we selected 10 genes for Sanger sequencing validations. Interestingly we observed that 7 genes were mutated in more than 60% samples and 3 genes were mutated in either one or two samples. Further, Sanger sequencing results showed 100% concordance in 7 genes and the remaining 3 genes concordances were found only in 80% cases. The mutations observed along with WES and Sanger sequencing data have been depicted in the Table 3 and Figure S1 in Supplementary File 1.
The gene-wise results of Sanger sequencing are given below:
TEKT1, GPRIN2 and KCNJ18 point mutation: These three genes were found mutated in all the samples by WES. On further validation by Sanger sequencing, we also found TEKT1 (exon6:c.G1213A:p.A405T) were positive in 18 cases, GPRIN2 (exon3:c.G721A:p.V241M) in 16 cases and KCNJ18 (c.C631T:p.L211F) point mutations in all samples.
Hornerin (HRNR) and FOLR3 mutation: These two genes were found mutated in 18 samples by WES. Validation by Sanger sequencing revealed 100% concordance. The Hornerin gene was present with the point mutation in exon 3 (c.C5050G:p.R1684G) and FOLR3 gene with deletion in exon 3 (c.46_47del:p.Y16fs).
ESSRA and CTBP2 mutation: WES revealed ESSRA gene was mutated in 16 cases and CTBP2 in 17 cases. Sanger sequencing revealed ESRRA gene exon 7 point mutation (c.G1127T:p.R376L) and deletion (c.1130_1132del:p.377_378del) in 12 and 5 cases whereas CTBP2 (exon5:c.G2292T:p.Q764H) point mutations in 15 cases.
MPRIP, TBP and FBXO6 mutation: Exon 6 deletion of MPRIP gene (c.537_539del:p.179_180del) was found in 2 samples whereas TBP gene (exon3:c.222_223insCAG:p.Q74delinsQQ) deletion and FBXO6 gene (exon2:c.A151G:p.M51V) point mutation were found in one case each by WES and Sanger sequencing.