Bioinformatics analysis of NSCLC multi-omics data


 The associated publication reports proteogenomic analysis of non-small cell lung cancer (NSCLC), where we identified molecular subtypes with distinct immune evasion mechanisms and therapeutic targets, and validated our classification method in separate clinical cohorts. This protocol describes sections of the bioinformatics analysis of the multi-omics data, namely, data analysis and processing for panel sequencing, identification of cancer- and driver-related proteins in proteomics data, proteogenomics search, and machine learning-based classifiers for NSCLC subtyping. Specifically, a cohort classifier was built using support-vector machine-recursive feature elimination (SVM-RFE) algorithm applied to in-depth proteomics data from a cohort of 141 samples. The classifier was then validated in three external datasets. Another classifier, suitable for single-sample subtyping, was built using k-top scoring pairs (k-TSP) algorithm applied to label-free data from a cohort of 136 samples. The k-TSP-based classifier was validated in two independent cohorts and an additional external dataset.


Introduction
Reagents Equipment Procedure Panel sequencing of early-stage NSCLC cohort: data analysis BALSAMIC work ow v4.0.0 1 was used to analyze each of the FASTQ les. In summary, we rst quality controlled FASTQ les using FastQC v0.11.5 2 . Adapter sequences and low-quality bases were trimmed using fastp v0.20.0 3 . Trimmed reads were mapped to the reference genome hg19 using BWA MEM v0.7.15 4 . The resulted SAM les were converted to BAM les and sorted using samtools v1.6 5,6 .
Results of the quality-controlled steps were summarized by MultiQC v1.7 7 . For each sample, somatic mutations were called using VarDict v2019.06.04 8 in tumor-only mode and annotated using Ensembl VEP v94.5 9 . Variants recurrently found (more than 10 cases) in the cohort and not previously described as oncogenic were manually reviewed to detect likely artifacts, which were removed from downstream analyses together with variants showing low quality calls. Variants were classi ed as putative functional versus passengers by using the interpretation pipeline developed by the Molecular Tumor Board Portal (accessed 2/2020), a clinical decision support tool that evaluates the functional and predictive relevance of genomic alterations 10 . Brie y, the portal classi es a variant as biologically relevant combining up-todate results from clinical and preclinical studies, bona de biological assumptions and bioinformatics calculations.
For tumor mutational load calculations, rst all low-quality variants were removed via a hard lter of total read depth (DP) > 50 and alternative allele depth (AD) > 5. Thereafter, we followed the procedure demonstrated by Chalmers et al 11 .
Downstream analysis of proteomics and proteogenomics data Cancer-and driver-related proteins (CDRPs) CDRPs were de ned based on membership in 10 cancer-related signaling pathways as previously described 12 , and/or if causally linked to cancer according to the COSMIC cancer gene census effort 13 . In total 832 CDRPs were identi ed and quanti ed in the current early-stage NSCLC cohort. CDRP annotation was performed using previously published information related to protein function as transcription factors, chromatin remodeling factor or transcription factor co-factor according to AnimalTFdb 14 ; protein kinase 15 ; protein phosphatase 16 ; ubiquitin E3 ligase 17 ; protein subcellular localization according to SubCellBarCode resource (www.subcellbarcode.org) 18 ; and annotation as drug target 19 .

Proteogenomics 6FT search
The IPAW proteogenomics pipeline for novel peptides was implemented as previously described 20 . Novel peptides from the 6-reading frame translation (6FT) search that passed SpectrumAI lter in the majority of TMT sets and lacked a SNPdb match were retained for outlier detection. Assuming that such peptides should be present in one or in a few samples and that the per set quanti cation depends on the sample composition, ratios to the reference pool were re-centered by the median and log2 transformed. Outlying peptides were determined by the same threshold used for the cancer-testis antigen analysis (i.e., ratio > 3).
Machine-learning based classi ers for NSCLC proteomics data First, we partitioned the dataset randomly into two parts: 80% for training and 20% for testing. To select the most important features in each iteration, support-vector machine-recursive feature elimination (SVM-RFE) algorithm was applied 32 . The algorithm was implemented using scikit-learn library (v0.21.2) in python (version 3) 30 . The model with the 200 most important features was then applied to the testing data to estimate the accuracy.
Finally, the overall accuracy was reported as the average accuracy from the 100 MCCV iterations, and we selected the most frequently used 200 features from the output of MCCV (100 iterations) to build the nal model and deploy it.
Applying SVM classi er to external data As the model was built on normalized proteomics data, training and testing data should be in the same scale in order to estimate the evaluation of the model robustly. Therefore, the model was built on Z-scoredistributed data and the external data (GEO 33 , TCGA 34 , and Gillette et al. 35 ) were transformed to Z-score distribution.

k-Top Scoring Pairs (k-TSP)-based single-sample classi er
The k-TSP algorithm 36 , developed for solving binary classi cation problems, was used here for development of a diagnostic single-sample classi er intended for a clinical setting. The classi er was trained and applied on label-free data-independent acquisition mass spectrometry (DIA-MS) data generated as described in Lehtiö et al. (associated publication). To remove samples with low-quality DIA data, sample-wise correlation (Spearman) analysis between the in-depth TMT-HiRIEF-LC-MS-DDA data (Lehtiö et al. associated publication) and the DIA-MS analysis was performed for overlapping proteins. This analysis revealed ve samples with low correlation, possibly due to low amounts of available starting material for the DIA-MS analysis, and these samples were excluded from downstream analysis.
For an initial ltering to remove uninformative proteins (features) and to reduce computation time for downstream analysis, we applied DEqMS 29 as described above (BH adjusted p-value < 0.01 and |log2(FC)| > 0.5). Comparison between differentially abundant 5,872 proteins and the 6,717 proteins identi ed in the DIA analysis resulted in an overlap of 3,028 proteins.
Missing values in DIA data were imputed by lling baselines signals for each protein, individually. We assumed that any resulting missing value was due to the lack of protein abundance in the sample. Therefore, we imputed the missing values with baseline signals instead of inferring the missing value based on protein abundance of other samples. We sampled value from a Gaussian distribution N(μ, σ) where μ is half of the minimum MS1 peak area of the protein abundance and σ is 2 in order to replace missing values with baseline signals for each sample independently.
Protein-wise correlation (Spearman and Pearson) between TMT-HiRIEF-LC-MS-DDA and imputed DIA-MS was computed for these 3,028 proteins, and proteins with greater than 0.3 Spearman and 0.5 Pearson correlations were included, resulting in a list of 2003 proteins. Next, for each comparison, the most upregulated and downregulated 100 (50 × 2) proteins were included in subsequent analysis resulting in a list of 760 proteins.
For k-TSP classi cation, we modi ed the 'switchbox' R package (v1.24.0) 37 for multi-class classi cation problems. The only parameter to tune is the number of feature pairs (k) used in the k-TSP algorithm (optimized k = 13). One-versus-one classi ers were built to classify samples (in total 15 classi ers for the 6 subtypes), and for each classi er the sample was classi ed into either of the subtypes. Consequently, each sample was classi ed 15 times and the nal decision was made based on a majority vote. In case of a tie in classi cations, direct comparison between the subtypes with the highest equal votes determined the nal classi cation. When there was a tie between direct subtypes comparison, the sample was labeled "unclassi ed" to prevent nal ambiguous calls.
As for the SVM classi er we used the MCCV method 31 to provide an unbiased performance estimation and to optimize the classi er. The whole process (described below) was repeated 100 times and for each iteration the testing performance (accuracy) and 195 (15 × 13) most important feature pairs were reported.
First, we partitioned the dataset randomly into two parts: 80% for training and 20% for testing. In the training data, 15 classi ers (Subtype 1 vs. Subtype 2, Subtype 1 vs. Subtype 3, etc.) were built independently, while simultaneously determining the 13 feature pairs for each classi er. Next, the corresponding classi ers were applied to the testing data to estimate the classi er accuracy.
Finally, the overall accuracy was reported as the average accuracy from the 100 MCCV iterations. To build the nal model and deploy it, all feature pairs from the MCCV iterations were sorted based on frequency and the top 13 most frequent pairs for each of the 15 classi ers were selected resulting in a total of 195 feature pairs (244 marker proteins).
Applying k-TSP classi er to independent validation-and late-stage cohorts datasets The k-TSP algorithm does not require any data normalization steps. It only compares the quantitative values of the proteins in each pair and assign samples to subtypes based on rules established during training. Thus, the k-TSP algorithm was directly applied to new DIA-MS sample data from two independent NSCLC cohorts after imputation of the missing values with 1. Furthermore, the classi er was applied to an external data-dependent acquisition (DDA)-MS data from a NSCLC adenocarcinoma cohort 38 .

Time Taken
Anticipated Results