Expression profiles and acquisition of clinical data
The RNA expression profiles for LUAD patients was obtained from Gene Expression Omnibus (GEO) database [29]. This dataset was sequenced on the GPL16791 (Illumina HiSeq 2500 (Homo sapiens)) platform and can be accessed using accession number GSE81089. The database contained RNA expression profiles of 199 LUAD patients as well as other relevant clinical information such as age, gender, tumor stage, life style and ps who (WHO performance status). This data-set was mainly used for subsequent acquisition of immune-related co-expression modules. TCGAbiolinks [30] package was then used to download RNA-seq data and related clinical information of LUAD patients from the TCGA database. The dataset contained expression profiles of 524 patients and was mainly used for construction of prognostic models based on immune-related lncRNAs. RNA sequence data for these samples were generated using the IlluminaHiSeq platform. Clinical metadata of these patients included their age, TNM stage, stage, overall survival data and outcome.
Acquisition of immune-related genes
Immune-related genes were downloaded from innateDB[31] and ImmPort[32] (Immunology Database and Analysis Portal). Innate immune-related genes from 35,747 publications were included in the two databases (innateDB and ImmPort) and included all genes involved in different immune processes from a variety of experimental studies. We combined the immune-related RNAs from these two databases to obtain a total of 1,238 genes (Table S1).
Weighted gene co-expression network analysis
We employed the Weighted gene co-expression network analysis (WGCNA) [33, 34] to obtain the candidate immune-related lncRNAs and GSE81089 to construct a co-expression network. In order to establish a reliable RNA co-expression network, we first screened genes and lncRNAs according to the following criteria: 1) for each gene and lncRNA, RNA with zero expression on more than 60% of LUAD samples was deleted. 2) RNA expression data with a variance equal to 0 was deleted. Then, the WGCNA package [34] was used to construct a topological overlap matrix (TOM) of immune-related RNA and lncRNAs which represents the relative interconnectedness between each pair of genes. To enable better characterization of the biological network in the constructed TOM, we selected the optimal soft threshold here to satisfy the scale-free topology of the network as previously described [34]. Then, we proceeded to do modular analysis of immune-related gene and lncRNAs, and calculated Pearson correlation coefficients between different modules as well as clinically related indices. If the proportion of immune-related genes contained in a module was greater than 1/4, and the module was significantly correlated to prognostic indicators (p-value < 0.05), then the module was selected for immune-related prognostic module. LncRNAs in these modules are defined as candidate immune-related lncRNAs.
Identification of immune-related lncRNAs
To further screen the immune-related lncRNAs controlling LUAD prognosis, the Cox proportional-hazards (Coxph) model, Least Absolute Shrinkage and Selection Operator (LASSO) regression models were applied during WGCNA analysis. Briefly, the TCGA-LUAD samples were split into a training and testing set in the ratio 2:1 for developing and validating the prognostic model. The univariate Coxph model, adjusting for age, gender, stage, T stage, N stage and M stage of the patients, was used to select the lncRNAs with prognostic significance (p <0.05). Then,these prognostic biomarkers were screened using multivariate LASSO regression via the glment [35] and survival[36] packages implemented in the R version 3.5.2 [37].
Establishment of a risk assessment model
Multivariate Coxph analysis was first performed on each immune-related lncRNA based on the TCGA-LUAD training dataset. For further construction of the prognostic risk score model, a coefficient for each immune-related lncRNA was extracted and the model constructed based on linear combination of their expression levels as follows:
where γi is the coefficient of lncRNA i obtained from multivariate Coxph regression, Ei is the expression of RNA i, and N is the number of lncRNAs in the Coxph model. Then, the risk score for each patient was calculated based on expression of the selected lncRNAs. The patients in the TCGA-LUAD training, testing and entire dataset were divided into high and low-risk subgroups for further analysis according to the median cut-off value obtained from the risk scores. Association between the risk score and each clinical feature was assessed using Fisher’s exact test, with p < 0.05 considered a significant cut-off. TimeROC package [38] was further used to separately draw receiver operator characteristics (ROC) curves for the datasets with the area under the ROC curve (AUC) calculated to examine and compare performance of the classifier.
Determination of immune cell infiltration and leukocyte subtypes in the LUAD samples
To investigate the relationship between ILRS and the degree of immune cell infiltration, we used the ESTIMATE package [39] to generate stromal and immune scores as well as tumor purity of TCGA-LUAD patients. The stromal and immune scores represented presence of stromal and immune cells, respectively in tumor tissues while tumor purity was calculated based on the combination of these scores. They were obtained on the basis of RNA expression profiles of the LUAD samples. Furthermore, we used CIBERSORTx [40] to assess the relative fraction of 22 leukocyte subtypes based on all mRNA transcripts in the TCGA-LUAD cohort, with samples showing statistical significance (p < 0.05) kept for further analysis. A student’s t test was conducted using the ggpubr package [41] to compare differences in immune cell infiltration and leukocyte fractions across the high and low ILRS groups.
Enrichment analysis
We used the Limma package [42] to obtain differentially expressed genes (DEGs) at FDR< 0.01, and |fold change| > 1.5 between patients with a high and low risk scores. We then used the DAVID Functional Annotation Bioinformatics Microarray Analysis [43] to perform gene ontology (GO) and KEGG enrichment analyses on the DEGs. GO enrichment analysis included biological process (BP), molecular function (MF) and subcellular localization (CC). Enrichment analysis and visualization were conducted using the TCGAanalyze_EAcomplete and TCGAvisualize_EAbarplot2 functions implemented in TCGAbiolinck package [30] respectively.
Survival analysis
To evaluate the prognostic value of different variables for LUAD, including lncNRA expression level, risk score and immune infiltration, we performed the Kaplan-Meier (KM) analysis and considered p <0.05, obtained by log-rank test to be statistically significant. KM analysis and log-rank test were mainly implemented by the "sruvival" package [36] with the "survminer" package [44] used for drawing the KM curve.
Patient stratification
We used expression profiles from immune-related lncRNAs to classify the LUAD patients. First, a z-score transformation was used to normalize immune-related lncRNA expression in the TCGA-LUAD cohort as previously described [45]. Then, a Pearson’s correlation coefficient (PCC) between samples was computed to assess their similarity followed by the use of partition around medoids algorithm to subsequently divide the samples into k subgroups [63]. The maximizing cluster reliability approach was adopted to determine the optimal number of subgroups as previously described. Finally, a consensus clustering analysis was carried out using the “ConsensusClusterPlus” package[46].
To further validate and visualize the immune-related lncRNA derived clustering, a principal component analysis (PCA) was performed among the TCGA-LUAD tumor samples. We conducted the PCA analysis based on normalized expression data using the “prcomp” function from the “stats” module implemented in R 3.5.2 software. We then selected the first two principal components that had most variance and projected each sample into two-dimensional space. Finally, “ggplot2” package [47] was used to visualize the expression pattern of all samples as well as the immune-related lncRNA -derived clusters.