Data selection and process
As is well know that the majority of NSCLC is composed of lung squamous carcinoma (LUSC) and lung adenocarcinoma (LUAD). So we obtained the raw counts of transcriptome profiling and clinical data from TCGA-LUAD including 535 cancer samples and 59 non-tumor tissues, and data from TCGA-LUSC including 502 cancer samples and 49 non-tumor tissues. And we divided those data into three groups including the NSCLC group, LUAD group, and LUSC group. All the transcriptome data of lncRNAs, miRNAs, and mRNAs were obtained from The Cancer Genome Atlas (TCGA) database and were annotated through the Ensemble database [14, 15]. Stepwise, all the raw count data were log2 (x+1) transformed and normalized using the "LIMMA" package [16]. Then, we screened out differentially expressed miRNAs (DEMIs) in three groups by the “LIMMA” package with the criteria of |log2FC| > 1, average expression > 1, and adjust P-value < 0.05, and identified differentially expressed lncRNAs (DELs) and differentially expressed mRNAs (DEMs) in three groups with the criteria of |log2FC| > 2, average expression > 2, and adjust P-value < 0.05, respectively.
Survival analysis
Additionally, we removed samples without survival time or survival time less than 7 days to improve the reliability of our study. We first estimated the association between overall survival (OS) and clinical parameters through univariate and multivariate CPHR analysis. Furthermore, we evaluated the relationship between survival time and common DELs expression through Kaplan Meier analysis and univariate CPHR method. Only DELs that their P-value was lower than 0.05 and their expression consistent with prognosis were regarded as candidate survival-related lncRNAs. Combing with clinical risk factors, we performed LASSON Cox regression analysis to obtain the best fitting variables. After selecting the best-fit of OS-related variables by the calculation mentioned above, we further verified their prognostic value through multivariate CPHR analysis. And only variables that P-value was lower than 0.05 in univariate and multivariate CPHR calculations were selected as qualified lncRNAs and were chose for the next analysis.
Constructing a risk score formula and nomogram model
Combining with clinical risk variables and qualified lncRNAs, we performed univariate and multivariate CPHR analysis to identify OS-related biomarkers in 970 patients with NSCLC. Also, we calculated the prognostic risk score of each patient through multivariate CPHR analysis according to the formula as follows: risk score = X1α1 + X2α2 + X3α3 +...+ Xnαn. And patients in the NSCLC group were divided into high-risk groups and low-risk groups based on the median value of risk score. C-indexes were performed to assess the predictive performances of our risk score formula. Next, we evaluated the prognostic differences in high-risk and low-risk groups by Kaplan Meier analysis and T-test. We assessed the prediction performance of the risk formula through ROC curves at 3 and 5 years and computed their AUC values in three groups. Moreover, we established a nomogram to vividly depict the predictive relationship among clinical factors, lncRNAs, and OS. Calibration curves of 3 and 5 years were calculated to assess the reliability of OS prediction between predicted performance and actual ability. All the analyses mentioned above were conducted in NSCLC, LUAD, and LUSC groups, respectively.
Function enrichment analysis
We performed the Gene Ontology (GO) terms and Kyoto Encyclopedia of Gene and Genomes (KEGG) pathways enrichment analysis to elucidate the potential functions of lncRNAs in the nomogram. The common DEMs in the three groups were first identified. Next, the correlation coefficient of each lncRNA with common DEMs was calculated, respectively. To obtain an accurate result, we only selected DEMs with a correlation coefficient greater than 0.2 for further enrichment analysis. Then, we performed the GO and KEGG analysis of lncRNA-related DEMs via the DAVID database, and the Enrichr database [17, 18]. And we depicted the top 10 enriched GO terms and KEGG pathways through a bar plot with the criteria of adjusting P-value < 0.05.
Establishing a ceRNA network
To explore the potential interaction between 5 lincRNAs and miRNAs, the LncBase database that provided miRNAs and lncRNAs interactions according to MREs sites was applied to predict the downstream miRNA of 5 lncRNAs with the criteria of Prediction score > 0.8 [19]. Only miRNAs expressed on the NSCLC group, LUAD group, LUSC group, and LncBase database were chosen as qualified miRNAs. Additionally, the miRDB database and the miRTarBase database are a widely-used tool for miRNA target prediction that were employed to find out potential mRNAs binding to qualified miRNAs [20, 21]. Only miRNAs that validated in miRDB database, miRTarBase database, and DEMs were considered as candidate mRNAs. Finally, to vividly display the interaction of 5 lncRNAs with qualified miRNAs and candidate mRNAs, we constructed a ceRNA network by Cytoscape software (Version 3.7.2) [22].
Statistical analysis
We performed all the statistical analyses mentioned above using R software (version 3.6.1). Briefly, OS was analyzed using the Kaplan–Meier test, and the log-rank T-test was applied to calculate the statistical significance. Univariate and multivariate CPHR analyses were conducted through "Survival" packages. LASSON CPHR was performed using "Survival" and "glmnet" packages. A nomogram was constructed by "Survival" and "rms" packages. And a time-dependent ROC analysis was performed by the ‘‘survivalROC’’ package, C-index by ‘‘survival’’ package, and calibration curve by ‘‘rms’’ package.
The ceRNA network was constructed by Cytoscape software. And we set a p-value lower than 0.05 as s statistical significance.