Raw data acquisition and preprocessing
RNA-seq datasets related to NPC (GSE12452, GSE53819, GSE64634, GSE61218, GSE102349) were sourced from the GEO database. For biomarker screening, datasets GSE12452 and GSE53819 were combined to create a new training set, while datasets GSE64634 and GSE61218 were utilized to establish the validation set. The dataset GSE102349, which contains NPC clinical information, was utilized for survival analysis and subsequent bioinformatics analysis. RNA-seq data for 33 solid tumors and corresponding normal tissues from the TCGA database. The raw data were subjected to probe annotation, normalization, and batch effect removal using the “sva” package(18). The characteristics of the GEO datasets are summarized in Supplementary Table 1.
Differential expression analysis and machine learning algorithms
Differential expression analysis was performed on the training set using the “limma” package, with |log2FC| > 1 and false discovery rate (FDR) < 0.05 as screening criteria for identifying differentially expressed genes (DEGs)(19). The DEGs were further screened utilizing Random Forest (RF), Least Absolute Shrinkage and Selection Operator (LASSO) logistic regression, and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) algorithms. RF was performed using the “randomForest” package, genes with a score > 1 were considered as key genes based on the “importance” function. The LASSO regression model was constructed utilizing the “glmnet” package(20) for variable selection, and a cross-validation method was used to determine the optimal lambda value corresponding to the number of key genes. SVM-RFE was performed using the “e1071” package to conduct 10-fold cross-validation and feature elimination and identified a set of key genes based on the principle of error minimization.
Selection of the key biomarker
The overlapped genes identified by the three algorithms were considered highly related to NPC. The diagnostic value of these genes was assessed by receiver operating characteristic curve (ROC) analysis and validated using the validation set. 88 patients from GSE102349 with complete progression-free survival (PFS) information were used for survival analysis. The RNA-seq data and PFS information from the TCGA were used to conduct a pan-cancer survival analysis. Grouping for survival analysis was based on median expression levels of the overlapped genes (GNA14 and LRRC34). Through the above process, we successfully identified a key biomarker, GNA14, highly related to the diagnosis and prognosis of NPC.
Enrolled patients and follow-up
Tissue specimens for immunohistochemical (IHC) examination of GNA14 were obtained from 165 diagnosed NPC patients treated at Zhongshan City People's Hospital (Guangdong, China) from January 2015 to December 2017, along with 30 patients diagnosed with chronic rhinosinusitis during the same period. This study was conducted in accordance with the Declaration of Helsinki. The study received approval from the Clinical Research Ethics Committee of the Zhongshan City People's Hospital. All NPC tissues were collected before anti-cancer treatment. All patients following these criteria were retrospectively enrolled: (a) histopathologically confirmed NPC; (b) clinical stages I-IVa according to the 8th edition AJCC/UICC staging system; (c) received either solely intensity-modulated radiation therapy, concurrent chemoradiotherapy, with or without induction chemotherapy or adjuvant chemotherapy; (d) had complete baseline data; (e) had no severe heart, lung, liver, kidney diseases, or other cancers at NPC diagnosis. NPC patients who completed treatment were followed monthly for the first 3 months, every 3 months for the next 3 years, every 6 months for the next 2 years, and annually thereafter. Follow-up ended in December 2023.
IHC and scoring strategies
All Tissues were fixed in 4% formaldehyde and embedded in paraffin. To assess the expression level of GAN14, IHC examinations were conducted using the GNA14 antibody (Polyclonal, rabbit, 13350-1-AP-50UL, 1:200 dilution, Wuhan Sanying). Tissues were cut into 4 µm sections, deparaffinized with xylene, and subsequently rehydrated with graded ethanol. The slides were then incubated with a 3% H2O2 solution for 10 minutes to quench endogenous peroxidase activity, and 0.01 mmol/L citrate buffer (pH 6.0) was used for antigen retrieval in a high-pressure cooker. The sections were incubated with the primary antibody for 3 hours at room temperature, followed by rinsing with TBS. Following incubation with the secondary antibody (Anti-Mouse/Rabbit Universal Immunohistochemical Test Kit, PK10006, Wuhan Sanying), the sections were stained with DAB. All slides were then re-stained with hematoxylin, examined under the microscope, and photographed. Positive control sections were provided by the antibody manufacturer. Cell staining intensity was scored based on a previous study(21, 22). The IHC results were assessed by calculating the total score (0–12) by multiplying the intensity of positive staining (negative, 0; weak, 1; moderate, 2; or strong, 3) by the proportion of target immunopositive cells (< 25%, 1; 25–49%, 2; 50–75%, 3; or > 75%, 4). IHC results were evaluated independently by two pathologists, and any discrepancies were resolved through consensus. Based on the median IHC score, the high GNA14 expression group was defined as samples with an IHC score > 4, and the low GNA14 expression group was defined as samples with an IHC score ≤ 4.
Expression profile of GNA14
RNA-seq data were extracted from the training and validation sets to compare the expression levels of GNA14 in NPC and normal tissue. IHC staining was employed to detect and compare the expression of GNA14 in nasopharyngeal tissues from patients with chronic sinusitis and NPC. Furthermore, we assessed the differential expression of GNA14 in various solid tumors by analyzing RNA-Seq data from TCGA database, covering 33 different types of tumors and their corresponding normal tissues.
Functional Enrichment Analysis
NPC patients in GSE102349 were categorized into high and low GNA14 expression groups. Identification of DEGs between two groups using the “limma” package (|log2FC| >1, FDR < 0.05). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways of the DEGs were performed using the “clusterProfiler” package. We selected “C5.go.Hs.symbols.gmt” from the Molecular Signatures Database (MSigDB) as the reference gene set, and gene set enrichment analysis (GSEA) was performed for genes in the GNA14 high and low expression groups.
Immune infiltration and drug sensitivity analysis
The “estimate” package in R was employed to predict the ImmuneScore, StromalScore, and their combined scores (ESTIMATE score) in tumor samples. Correlations between GNA14 expression and 79 immune checkpoint genes were analyzed(23), and the immune checkpoint genes strongly associated with GNA14 were selected (p < 0.001). The single-sample gene set enrichment analysis (ssGSEA) algorithm was utilized to assess the relationship between the proportions of various immune cell types in NPC and GAN14 expression. The gene annotation file contains 28 tumor-infiltrating immune cells from TISIDB. Furthermore, the “oncoPredict” package was used to estimate the chemotherapeutic response of patients from high and low GNA14 groups. The chemotherapeutic response was determined by the half maximal inhibitory concentration (IC50) of each NPC patient and the IC50 data was sourced from the GDSC website (https://www.cancerrxgene.org/).
Statistical Analysis
Descriptive statistical analysis was performed on the collected data, expressed as mean ± standard deviation (SD) or percentage (%). Continuous variables were compared using the independent samples t-test or Mann-Whitney U test. Correlations between variables were assessed using Pearson or Spearman correlation analysis. The division of high and low GNA14 expression groups was based on the median GNA14 expression level. Survival analysis was performed using the Kaplan-Meier method, and log-rank tests were used to compare differences. Correlations between GNA14 expression and clinicopathological features were analyzed using the chi-square test or Fisher exact test. The cutoff value for the high and low EBV DNA groups was chosen as 4000 copies per milliliter based on a previous study of the prognostic value of EBV DNA(24). Univariate and multivariate Cox regression models were used to identify the risk factors influencing the prognosis of NPC. A nomogram was constructed to predict the 3-year and 5-year DMFS of NPC patients based on the results of multivariate analysis, and the predictive ability of the models was assessed by calculating the consistency index (C-index). All tests were two-tailed, with significance levels set at p < 0.05. All statistical analyses were performed using R software (version 4.0.0) or SPSS software (version 20.0, IBM, New York, USA).