Literatures related to the prognosis and treatment response of ESCC were retrieved from NCBI PubMed, Web of Science and Embase databases, published up to 31 December 2018, by two independent researchers. The key words for literature searching included “esophageal squamous cell cancer”, “prognosis or recurrence or resistance or sensitivity” and “chemotherapy or chemoradiotherapy”. All relevant studies were retrieved.
Inclusion and exclusion criteria
We selected the studies using the following criteria: (1) clinical prognosis of patients with ESCC; (2) prediction of clinical response to chemotherapy or chemoradiotherapy; (3) clinical recurrence of ESCC; (4) retrospective and prospective cohort studies; (5) studies published in English. When disagreements occurred between reviewers, a third reviewer was invited for discussion of the eligibility of related studies.
Publicably available mRNA transcriptome data of ESCC from Gene Expression Omnibus (GEO) and TCGA datasets included GSE53625 and TCGA-ESCC. GSE53625 included 179 patients with ESCC that were randomly divided into a training cohort of 134 patients and a test cohort of 45 patients. Since the GSE53625 data had been normalized in the original study , and all samples in the data set were paired samples, the difference between the expression values of cancer tissue and corresponding adjacent tissue was taken as the input data for all subsequent calculations. TCGA-ESCC contained 82 patients with ESCC, of which 37 Vietnamese patients with ESCC were used for an independent validation.
Patients and clinical samples
Eighty-six fresh-frozen ESCC with matched noncancerous mucosa samples were collected from the First Affiliated Hospital of Henan University of Science and Technology between 2012 and 2017. All ESCC patients received curative esophagectomy without preoperative neoadjuvant chemoradiotherapy. Written informed consent was obtained from all patients. This study was approved by the Ethics Committee of the First Affiliated Hospital of Henan University of Science and Technology.
In this study, 48 molecules related to prognosis of ESCC were used to establish a molecular interaction subnetwork by NetBox . The shortest path between molecules in the network was defined as 1, denoting that molecules with direct interaction were selected as nodes of the subnetwork. NetBox, a java-based software tool, integrates four databases including the Human Protein Reference Database (HPRD), Reactome, NCI-Nature Pathway Interaction (PID) Database, and the MSKCC Cancer Cell Map.
Introduction of machine learning algorithms
This study used 5 machine learning algorithms, including LR, SVM, ANN, RF and XGBoost, to develop classifiers for prognostic classification.
The LR model is a generalized linear model, which is based on linear regression with a layer of Sigmoid function mapping. LR regression model is one of the most commonly used methods in medical research [46, 47].
SVM is a supervised learning method developed by Cortes and Vapnik in 1995 . The support vectors are used to find the best hyperplane and then classify samples with different labels. The nonlinear features are mapped to the new high dimensional space by constructing a mapping function, and the inner product operation in the mapping space is simplified by kernel function to ensure that the results were equivalent, to achieve the linear separability of the samples. In this study, the Radial Basis Function (RBF) kernel function was used, and the RBF’s transformation method was as follows:
where is the hyper-parameter controlled in accordance with deviation and error of variance.
Neural networks are an important machine learning technology and have widespread applications with advances of scientific computing capabilities such as supercomputers and quantum computing. In general, a neural network consists of an input layer, multiple hidden layers, and an output layer. The most important element in a neural network is the design of hidden layer and connection weight between neurons. Logistic regression belongs to the neural network with zero hidden layers.
RF and XGBoost are two integrated learning algorithms based on bagging and boosting algorithms, respectively. Integrated learning uses a certain method to learn multiple weak classifiers with some differences followed by combination of these classifiers. If the error rate of weak classifier is less than 0.5, the combination of strong classifier will gradually increase predictive ability and reduce classification error to achieve classification.
Development of classifiers
For 179 patients with ESCC samples, labels were assigned according to the survival time. Label 1 denotes the ESCC cases with survival times of more than 3 years and the remaining cases were labeled as 0. In the training cohort, cross-validation and parameter optimization were used to develop the models, and the test cohort was used for validation. Receiver operating characteristic (ROC) curve analysis was used to estimate predictive values of machine learning classifiers and the area under the curve AUC (area under ROC Curve) was calculated.
For each machine learning algorithm, 131071 models representing various combinations of 17 selected features were established, and AUCs of the models in training and test cohort were calculated. During the development of classifiers, candidate classifiers were those classifiers with AUCs greater than the average of AUCs across all classifiers. Among all candidate classifiers, top 1000 models with the highest AUC values in test cohort were selected, and the occurrence frequencies of each molecule were counted in these 1000 classifiers. Top 5 molecules with the highest occurrence frequency were regarded as the important molecules of the corresponding machine learning algorithm.
The construction and testing of the classifiers in this study were implemented by using R 3.6.3. The weak classifier uses R packages such as bestglm, e1071, and nnet, and the integrated learning algorithm uses randomForest and xgboost.
RNA extraction and quantitative RT-PCR
Total RNA of 86 pairs of ESCC samples with matched noncancerous tissues were isolated using Trizol reagent (Invitrogen, Carisbad, CA), and reverse transcription was performed using 1 μg of total RNA (Promega, USA). The primer pair for stratifin was as follows: forward primer, 5’-GACTACTACCGCTACCTGGC-3’, and reverse primer, 5’-GTTGGCGATCTCGTAGTGGA-3’. GAPDH was used as an internal standard and its primer pair was as follows, forward primer, 5’- GCCACATCGCTCAGACACC -3’, and reverse primer, 5’- GATGGCAACAATATCCACTTTACC -3’. Quantitative RT-PCR was performed in triplicate on an Applied Biosystems 7900 quantitative PCR system (Foster City, CA, USA). The Ct values were used for comparison using 2-ΔΔCt method with GAPDH as the internal standard.
Differences of the quantitative data between 2 groups were performed using the unpaired or paired Student t-test. The relationship between the abundance of western blot and the expression level of SFN was analyzed by using linear regression. The Kaplan-Meier survival curves and log-rank tests were performed to determine the statistical significance of overall survival. All P values were 2-tailed and P values <0.05 were designated as significantly different.