Screening of differentially expressed genes
Analysis of differentially expressed genes were conducted by using the limma packages and differentially expressed genes were defined as |log2FC|>1 and adj.P.Val < 0.01. After comparison the results from limma, a total of 15814 genes were up-regulated in ESCA, and there were a total of 6176 gene were down-regulated. The volcano plot and heatmap of differential expression analysis (the significance top 50) are shown in Figures 2A and 2B.
Weighted gene co-expression network analysis
The ESCA gene expression network was constructed by combining the gene expression profiles of tumor samples and normal controls, and the correlation between gene modules and tissue specimen types and other clinical information was analyzed in combination with the clinical information of the study subjects, so as to find gene modules closely participant to ESCA. This part uses the WGCNA program package in R software to realize.
Remove outliers
As mentioned above, after removing batch effect and normalizing the transcriptional expression count data, the data were used for WGCNA analysis, and the hierarchical clustering number was established using the gene expression data of 1558 samples, as shown in Figure 3A. As can be seen from the figure, some samples at both ends are obviously outlier, so the height is set to 275, and the outlier samples are removed. The ID numbers of the 10 outlier samples are: TCGA.LN.A9FO.01, TCGA.VR.A8ET.01, TCGA.VR.A8EO.01, TCGA.IG.A4P3.01, TCGA.Z6.A8JE.01, TCGA.VR.A8Q7.01, TCGA.IG.A50L.01, TCGA.LN.A49W.01, TCGA.VR.A8EP.01, TCGA.V5.A7RE.11. The consequences in Figure 3A revealed that the tumor tissue and normal tissue were clustered reasonably and could be analyzed later.
Select the soft threshold β
As mentioned above, we choose an appropriate adjacent-function weighted parameter, namely the soft threshold β, according to the criteria of scale-free networks. Select a soft threshold that can make the adjacancy function better satisfy the scale-free condition, even if the relevance coefficient (SFT.R.sq) between the logarithm of node connectivity (log(pk)) and the logarithm of the probability of node occurrence (log(p(k)) is above 0.8. The higher SFT.R.sq is, the more the network is in line with the scale-free network distribution. Generally, the soft threshold value when SFT.R.sq is close to 0.9 and reaches the plateau is selected. The calculation results of this study are shown in Figure 3B. The left figure in Figure 3B shows the variation of SFT.R.sq corresponding to different soft thresholds, and the right figure shows the mean value of critical coefficient of genes in the gene network corresponding to different soft thresholds, which reflects the average connection level of the network. When the soft threshold is 18, SFT.R.sq has reached 0.85 and is basically in the plateau.
Step-by-step method to construct gene co-expression network
As mentioned earlier, soft threshold 18 was selected for step-by-step network construction. The basic concept is to calculate the correlation matrix and adjacence matrix of the expression profile, and finally transform them into a topological matrix, and get the number of systematic clustering according to the dissimilarity between the two pairs of genes. Firstly, the minimum number of genes in each gene module was defined as 30, and the dynamic shearing method was adopted to combine the gene modules with the height less than 0.25, so as to obtain a total of 7 gene modules (Figure 4A, 4B). The gray module represented the genes that were not allocated to any known module.
Analysis of relationship between gene module and clinical information
The clinical information included in the study, including sample type, ESCA, Gender, Weight, Height, Bmi etc., were all time-dependent variables. Pearson relevance coefficient between Module eigengene(ME) gene and corresponding variable represents the relevance between module and clinical information. The results are shown in Figure 4D. As can be seen from Figure 4D, there is a moderate positive relevance between the brown module and the sample type (r = 0.87, P < 0.001), the lightcyan module was moderately negatively corelative with the sample type (r = -0.75, P < 0.001). Figure 5 shows that from the perspective of MS (Gene Significance across modules), i.e. the mean significance of all genes contained in the module, the brown module is also the module most closely related to ESCA, reflecting that our analysis consequences are reliable.
The greeyellow module had a negative relevance with race(r = -0.45, P < 0.001), presumably genes in this module are in some ways associated with coloured race. However, since the key purpose of this study is to establish ESCA an prediction model, lightcyan module and greeyellow module will not be further studied.
Module eigengene (ME), the first principal component gene (Eigengene) of a particular module, represents the overall level of gene expression within that module. Module membership (MM) is expressed by the relevance between the expression profile of a gene in the sample where it resides and the expression profile of a feature vector gene ME. The module vector gene tree cluster diagram and module vector gene adjacencies relationship heat map were drawn to find gene modules closely participant to ESCA. The consequences are shown in Figure 4B and 4C.
Since the objective of this study was to find gene modules closely participant to ESCA, no further analysis was conducted on this finding. Combined with Figure 4C, we can see that brown module has the closest relationship with ESCA, so the known genes in this module are selected for subsequent GO and KEGG analysis. For age, smoking history, BIM, weight, height and other traits, there were also found to be significant relevance gene modules.
Hub genes closely participant to ESCA
Hub genes refer to a series of genes with the supreme degree of connectivity in a module, which conclude the traits of the module to a certain extent. Compared with the hub genes in the global network, the hub genes in the module tend to have more biological significance [52-54]. The module membership(MM) in the brown module and the gene significance(GS) have a high correlation(0.87) and high P-value(<1E-200)(Figure 6A), suggesting that the module is suitable for identifying the hub genes associated with the ESCA. We selected the top 30 genes with the supreme connectivity(kWithin top 30) from the brown module and mapped the gene mutual effect network using Cytoscape software(Figure 6C). It was found that the 30 hub genes were all significantly up-regulated DEGs by intersection analysis with 15,814 significantly up-regulated DEGs(Figure 6D). The 30 hub differentially expressed genes in this module are constructed for ESCA prediction model.
GO classification and enrichment analysis results
Brown module differentially expressed genes
Brown module in a total of 6838 genes, including 4419 genes’ expression is meaningfully different (| log2FC | > 1, and adj.P.Val < 0.01)(Figure 6B). Because clusterProfiler needs to use ENTREZ ID number as input file, 3497 genes’ with ENTREZ ID number among the 4419 genes GO&KEGG enrichment analysis was implemented, and the consequences were shown in Figure 7A, B and C.
These 3497 genes were meaningfully enriched in the biological processes of adaptive immune reaction based on somatic recombination of immune receptors built from immunoglobulin superfamily domains, mitotic nuclear division, lymphocyte mediated immunity, nuclear division, and DNA replication(Figure 7A). The most enriched terms in cellular component were immunoglobulin complex, “chromosome, centromeric region”, condensed chromosome, “immunoglobulin complex, circulating”, and “condensed chromosome, centromeric region” (Figure 7B). The most typical term in molecular function were antigen binding, immunoglobulin receptor binding, ATPase activity, cadherin binding, and DNA helicase activity(Figure 7C). In KEGG enrichment analysis, we found that Cell cycle, Pathogenic Escherichia coli infection, DNA replication, IL-17 signaling pathway, and Human T-cell leukemia virus 1 infection were the most over-represented pathways (Figure 7D).
Brown module hub genes
Interestingly, according to the results of WGCNA and differential expression analysis, all 30 hub genes in the brown module were significantly high expressed(log2FC>1, and adj.P.Val<0.01)(Figure 6D). Among the 30 differentially expressed hub genes, 29 had ENTREZ ID. These 29 genes were used for functional enrichment analysis of GO and KEGG.
The consequences show that these differentially expressed hub genes enriched in the biological processes of regulation of microtubule cytoskeleton organization, regulation of microtubule-based process, nuclear division, centrosome cycle, and microtubule organizing center organization(Figure 8A). The most enriched terms in cellular component were spindle, chromosomal region, spindle pole, microtubule, and mitotic spindle(Figure 8B). The most representative term in molecular function were protein serine/threonine kinase activity, DNA helicase activity, single-stranded DNA binding, helicase activity, and catalytic activity, acting on DNA(Figure 8C). In KEGG enrichment analysis, we found that Cell cycle, DNA replication were the most over-represented pathways (Figure 8D).
Prediction model construction
Training group and validation group data
In the field of machine learning research, improving the accuracy of consequences and the potency of task processing are the two main problems. If the training set is insufficient, the prediction effect of the model may be poor. If there are too many training sets, the task processing potency will be reduced, and it is easy to lead to overfitting. Therefore, cross validation (CV) method emerged as the times require, which can not only avoid poor fitting or overfitting, but also build up processing potency under the condition of ensuring prediction accuracy [22, 23]. Cross validation, can cut the data samples into smaller subsets, and use some of them as training sets to learn or train the model, while other subsets are used as verification sets or test sets to evaluate the model effect. Common cross-validation methods include k-fold corss-validation and leave-one validation. In this study, we adopted the 10 fold cross validation recommended in the literature to build the prediction model [23, 24], and used the trainControl function integrated in caret program package to set the 10 fold cross validation, and repeated it three times. As a large number of subjects were included in this study, the data of the training group was designed as 50% of the total data in the grouping.
The index used for ESCA prediction model construction was from the hub gene (also significantly high expressed gene) in the brown module closely participant to the ESCA, a total of 30. According to the test, there is no nearly zero variable among these variables, but there are 26 highly corelative variables. After removing these variables, ESCA prediction model is constructed by CHEK1, MOB1A, PTBP3 and G3BP1. The allocation of the four markers in the training group and the verification group was plotted respectively, and the results were shown in Figure 9 and Figure 10. As can be seen from the figure, the distribution of these four indicators was basically united between the training group and the verification group, reflecting that the random grouping was reasonable and the data composition between the two groups was basically united.
Evaluation of prediction effect of different prediction models
As mentioned above, 12 commonly used machine learning algorithms are used to build the prediction model, and the optimal parameters of each model are shown below:
Optimal parameters of the SVM model: sigma = 0.475461684041232, C = 4.
Optimal parameters of XGBoost model: Nrounds = 150, max_depth = 3, ETA = 0.3, Gamma = 0, colsample_bytree = 0.6, min_child_weight = 1, subsample = 1.
Optimal parameters for C5.0 model: Trials = 60, Model = rules, WinNow = FALSE.
Optimal parameter of PLS model: NCOMp = 1.
The optimal parameter of the LMT model is iter = 1.
The best parameters of the boostedGLM model are: mstop = 50, prune = no.
The optimal parameter of parRF model: mtry = 2.
Optimal parameter of rpart model: cp = 0.870257037943696.
The optimal parameters of the JRip model: NumOpt = 7, NumFolds = 10, MinWeights = 2.
PART Optimal parameter of the model: threshold = 0.5, pruned = yes.
GBM model optimal parameters: N.rees = 150, Interact.Dept = 3, Shrinkage = 0.1, n.MinobsinNode = 10.
Optimal parameters of Ada Boost model: nIter = 150, method = Adaboost.m1.
Although the optimal parameters of these models have been decided, it is still inevitable to consider how to select the optimal model from these models. This has a lot to do with the traits and types of data and generally needs to be considered from the following perspectives: First try boosted tree and Support Vector Machines (SVM), which are the most difficult to interpret and the most flexible models, which are likely to get the best prediction consequences (with the best accuracy); Explore more simple and easier to interpret models, such as partial least squares model, generalized linear model or naive bayesian algorithm model; The simplest model that can achieve or close to the prediction effect of the most complex model can be considered as the final model [55].
Among the 12 machine learning algorithms used in this study, 9 machine learning algorithms are all black box models. The three transparent or translucent algorithm models are PART, RPart and JRip,, the specific models are as follows:
PART algorithm prediction model:
PART Decision tree rules (There are four rules) :
MOB1A > -0.032404: ESCA (85.0/3.0)
CHEK1 <= -0.141327: Normal (75.0)
G3BP1 > -0.324787: Normal (12.0)
: ESCA (4.0)
rpart decision tree optimal model: n= 176
node), split, n, loss, yval, (yprob) * denotes terminal node
1) root 176 86 Normal (0.48863636 0.51136364)
2) MOB1A>=0.02317645 85 3 ESCA (0.96470588 0.03529412) *
3) MOB1A< 0.02317645 91 4 Normal (0.04395604 0.95604396) *
JRip optimal model (3 decision rules) :
(MOB1A >= 0.078756) => .outcome=ESCA (85.0/3.0)
(CHEK1 >= -0.117953) and (G3BP1 <= -0.378487) => .outcome=ESCA (4.0/0.0)
=> .outcome=Normal (87.0/0.0)
Then, the validation group data were used to evaluate the prediction effects of the prediction models constructed by various algorithms, mainly from the perspectives of accuracy, Kappa value (consistency rate), optimal sensitivity and specificity. The prediction effects of different models are shown in Table 1. It can be seen from the consequences that the prediction model built by gbm and BoostGLM algorithms has the best effect, followed by SVM, LMT and parRF algorithms. Although Jrip, PART and RPart, which are highly readable sibylline models, have high specificity, their sensitivity is relatively low. The selection of these models will lead to about 10%~11% of normal individuals being misjudged as ESCA. Therefore, these three models are not suitable for the final model in this study. The sensitivity and specificity of the models constructed by gbm and BoostGLM algorithms are up to 95% and above 97% in the validation group data, so the models procured by these two algorithms are taken as the ultimate models. In this study, an ESCA prediction model with accuracy above 0.97 was successfully constructed.
Validation of the four markers
Through the intersection analysis of DEGs from GSE33426 and the four markers, we find that the four markers are all differentially expressed(Figure 11A), while that in GSE23400-GPL96 only G3BP1, MOB1A, CHEK1 were differentially expressed(Figure 11B). To further test the value of the candidate four markers as prognostic biomarkers of ESCA, ROC curves were performed and the AUCs (95% CIs) were calculated. As shown in (Figure 12A), the AUC of G3BP1, MOB1A, CHEK1 and PTBP3 in GSE33426 were respectively 0.99, 0.962, 0.958 and 0.856, while that in GSE23400-GPL96 were 0.819, 0.77, 0.952 and 0.543(Figure 12B). These results suggested G3BP1, MOB1A and CHEK1 as potential biomarkers of ESCA. Simultaneously, based on overall survival analysis, we found that G3BP1 may be a potential therapeutic target for ESCA(Figure 13B).