Biomarker Screening And Prediction Model Construction of Esophageal Carcinoma Based On Bioinformatics

doi:10.21203/rs.3.rs-915949/v1

Download PDF

Research

Biomarker Screening And Prediction Model Construction of Esophageal Carcinoma Based On Bioinformatics

https://doi.org/10.21203/rs.3.rs-915949/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background and objective: Esophageal cancer(ESCA) ranks eleventh in incidence and eighth in mortality among malignant tumors in the world. Due to the paucity of effective early diagnostic approach, a lot of patients have missed the first-rank treatment time frame and were already in the advanced phase at their first diagnosis. The continuous reforming of high-throughput sequencing technologies and analytical techniques has provided novel concepts and approaches for the study of cancer biomarkers in esophageal cancer. The development of cancer is a complex biological process with multi-gene concernment, multi-factor mutual effect and multi-phase development. This process includes the mutations in proto-oncogenes, changes in transcript expression profiles, and abnormalities of protein structure, function, or expression levels. The study of the molecular mechanism of ESCA using high-throughput sequencing technology will lay theoretic foundation for the early diagnosis and targeted therapy of ESCA.

Materials and methods: In this study, a search was conducted in tow commonly used public databases, UCSC XENA and GEO, one UCSC XENA RNA-seq data and tow GEO datasets were included in this study. Differential expression analysis was implemented by using limma in R software.

Weighted gene co-expression network analysis (WGCNA) was used to analyze the gene transcriptome expression profile consisting of 181 ESCA tissues and 181 normal tissues as controls to construct topology network. We constructed gene modules and searched for gene modules that were closely participant to ESCA, and gene ontology (GO) and KEGG pathway enrichment analysis were implemented to probe into the functions of the DEGs and differentially expressed hub genes in key modules. By combining the consequences of differential gene expression analysis with WGCNA consequences(hub genes), we procured a 30 of differentially expressed genes in module that were closely participant to ESCA. Next, we procured the expression data of these genes from normalized transcriptome expression data to construct ESCA predictive model. Then, ten-fold cross validation combining with machine learning algorithms were used to construct prediction models for ESCA. Finally, we also verified the four screened biomarkers which used to build the predictive model with the GEO data sets.

Results: Analysis of differentially expressed genes were conducted by using the limma packages and differentially expressed genes were defined as |log2FC|>1 and adj.P.Val < 0.01. After comparison the results from limma, a total of 15814 genes were up-regulated in ESCA, a total of 6176 gene were down-regulated in ESCA.

A total of 7 gene modules were identified from WGCNA, 2 modules of them are strongly corelative with ESCA (Brown module: R2=0.87, Lightcyan module: R2=-0.75, both P <0.001). Brown module is closely related to ESCA.

The consequences of WGCNA analysis combined with differentially expressed genes revealed that there were 4419 differentially expressed genes in the brown module which were closely related to ESCA. 30 hub gene were screened by kWithin top 30 from brown module, and all of them are differentially expressed.

GO analysis of differetially expressed genes from brown module revealed that these genes are from immunoglobulin complex, “chromosome, centromeric region”, condensed chromosome, “immunoglobulin complex, circulating”, condensed chromosome, centromeric region, and other components, and they participated in biological function such as antigen binding, immunoglobulin receptor binding, ATPase activity, cadherin binding, DNA helicase activity, etc., involved in biological processes such as adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains, mitotic nuclear division, lymphocyte mediated immunity, nuclear division, and DNA replication; KEGG pathway analysis shows the brown module differentially expressed genes are mainly enriched in signal pathways such as cell cycle, pathogenic escherichia coli infection, DNA replication, IL-17 signaling pathway and human T-cell leukemia virus 1 infection. This shed new light on molecular mechanisms of the development of ESCA.

Twelve ESCA prediction models constructed from 30 gene expression matrices from 362 subjects by using 10-fold cross-validation combined with machine learning algorithms revealed good prediction performance in validation dataset, among which models from gbm, BoostGLM, C5.0 algorithms revealed higher accuracy than from other algorithms. Although the transparent or semi-transparent models constructed by JRip, PART, and Rpart algorithms have acceptable accuracy in validation dataset, their sensitivity are lower. From a comprehensive perspective, two black box algorithm models including gbm and BoostGLM models are selected as the final model. This study has successfully constructed ESCA prediction models with accuracies higher than 0.97.

Finally, three of the four screened biomarkers were validated.

Conclusions: In current study, differential expression analysis and WGCNA of ESCA participant RNA-seq data available in public database were used to screen DEGs and genes that were closely participant with ESCA. Consequences from GO and KEGG analysis further revealed the underlying mechanisms of ESCA. Normalized gene expression data was feed to several different machine learning techniques and 10-fold cross validation was used to construct high accuracy ESCA predictive models. Eventually, several ESCA predictive models with accuracy higher than 0.96 in validation group were constructed. At the meantime, three biomarkers(G3BP1, CHEK1 and MOB1A) were screened and validated, in particular, G3BP1 may be a potential therapeutic target, as overall survival analysis have shown it to be an adverse prognostic factor. Current study has lay the basis of applying RNA-seq data in the early genetic diagnosis of ESCA, and a prognostic marker that might contribute to treatment of ESCA.

Molecular Biology

Esophageal carcinoma

WGCNA

Gene Expression

prognostic markers

ESCA is the seventh most usual cancer and the sixth major cause of cancer-participant death universal [1, 2]. In 2018, there were 572,034 new sufferers and 508,585 mortality universal, and almost half of the conditions happen in China [1, 3]. The 5-year survival rate of early-phase carcinomatosis may arrive at 85%, while it drops to less than 15% when the carcinomatosis progress to senior phase [4, 5]. Despite the fast development of peaky treatments in latest years, esophagectomy remained the standard curative method for advanced ESCA, and patients still suffer poor life quality and face risk of standing recurrence [6–8].

Commonly, clinical screening of early ESCA mostly depends on endoscopic observation combined with biopsy-based histopathological diacrisis, while, it can detect only 19% of ESCA at inchoate phases [9]. The dependability of such approaches confront boundaries due to variation in endoscopic sampling site and observer’s experience, and may lead to spare biopsies, expenses, and high false-positive rates. Even the molecular and cellular alterations may occur lazy in endoscopic histology to avoid supervision[10, 11]. Therefore, it is emergency to detect exquisite approaches or biomarkers to build up the early discovery, therapy direction, and prognosis forecast for patients. Currently, several scholars have reported their exploration of biomarkers for esophageal cancer and their effects [12–14]. Defects of a mitotic checkpoint may bring about errors in the chromosome segregation, and the higher level of cyclin B1 (CCNB1) is a scorer of poor prognosis in many cancer types [15, 16]. Whereas, there are few reports about CCNB1 as a biomarker of ESCA.

RNA sequencing can detect the changes in tumor gene expression and transcriptome level, which is helpful for disease classification and molecular mechanism of disease occurrence and development, seeking reliable biological markers for early diagnosis and typing of tumors, and providing potential therapeutic targets for targeted therapy of tumors. However, how to effectively analyze and process these data to better reveal the relationship between genes and tumors still requires further research.

WGCNA can effectively using gene expression information, the genes were divided into different modules with similar expression profiles, the research method is not simply some gene expression in either high or low expression, but genes according to biological networks is decomposed into modules to study, to help discover important gene modules participant to the traits of samples[17, 18]. Scholars have used WGCNA to identify gene modules closely participant to adrenocortical carcinoma and lung adenocarcinoma[19, 20]. Based on this, WGCNA method can be used to analyze the data to find gene modules closely participant to ESCA. This method can make up for the deficiency of differential expression gene analysis to some extent, and can complement the results of differential expression gene analysis to narrow the screening range of candidate tumor markers.

Machine learning is a branch of artificial intelligence, which has developed into a multi-field interdisciplinary science in the last 30 years. Machine learning arithmetic is a kind of arithmetic that automatically obtains rules from data analysis and uses the rules to predict unknown data [21].Machine learning has been widely used in data mining, biometric recognition, medical diagnosis, DNA sequencing and other fields. In the field of machine learning research, improving the accuracy of results and the potency of task processing are the two main problems it faces [21]. If the training set is insufficient, the prediction effect of the model procured is not good. If there are too many training sets, the task processing potency will be reduced, and it is easy to lead to overfitting. Cross validation (CV) method came into being, which can not only avoid poor fitting or overfitting, but also build up the processing potency under the condition of ensuring the prediction results [22, 23]. Cross validation, also known as circular estimation, can cut the data samples into smaller subsets, and use some of them as training sets to learn or train the model, while other subsets are used as verification sets or test sets to evaluate the model effect. Common cross-validation techniques include K-fold corss-validation and leave one out cross validation (LOOCV). In this study, we adopted the 10 fold cross validation recommended by the literature to build the prediction model [23, 24].

Gene modules closely participant to ESCA were screened by WGCNA analysis, and GO and KEGG enrichment analysis of these genes will help to further understand the molecular mechanism of ESCA. By combining the analysis results of WGCNA and differential expression genes, the differential expression genes of hub genes in the modules closely participant to ESCA can be sifted out, and these genes are likely to be the biomarkers in the occurrence and development of ESCA. Based on the expression data of these biomarkers, the prediction model of ESCA was constructed by cross-validation and machine learning, and the prediction effect of the model was evaluated in the validation group, providing a new concept for improving the early diagnosis of ESCA. Finally, other data sets were used to validate biomarkers and potential therapeutic targets of ESCA.

Data

The dataset of high-throughput sequencing required for this study was downloaded from the UCSC XENA(https://xenabrowser.net/hub/). The "UCSC Toil RNa-seq Recompute" option includes “TCGA TARGET GTEx”. The file 'TcgaTargetGtex_rsem_gene_tpm' download from UCSC XENA contains 8.36 GB of mRNA-seq data for 33 cancers and adjacent tissues in TCGA, as well as normal tissues in GTEx. According to the information in ‘TCGATargeTGTEX_PHENOtype.csv’ download from UCSC XENA, we extracted 170 normal tissues and 11 normal tissues adjacent to cancer and 181 cancer samples(come from TCGA), so a total of 362 samples were used for differential expression analysis and WGCNA. Data of clinical traits were obtained from TCGA, while GTEx samples did not have data of other clinical traits other than grouping. We also verified the screened biomarkers with the GEO database's GSE33426 and GSE23400 data sets.

Experimental method

DEGs(Diffriencial Expression Genes)

Analysis of transcriptome high-throughput sequencing data is helpful to deepen the understanding of the molecular mechanism and gene-gene mutual effect of ESCA genesis and development. By comparing the gene expression profiles of ESCA and normal tissues, the differentially expressed genes in ESCA tissues can be sifted out.

Weighted Gene Co-expression Network Analysis (WGCNA)

In network theory, scale-free network is a complex network with a class of traits. Its typical feature is that most of the nodes in the network are only connected to a few nodes, and a few nodes are connected to a large number of nodes. The existence of such key nodes (called "hubs" or "distribution nodes") makes scale-free networks have strong capacity to strive against sudden failures, but they are sensitive to harmonious strikes. Scale-free networks have no typical nodes, but are dominated by a few hub nodes.

Although differential expression analysis can identify abnormally expressed genes in ESCA, some of these sifted genes may simply be due to differences in cell cycle or cell state. Weighted gene co-expression Network analysis (WGCNA) is an arithmetic for mining module information from high-throughput data. It can effectively make use of gene expression information to cluster genes with similar expression patterns and screen out genes closely participant to specific traits of research objects[18]. In this approach, a module is defined as a group of genes with similar expression profiles, if certain genes are in a physiological process or in different tissues, always have similar expression trends, so these genes can be considered functionally corelative, and this group of genes is defined as a module [18]. This is similar to the results procured by clustering analysis, but the difference is that the clustering criteria of WGCNA have biological significance, rather than the unconventional clustering methods (such as using the geometric distance between data for clustering), so the gene modules screened by WGCNA have more biological significance. Peng Y, Niemira et al have described the implementation process of WGCNA in detail [19, 20]. In this study, WGCNA analysis is implemented by the corresponding R software package[25]. The pivotal steps are summarized as follows:

Prerequisites for network construction

A) Remove outliers

Outlier samples are likely to have adverse effects on the analysis results of network modules, so it is necessary to identify and remove outlier samples before the construction of the network. Standardized connectivity (Z.K) method suggested by WGCNA authors was used, with the threshold Z.K score < −2[26].

B) Determine the soft threshold β

According to Zhang et al., gene co-expression networks should conform to scale-free traits, i.e., power-law distribution [18]. To construct gene co-expression network, a soft threshold β is selected and the coexpression similarity is increased to this threshold to calculate the adjacency. Zhang et al suggested that the soft threshold should be selected by approaching the standard of scale-free topology, that is, the R² of scale-free topology model could be close to 0.9 and reach a plateau, while the mean connectivity should not be too low (it needs to be meaningfully greater than zero).

C) Construction of weighted gene co-expression network

In this study, a step-by-step method was used to construct gene co-expression networks and identify gene modules. Gene co-expression matrix is composed of the absolute value of pearson relevance coefficient between two pairs of genes. By selecting a suitable soft threshold, the adjacancy function can be procured by continuous transformation of the correlation coefficients in the similarity matrix in the form of exponential function transformation, and then the topological overlap matrix can be calculated. The division of gene modules is based on the connection degree between modules, so it is necessary to transform the overlap degree of topological connection into dissimilarity degree. For the modules with high topological overlap (dissimilarity less than 0.25), the calculated dynamic pruning arithmetic was used to merge the modules and then recalculate the gene modules.

D) Association analysis of gene modules with clinical information

Relevance analysis of gene modules with clinical information can help to find gene modules participant to the trait of the subject, and the genes in these modules are likely to be of great significance for the trait. For grouped phenotypes (such as sample types, etc.), the Gene Significance (GS) can be calculated first, and then the Module Significance (MS) can be defined as the mean significance of all genes contained in the module. In general, the higher the MS value, the higher the relevance between the module and the sample type. Module Membership (MM) of the genes in the module is the correlation coefficient between the gene expression and the first principal component of a module, which can be used for screening the important genes in the module.

E)Search for hub genes

Genes with high connectivity in modules are also called hub genes, which may play an important role in modules [17]. Hub genes are conserved to a certain extent and can act as genetic buffers to reduce the impact of mutations in other genes [27]. Previous studies have shown that hub genes are at the core of gene co-expression networks, but they do not always have significant biological significance [28]. We exported the first 30 hub genes in the modules closely participant to ESCA, and drew the gene-gene mutual effect network with Cytoscape software [29].

Function annotation of the interested module’s DEGs and the hub genes

GO functional analysis and KEGG Pathway analysis were implemented on the module’s differentially expression genes of interest and the hub genes using the R-package clusterProfiler software [30]. GO annotation results can be classified into three main bodies: Molecular function, Biological process and Cellular component. After multiple test calibration, with false discovery rate (FDR) ≤ 0.05 as the threshold, GO term and Pathway satisfying this condition were defined as GO term and Pathway which were meaningfully enriched in module genes of interest and the Hub genes.

ESCA predictive model construction

Machine learning is a cross science, participant to pattern recognition, data mining, statistical learning, computing machine vision and speech recognition, natural language processing, etc., made a rapid development in the 21st century, researchers have been trying to bring it into medical research field, in the complicated massive medical data using mathematical method to probe into the regularity of [31]. At present, the general method is to "learn" the target traits of research objects through computer programs to establish a prediction model, for purpose of be used for the estimate of new data [32]. The most common adhibition is to create classifiers based on each sample feature to analyze medical data for disease detection or diagnosis [32]. There are many machine learning methods, such as regression arithmetic, artificial neural network, support vector machine, clustering arithmetic, dimension decrease arithmetic and recommendation arithmetic, etc. [31].

In this study, CARET (Classification And Regression Training) software package in R software [33] was used to construct the ESCA prediction model. Caret software package is an R software package particularly used for forecasting model and evaluation[34]. Researchers can use the software package to quickly establish and evaluate multiple different types of models and select the suitable prediction model. The prediction model construction mainly includes the following steps: data preprocessing, model parameter estimation, model prediction factor selection, model prediction effect evaluation, and further adjustment of the prediction model. In this study, we mainly used the following 12 machine learning arithmetics to establish the prediction model.

Decision Trees are a kind of prediction model machine learning arithmetic. After decades of evolution and development of classic decision tree strategies, newer arithmetics such as Random Forest (RF) Classification And Regression Trees (CART) have emerged. CART is a decision tree devised by Leo Breiman to solve classification and regression prediction models. The arithmetic is used to being called a decision tree, but in software such as R software, it is often referred to by its more modern name: CART. This arithmetic lays the basis for other important arithmetics such as bagged decision tree, random forest and gradient lifting decision tree.

CART probe intos data structures to establish easily visualized decision strategies and predict classification (classification tree) or sequential outcomes (regression tree). Classification and regression trees can be built via the rpart package in R software [35].

Schapire proposed a machine learning arithmetic [36]. Random forest is an arithmetic first proposed by Leo Reiman and Adele Culter. It is a classifier containing multiple decision trees [41, 42]. Random forest is an arithmetic that integrates multiple trees through the concept of ensemble learning. Its basic unit is decision tree, which belongs to ensemble learning [43]. As the traditional arithmetic takes too long, ParallelForest package is adopted to establish the prediction model for purpose of build up the operation potency [33, 43]. Gradient Boosting Decision Tree (GBDT) arithmetic is an arithmetic that has been mentioned more and more in recent years, and it has excellent performance in all kinds of data mining and machine learning competitions, mainly due to its excellent arithmetic performance. EXtreme Gradient Boosting (XGBoost) is an open source code developed by Chen Tianqi after improving the GBDT arithmetic. It can use the multi-threaded CPU for parallel computing, speeding up the running speed and improving the accuracy. XGBoost belongs to the build upd classifier in the integrated learning model. Its basic concept is to combine multiple tree models with low classification accuracy to form a model with high accuracy. The model iterates, and each iteration generates a new tree. AdaBoost is an abbreviation for Adaptive Boosting, which was developed by Yoav Freund and Robert. Ada Boost arithmetic is sensitive to noise data and abnormal data, but it is not easy to overfit compared with most other arithmetics. Stochastic Gradient Boosting (SGB) arithmetic is a kind of gradient boosting arithmetic. In each iteration, the subset of training data is randomly extracted from the whole training set (no return), and then the randomly selected subset is used to train the model [43]. In this study, xgboost, fastAdaboost and gbm packages in R software were respectively used to establish the prediction model [41-43].

C5.0 Classification Models (C5.0) is an developmental version of C4.5 Classification Model, which is mainly used to establish evolutionary trees. C4.5 classification model is descendent from the concept learning system based on ID3 arithmetic. Compared with C4.5 classification model, C5.0 classification model has a series of upgrades: C5.0 computing speed, more efficient memory utilization, fewer decision trees, support for acceleration, etc. In this study, C5.0 program package was used in R software to establish the prediction model [44].

Partial Least Squares Regression (PLS Regression) is a new method of polynary statistical data analysis. It is to find a linear regression model by projecting sibylline variables and scanning variables into a new space. It mainly studies the regression modeling of multiple dependent variables against multiple independent variables, especially when the inner variables are highly linearly corelative, the partial least square method is more effective [45]. Partial least squares regression combines the advantages of principal component analysis, canonical relevance analysis and multiple linear regression analysis. Currently PLS has been far and broad used in bioinformatics, neuroscience, anthropology and other fields of research. PLS program package was used in this study to establish the prediction model [46].

Support Vector Machines (SVM) is a data mining method based on statistical learning theory, which can deal with regression problems (time series analysis) and pattern identification problems (classification problems, discriminant analysis) with great achievement. And can be extended to prediction and extensive evaluation and other fields and subjects[47]. The mechanism of SVM is to find an optimal classification hyperplane that meets the classification requirements, so that the hyperplane can maximize the blank areas on both sides of the hyperplane while ensuring the classification accuracy [48]. In this study, svmRadical function integrated in KernLab program package was used to establish the prediction model [49].

Rule Based Classifier is a method that uses arithmetics to learn data rules and extract a set of rules to identify the key links between attributes and class labels of the data set, so as to realize data discrimination. The most important arithmetics are incremental error reduction pruning (IREP) arithmetic and repeated incremental pruning (RIPPER) arithmetic. The RIPPER arithmetic is more efficient than the IREP arithmetic. It generates a rule that randomly divides uncovered instances into a growing set and a pruning set, and specifies that each rule in the rule set has two rules to generate: an alternative rule and a revision rule.

PART is a rule-based classifier generated by Frank et al by integrating C4.5 and RIPPER algorithm. It does not need global optimization and can still generate accurate and compact rule sets [50]. Rule-based classifiers are frequently used to produce depictive models that are easier to interpret. In this study, JRip and PART of RWeka package in R software were used to establish the prediction model respectively [50, 51].

The prediction effect of the prediction model is usually evaluated by the following indicators: True positive (TP) : also known as sensitivity, that is, the proportion of patients who are actually sick and can be correctly judged as sick according to the prediction results of the model. It can reflect the capacity of the prediction model to detect patients with diseases. True negative rate (TN) : it is also called specificity, which means the ratio of candidates who are actually free of disease and can be properly judged as free of disease according to the prediction result of this model. It can reflect the ability of the prediction model to identify a non-patient; False positive (FP) : that is, the proportion of patients who are not actually ill but are judged to be ill according to the prediction results of the model; False negative rate(False negative, FN) : the proportion of patients who actually had the disease and were judged as not having the disease according to the predicted results of the model.

Accuracy and coincidence rate are usually used to evaluate the performance of the model. Accuracy is used to glass the closeness between the prediction consequences of the prediction model and the real situation. Coincidence rate refers to the proportion of the number of people whose prediction model consequences are true to the total number of people in the test/verification data set. Kappa value is applicable to the consistency evaluation of counting data. If Kappa value is greater than 0.75, it indicates excellent consistency; if Kappa value =1, it indicates that the predicted consequences are entirely united with the actual situation. The caret package also provides no information rate, which can help evaluate the model performance [42]. No information rate refers to the probability of randomly guessing which category each sample belongs to if all known attributes of the verification group are removed. If the accuracy and Kappa values are higher than the no information rate, the model performance is responsible. The effect of the prediction model was evaluated through the confusion matrix feature in the caret package.

For purpose of narrow the range of predictors, differentially expressed contained in the closely participant module’s hub genes of ESCA were used as predictors for the construction of the prediction model, and the method of cross-validation was took to optimize and evaluate the prediction effect of the model. Finally, biomarkers and potential therapeutic targets of ESCA were verified using wilcoxon rank sum test and ROC curve analysis. The overall technical road map of this study is shown in Figure 1.

Screening of differentially expressed genes

Analysis of differentially expressed genes were conducted by using the limma packages and differentially expressed genes were defined as |log2FC|>1 and adj.P.Val < 0.01. After comparison the results from limma, a total of 15814 genes were up-regulated in ESCA, and there were a total of 6176 gene were down-regulated. The volcano plot and heatmap of differential expression analysis (the significance top 50) are shown in Figures 2A and 2B.

Weighted gene co-expression network analysis

The ESCA gene expression network was constructed by combining the gene expression profiles of tumor samples and normal controls, and the correlation between gene modules and tissue specimen types and other clinical information was analyzed in combination with the clinical information of the study subjects, so as to find gene modules closely participant to ESCA. This part uses the WGCNA program package in R software to realize.

Remove outliers

As mentioned above, after removing batch effect and normalizing the transcriptional expression count data, the data were used for WGCNA analysis, and the hierarchical clustering number was established using the gene expression data of 1558 samples, as shown in Figure 3A. As can be seen from the figure, some samples at both ends are obviously outlier, so the height is set to 275, and the outlier samples are removed. The ID numbers of the 10 outlier samples are: TCGA.LN.A9FO.01, TCGA.VR.A8ET.01, TCGA.VR.A8EO.01, TCGA.IG.A4P3.01, TCGA.Z6.A8JE.01, TCGA.VR.A8Q7.01, TCGA.IG.A50L.01, TCGA.LN.A49W.01, TCGA.VR.A8EP.01, TCGA.V5.A7RE.11. The consequences in Figure 3A revealed that the tumor tissue and normal tissue were clustered reasonably and could be analyzed later.

Select the soft threshold β

As mentioned above, we choose an appropriate adjacent-function weighted parameter, namely the soft threshold β, according to the criteria of scale-free networks. Select a soft threshold that can make the adjacancy function better satisfy the scale-free condition, even if the relevance coefficient (SFT.R.sq) between the logarithm of node connectivity (log(pk)) and the logarithm of the probability of node occurrence (log(p(k)) is above 0.8. The higher SFT.R.sq is, the more the network is in line with the scale-free network distribution. Generally, the soft threshold value when SFT.R.sq is close to 0.9 and reaches the plateau is selected. The calculation results of this study are shown in Figure 3B. The left figure in Figure 3B shows the variation of SFT.R.sq corresponding to different soft thresholds, and the right figure shows the mean value of critical coefficient of genes in the gene network corresponding to different soft thresholds, which reflects the average connection level of the network. When the soft threshold is 18, SFT.R.sq has reached 0.85 and is basically in the plateau.

Step-by-step method to construct gene co-expression network

As mentioned earlier, soft threshold 18 was selected for step-by-step network construction. The basic concept is to calculate the correlation matrix and adjacence matrix of the expression profile, and finally transform them into a topological matrix, and get the number of systematic clustering according to the dissimilarity between the two pairs of genes. Firstly, the minimum number of genes in each gene module was defined as 30, and the dynamic shearing method was adopted to combine the gene modules with the height less than 0.25, so as to obtain a total of 7 gene modules (Figure 4A, 4B). The gray module represented the genes that were not allocated to any known module.

Analysis of relationship between gene module and clinical information

The clinical information included in the study, including sample type, ESCA, Gender, Weight, Height, Bmi etc., were all time-dependent variables. Pearson relevance coefficient between Module eigengene(ME) gene and corresponding variable represents the relevance between module and clinical information. The results are shown in Figure 4D. As can be seen from Figure 4D, there is a moderate positive relevance between the brown module and the sample type (r = 0.87, P < 0.001), the lightcyan module was moderately negatively corelative with the sample type (r = -0.75, P < 0.001). Figure 5 shows that from the perspective of MS (Gene Significance across modules), i.e. the mean significance of all genes contained in the module, the brown module is also the module most closely related to ESCA, reflecting that our analysis consequences are reliable.

The greeyellow module had a negative relevance with race(r = -0.45, P < 0.001), presumably genes in this module are in some ways associated with coloured race. However, since the key purpose of this study is to establish ESCA an prediction model, lightcyan module and greeyellow module will not be further studied.

Module eigengene (ME), the first principal component gene (Eigengene) of a particular module, represents the overall level of gene expression within that module. Module membership (MM) is expressed by the relevance between the expression profile of a gene in the sample where it resides and the expression profile of a feature vector gene ME. The module vector gene tree cluster diagram and module vector gene adjacencies relationship heat map were drawn to find gene modules closely participant to ESCA. The consequences are shown in Figure 4B and 4C.

Since the objective of this study was to find gene modules closely participant to ESCA, no further analysis was conducted on this finding. Combined with Figure 4C, we can see that brown module has the closest relationship with ESCA, so the known genes in this module are selected for subsequent GO and KEGG analysis. For age, smoking history, BIM, weight, height and other traits, there were also found to be significant relevance gene modules.

Hub genes closely participant to ESCA

Hub genes refer to a series of genes with the supreme degree of connectivity in a module, which conclude the traits of the module to a certain extent. Compared with the hub genes in the global network, the hub genes in the module tend to have more biological significance [52-54]. The module membership(MM) in the brown module and the gene significance(GS) have a high correlation(0.87) and high P-value(<1E-200)(Figure 6A), suggesting that the module is suitable for identifying the hub genes associated with the ESCA. We selected the top 30 genes with the supreme connectivity(kWithin top 30) from the brown module and mapped the gene mutual effect network using Cytoscape software(Figure 6C). It was found that the 30 hub genes were all significantly up-regulated DEGs by intersection analysis with 15,814 significantly up-regulated DEGs(Figure 6D). The 30 hub differentially expressed genes in this module are constructed for ESCA prediction model.

GO classification and enrichment analysis results

Brown module differentially expressed genes

Brown module in a total of 6838 genes, including 4419 genes’ expression is meaningfully different (| log2FC | > 1, and adj.P.Val < 0.01)(Figure 6B). Because clusterProfiler needs to use ENTREZ ID number as input file, 3497 genes’ with ENTREZ ID number among the 4419 genes GO&KEGG enrichment analysis was implemented, and the consequences were shown in Figure 7A, B and C.

These 3497 genes were meaningfully enriched in the biological processes of adaptive immune reaction based on somatic recombination of immune receptors built from immunoglobulin superfamily domains, mitotic nuclear division, lymphocyte mediated immunity, nuclear division, and DNA replication(Figure 7A). The most enriched terms in cellular component were immunoglobulin complex, “chromosome, centromeric region”, condensed chromosome, “immunoglobulin complex, circulating”, and “condensed chromosome, centromeric region” (Figure 7B). The most typical term in molecular function were antigen binding, immunoglobulin receptor binding, ATPase activity, cadherin binding, and DNA helicase activity(Figure 7C). In KEGG enrichment analysis, we found that Cell cycle, Pathogenic Escherichia coli infection, DNA replication, IL-17 signaling pathway, and Human T-cell leukemia virus 1 infection were the most over-represented pathways (Figure 7D).

Brown module hub genes

Interestingly, according to the results of WGCNA and differential expression analysis, all 30 hub genes in the brown module were significantly high expressed(log2FC>1, and adj.P.Val<0.01)(Figure 6D). Among the 30 differentially expressed hub genes, 29 had ENTREZ ID. These 29 genes were used for functional enrichment analysis of GO and KEGG.

The consequences show that these differentially expressed hub genes enriched in the biological processes of regulation of microtubule cytoskeleton organization, regulation of microtubule-based process, nuclear division, centrosome cycle, and microtubule organizing center organization(Figure 8A). The most enriched terms in cellular component were spindle, chromosomal region, spindle pole, microtubule, and mitotic spindle(Figure 8B). The most representative term in molecular function were protein serine/threonine kinase activity, DNA helicase activity, single-stranded DNA binding, helicase activity, and catalytic activity, acting on DNA(Figure 8C). In KEGG enrichment analysis, we found that Cell cycle, DNA replication were the most over-represented pathways (Figure 8D).

Prediction model construction

Training group and validation group data

In the field of machine learning research, improving the accuracy of consequences and the potency of task processing are the two main problems. If the training set is insufficient, the prediction effect of the model may be poor. If there are too many training sets, the task processing potency will be reduced, and it is easy to lead to overfitting. Therefore, cross validation (CV) method emerged as the times require, which can not only avoid poor fitting or overfitting, but also build up processing potency under the condition of ensuring prediction accuracy [22, 23]. Cross validation, can cut the data samples into smaller subsets, and use some of them as training sets to learn or train the model, while other subsets are used as verification sets or test sets to evaluate the model effect. Common cross-validation methods include k-fold corss-validation and leave-one validation. In this study, we adopted the 10 fold cross validation recommended in the literature to build the prediction model [23, 24], and used the trainControl function integrated in caret program package to set the 10 fold cross validation, and repeated it three times. As a large number of subjects were included in this study, the data of the training group was designed as 50% of the total data in the grouping.

The index used for ESCA prediction model construction was from the hub gene (also significantly high expressed gene) in the brown module closely participant to the ESCA, a total of 30. According to the test, there is no nearly zero variable among these variables, but there are 26 highly corelative variables. After removing these variables, ESCA prediction model is constructed by CHEK1, MOB1A, PTBP3 and G3BP1. The allocation of the four markers in the training group and the verification group was plotted respectively, and the results were shown in Figure 9 and Figure 10. As can be seen from the figure, the distribution of these four indicators was basically united between the training group and the verification group, reflecting that the random grouping was reasonable and the data composition between the two groups was basically united.

Evaluation of prediction effect of different prediction models

As mentioned above, 12 commonly used machine learning algorithms are used to build the prediction model, and the optimal parameters of each model are shown below:

Optimal parameters of the SVM model: sigma = 0.475461684041232, C = 4.

Optimal parameters of XGBoost model: Nrounds = 150, max_depth = 3, ETA = 0.3, Gamma = 0, colsample_bytree = 0.6, min_child_weight = 1, subsample = 1.

Optimal parameters for C5.0 model: Trials = 60, Model = rules, WinNow = FALSE.

Optimal parameter of PLS model: NCOMp = 1.

The optimal parameter of the LMT model is iter = 1.

The best parameters of the boostedGLM model are: mstop = 50, prune = no.

The optimal parameter of parRF model: mtry = 2.

Optimal parameter of rpart model: cp = 0.870257037943696.

The optimal parameters of the JRip model: NumOpt = 7, NumFolds = 10, MinWeights = 2.

PART Optimal parameter of the model: threshold = 0.5, pruned = yes.

GBM model optimal parameters: N.rees = 150, Interact.Dept = 3, Shrinkage = 0.1, n.MinobsinNode = 10.

Optimal parameters of Ada Boost model: nIter = 150, method = Adaboost.m1.

Although the optimal parameters of these models have been decided, it is still inevitable to consider how to select the optimal model from these models. This has a lot to do with the traits and types of data and generally needs to be considered from the following perspectives: First try boosted tree and Support Vector Machines (SVM), which are the most difficult to interpret and the most flexible models, which are likely to get the best prediction consequences (with the best accuracy); Explore more simple and easier to interpret models, such as partial least squares model, generalized linear model or naive bayesian algorithm model; The simplest model that can achieve or close to the prediction effect of the most complex model can be considered as the final model [55].

Among the 12 machine learning algorithms used in this study, 9 machine learning algorithms are all black box models. The three transparent or translucent algorithm models are PART, RPart and JRip,, the specific models are as follows:

PART algorithm prediction model:

PART Decision tree rules (There are four rules) :

MOB1A > -0.032404: ESCA (85.0/3.0)

CHEK1 <= -0.141327: Normal (75.0)

G3BP1 > -0.324787: Normal (12.0)

: ESCA (4.0)

rpart decision tree optimal model: n= 176

node), split, n, loss, yval, (yprob) * denotes terminal node

1) root 176 86 Normal (0.48863636 0.51136364)

2) MOB1A>=0.02317645 85 3 ESCA (0.96470588 0.03529412) *

3) MOB1A< 0.02317645 91 4 Normal (0.04395604 0.95604396) *

JRip optimal model (3 decision rules) :

(MOB1A >= 0.078756) => .outcome=ESCA (85.0/3.0)

(CHEK1 >= -0.117953) and (G3BP1 <= -0.378487) => .outcome=ESCA (4.0/0.0)

=> .outcome=Normal (87.0/0.0)

Then, the validation group data were used to evaluate the prediction effects of the prediction models constructed by various algorithms, mainly from the perspectives of accuracy, Kappa value (consistency rate), optimal sensitivity and specificity. The prediction effects of different models are shown in Table 1. It can be seen from the consequences that the prediction model built by gbm and BoostGLM algorithms has the best effect, followed by SVM, LMT and parRF algorithms. Although Jrip, PART and RPart, which are highly readable sibylline models, have high specificity, their sensitivity is relatively low. The selection of these models will lead to about 10%~11% of normal individuals being misjudged as ESCA. Therefore, these three models are not suitable for the final model in this study. The sensitivity and specificity of the models constructed by gbm and BoostGLM algorithms are up to 95% and above 97% in the validation group data, so the models procured by these two algorithms are taken as the ultimate models. In this study, an ESCA prediction model with accuracy above 0.97 was successfully constructed.

Validation of the four markers

Through the intersection analysis of DEGs from GSE33426 and the four markers, we find that the four markers are all differentially expressed(Figure 11A), while that in GSE23400-GPL96 only G3BP1, MOB1A, CHEK1 were differentially expressed(Figure 11B). To further test the value of the candidate four markers as prognostic biomarkers of ESCA, ROC curves were performed and the AUCs (95% CIs) were calculated. As shown in (Figure 12A), the AUC of G3BP1, MOB1A, CHEK1 and PTBP3 in GSE33426 were respectively 0.99, 0.962, 0.958 and 0.856, while that in GSE23400-GPL96 were 0.819, 0.77, 0.952 and 0.543(Figure 12B). These results suggested G3BP1, MOB1A and CHEK1 as potential biomarkers of ESCA. Simultaneously, based on overall survival analysis, we found that G3BP1 may be a potential therapeutic target for ESCA(Figure 13B).

In this study, RNA-Seq data of qualified ESCA and normal esophageal tissue (or normal tissue adjacent to cancer) were screened from common databases. Differential expression gene analysis was implemented using limma package. A total of 21,990 differentially expressed genes were procured, there were 15814 up-regulated genes and 6176 down-regulated genes. 58,581 gene expression matrices of 352 samples are used as input data for WGCNA analysis, and 7 gene modules are finally identified. The brown module was most closely participant to ESCA and revealed a high intensity correlation. Then the intersection of the differential expression gene analysis results and the hub genes in the brown module was taken, and it was found that the 30 hub genes were significantly differentially expressed. Using the method of cross-validation and machine learning, 12 different machine learning methods were selected to build ESCA prediction models. Among the 12 prediction models, gbm and BoostGLM algorithms have the best prediction model, were selected as the final prediction model.

It is feasible to apply machine learning method to tumor prediction model construction

Recently, Cohen et al. developed a prediction model CancerSEEK[58], which can be used for the diagnosis of eight kinds of tumors, by detecting various protein and gene mutations in serum and using machine learning methods. The method used machine learning to construct a prediction model based on the detection results of serum proteins and gene mutation markers. The results revealed that the model significantly improved the clinical method sensitivity without diminishing the specificity, and had the ability to distinguish different types of tumors. This study also suggested that other types of tumor markers, such as mRAN transcription information, miRNA, DNA methylation and metabolites, could also be combined with traditional tumor markers to explore their clinical method value in tumors [58]. This study has great reference significance for us. In this study, the 362 ESCA and para-cancer tissues or normal esophageal tissues after standardized treatment were used, 12 different machine learning algorithms were used respectively, and the ESCA prediction model was constructed by combining with the ten-fold cross-validation method. The consequences show that the multiple models constructed by us have good discrimination ability for ESCA, among which the prediction effect of the model constructed by gbm and BoostGLM algorithm is the best, the accuracy of gbm is above 0.97, and the diagnosis effect in ESCA is better than that of CancerSEEK, which provides an important reference for the prediction or early diagnosis of ESCA. However, cancerSEEK can distinguish eight different types of tumors, and whether the prediction model procured in this study can distinguish other types of tumors needs further study.

Advantages and limitations of this study

Overall, this study has the following advantages. In this study, RNA-seq data were used, and the detection range was wider than the traditional gene chip technology, with less background noise, and the consequences were relatively stable. In this study, differential expression gene analysis was combined with WGCNA analysis to further narrow the range of genes closely participant to ESCA, making the marker selection included in machine learning more reasonable and reliable. Furthermore, the model can be optimized by using the limited data by using the cross-validation of ten fold, so as to get the optimal model of each algorithm, and the prediction effect of the prediction model is verified innerly by using the validation group data.

However, this study also has the following limitations. First of all, the subjects included in the study were from South Korea, Sweden, the United States, etc., so the prediction model procured in this study needs to be confirmed by further research in Chinese population prediction. The RNA-seq data included in this study were all from ESCA tissues and normal esophageal tissue. Further studies are needed to confirm whether the results of this study are applicable to other types of specimens, such as peripheral blood or serum. In addition, the ultimate goal of this study is to build ESCA prediction model, so other positive consequences in WGCNA analysis are not further analyzed. Finally, although the prediction model of ESCA built by gbm and BoostGLM algorithm is very good, only the optimal parameters of these models are procured, and the details of these models are still black boxes. The prediction model of ESCA procured in this study has only been validated innerly in the ESCA patients included in this study, and the sibylline ability of external data needs to be evaluated by subsequent studies.

(1) Through differential expression gene analysis and WGCNA analysis, 30 of differential expression hub genes closely participant to ESCA were screened, including some genes or gene families that play an important role in tumor, which are united with previous studies, providing new clues for the diagnosis and treatment of ESCA. The enrichment analysis of GO and KEGG of these differentially expressed genes deepened the understanding of the molecular mechanism of ESCA.

(2) Several prediction models of ESCA have been preliminarily constructed. These models can achieve satisfactory accuracy in theory, which provides new strategies and concepts for early diagnosis of ESCA and lays a theoretical foundation.

(3) G3BP1, one of the three validated markers used to construct the ESCA prediction model, was found to be a poor prognostic factor for ESCA and may be a potential therapeutic target for ESCA.

WGCNA, Weighted gene co-expression network analysis; ESCA, Esophageal cancer; GO, Gene Ontology; KEGG, Kyoto encyclopedia of genes and genomes; AdaBoost, Adaptive Boosting; AUC, Area Under the ROC Curve; C5.0, C5.0 Classification Models; CART, Classification And Regression Trees; CC, Cellular Component; BP, Biological Process; MF, Molecular Function; CV, Cross Validation; GBDT, Gradient Boosting Decision Tree; GS, Gene Significance; LOOCV, Leave One Out Cross Valitaion; MM, Module Membership; MS, Module Significance; PLS, Partial Least Squares Regression; RF, Random Forest; RNA-seq, RNA Sequencing; ROC, Receiver Operating Characteristic Curve; SGB, Stochastic Gradient Boosting; SVM, Support Vector Machines; TCGA, The Cancer Genome Atlas; XGBoost, eXtreme Gradient Boosting.

Ethical Approval and Consent to participate

Not applicable

Consent for publication

All authors have read and agreed the contents of the manuscript and consent to its publication

Availability of supporting data

UCSC XENA: https://xenabrowser.net/datapages/?cohort=TCGA%20TARGET%20GTEx&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

GEO:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33426

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23400

Competing interests

All authors declare that they have no competing interests.

Funding

No funding was received.

Authors’ Contributions

Bin Ni contributed to conception, project, modification and governance of the study. Yanzhou Zhang contributed in draft the scripture and data analysis. Qing Zhu , Xiufeng Cao and Weiwei Tang contributed to the construction of prediction models and data acquisition.

Acknowledgements

Thanks to my family and colleagues for their understanding and support.

Authors' information

It was noted at the beginning of the article.

Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424. doi:10.3322/caac.21492.
Wei WQ, Hao CQ, Guan CT, Song GH, Wang M, Zhao DL, Li BY, Bai WL, Hou PY, Wang JW, et al. Esophageal histological precursor lesions and subsequent 8.5-year cancer risk in a population-based prospective study in China. Am J Gastroenterol. 2020;115(7):1036–44. doi:10.14309/ajg.0000000000000640.
He LJ, Xie C, Wang ZX, Li Y, Xiao YT, Gao XY, Shan HB, Luo LN, Chen LM, Luo GY, et al. Submucosal saline injection followed by endoscopic ultrasound versus endoscopic ultrasound only for distinguishing between T1a and T1b esophageal cancer. Clin Cancer Res. 2020;26(2):384–90. doi:10.1158/1078-0432.CCR-19-1722.
Martin JL, Wolanin A, Lerner I. Oral cancer screening. reducing fear using salivary diagnostics. Dent Today. 2016;35(5):12.
Khan T, Relitti N, Brindisi M, Magnano S, Zisterer D, Gemma S, Butini S, Campiani G. Autophagy modulators for the treatment of oral and esophageal squamous cell carcinomas. Med Res Rev. 2020;40(3):1002–60. doi:10.1002/med.21646.
Azad TD, Chaudhuri AA, Fang P, Qiao Y, Esfahani MS, Chabon JJ, Hamilton EG, Yang YD, Lovejoy A, Newman AM, et al. Circulating tumor DNA analysis for detection of minimal residual disease after chemoradiotherapy for localized esophageal cancer. Gastroenterology. 2020;158(3):494–505. doi:10.1053/j.gastro.2019.10.039.
Mariette C, Markar S, Dabakuyo-Yonli TS, Meunier B, Pezet D, Collet D, D’Journo XB, Brigand C, Perniceni T, Carrere N, et al. Health-related quality of life following hybrid minimally invasive versus open esophagectomy for patients with esophageal cancer, analysis of a multicenter, open-label, randomized phase III controlled trial: the MIRO trial. Ann Surg. 2020;271(6):1023–9. doi:10.1097/SLA.0000000000003559.
Berkelmans G, Fransen L, Dolmans-Zwartjes A, Kouwenhoven EA, van Det MJ, Nilsson M, Nieuwenhuijzen G, Luyer M. Direct oral feeding following minimally invasive esophagectomy (NUTRIENT II trial): an international, multicenter, open-label randomized controlled trial. Ann Surg. 2020;271(1):41–7. doi:10.1097/SLA.0000000000003278.
Kossatz S, Pirovano G, Demetrio DSFP, Strome AL, Sunny SP, Zanoni DK, Mauguen A, Carney B, Brand C, Shah V, et al. Validation of the use of a fluorescent PARP1 inhibitor for the detection of oral, oropharyngeal and oesophageal epithelial cancers. Nat Biomed Eng. 2020;4(3):272–85. doi:10.1038/s41551-020-0526-9.
Davison JM, Goldblum J, Grewal US, McGrath K, Fasanella K, Deitrick C, DeWard AD, Bossart EA, Hayward SL, Zhang Y, et al. Independent blinded validation of a tissue systems pathology test to predict progression in patients with Barrett’s esophagus. Am J Gastroenterol. 2020;115(6):843–52. doi:10.14309/ajg.0000000000000556.
Tan MC, Bhushan S, Quang T, Schwarz R, Patel KH, Yu X, Li Z, Wang G, Zhang F, Wang X, et al. Automated software-assisted diagnosis of esophageal squamous cell neoplasia using high-resolution microendoscopy. Gastrointest Endosc. 2020. doi:10.1016/j.gie.2020.07.007.
Rehman AU, Iqbal MA, Sattar R, Saikia S, Kashif M, Ali WM, Medhi S, Saluja SS, Husain SA. Elevated expression of RUNX3 co-expressing with EZH2 in esophageal cancer patients from India. Cancer Cell Int. 2020;20:445. doi:10.1186/s12935-020-01534-y.
Hesari A, Azizian M, Sheikhi A, Nesaei A, Sanaei S, Mahinparvar N, Derakhshani M, Hedayt P, Ghasemi F, Mirzaei H. Chemopreventive and therapeutic potential of curcumin in esophageal cancer: current and future status. Int J Cancer. 2019;144(6):1215–26. doi:10.1002/ijc.31947.
Jamali L, Tofigh R, Tutunchi S, Panahi G, Borhani F, Akhavan S, Nourmohammadi P, Ghaderian S, Rasouli M, Mirzaei H. Circulating microRNAs as diagnostic and therapeutic biomarkers in gastric and esophageal cancers. J Cell Physiol. 2018;233(11):8538–50. doi:10.1002/jcp.26850.
Wang S, Sun H, Zhan X, Wang Q. MicroRNA718 serves a tumorsuppressive role in nonsmall cell lung cancer by directly targeting CCNB1. Int J Mol Med. 2020;45(1):33–44.
Bai X, Wang W, Zhao P, Wen J, Guo X, Shen T, Shen J, Yang X. LncRNA CRNDE acts as an oncogene in cervical cancer through sponging miR-183 to regulate CCNB1 expression. Carcinogenesis. 2020;41(1):111–21. doi:10.1093/carcin/bgz166.
Horvath S, Dong J. Geometric interpretation of gene coexpression network analysis[J]. PLoS Comput Biol. 2008;4(8):e1000117.
Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis[J]. Statistical Applications in Genetics Molecular Biology. 2005;4(1):17.
Peng Y, Song Y, Ding J, et al. Identification of immune-related biomarkers in adrenocortical carcinoma: Immune-related biomarkers for ACC[J]. International immunopharmacology, 88:106930.
Magdalena, Niemira F, et al. Molecular Signature of Subtypes of Non-Small-Cell Lung Cancer by Large-Scale Transcriptional Profiling: Identification of Key Modules and Genes by Weighted Gene Co-Expression Network Analysis (WGCNA).[J]. Cancers, 2019, 12(1).
Dhall D, Kaur R, Juneja M. Machine Learning: A Review of the Algorithms and Its Applications[M]. 2020.
Devijver PA, Kittler J. Pattern recognition: A statistical approach[M]. Prentice hall; 1982. p. 18.
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection[Z].International Joint Conference on Artificial Intelligence, Montreal, Canada, 1995.1137-1145.
Refaeilzadeh P, Tang L, Liu H. Cross-validation[M]. Springer; 2009.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis[J]. BMC Bioinformatics. 2008;9(1):559.
Oldham MC, et al. Network methods for describing sample relationships in genomic datasets: application to Huntington’s disease. BMC Syst Biol. 2012;6:63.
Lehner B, Crombie C, Tischler J, et al. Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways[J]. Nat Genet. 2006;38(8):896–903.
Carter SL, Brechbühler CM, Griffin M, et al. Gene co-expression network topology provides a framework for molecular characterization of cellular state[J]. Bioinformatics. 2004;20(14):2242–50.
Otasek, et al. Cytoscape Automation: empowering workflow-based network analysis Genome Biology. 2019;20:185.
Guangchuang Yu L-G, Wang Y, Han, Qing-Yu, He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–7.
Beam, Andrew L, et al. Big Data and Machine Learning in Health Care[J]. Jama the Journal of the American Medical Association; 2018.
Foster KR, Koprowski R, Skufca JD. Machine learning, medical diagnosis, and biomedical engineering research-commentary[J]. Biomedical engineering online. 2014;13(1):94.
Kuhn M, Wing J, Weston S, et al. caret: Classification and regression training. R package version 6.0–78. https://CRAN.R-project.org/package=caret, 2017.
Kuhn M. A Short Introduction to the caret Package. URL: pdf; 2016. https://cran.r-project.org/web/packages/caret/vignettes/caret.
Therneau T, Atkinson B, Ripley B, et al. Package ‘rpart’. Available online: cran.ma.Ic.ac.uk/web/packages/rpart/rpart.pdf (accessed on 20 April 2018), 2018.
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and anapplication to boosting[J]. J Comput Syst Sci. 1997;55(1):119–39.
Breiman L. Random forests[J]. Mach Learn. 2001;45(1):5–32.
Cutler A, Cutler DR, Stevens JR. Tree-based methods[M]. Springer; 2009.
Kuhn M. Caret: classification and regression training. Astrophysics Source Code Library; 2015.
Friedman JH. Stochastic gradient boosting[J]. Comput Stat Data Anal. 2002;38(4):367–78.
Chen T, He T, Benesty M. Xgboost: extreme gradient boosting. R package version 0.4-2, 2015: 1–4.
Chatterjee S. fastAdaboost: a Fast Implementation of Adaboost. R package version 1.0.0. https://CRAN.R-project.org/package=fastAdaboost, 2016.
Ridgeway G. gbm. Generalized boosted regression models. R package version 2.1.3.https://CRAN.R-project.org/package=gbm, 2017.
Kuhn M, Quinlan R. C50: C5.0 Decision Trees and Rule-Based Models. R package version 0.1.1. https://CRAN.R-project.org/package=C50, 2017.
Tobias RD, Inc SI, Cary. An Introduction to Partial Least Squares Regression. 1995.
Mevik BH, Wehrens R, Liland KH, pls. Partial Least Squares and Principal Component regression, r package version 2.6-0. URL http://CRAN.R-project.org/package=pls, 2016.
Cortes C, Vapnik V. Support-vector networks[J]. Mach Learn. 1995;20(3):273–97 52.
Smola† AJ, Lkopf‡ BS. A tutorial on support vector regression[J]. Statistics Computing. 2004;14(3):199–222.
Karatzoglou A, Smola A, Hornik K, et al. kernlab-an S4 package for kernel methods in R[J]. J Stat Softw. 2004;11(9):1–20.
Frank E, Witten IH. Generating accurate rule sets without global optimization.. (Working paper 98/2). Hamilton, New Zealand: University of Waikato, Department of Computer Science. 1998.
Hornik K, Buchta C, Zeileis A. Open-source machine learning: R meets Weka[J]. Comput Statistics. 2009;24(2):225–32.
Jeong H, Mason SP, Barabási AL, et al. Lethality and centrality in protein networks[J]. Nature. 2001;411(6833):41.
Goh K-I, Cusick ME, Valle D, et al. The human disease network[J]. Proceedings of the National Academy of Sciences, 2007, 104(21): 8685–8690.
Liang H, Li W-H. Gene essentiality, gene duplicability and protein connectivity in human and mouse[J]. Trends Genet. 2007;23(8):375–8.
Kuhn M, Johnson K. Applied predictive modeling[M]. Springer; 2013.
Kourou K, Exarchos TP, Exarchos KP, et al. Machine learning applications in cancer prognosis and prediction[J]. Computational structural biotechnology journal. 2015;13:8–17.
Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics[J]. Nat Rev Genet. 2015;16(6):321–32.
Cohen JD, Li L, Wang YX, et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test[J]. Science, 2018: eaar3247.

Due to technical limitations, Table 1 is only available as a download in the Supplemental Files section.

Table1.xlsx

Download PDF

Version 1

posted

You are reading this latest preprint version

Biomarker Screening And Prediction Model Construction of Esophageal Carcinoma Based On Bioinformatics

Status:

Version 1

Abstract

Figures

Introduction

Materials And Methods

Results

Discussion

Conclusions

Abbreviations

Declarations

References

Table

Supplementary Files

Status:

Version 1