A model of molecular characteristics to classify patients with WHO Grade III gliomas

Background: The World Health Organization (WHO) Grade III gliomas display large within-group differences, and no studies have attempted to classify Grade III gliomas to our knowledge. We aimed to classify Grade III gliomas and complement the existing molecular classication system. Methods: We used microarrays to prole glioma progression from normal brains (n = 5) to WHO Grade II (n = 97) to Grade III (n = 34) to Grade IV samples (n = 89). Using bioinformatics analysis to construct a model to evaluate the diagnosis and prognosis of patients with Grade III gliomas. Results: We identied 27 gene expression patterns of glioma progression and found Grade III glioma samples had large within-group differences. There were 506 differentially expressed genes in Grade III glioma samples and used to construct a network. Three clusters were identied, we chose the rst two genes with the most signicant P values in each cluster, age and Isocitrate Dehydrogenase 1 (IDH1) mutations to construct a model to score and high risk scores were signicantly associated with poor survival in our data and two independent datasets for patients with Grade III gliomas (P = 1.12e−03, P = 2.55e−04, P = 5.93e−06, respectively). The ROC area under the curve was 80.0%, 80.9% and 76.6%, respectively. Conclusions: Our results revealed that the model was able to predict the diagnosis and prognosis of patients with Grade III gliomas, assisting with post-surgical treatment management and providing a priori molecular targeting treatment.


Background
Glioma represents the majority of malignancies in the central nervous system and is often associated with the progressive deterioration of emotional and cognitive health and poor quality of life [1,2].
Although multiple modes of treatment, such as surgery, chemotherapy, radiation and biotherapy, improves survival, death inevitably occurs from either recurrent or progressive disease, thus, scienti c and clinical advances are desperately needed [3]. The WHO Classi cation of Tumors of the Central Nervous System is currently the most common classi cation approach. The updated 2016 revision of the WHO classi cation mandates the inclusion of molecular-genetic alterations in the diagnosis of diffuse glial tumors and is a signi cant departure from the morphology-based 2007 classi cation [4]. The WHO classi cation has been clinically applied for years and is a practical and effective approach for brain tumor analysis; it provides a means for placing tumors into speci c, relevant categories [5]. Although histopathological grading is subjective and associated with signi cant intra-observer variability [6], it has implications not only for survival prediction but also for tailoring post-surgical management and patient selection in the evaluation of relatively new molecular targeting treatments when it is combined with molecular data [7,8].
In recent decades, the molecular biology of gliomas has made tremendous progress. Mutations in the IDH gene are primarily located at codon R132 in IDH1 and R172 in IDH2. IDH gene mutations commonly indicate a favorable independent prognostic factor of patients with gliomas [9][10][11]. The combined loss of chromosomal arms 1p and 19q resulting from an unbalanced t(1;19)(q10;p10) mutation leads to loss of heterozygosity, which associated with improved prognosis and responsiveness to therapy in patients with gliomas [12]. Recently, several published studies reported that speci c combinations have the ability to reclassify gliomas into rational subsets [13][14][15].
An increasing number of studies have reported that the WHO classi cation of glioma is subjective, and new classi cation schemes are emerging [16]. However, a new approach for WHO Grade III glioma classi cation has not been reported.
In this study, we used microarrays to pro le glioma progression from normal brains to WHO Grade II to Grade III to Grade IV samples. We identi ed 27 gene expression patterns by comparing datasets from different glioma grades and found that the WHO classi cation could predict the prognosis of patients with gliomas and that the prognoses were worse for higher grades. However, patients with WHO Grade III gliomas had signi cantly different prognoses, so we divided the Grade III samples into two groups and identi ed their differentially expressed genes. We selected 3 clusters for further analysis based on proteinprotein interaction (PPI) networks and the Molecular Complex Detection (MCODE) plugin in the Cytoscape. The most enriched Gene Ontology (GO) biological process terms in each respective cluster were cell proliferation, immune response and smooth muscle cell migration. At last, we constructed a model to score and evaluate patients' diagnosis and prognosis.

Patients and samples
The glioma samples were obtained from Beijing Tiantan Hospital between January 2005 and December 2009. The samples included 5 cases of normal brain tissues samples (hereafter referred to as "grade0"), 97 cases of WHO Grade II samples (hereafter referred to as "grade2"), 34 cases of Grade III samples (hereafter referred to as "grade3") and 89 cases of Grade IV samples (hereafter referred to as "grade4"). Patients were graded according to the WHO classi cation guidelines by two neuropathologists (detailed information about the tissue samples is presented in table S1). Overall survival (OS) was calculated from the date of diagnosis until death or the end of follow-up, while progression-free survival (PFS) was de ned as the time between diagnosis and the rst unequivocal clinical or radiological sign of disease progression. All of samples were rinsed with normal saline after surgical resection and divided into two parts, one of which was placed in liquid nitrogen immediately and then stored at -80°C until use, and the other was placed in formaldehyde solution and for hematoxylin-eosin stained to assess the percentage of tumor cells.
RNA isolation and microarray expression pro ling Total RNA was extracted using the mirVana miRNA Isolation kit (Ambion 1561) according to the manufacturer's protocol. Only samples with >80% tumor cells were selected for RNA extraction and subsequent experiments. The Agilent 4×44K Whole Human Genome Oligo Microarrays were implemented and normalized as our previous description [17]. The raw and processed data are publicly available at the Gene Expression Omnibus (GEO) website under the accession number GSE109857.
Expression patterns of multi-stage glioma samples The tissue samples were divided into the following four stages: grade0, grade2, grade3 and grade4. Gene expression patterns were de ned by expression levels changed relative to those changes found in neighboring stages. There are three scenarios for differences in gene expression between adjacent stages: upregulation (u), downregulation (d), and no signi cant change (n). Accordingly, the genes were divided into 27 groups de ned by the four stages. The group in which gene expression increasing steadily occurred was named as "uuu'', the group of genes with sharply decreasing expression was referred to as "ddd", and the group consisting of genes with no signi cant changes were called as "nnn".

PPI network construction and subnetwork analysis
Search Tool for the Retrieval of Interacting Genes/Proteins (STRING; https://string-db.org) was employed to construct PPI networks for differentially expressed genes [18]. The tab-delimited format of the PPI network was exported from STRING and imported into Cytoscape, which was used to visualize the network [19]. Then, we used the MCODE plugin in Cytoscape to identify discrete clusters from the former PPI network [20]. The GO functional enrichment analysis of clusters were conducted in the Database of Annotation Visualization and Integrated Discovery (DAVID; http://david.abcc.ncifcrf.gov/).

Prognostic analysis
We proposed a model that considered several factors to generate a score for each patient to predict the prognosis of patients with WHO Grade III gliomas. In detail, we used univariate Cox regression analysis to evaluate the association between survival time and the expression levels of selected genes, IDH1 mutations and age. A mathematical formula was constructed for the model [21]. The risk score of each patient was calculated as follows: Risk _ Score = i* exp (gene(i)) + β* mutation(IDH1) + γ * (Age) i, β and γ represent the regression coe cients of genes expression values, IDH1 mutations and age. All patients in the dataset were thus assigned to high-risk and low-risk groups using the median risk score as the cut-off point. The Kaplan-Meier method was used to estimate the overall survival time for the two subgroups, and differences in survival time were analyzed using the log rank test (R package "survival").

Analysis of public datasets
Two independent datasets of WHO Grade III glioma transcriptome data were used for validation, namely, the Chinese Glioma Genome Atlas RNA sequencing dataset (CGGA data) and GEO microarray dataset (GSE16011), as well as their corresponding clinical information (Table S2-S3). Because the raw data were not provided for CGGA data, the two sets of normalized data were directly downloaded and analyzed.

Statistical analysis
All statistical analyses in this study were performed using R software (http://www.r-project.org). T-Distributed Stochastic Neighbor Embedding (t-SNE) were implemented in the R package Rtsne. Gene expression pattern plots, circus plots and ROC curves were generated by the R packages ggplot2, OmicCircos, pROC, respectively. All statistical tests were two-sided, and a P value less than 0.05 was considered to be statistically signi cant.

Discovery of different outcomes of patients with Grade III gliomas
The gene expression pro les of all 225 samples were analyzed. After we normalized and ltered the probes, 20,358 genes were used in subsequent analyses. These genes were divided into 27 groups according to their expression, ranging from normal to Grade II to Grade III to Grade IV samples (Table S4 and Fig.1a). The expression patterns "uuu" (n=119), "nuu" (n=216), "ddd" (n=97), "ndd" (n=190) were used for further analysis. We performed a principle component analysis (PCA) with the 622 genes and found that patients with WHO Grade II gliomas and patients with Grade IV gliomas were clearly separated; however, patients with Grade III gliomas were mixed with patients in the other two grades (Fig.1b). Further analysis indicated that the prognoses of patients with Grade II or Grade IV were not signi cantly different within their group ( Figure S1a-d); however, Grade III glioma samples displayed large within-group differences (Fig.1c-d). Consideration of the prognostic outcomes of WHO grading could be a good indicator for predicting the survival of patients with gliomas ( Fig.1e-f). However, patients with Grade III gliomas had a range of outcomes, so we attempted to establish a model to assist the diagnosis and treatment of patients with Grade III gliomas.
Identi cation of differential genes in Grade III glioma samples We employed t-SNE and unsupervised clustering to divide the Grade III samples into two groups. Group1 had 22 samples and Group2 had 12 samples (Fig.2a). Then, we identi ed 506 differentially expressed genes through an unpaired Student's t-test; 298 genes were highly expressed in Group1, and 208 genes were highly expressed in Group2. The distribution of the differentially expressed genes on chromosomes is shown in Fig.2b. Based on the GO analysis, the signi cant biological processes of genes in Group1 were primarily associated with cell proliferation, cell adhesion, angiogenesis; however, highly expressed genes in Group2 were signi cantly enriched in cerebellar granule cell differentiation, central nervous system development and adult behavior. Survival analysis of the groupings of Grade III samples is shown in Fig.S1e-f. PPI analysis of the differentially expressed genes in patients with Grade III gliomas We used all differentially expressed genes in patients with Grade III gliomas to construct a PPI network in STRING. The largest connected subnetwork contained 340 nodes and 1,830 edges (Fig.3a). The nodes degree distribution in the subnetwork displayed a scale-free feature, and the power was 0.8 (Fig.S2). Then, we extracted discrete clusters from the subnetwork using the MCODE plugin in Cytoscape software. Three clusters were identi ed in the subnetwork, which included a total of 82 genes. The respective clusters had 44, 15, and 23 genes, and the corresponding gene numbers are listed in Table S5. Three clusters interaction networks are shown in Fig.3b-d. Then, we examined the functions of these 3 clusters using GO enrichment analysis. Up to 10 of the most signi cantly enriched GO biological process terms corresponding to each cluster are shown in Fig.3e-g. The most signi cantly enriched terms in Cluster1 were DNA repair and cell cycle, including microtubule-based movement, chromosome segregation, mitotic G2 DNA damage checkpoint, and mitotic chromosome condensation and so on; in Cluster2, the most enriched terms were chemotaxis, cell-cell signaling, in ammatory response and immune response, and in Cluster3, the terms were smooth muscle cell migration and protein ubiquitination, and patterning of blood vessels.

A model to evaluate the diagnoses and prognoses of patients with Grade III gliomas
We performed univariate Cox regression analysis to evaluate the association between survival time and all genes in the 3 clusters. In each cluster, we chose the rst two genes with the most signi cant P values. We ultimately chose CENPE and DEPDC1 in Cluster1, VIM and ANXA1 in Cluster2, and PLAU and ASB1 in Cluster3. In addition to these 6 genes, age and IDH1 mutations were used to construct a model to evaluate the diagnosis and prognosis of patients with Grade III gliomas (detailed information about the model is presented in Table 1). Regression coe cients with a plus sign indicate that increased expression was associated with decreased survival (risk factors), and conversely, a minus sign indicates that increased expression was associated with increased survival (protective factors). Survival analysis indicated that the model could predict the prognosis of patients with Grade III gliomas (P = 1.12e−03), and high risk scores were associated with poor survival (Fig.4a). In addition, we used two independent datasets of WHO Grade III transcriptome data to validate this model. The results revealed that high risk scores were signi cantly associated with poor survival in the CGGA data for patients with Grade III gliomas (P = 2.55e−04) and in GSE16011 for patients with Grade III gliomas (P =5.93e−06) (Fig.4b-c). ROC curve analysis for the model in patients with Grade III gliomas was performed, and the area under the curve was 80.0% (Fig.4d). The ROC area under the curve was 80.9% and 76.6%, respectively, in CGGA data and GSE16011 (Fig.4e-f). These results suggested that the model was able to predict the diagnosis and prognosis of patients with Grade III gliomas, assisting with post-surgical treatment management and providing a priori molecular targeting treatment.

Discussion
Using of gene expression data from glioma tissues to determine better treatment options is becoming increasingly common in clinical practice [22]. In our study, transcriptomic data from multiple stages in a total of 225 samples were used to analyze glioma progression.
Survival analysis of all glioma samples using only the WHO classi cation could clearly distinguish different grades. We found that WHO Grade III samples were mixed with Grade II and Grade IV samples and that the survival times of patients with Grade III gliomas were clearly different. That is, patients with Grade III gliomas may have completely different prognoses. Due to urgent need for individualized treatment, a method to assist with diagnosis and treatment is needed, especially for tumors that appear to be different within the same group.
We discovered that highly expressed genes in Group1 were associated with cell proliferation, immune response and angiogenesis; these biological behaviors are similar to those of malignant cells. In contrast, highly expressed genes in Group2 were mainly enriched in nervous development and neurological function, their molecular phenotype partially retained the characteristics of normal glial cells. Furthermore, the median OS of Group2 patients was 1,412 days with a 10-year survival rate of 50%, whereas the median OS of Group1 patients was 651 days with a 10-year survival rate of only 18%. Thus, the better the differentiation of tumor cells, the closer they are to the characteristics of normal cells, and the smaller the impact on patients.
For patients with Grade III gliomas, we proposed a model that included 6 genes, age and IDH1 mutations to evaluate diagnoses and prognoses. By consulting literatures, we know that age is associated with glioma patients' prognosis [23,24] and DH1 mutations or not commonly is thought to be a favorable independent prognostic indicator of glioma patients [25], although the regression coe cients of age and IDH1 mutations in our study were not signi cant, the sample size of Grade III was probably small. CENPE, a cell cycle related gene, has been reported to signi cantly increase with increasing WHO grades [26]. DEPDC1 expression is increased in glioma cell lines, tissues, and brain tumor initiating cells, and the suppression of endogenous DEPDC1 expression by small interfering RNA inhibited glioma cell viability and induced apoptosis through NFκB signaling [27]. VIM encodes a type III intermediate lament protein and has been reported participate in the epithelial-to-mesenchymal transition phenotype in human glioblastoma [28]. ANXA1, an invasion-associated gene, exhibits increased expression levels in glioblastoma multiforme (GBM) [29,30]. PLAU induced increased expression by Oncostatin M, which is involved in cell matrix remodeling, migration, proliferation control and angiogenesis [31]. ASB1 encodes a protein that contains an ankyrin repeat sequence and a SOCS box domain and is related to the innate immune system, but as far as we know, it has not been reported to be associated with gliomas.
To the best of our knowledge, this is the rst report to combine WHO classi cations with a molecular model to evaluate the diagnosis and prognosis of patients with Grade III gliomas. Furthermore, multiple, rather than single, factors are considered to drive the disease process in complex glioma tumors. Of course, the transcriptome represents only one level of biology, and we should account for other "-omics" data types, such as proteomics and epigenomics data.

Conclusions
Our results provide suggestions for the future development of a mechanistically based molecular risk estimation model for Grade III gliomas. The signature might better lend itself and be accessible to translation into clinical practice than other classi cation approaches, which could aid in the implementation of individualized medicine and treatment with optimal regimens. All patients signed informed consent forms and the consent was obtained in written. The use of human tissue samples and experimental procedures for this study were reviewed and approved by the Ethics Committee of the Cancer Institute and Hospital, Chinese Academy of Medical Sciences. The ethics committee approved this procedure. The dataset was anonymized prior to analysis.    PPI network analysis of the differentially expressed genes in Grade III samples. (a) PPI network of associated proteins encoded by the differentially expressed genes. The blaze orange nodes and the persian green nodes represent genes that were highly expressed in Group1 and Group2, respectively. The node size represents the fold change of genes. The edges represent correlation coe cients, and the colors of the edges represent the combined scores from STRING, where blue represents a high combined score and atomic tangerine represents a low combined score. (b-d) Three discrete clusters identi ed from the network. (e-g) The most signi cantly enriched GO terms in each cluster.