3.1 Characteristics of UC samples
To obtain reliable sample data, we selected three groups of ulcerative colitis mucosal biopsy samples in the GEO database, a total of 237 samples, including 197 UC samples and 40 healthy samples. The genetic data derives from the GEO database, and the three data sets are GSE87466(Telesco S et al,2019༌n = 108༌UC = 87), GSE48958༈Van der Goten J et al༌2018༌n = 21༌UC = 13༉, GSE59071༈Arijs I et al༌2018༌n = 97༌UC = 13༉respectively. Besides, the GSE87466 dataset provides information on age, and the GSE48958 provides data on whether the state of the disease is active or not. Information on gender and race was not available.
3.2 Eliminate batch effects through cross-platform standardization
We applied the combat method to batch calibration on different platforms of data sets. The WGCNA package constructed the modules. At last, we chose 16454 genes on 237 samples. First, we select the first two primary components to cluster the samples (Fig. 1A), which were determined according to the expression value unnormalized and carried out in batches. Then batch effect was eliminated. Then through the normalization process, a scatter diagram was obtained (Fig. 1B), which indicated batch effect was eliminated.
FIGURE 1 Primary component analysis on samples from a different platform. Different colors represent samples from different data sets. The scatter plot points are due to the first two primary components of the gene expression profile (PC1 and PC2) to visualize the sample. (A) Before batch-effect removal, while (B) removes batch effects. Figure 1A is the result before batch calibration, while Fig. 1B is the result after batch calibration. The batch effect has been eliminated successfully.
3.3 Consensus clustering of UC cases
We obtained expression (GSE48958, GSE59071, GSE87466) data from the Gene Expression Omnibus (GEO) database of NCBI. We applied the R package to recognize the abnormally methylated differentially expressed genes (MDEG). The clusterProfiler software package was executed to perform enrichment analysis on functions and pathways. Gene Expression Synthesis (GEO) objects were extracted by using R/Bioconductor package GEOquery. The GEO object consists of probe sets, gene array, and clinical features. The followed-up analyses were performed according to the GEO sample data. After removing the batch effect of gene expression profiles, consensus clustering (unsupervised clustering method) divided 237 UC cases into subgroups. According to the cluster consensus matrix, when the cluster consensus score (> 0.8), it had the highest stability. Therefore, the samples were divided into three groups. Then we performed consensus clustering and obtained three subtypes, among which there were 99, 58, and 48 cases in these groups, respectively. The analysis was based on gene expression profiles. Their expression patterns were significantly different. In contrast, based on the consensus matrix, a high degree of similarity in gene expression patterns was observed in each subgroup (Fig. 2A). According to the cluster consensus matrix result, when the cluster consensus score is higher than 0.8, it had the highest stability.
FIGURE 2 Consensus cluster on gene expression profiles in patients with ulcerative colitis (UC). (A) A consensus matrix that the cluster number is 3. The heat map is up to the minimum consensus score of the subtypes (> 0.8). (B)The bar graph represents the consensus scores of the subgroups with the number of clusters ranging from 2 to 10.
3.4 Comparison of clinical characteristics of subgroups
We conduct research on the clinical features of the three subtypes. The disease state and age data of UC cases were selected from the GSE48958 data set. Precisely, the active proportion in subtype III was markedly lower than subtype I and II,but it is evident that no remarkable divergence between subtype I and II on the proportion of active (Fig. 3A). However,there are no apparent differences between the three subgroups on age (Fig. 3B). Then the proportion of active diseases was visualized. Compare whether there was a difference in the proportion of active diseases between different types (Fig. 3A). The calculation of the pairwise function shows the proportion of diseases in the active state had a significant difference when the difference of p-value < 0.001.
FIGURE 3 The paired comparison between three subtypes on clinical features (PCA). (A) The bar-plot represents the proportion of active state. (B) displays the age of each subgroup. The abscissa is the classification of the samples into three groups, Ⅰ, Ⅱ, and Ⅲ. The ordinate represents the proportion of active diseases in figure A and the age of figure B. ***p-value < 0.001.
3.5 Transcriptome classification on GSEA
Gene Set Enrichment Analysis (GSEA) is an effective calculation means. It can determine the difference between gene sets. We mainly focus on the statistical discrepancy between biological features. There is no doubt that the discrepancy should be markedly and consistently. We used the predefined gene sets from functional explanation and ranked the genes of double sorts of samples. The ranking was mainly based on the differential expression. Subsequently, we checked the previous gene set enriched on the sorting table and identified if they were at the top or bottom. Through GSEA enrichment analysis, the specific differential genes in each type can be obtained, and whether the type is still different from the normal sample. The differential gene in the typing was the differential gene compared with the normal sample, indicating that the differential gene in the typing was also different when the typing compared with the normal sample (Fig. 4). The original chip data contained 16454 probes. No probe set = > gene symbol collapsing was requested, so all 16454 features were used. The genome size filter (15 ≤ n ≤ 5000) leads to screening out one gene set. One gene set was remarkably enriched on the condition of p-value < 1%.
The heat map was drawn with the limma and pheatmap packages. Analysis of the difference of classification was performed. Set the absolute value of the mean filter > 0.2 and the corrected p-value filter < 0.05 as the differential gene. The average value of multiple rows of genes was taken, deleted the genes whose expression was 0 in all samples. Limma was applied to analyze the difference and extract the up-regulated genes in each type.
Take the intersection of expression data and typing data to get the module to which each gene belongs. The samples were typed, and genes are annotated. The subgroup-I samples were up-regulated in the green-yellow gene module and salmon gene module. The subgroup-II samples were up-regulated in the brown gene module and cyan gene module and down-regulated in the black gene module. The subgroup-III samples showed up-regulation in the black gene module and down-regulation in the brown and salmon gene modules.
In order to reveal the differences between the transcriptome UC subgroups, each subgroup made up the WGCNA level of gene expression. Differential expression analysis pairwise between every two subtypes. The number of genes up-regulated in subgroups I, II, and III was determined to be 497, 2839, and 2835 (mean deviation > 0.2, p < 0.05). In addition, differential expression analysis was performed by contrasting the gene expression profile of every subtype with the control profile. The subpopulation, specifically up-regulated genes, was significantly up-regulated in the casecontrol (Fig. 4A,4B,4C).
FIGURE 4 Overview of ES scores in operation and the position of gene set members on the ranking list. Three subgroups specifically up-regulated gene expression patterns. In the enrichment map(A), (B), and (C), compared with the control group, the up-regulated genes of the specific subgroup have higher expression. (D) Gene expression value was shown in the heat map for six WGCNA modules. Different colors represent the level of gene expression. Low expression was shown in blue, and high expression was shown in red. The vertical line represents the differential gene of each type, and the abscissa represents the difference between the type n and the normal sample. The closer to the left lies, the more significant the genetic differentiation.
3.6 WGCNA analysis and modules
Before establishing an appropriate gene co-expression module, a suitable power value was chosen for the comparative value. The target gene expression level was extracted. Clustering was performed in the absence of missing values and power index range 1:20 to find the best power value. Obtained a scatter plot of the fitting index and power value and a scatter plot of connectivity and power value (Fig. 5A). Therefore, according to the R2 value is not less than 0.8, the slope is close to -1, we chose the best power value to be 7 in subsequent analysis.
Meanwhile, it can be found that when the scale independence was > 0.8, there was higher average connectivity. The gene distance was calculated, and the genes were clustered. Then dynamic cutting module identification was performed. Each module identifies at least 60 genes and obtains the module to which each gene belongs. Find similar modules, clustering was performed, and the module was gotten that each gene ultimately belongs to. The module data and the clinical shape data were intersected, the clinical data and the module data were tested, the p-value of the correlation test was obtained, and the visualization was performed to obtain the heat map of the module and the trait data. Find the module where the gene was located, and classify the gene. The limma and WGCNA packages were used to perform WGCNA analysis. Subsequently, the amount of the genes was calculated in every module. At last, according to genes amount, we depict these modules. We used different colors to represent and arranged from high to low.
FIGURE 5 WGCNA analysis and modules. (A) Scale independence and mean connectivity, based on which the optimal power value can be obtained. (B) Functional characterization on WGCNA modules and clinical peculiarity. The correlation coefficients of UC's age and disease state are shown in orange and blue, respectively. The ordinate represents the module to which the gene belongs, and the abscissa represents the clinical traits. The correlation analysis between the module and clinical traits shows that if p-value < 0.05, it means that the module is correlated with clinical traits. Blue represents a negative correlation, and orange represents a positive correlation.
3.7 Association of WGCNA modules and Clinical features.
We computed the pertinence coefficient or p-value between each module's activity level or age and the characteristic gene so that the correlation between clinical features and WGCNA modules could be studied (Fig. 5B). It is worth noting that the feature vector represents the feature gene on the gene presentation array of every module. It was not difficult to see that module 2 was positively correlated with age, while module 6 was negatively correlated. Modules 1, 5, and 6 were related to disease states actively, while modules 2, 3, and 4 were passively correlated with disease states. In addition, 1,3,4 modules were displayed regardless of age, which may be indicated that there was no correlation between disease progression and age. It can be seen from our research that WGCNA was correlated with certain clinical characteristics, such as disease states.
3.8 GO and KEGG enrichment analysis
The enrichment analysis was performed by using clusterProfiler and enrichplot in the R package. Set p-value༜0.05 and perform GO enrichment analysis on BP, CC, and MF, respectively. We compared the functions between modules through GO enrichment analysis. KEGG was a comparison between module pathways, and the enrichment conditions were the same as those of GO. BH or Benjamini & Hochberg(1995)was chosen as the calibration method. Then obtain the GO (Fig. 6) and KEGG (Fig. 7) enrichment pathway bubble chart.
FIGURE 6 GO enrichment bubble graph. Figure 7 KEGG enrichment bubble graph. The ordinate of bubble chart represents the name of the GO, and the abscissa represents the name of a module. The color is displayed according to the p-value. The redder the color, the more significant the enrichment of GO in the module. ▲: p-value༜0.05,Significantly enriched;●༚p-value༞0.05༌Enrichment is not significantly.