3.1 microarray data characteristics
Three independent microarray information were included in this study, involving three independent clinical trials, which were taken from geo database, including 277 samples of gse37768, gse76925 and gse130928 (including 151 patients with chronic obstructive pulmonary disease and 126 healthy subjects). All provided clinical information such as age, lung function, smoking and body mass index.
3.2 Eliminate batch effect through cross platform standardization
In order to eliminate the batch effect from different platforms and different batches, we use the combat method to eliminate the batch effect between data sets. A total of 15326 genes were detected on three microarray platforms. Before eliminating the batch effect, the samples were clustered by batch according to the first two main components (PCS) of the non standardized expression value (Fig. 1a). In contrast, the scatter diagram is standardized based on principal component analysis. The results show that the samples of three batches are mixed together, indicating that the batch effect caused by different platforms has been thoroughly understood (Fig. 1b). The results show that cross platform standardization successfully eliminates the batch effect after batch correction.
3.3 Consensus clustering of COPD cases
The expression files corrected by batch effect and the sample information of disease group were used for cluster analysis. 151 patients with COPD were divided into subgroups.According to the consistency scoring in data statistics, the consistency cluster analysis of gene expression profile is divided into three subgroups, in which the number of cases in subgroups I, II and III are 55, 49 and 47 respectively, which have significantly different expression patterns (Fig. 2a).On the contrary, based on the consistency matrix, a high similarity of gene expression patterns was observed in each subgroup. Generally speaking, the higher the consistency score, the more stable the high score type is.In the results of this study, when divided into 2 groups and 3 groups, the cluster consistency score of each subgroup is higher than 0.8, which indicates that these classifications are more robust than other clusters, and it is better when considering more groups (Fig. 2b). Therefore, 151 patients with chronic obstructive pulmonary disease were divided into 3 subgroups.By describing the clinical characteristics of the three subgroups and analyzing the data of 151 cases of chronic obstructive pulmonary disease, it was found that the age of subgroup I was younger (Fig. 3a), and the age of subgroup II and III was older. Smoking time in subgroups II and III was significantly longer than that in subgroup I (Fig. 3b), indicating that subjects in subgroups II and III were heavier in subgroup I, while subgroup I developed earlier. However, no differences were found in age and smoking time in subgroups II and III.
Table 1 The number of differentially expressed genes by case-control and case‐case comparisons and weighted gene co‐expression analysis modules in each subgroup
Transciptome classification
|
#of upregulated genes by case-control comparison
|
#of downregulated genes by case-control comparison
|
#of subgroup-specific upregulated genes by case-control comparison
|
Modules
|
Ⅰ
|
74
|
5220
|
1076
|
Brown
|
Ⅱ
|
418
|
1257
|
4271
|
Blue and Yellow
|
Ⅲ
|
562
|
28
|
2339
|
Turquoise
|
3.4 Identification of gene coexpression modules of subgroups
In order to reveal the gene differences between COPD subgroups, WGCNA analysis was carried out under the expression level of specifically up-regulated genes in each subgroup.Through the paired differential expression analysis between each two subgroups, 1076, 4271 and 2339 genes were specifically up-regulated in subgroups I, II and III respectively (the corrected threshold was p < 0.05, and the absolute difference of the average value was > 0.2).Differential expression analysis was performed by comparing the gene expression profiles of each subgroup with those of the normal control group.The results of GSEA enrichment analysis showed that there was also significant difference between the specific up-regulated genes in the subtypes and the normal samples (Fig. 4a-c, FDR < 0.05). It is worth noting that the number of up-regulated genes in each subgroup I is relatively less than that in the control group.
Based on the expression levels of 7686 specifically up-regulated genes in the subgroup, a gene heat map expression network was constructed, which identified five WGCNA modules (Fig. 4D、Table 1).By analyzing the relationship between WGCNA module and corresponding subgroups and KEGG enrichment, it was found that PPAR signaling pathway, regulation of actin cytoskeleton and platelet activation pathway were significantly enriched only in brown module (Fig. 5a), and brown module was significantly enriched in sub I, indicating that subjects in sub group I showed airway remodeling characteristics; Glycerol phospholipid metabolism, peroxisome and neuroactive ligand receptor interaction are only significantly enriched in the blue module, trap interaction in vesicle transport and bacterial invasion of epithelial cells are only significantly enriched in the Yellow module, vascular smooth muscle contraction is only significantly enriched in the blue and yellow modules, and yellow and blue modules are significantly enriched in subgroup II, The results showed that the subjects in subgroup II showed the characteristics of metabolic activity; JAK-STAT signal pathway, PI3K Akt signal pathway and HIF-1 signal pathway are significantly enriched in the green module, and the green module is significantly enriched in subgroup III, indicating that the subjects in subgroup III show inflammatory characteristics.However, the gray module did not find significant enrichment in the corresponding subgroups.These findings suggest that each subgroup has its specific functional modules or pathways that can regulate the occurrence or progression of COPD.
3.5 Clinical features and WGCNA module discussion
Through the correlation module, correlation coefficient and corresponding p value of clinical features and WGCNA, the relationship between the characteristic genes of each module and BMI index or FEV1 is analyzed and calculated. It is found that the characteristic genes are represented by the characteristic vector of gene expression matrix.Brown and gray modules were not associated with BMI index or FEV1 (Fig. 5b), indicating that the airway remodeling process in subgroup I was not significantly correlated with BMI index or FEV1. In contrast, blue and yellow modules were negatively correlated with BMI index or FEV1, and the difference was statistically significant, indicating that patients with low BMI index and high FEV1 were metabolically active in subgroup II; The green module was positively correlated with BMI index or FEV1, and the difference was statistically significant, indicating that patients in subgroup III showed a continuously enhanced inflammatory response with the increase of BMI index and the decrease of FEV1.Our results further suggest that the WGCNA module is associated with some clinical features, such as BMI index or FEV1.