Identication of hub genes associated with accumulation of extracellular matrix as biomarkers of diabetic nephropathy

FSGS: RPGN:Rapidly progressive glomerular nephritis; LN:Lupus nephritis.


Background
Diabetic nephropathy (DN), a severe microvascular complication of diabetes mellitus (DM), represents a crucial cause of chronic kidney disease (CKD) which frequently leads to end stage renal disease (ESRD). DN develops in about 30% of patients with type1 DM (T1DM) and 40% of patients with type 2 DM (T2DM) [1]. Clinically, DN is characterized by progressive renal damage re ected by persistent albuminuria, decline of glomerular ltration rate (GFR), hypertension, and excess mortality due to ESRD or cardiovascular complications. A report from American Diabetes Association (ADA) indicated that DN was the most common cause of ESRD in many countries in the world including the United States, Japan and Europe.
Patients with DN accounted for 25-45% of patients enrolled in ESRD programs [2]. Taking into account the high incidence and heavy health and public burden of ESRD, the early diagnosis and justi ed management of DN are of great clinical importance.
The clinical diagnosis of DN is based on estimated GFR (eGFR) and albuminuria, combined with the clinical characteristics of DM, such as the course of disease and the existence of diabetic retinopathy [3]. DN is de ned as persistently urinary albumin to creatinine ratio (ACR) ≥ 30 mg/g and/or sustained decline in eGFR < 60 ml/min/1.73 m 2 [1,4]. For patients with atypical clinical manifestations of DM and renal involvement, renal biopsy remains the "golden standard" of nal diagnosis. Morphological changes include thickness of glomerular and tubular basement membranes, mesangial expansion, and typical glomerulosclerosis with nodular mesangial lesions (Kimmelstiel-Wilson nodule). However, the elevated level of albuminuria and the reduction of eGFR are not exclusive to DN, which are challenged when patients with DM develop CKD with atypical clinical features. The application of renal biopsy subjects to great restrictions because of its strict indications and the risk of internal bleeding [5]. There is an increasing need for development of novel biomarkers of DN to allow for more accurate diagnosis.
Technologies of microarray and RNA sequencing have made great progress in the past two decades. As an algorithm of systems biology, weighted gene co-expression network analysis (WGCNA) enables the identi cation of clusters of biologically related genes and it can associate the gene clusters (often called gene modules) with speci c phenotypes or clinical traits, such as the pathological classi cation of disease. WGCNA has emerged as a powerful approach to understand mechanisms and identify potential biomarkers or therapeutic targets [6][7][8]. In view of its advantages, analyzing the whole transcriptional characteristics is of great signi cance for understanding pathogenesis and discovery of speci c biomarkers for glomerular diseases. In 2012, by comparing the differentially expressed genes (DEGs) between DN and normal subjects, Wang et al [9] reported gene modules and several transcriptional factors related to the development of DN. To our best knowledge, WGCNA has so far not applied to identify speci c genes that distinguish DN from other pathological types of CKD.
In the present study, we analyzed whole-transcriptome gene expression data of microdissected glomeruli samples from various types of CKD including DN. WGCNA was used to group global genes into modules in an unsupervised manner and the modules speci c for DN were identi ed. Further bioinformatics tools, like gene ontology (GO) analysis and protein-protein interaction (PPI) network, allowed the identi cation of the most connected and central genes (called hub genes). The main goal of this study was to expand our understanding of the mechanism of DN, unravelling speci c novel candidate biomarkers that can differentiate DN from other types of glomerular diseases.

Microarray data processing
The mRNA expression matrix of GSE99339 [10] was downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/). This dataset was performed on the GPL19109 platform (Affymetrix Human Genome U133 Plus 2.0 Array). mRNA expression data was preprocessed and analyzed in R statistical environment (version 3.6.1). Background correction and normalization were performed by "normalize-BetweenArrays" function in the R package "limma" (version 3.6) in Bioconductor project (version 3.1) (http://www.-bioconductor.org/). After the initial ltering, the expression data of 10,947 probes were yielded. Then, the array annotation R package "hgu133plus2.db" (version 3.6) was used to map the microarray probes to the o cial gene symbol.
Construction of gene co-expression networks and identi cation of gene modules related to DN.
Gene co-expression network analysis was performed using the R package "WGCNA" (version 1.69), a coexpression network analysis approach proposed by Steve Horvath in 2005 [11]. The details of algorithm referred to the WGCNA manual (https://cran.r-project.org/web/packages/WGCNA/WGCNA.pdf). Brie y, Pearson correlation coe cient was calculated for all the genes. Then, a soft power threshold was used to transform the correlation matrix into a weighted adjacency matrix. Next, the adjacency matrix was transformed into a topological overlap matrix (TOM) and 1-TOM was calculated. Modules identi cation was accomplished with the method of dynamic tree cut by hierarchically clustering genes taking 1-TOM as the distance measure with a minimum module size cut-off of 30 genes. Highly similar modules were recognized by clustering and then merged together with a height cut-off of 0.25. Then, the module eigengenes (MEs) were regarded as gene expression pro les that best characterize the overall levels of modules. MEs were then analyzed for correlations with clinical traits of subjects, that was, the pathological diagnosis of renal biopsy. The gene signi cance was de ned as the correlation between gene and clinical traits [12]. The MEs of interests were selected by clinical traits of diagnosis of DN based on GS > 0.4 and with a threshold of P-value < 0.05.

Gene ontology (GO) enrichment analysis
To interpret the biological function of the interested modules, the visualization of GO enrichment terms of the selected module associated with DN was performed using R package "clusterPro ler" (version 3.11) [13] in the Bioconductor project. Adjust P value < 0.05 was considered as statistically signi cant.

Recognition of hub genes and further analysis
The online platform, STRING (http://string-db.org), was utilized to construct protein-protein interaction (PPI) networks within the interested modules. Those with the highest degrees, de ned as the edges' number incident to the nodes, were recognized as hub genes. "Cytoscape" software (version 3.7.2) was utilized to analyze the characteristics of each node of PPI and hub genes were identi ed by Maximal Clique Centrality (MCC) method in "cyto-hubba", a plug-in of Cytoscape software. The log2 transformed expression levels of hub genes were compared in different groups of glomerular diseases. Receiver operator characteristic (ROC) curves and the areas under ROC (AUCs) were generated to evaluate the e cacy of hub genes in the diagnosis of DN different from normal control and other pathological types of glomerular diseases.

External veri cation
Dataset GSE30528, a study consisted of mRNA pro les of 9 cases of glomeruli samples with DN and 13 healthy living donors, was downloaded from GEO database and used as external veri cation. The log2 transformed mRNA expression levels of hub genes were compared between DN and control group (healthy living donors). ROC curves were used to evaluate the e cacy of hub genes in diagnosis of DN. The clinical data of GSE30528 was downloaded from "Nephroseq" online open-access platform (http://v5.nephroseq.org). Pearson correlation was performed between mRNA levels of hub genes and eGFR values.

Statistical analysis
Statistical analysis was performed in R and GraphPad Prism 7.0 (GraphPad Software, Inc.). One-way ANOVA analysis or Kruskal-Wallis test was performed for comparisons among groups. Dunnett's or Dunn's test was used for multiple comparisons. Unpaired t test or Mann-Whitney test was used to evaluate statistical signi cance between the values of patients with DN and control group. The ROC curves were established and AUCs were calculated to evaluate the e ciency of hub genes in diagnosing DN. Pearson correlation coe cient was performed to assess the correlations between mRNA levels and eGFR. All tests were two tailed, with a value of P < 0.05 considered as statistically signi cant. Gene modules relevant to DN was identi ed via WCGNA To identify gene modules associated with pathological types of DN, WGCNA was applied based on the whole-transcriptome data of microdissected glomeruli samples. Power of 8 was selected to approximate scale-free topology (Fig. 1a). Modules were generated by dynamic tree cut and were merged with the number of genes less than 30 and the cutting heights of 0.25 (Fig. 1b, c). Then we got 23 modules for the whole transcriptome (Fig. 1d). MEs were calculated as the representative for modules. Pearson correlation coe cients were calculated between MEs and the clinical traits of pathological type. From the heatmap of module-trait correlations, we identi ed that the saddle brown module was the highest correlated with DN (r = 0.54, P = 9e-5), assigning 64 genes (Additional le 1: Table S1). The correlations with other groups of glomerular diseases were relatively weak (Fig. 1d).

Gene expression data
GO terms were mainly enriched in ECM accumulation and 6 genes were identi ed as hub genes In order to annotate the biological functions of genes in saddle brown module, the GO enrichment analysis was performed. GO analysis showed genes were mainly enriched in the component and biological process of extracellular matrix (ECM) (Fig. 2a). In addition, we constructed a PPI network (Fig. 2b) of the module genes. Hub genes with the highest degrees of connectivity in the PPI network were applied to MCC method (Additional le 1: Table S2). Genes of LUM, ELN, FBLN1, MMP2, FBLN5 and FMOD were recognized as hub genes with the top 6 MCC score (Fig. 2c, Table 1). These 6 hub genes showed relatively high AUCs levels for differential diagnosis of DN from other glomerular diseases. The AUCs of LUM, FBLN1, MMP2, FBLN5 and FMOD in the differential diagnosis of DN from other groups were higher than 0.7 (0.7078 ~ 0.9524). (Table 2). FMOD showed the best e ciency of discrimination (AUC, 0.8864 ~ 0.9381).

External validation
Another independent microarray dataset GSE30528, including 9 cases of glomeruli samples with DN and 13 cases of normal control, was used for external validation. The clinical data, including age, sex, serum creatinine and eGFR was downloaded from Nephroseq platform. The log2 transformed mRNA levels of hub genes and values of eGFR of subjects in GSE30528 were downloaded from Nephroseq platform (Additional le 1: Table S3). As shown in Fig. 4a-f. Except ELN, the expression levels of other 5 hub genes in DN group were statistically higher than that of normal control group (P < 0.01). The ROC curves showed that except ELN (AUC, 0.624), other 5 genes had relatively high AUCs in the diagnosis of DN (AUC, 0.778 ~ 0.992) (Fig. 5a-f). Furthermore, Pearson correlation analysis showed the expression levels of 6 hub genes were negatively correlated with eGFR ( Fig. 6a-f). Except ELN (r=-0.176, P = 0.533), the correlations of other 5 hub genes with eGFR were statistically signi cant (r = -0.487 ~-0.658, P < 0.05).

Discussion
In the present study, microarray data from 179 pathologically con rmed renal biopsy samples was analyzed, covering a variety of primary or secondary glomerular diseases. By de ning the pathological types as clinical traits and analyzing the correlation with gene modules, a cluster of genes with high expression speci city in DN was generated. WGCNA was performed in an independent study of large number of samples of multiple kidney diseases, avoiding the heterogeneity among different studies caused by a comprehensive analysis of multiple GEO datasets. Different from the analysis of DEGs [14][15][16], the genome-wide transcriptional co-expression network analysis has provided an unbiased description of gene expression network in DN. A gene module speci c for DN was identi ed re ecting the unique pathogenesis of DN. GO analysis revealed genes assigned by the module were mainly enriched in the components and biological process of ECM, which were consistent with the pathological characteristics of DN. Next, we identi ed 6 hub genes including LUM, ELN, FBLN1, MMP2, FBLN5 and FMOD in the speci c module and found that DN showed the highest expression levels. Furthermore, LUM, FBLN1, MMP2, FBLN5 and FMOD showed e cient diagnostic capacity of DN. In addition, these ndings were proved by external validation.
The hallmark of the pathogenesis of DN is an increased ECM accumulation which causes thickness of the glomerular and tubular basement membranes, followed by mesangial expansion, glomerular sclerosis, and tubulointerstitial brosis. The main components of ECM in DN include elevated expression of collagen I, collagen III, collagen IV (α1 and α2 chains), collagen V, collagen VI, bronectin, laminin, and small leucinerich proteoglycans (SLRP) [17][18]. The major physiologic regulators of ECM degeneration in glomeruli were matrix metalloproteinases (MMPs), mainly MMP-2 and MMP-9 [19]. Interestingly, among the 6 hub genes identi ed in this study, LUM and FMOD encode members of SLR, lumican and bromodulin and MMP-2 was critical in the ECM accumulation of DN.
Lumican and bromodulin, members of class II SLRP family [20] encoded by LUM and FMOD, were basically expressed in kidney [21][22]. In DN, lumican and bromodulin were mainly localized in areas of sclerosis scars and become progressively more evident with the extent of brosis in glomeruli and tubulointerstitium [23]. As soluble forms of lumican and bromodulin could be released into body uid [23], Ahn et al [24] reported lumican was identi ed belong to 13 up-regulated glycoproteins in DN plasma by multi-lectin a nity chromatography. Fibulin-1, a secreted glycoprotein encoded by FBLN1, was highly expressed in the capillary loop of DN glomeruli [25]. The plasma level in glomerulonephritis, DN and obstructive nephropathy were elevated compared with normal subjects [26]. In addition, Scholze, et al [27] found that plasma bulin-1 was related to cardiovascular risk in CKD patients, especially in patients of CKD and DM. MMP-2 degrades the components of ECM, including col-IV, bronectin, aggrecan, laminin, elastin, gelatin, and non-matrix substrates. In patients with DN, MMP-2 was increased in kidney samples and urinary MMP-2 was elevated in patients with albuminuria and established renal injury [28,29]. Elastin, encoded by ELN, was found accumulated in DN kidney samples and involved in the deposition of ECM in DN [30]. In the present study, its diagnostic value was not so satisfactory as that of the other 5 hub genes.
DN is usually a clinical diagnosis according to albuminuria and/or reduced eGFR in the absence of signs of other causes of renal disease. As renal biopsy is not indispensable to DN diagnosis, a proportion of nondiabetic renal disease indicated by renal biopsy is among patients clinically diagnosed with DN, especially MGN and IgAN [31,32]. The development of DN diagnostic biomarkers in plasma or urine will contribute to accurate clinical diagnosis. As mentioned above, lumican, bromodulin, bulin and MMP-2 can exist in blood or urine in soluble form, and the levels were elevated in patients with DKD or DM compared with normal subjects. It is of great signi cance to verify whether these hub genes have speci c diagnostic values for DN in clinical utility.

Limitations
The main limitation of this study lies in the lack of established clinical data of GSE99339. In addition, it is a data mining research based on bioinformatic analysis. There is still a need of validation cohort and direct evidences that hub genes can be used as biomarkers of DN, although the external validation in an independent dataset in our study supported our ndings.

Conclusions
In conclusion, through WGCNA and other bioinformatics tools, we identi ed a gene module speci cally

Declarations
Availability of data and materials The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Ethics approval and consent to participate Not applicable.