Data Collection and Preprocessing
We extracted the gene expression profiles of DKD (GSE96804,GSE30566,GSE99339 and GSE30528) from the GEO database using the GEOquery package43. The GSE96804 dataset 44 from Homo sapiens, based on the GPL17586 platform, contains 61 samples, including 20 normal glomerular and 41 diabetic glomerular tissues. All the samples were included in this study. The GSE30566 dataset45 from Homo sapiens based on the GPL571 platform contains 26 samples, including 13 normal glomerular and 13 normal renal tubular control; 13 normal glomerular samples were included in this study. The GSE30528 dataset45 from Homo sapiens, based on the GPL571 platform, contains 26 samples, including 13 normal glomerular and 9 diabetic glomerular tissues. All the samples were included in this study. The GSE99339 dataset46 was obtained from Homo sapiens, and the data platforms were GPL19109 and GPL19184. It contained 184 samples, including 13 diabetic glomerular tissues, all of which were included in this study. The data were normalized and de-batch processed using the sva package47 and standardized using the limma package48.Subsequently, 382 FRGs were obtained from the FerrDb49. Our study is based on open-source data; therefore, there are no ethical issues or conflicts of interest.
Construction and Verification of the LASSO Model
Currently, LASSO regression is a commonly used machine learning algorithm for the construction of diagnostic models. Regularization was used to solve the occurrence of overfitting in the process of curve fitting and to improve the accuracy of the model. The model was built using the GLMnet package50 with a parameter set.seed (2), family = "binomial."
Difference Expression Analysis
We used the limma package to calculate the differential expression of genes between the normal and DKD groups in the GEO microarray data with log fold change (logFC) > 0.5 and adjusted P-value < 0.01 as the threshold. The genes were upregulated in the DKD group if logFC > 0.5, and downregulated if logFC < 0.5. The results of the differential expression analysis are shown in the heat map using the R package pheatmap and the volcano map using the ggplot2 package51.
Functional Enrichment Analysis
GO analysis was used to conduct large-scale functional enrichment, including biological process (BP), molecular function (MF), and cellular component (CC) analyses. The KEGG database stores the genomes, biological processes, diseases, and medical information. GO biological process and KEGG pathway enrichment analyses of the DEGs were performed using the R clusterProfiler package52. The critical value of FDR< 0.05 was considered statistically significant52.
To study the differences in biological processes among different groups, we used GSEA for enrichment analysis according to the logFC arrangement based on the GSE96804 profile. GSEA is a computational method used to analyze whether a particular gene set has a statistical difference between the two biological states. In our study, we used GSEA to explore the differences in pathways and biological processes of the samples in the datasets. The "msigdb.v7.0.symbols" gene set were downloaded from the MSigDB53database for GSEA analysis.
In addition, the enrichment scores of related pathways in the MSigDB database were calculated according to the gene expression matrix of each sample using the GSVA method using the R-packet GSVA54.The differences in samples were screened using the limma package48. Enrichment items with statistically significant differences are shown in the heat map. According to the gene expression matrix of each sample, the enrichment scores of the FRGs were calculated using the ssGSEA method and are displayed in boxplots.
Gene Co-expression Analysis
WGCNA55 aims to identify co-expressed gene modules, analyze core genes in the network, and explore the relationships between modules and phenotypes. First, the soft threshold was calculated using the pickSoftTreshold function, and five was the best soft threshold. We then built a scale-free network based on the soft threshold. Next, a topology matrix and hierarchical clustering were constructed. We dynamically cut and identified the gene module and calculated the eigengenes, and the number of genes in each module was at least 50. The correlation between modules was constructed according to the eigengenes of the modules, and hierarchical clustering was performed. The modules were merged again with a correlation of more than 0.7, and finally, 10 modules were obtained. The correlations between modules and clinical features were explored using Pearson’s correlation analysis.
Consistent Cluster Analysis
Consistent clustering is a method that can be used to determine the members and number of possible clusters in a dataset (microarray gene expression). In order to distinguish different subtypes of DKD, we carried out a consensus clustering analysis related to FRGs in the DKD group in the GSE96804 dataset using "ConsensusClusterPlus" R package56. In this process, the number of clusters was set between 2 and 10; 80% of the samples were taken each time and calculated 100 times, clusterAlg = "hc” and distance = "euclidean."
Immune Infiltration (CIBERSORT)
CIBERSORT57 is an algorithm for deconvolution of the transcriptome expression matrix, according to the principle of linear support vector regression, to estimate the composition and abundance of immune cells among mixed cells. The gene expression matrix data (TPM) were uploaded to CIBERSORT, combined with the LM22 gene matrix, and an immune cell infiltration matrix with filtration (P < 0.05) was obtained. A bar chart was drawn using the ggplot2 package in R language to show the distribution of 22 immune cell infiltrations in each sample.
Statistical Analyses
All data calculations and statistical analysis were conducted using R programming (https://www.r-projec t.org/, 4.0.2 version). For two groups of continuous variables, the statistical difference in normal distribution variables was estimated using an independent Student’s t-test, and non-normally distributed variables were analyzed using the Mann–Whitney U test (Wilcoxon rank sum test). A two-tailed P < 0.05 was considered statistically significant.