3.1 Identification of DEGs using RRA Method
Table 3 describes the sample characteristics of the datasets included in our study, including dataset ID, platform ID, total number of samples, number of normal samples, number of OA samples, and grouping information. Differential expression was analyzed for each dataset, and based on the results of DEGs from each dataset, the RRA method identified a total of 202 significantly up-regulated and 49 significantly down-regulated DEGs. A heatmap displays the top 50 up-regulated and top 10 down-regulated DEGs (Figure 1A).
Table 3 Sample Characteristics of Included Datasets
GEO dataset
|
platform
|
samples
|
Total
|
normal
|
OA
|
group
|
GSE51588
|
GPL13497
|
Subchondral Bone
|
50
|
10
|
40
|
Train
|
GSE12021
|
GPL96
|
Synovium
|
19
|
9
|
10
|
Train
|
GSE55457
|
GPL96
|
Synovium
|
20
|
10
|
10
|
Train
|
GSE56409
|
GPL570
|
Synovium
|
22
|
11
|
11
|
Train
|
GSE114007
|
GPL11154/GPL18573
|
Cartilage
|
38
|
18
|
20
|
Train
|
GSE168505
|
GPL16791
|
Cartilage
|
7
|
3
|
4
|
Train
|
GSE169077
|
GPL96
|
Cartilage
|
11
|
5
|
6
|
Train
|
GSE55235
|
GPL96
|
Synovium
|
20
|
10
|
10
|
Test
|
GSE129147
|
GPL6947
|
Cartilage
|
40
|
7
|
33
|
Test
|
GSE48556
|
GPL6947
|
Blood
|
139
|
33
|
106
|
Test
|
3.2 Identification of DEGs using SVA Method
Box plots (Figures S1A-S1B) and PCA cluster plots (Figures S1C-S1D) present the gene expression matrix of the seven combined datasets before and after removing batch effects. The results indicate that, after normalization and batch effect adjustment, the batch effects between different datasets are not obvious, suggesting that the final combined dataset is suitable for subsequent analysis. The volcano plot and heatmap in Figures 1B-1C show the genetic differences between OA and normal controls in the combined dataset. Among these, 30 genes were found to be down-regulated and 64 genes were up-regulated. A total of 83 genes, identified by both RRA and SVA methods, were considered key genes for OA and retained for further analysis (Figure 1D).
3.3 Functional Enrichment and Pathway Analysis of Key Genes
We analyzed the GO annotations, KEGG pathways, and DO enrichment for these key genes to explore their potential biological information. GO analysis results indicate that, in the biological process (BP) category, the genes are involved in extracellular matrix organization, extracellular structure organization, and external encapsulating structure organization. In the cellular component (CC) category, they are associated with the collagen-containing extracellular matrix, complex of collagen trimers, and fibrillar collagen trimer. The enriched molecular function (MF) terms include extracellular matrix structural constituent, extracellular matrix structural constituent conferring tensile strength, and platelet-derived growth factor binding (Figures 1E,1H). KEGG enrichment results show that key genes are primarily enriched in the Relaxin signaling pathway, Protein digestion and absorption, AGE-RAGE signaling pathway in diabetic complications, and Rheumatoid arthritis (Figures 1F,1I). DO enrichment analysis results (Figures 1G,1J) indicate that the key genes are enriched in diseases including degenerative disc disease, connective tissue cancer, cell type benign neoplasm, Osteoarthritis, and musculoskeletal system cancer.
3.4 Construction of PPI Network for Key Genes
We used the key genes to construct a PPI network in STRING to understand the potential connections between proteins, with a minimum required interaction score of 0.4 and a PPI enrichment p-value <1.0e-16. This score indicates that the connections have a high level of confidence. To understand the expression of interacting proteins in OA, we scored the protein-protein interactions and combined the PPI network with scores greater than 0.4. We used Cytoscape software to visualize the PPI network constructed in STRING (Figure 2A). To understand the interactions between hub genes, we used the 'cytoHubba' algorithm to identify hub genes. We considered all computational methods in the cytoHubba algorithm ("MCC", "DMNC", "MNC", "Degree", "EPC", "BottleNeck", "EcCentricity", "Closeness", "Radiality", "Betweenness", "Stress", "ClusteringCoefficient") as hub genes, and only those identified by all methods were included as final hub genes. In the end, we identified 25 hub genes (Figure 2B): COL1A2, COL3A1, POSTN, COL11A1, CDH11, SULF1, FAP, MMP13, MMP1, OGN, THY1, JUN, CDH2, WNT5A, ATF3, CDKN1A, GAP43, PTN, DDIT4, STMN2, TOP2A, IRS2, IGFBP1, TUBB3, and SPOCK1. The interactions of these genes were visualized, with yellow representing up-regulated genes and pink representing down-regulated genes (Figure 2C). We also plotted the volcano plot and heatmap of the hub genes (Figures 2D-2E). GeneMANIA was used alongside PPI to evaluate the 25 hub genes and 20 interacting genes to predict relationships in terms of co-expression, shared protein domains, co-localization, and pathways (Figure 2F). The outer circle represents predicted genes, and the inner circle represents hub genes. Network analysis indicated that these genes are related to fibroblast proliferation, regulation of fibroblast proliferation, sensory organ morphogenesis, axonogenesis, extracellular matrix organization, banded collagen fibril, and collagen-containing extracellular matrix.
To further understand the function of hub genes in OA, we performed enrichment analysis on the hub genes. According to GO analysis, the biological processes (BP) enriched include extracellular matrix organization, extracellular structure organization, external encapsulating structure organization, axon development, and regeneration. The cellular components (CC) enriched include collagen-containing extracellular matrix, fibrillar collagen trimer, banded collagen fibril, endoplasmic reticulum lumen, and complex of collagen trimers. The molecular functions (MF) enriched include extracellular matrix structural constituent, extracellular matrix structural constituent conferring tensile strength, integrin binding, platelet-derived growth factor binding, and SMAD binding (Figures 2G-2H). KEGG analysis indicated that the hub genes are enriched in the Relaxin signaling pathway, IL-17 signaling pathway, AGE-RAGE signaling pathway in diabetic complications, Protein digestion and absorption, and PI3K-Akt signaling pathway (Figures 2I-2J).
3.5 Association between Hub Genes and TFs
Transcription factors (TFs) are involved in gene regulation. To explore the role of TFs, we used TRRUST to predict key TFs influencing OA through hub genes. The analysis of interactions between TFs and hub genes indicated that 49 TFs coordinated the regulation of 18 common DEGs (JUN is both a differentially expressed gene and a transcription factor), indicating a complex regulatory relationship (Figure 3A). The top-ranked (TFs), based on the complexity of gene regulation, include JUN, NFKB1, SP1, ETS1, AR, RELA, HDAC4, ATF2, ATF4, ESR1, EGR1, SP3, and TP53. Additionally, we validated the expression of several of these top-ranked TFs in OA (Figures 3B-3F). These findings reveal significant relationships between OA hub genes and TFs.
3.6 Identifying Diagnostic Biomarkers through Machine Learning
To identify key diagnostic biomarkers for osteoarthritis (OA), we conducted a thorough feature selection process in our study. Initially, we employed the `train` function from the caret package to predict the optimal machine learning algorithms. The dataset was divided into training and validation sets, with training set proportions of either 0.7 or 0.8. Among the six machine learning methods evaluated, XGBoost, SVM, and Lasso regression consistently ranked high in terms of residuals and root mean square error (Figures 4A, 4C), with areas under the ROC curve exceeding 0.9 (Figures 4B, 4D), indicating superior diagnostic efficacy and learning performance. Consequently, we selected XGBoost, SVM, and Lasso regression to screen for key diagnostic biomarkers for OA.
Using the XGBoost algorithm, we identified 23 candidate genes (Figure 4E). The SVM algorithm identified 23 key genes (Figures 4F-4G), and Lasso regression selected 26 candidate genes (Figures 4H-4I). By intersecting the feature genes identified by these three machine learning methods, we pinpointed 15 candidate genes: IRS2, ADM, SIK1, PTN, CX3CR1, WNT5A, IL21R, APOD, CRLF1, FKBP5, PNMAL1, NPR3, RARRES1, ASPN, and POSTN. Further intersecting these candidate genes with the hub genes, we identified IRS2, PTN, POSTN, and WNT5A as diagnostic biomarkers for OA (Figure 4J).
Next, we constructed box plots for these candidate genes, which showed differential expression in the dataset. We used ROC curves to evaluate the specificity and sensitivity of these four features in distinguishing OA from normal tissues in the test set. The diagnostic values of the four genes were as follows: IRS2 (AUC = 0.879, 95% CI: 0.826-0.927), PTN (AUC = 0.784, 95% CI: 0.707-0.852), POSTN (AUC = 0.668, 95% CI: 0.583-0.748), and WNT5A (AUC = 0.783, 95% CI: 0.706-0.857) (Figure 5A). Since the AUC of the POSTN gene in the test dataset was 0.668, which is less than 0.7, we ultimately selected IRS2, PTN, and WNT5A as the diagnostic biomarkers for OA.We then constructed a diagnostic nomogram for OA based on the training dataset to develop a clinically applicable diagnostic model for OA (Figure 5B). The clinical calibration curve (Figure 5C) and clinical decision curve (Figure 5D) of the model clearly demonstrated its high predictive ability for OA (AUC = 0.913, 95% CI: 0.866-0.953, Figure 5E).
3.7 Validation of Relevant Screening Genes
To better validate the clinical diagnostic capability of the candidate genes, we used the cartilage tissue dataset GSE129147, the synovial tissue dataset GSE55235, and the blood sample dataset from OA patients GSE48556 for verification, with sample sizes of 40, 20, and 139, respectively. We constructed violin plots of the candidate genes, which showed differential expression in the datasets (Figures 6A-6C). Additionally, we plotted ROC curves to evaluate the diagnostic value of each gene in different tissue samples (Figures 6D-6F).
Among these, PTN had lower diagnostic efficiency in cartilage samples, POSTN had poor diagnostic efficiency in synovial samples, and only IRS2 exhibited high differential expression (p<0.01) in all samples with good diagnostic value, indicating a high clinical value.
3.8 Experimental Validation of IRS2 Gene
To test the reliability of the candidate gene IRS2 in OA patients, we simulated OA by adding IL-1β to human SW1353 chondrocytes. We identified the protein level expression and relative mRNA expression level of the IRS2 gene in this sample through WB and qRT-PCR. Additionally, we checked OA-related phenotypes to ensure the accuracy of the OA simulation. The study data indicated that, compared to the control group, OA-related markers COL2A1 and SOX9 decreased, while MMP13 and COX2 significantly increased, consistent with OA characteristics. In the OA group, the protein and mRNA expression levels of the IRS2 gene were significantly reduced, showing statistical differences (Figures 7A-7C).