Bioinformatics approach for potential genes associated with osteoarthritis

Background: Osteoarthritis is the main cause of disability and pain. Due to limited understanding of the disease mechanism, it is generally dicult to diagnose early and eliminate sequelae. The aim of this study was to identify novel biomarkers for osteoarthritis (OA) using a bioinformatics approach. Method: The gene expression dataset GSE82107 were obtained from the Gene Expression Omnibus (GEO) database. Matrix quality assessment was performed using corrplot packages and principal component analysis. The differentially expressed genes (DEGs) picked out using GEO2R tool. Enrichment analyses were performed using The Database for Annotation, Visualization and Integrated Discovery and Gene Set Enrichment Analysis (GSEA). Weighted correlation network analysis (WGCNA) was used to nd gene modules highly associated with OA. Cytoscape with Molecular Complex Detection (MCODE) plug-in was utilized to analyze protein-protein interaction of these DEGs. Receiver operator characteristic curve analysis was used to evaluate the diagnostic effectiveness of genes. Results: Samples with different conditions (HC and OA) are obviously distinguished, and the distance between biological replicates is relatively close. In total, 1676 DEGs were identied. Enrichment analysis showed that there were some gene sets related to OA pathology, such as chondrocyte development. The results of WGCNA analysis showed that 298 genes were most positive associated with OA. 10 common genes obtained were selected as candidate core genes. ROC results showed that 5 of these genes had the greatest diagnostic value. Conclusion: This is the rst study to identify biomarkers related to OA by combining multiple algorithms such as GSEA, WGCNA, MCODE and ROC. We suggest that GLG1, PAPSS2, CTSK, TIMP1 and SDC1 could serve as valuable biomarkers. Further studies are needed to examine the precise role and mechanism of these genes in OA.

gene sets related to OA pathology, such as chondrocyte development. The results of WGCNA analysis showed that 298 genes were most positive associated with OA. 10 common genes obtained were selected as candidate core genes. ROC results showed that 5 of these genes had the greatest diagnostic value. Conclusion: This is the rst study to identify biomarkers related to OA by combining multiple algorithms such as GSEA, WGCNA, MCODE and ROC. We suggest that GLG1, PAPSS2, CTSK, TIMP1 and SDC1 could serve as valuable biomarkers. Further studies are needed to examine the precise role and mechanism of these genes in OA.

Background
Osteoarthritis (OA) is considered to be the most major chronic joint disease, which is the main cause of pain and disability [1,2]. It is estimated that the percentage of the aging population over the age of 65 will more than double in the next 20 years in Asia and the incidence of OA is rising due to the increase in the ageing population [3]. Clinicians recognize that OA is usually diagnosed later in the process of the disease, and the incidence of serious adverse events caused by treatment such as total knee arthroplasty is higher than non-surgical treatment [4]. Therefore, despite efforts to develop disease markers in the past, it is still necessary to identify effective biochemical markers to reliably describe OA pathological processes, or to make an early diagnosis of OA, or to follow the course and therapeutic effects of OA.
Bioinformatics has become an important part of scienti c research to analyze, identify and interpret biomarkers and therapeutic targets [5]. Many bioinformatic studies on OA identi ed uncover differentially expressed genes (DEGs) directly, but quite a few bioinformatics studies on OA have not previously conducted quality assessments on gene expression matrix, which may lead to reduce the reliability of results [6]. Though DEGs usually have a signi cantly different expression, enrichment analysis only for DEGs may also miss important genes with insigni cant expression in OA. In addition, the interesting gene sets enriched by Gene Set Enrichment Analysis (GSEA) are often considered as candidate for hub genes.
But these genes may be in a non-concentrated interaction region of the whole biological network leading to the poor linkage between genes. Weighted correlation network analysis (WGCNA) uses gene expression datasets to construct weighted gene co-expression network mining synergistically expressed gene modules and exploring the relationship between gene modules and biological phenotypes [7]. However, few studies have used WGNCA to identify the co-expression gene modules of OA to nd core genes. The bioinformatic studies using this method on OA only included DEGs for WGCNA analysis, which may also lead to bias of the results. Besides, gene modules associated with biological phenotypes and statistical signi cance analyzed by WGCNA are performed in GSEA to identify core genes. These designs focus on preset subsets of the gene expression dataset may cause some implicit information to be ignored in all gene sets.
In the present study, the gene expression dataset GSE82107 was selected from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo) [8]. Data quality assessment was performed using correlation analysis and principal component analysis (PCA) rstly, then the GEO2R online analysis software (https://www.ncbi.nlm.nih.gov/geo/geo2r/) was used to uncover DEGs. In addition, the functions of the DEGs and all genes were analyzed respectively, using the Database for Annotation, Visualization and Integrated Discovery (DAVID) and GSEA [9][10][11]. The study also used WGCNA to construct gene co-expression network and explore the correlation within gene modules and biological phenotypes of OA screening important gene modules [7]. Using these loci, a protein-protein interaction (PPI) network of DEGs was obtained and network analysis and Molecular Complex Detection (MCODE) were performed in order to get degrees of connectivity for each gene and interactive modules [12]. We detected common genes in each gene set obtained based on the above methods and identi ed the relationship between genes and musculoskeletal diseases using comparative toxicogenomics database (CTD) [13]. Receiver operator characteristic (ROC) analysis was used to evaluate the diagnostic effectiveness of genes providing a basis for further studies.

Data
The gene expression dataset GSE82107 was obtained from the publicly accessible GEO database (http://www.ncbi.nlm.nih.gov/geo/). The study was based on the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array; Affymetrix; Thermo Fisher Scientifc, Inc., Waltham, MA, USA). The series matrix le of GSE82107 was downloaded from the GEO database. The dataset includes synovium samples from 7 individuals without a joint disease and 10 patients suffering from end-stage OA [14].

Matrix quality assessment
After the gene expression matrix is obtained, it is important to test and analyze the relationship between biological samples and experimental design. Usually, the methods of sample clustering and PCA are used to calculate and visualize the matrix. In this study, generate the numerical correlation matrix using cor, and we use the corrplot packages to graph the correlation matrix [15]. At the same time, p-values were calculated, which was concidered to be statistically signi cant at P < 0.01. In addition, PCA online tool was used to show the distribution of sample points on the two-dimensional plane, and the distribution of sample points indicates whether samples in the same group are united, or samples in different groups are obviously separated.
Data processing for DEG identi cation GEO2R was used to identify DEGs among experimental samples. GEO2R offers a convenient interface enabling sophisticated R-based analyses of GEO data and is useful to identify and evaluate DEGs [8]. P < 0.05 and |logFC| ≥ 1.5 were set as the thresholds for DEGs detection.

Enrichment analysis of DEGs
Gene Ontology (GO) analyses used to annotate genes or gene products and to determine biological characteristics of high-throughput genome or transcriptome data and Kyoto Encyclopedia of Genes and Genomes (KEGG) that is a group of databases for various biological data, including genomes and biological pathways were carried out using DAVID v6.8 (https://david.ncifcrf.gov/) [16]. The Functional Annotation Tool of DAVID was used and then the upregulated and downregulated DEGs were inserted into the tool for GO and KEGG analysis, respectively. The data was downloaded and P < 0.05 was used to indicate a statistically signifcant difference. Besides, GSEA (http://www.broadinstitute.org/gsea/index.jsp) was performed to explore whether the identi ed sets of genes showed statistical differences. | Normalized enrichment score (NES) | > 1 and NOM P value < 0.05 were used to determine the statistical signi cance.

Construction of the gene co-expression network
Because a sequencing data set may have tens of thousands of probes, many of these genes may not be expressed very differently in each sample. In order to reduce the computational cost, the top 25% of variance genes were screened and weighted co-expression modules were constructed using the WGCNA package [7]. Firstly, we clustered the gene sets and removed them if there were signi cant outliers. Next, an appropriate soft threshold based on the standard scale-free network was selected and network and module detection were constructed by one-step method. Subsequently, gene modules were correlated with the biological phenotype, and 1000 genes were randomly selected to visualize the correlation of the genes in the modules. In addition, we performed cluster analysis based on gene expression and obtained correlations between each module.

PPI network and module analysis
The Search Tool for the Retrieval of Interacting Genes (STRING) v11 (https://string-db.org/) is an online tool designed to evaluate PPI information and this was used to detect potential associations among the DEGs [17]. The results were input into Cytoscape v3.7.1 (http://www.cytoscape.org/) to map the associations among the DEGs [18]. A con dence score > 0.4 was set as criteria. MCODE v1.5.1 (Bader Lab, University of Toronto, Toronto, Ontario, Canada) was used to evaluate the interactive clusters in the PPI network in Cytoscape. The study combined degree, GSEA functional enrichment modules, gene coexpression network and DEGs interactive clusters to identify the common genes. The CTD (http://ctdbase.org/) was used to nd integrated chemical-gene, chemical-disease, and gene-disease interactions to generate expanded networks and predict novel associations. We used these data to analyze relationships between gene products and musculoskeletal diseases. Thus, relationships between genes and OA and an implied association or association were identi ed. Receiver operator characteristic (ROC) curve analysis was used to evaluate the diagnostic effectiveness of genes using expression pro ling datasets in GraphPad Prism 7.0 (GraphPad Software, Inc., La Jolla, CA) and P < 0.05 was considered as a statistical alteration.

Access of expression patterns and identi cation of DEGs.
The study calculated the correlation based on the gene expression matrix in the chip and performed cluster analysis in samples from GSM2183532 to GSM2183548 to explore the composition of the dataset, the potential classi cation and the internal relationship using the corrplot package to draw the correlation matrix. The results show that 17 samples can be basically distinguished in hierarchical clustering, and samples with the same phenotype are mainly clustered together (Fig. 1A). Meanwhile, dimensionality reduction is performed by PCA, and the distribution of sample points is displayed on twodimensional plane. The study shows that the samples with different conditions (HC and OA) are obviously distinguished, and the distance between biological replicates is relatively close, indicating that the consistency of biological replicates and the difference of groups are comparatively obvious (Fig. 1B).
Subsequently, the gene expression pro les of 17 samples in the gene expression dataset GSE82107, including 7 control and 10 OA samples, were analyzed. A total of 1676 DEGs between the control and OA samples were identi ed, including 218 upregulated and 1458 downregulated genes (Fig. 1C).

GO function and KEGG pathway enrichment analysis.
GO and KEGG pathway enrichment analyses were performed using DAVID and GSEA in order to gain a comprehensive understanding of the functions of the genes. DAVID focuses on the gene sets contained upregulated or downregulated genes, while GSEA can observe the consistency of gene expression in entire dataset with speci c functional gene sets to interpret the biological information. The results of the GO analysis from DAVID indicated that upregulated and downregulated genes were enriched for various BP terms, which are showed in Supplemental Fig. 1 and Supplemental Table 1 (top 5). For upregulated DEGs these included 'response to organic substance', 'protein modi cation process', 'cellular protein modi cation process', 'cellular response to chemical stimulus' and 'cellular response to organic substance', and for downregulated DEGs they included 'regulation of signaling', 'regulation of cell communication', 'positive regulation of metabolic process', 'positive regulation of cellular metabolic process', 'regulation of signal transduction'. In the molecular functions category, the upregulated genes were enriched for 'receptor binding', 'protein complex binding', 'glycosaminoglycan binding', 'cytokine activity' and 'anion binding', while the downregulated genes were enriched for 'adenyl ribonucleotide binding', 'adenyl nucleotide binding', 'ATP binding', 'molecular function regulator' and 'cytoskeletal protein binding'. A cellular components analysis further demonstrated that the upregulated genes were enriched for 'extracellular region', 'extracellular region part', 'membrane-bounded vesicle', 'extracellular exosome' and 'extracellular vesicle', while the downregulated genes were enriched for 'intrinsic component of plasma membrane', 'integral component of plasma membrane', 'cell junction', 'neuron part' and 'plasma membrane region'. In addition, 10 KEGG pathways were identi ed, as listed in Supplemental Fig. 2 and Supplemental Table 2 (top 5), involving the 'cytokine-cytokine receptor interaction', 'chemokine signaling pathway', 'ECM-receptor interaction', 'TNF signaling pathway' and 'osteoclast differentiation' for upregulated DEGs, and 'neuroactive ligand-receptor interaction', 'calcium signaling pathway', 'cAMP signaling pathway', 'oxytocin signaling pathway' and 'dopaminergic synapse' for downregulated DEGs.
GSEA analysis revealed that 8 GO terms, including 'bone development' and 'bone morphogenesis', 'bone remodeling', 'cartilage development involved in endochondral bone morphogenesis', 'chondrocyte development', 'chondrocyte differentiation', as well as 'positive regulation of osteoblast differentiation', and 'regulation of bone development', shown signi cantly differential enrichment in OA phenotype based on NES and NOM P value (Fig. 2, Table 1). NES: normalized enrichment score. Size represents the total number of genes in the gene set. NOM P value indicates the credibility of enrichment results. FDR q value represents the corrected P value of multiple hypothesis testing. Tags represents the proportion of core genes in the total number of genes in the gene set, while list represents the proportion of core genes in the total number of genes. For a gene set, when the number of core genes is the same as the total number of genes under the gene set, the signal value is the largest.
3. Construction of weighted gene correlation network analysis.
In this study, WGCNA was used to construct the gene correlation module associated with the sample trait. In total, 5879 genes were included for analysis, and the soft threshold β was calculated before construction the weighted co-expression network. We set the correlation coe cient to 0.9 as screening criteria and calculated the value of soft threshold β as 6 (Fig. 3A). Then a total of 13 gene modules with different colors were recognized by hierarchical clustering (Fig. 3B) (Fig. 3C), and then transformed them into a topological overlap matrix (TOM) and visualized the system clustering tree of gene (Fig. 3D). Eigengenes was correlated with external traits in order to search for the signi cant associated modules. It was clear that the MEgreen (298 genes) was most positive associated with OA (Fig. 3E).

Sub-modules in the PPI network and identi cation of core genes.
We detected densely linked regions in large PPI networks evaluated by STRING online tool that may represent molecular complexes using an MCODE analysis, and several signi cant modules were identi ed. The 3 modules with the highest score were selected (Fig. 4A ~ Fig. 4C). The study also calculated degree using NetworkAnalyzer, and degree > 30 was set as the thresholds. So far, the gene sets obtained by the analysis of MCODE, Degree, GSEA and WGCNA were put together and the common genes were detected (Fig. 4D). The results showed that golgi glycoprotein 1 (GLG1), secreted frizzled related protein 2 (SFRP2), secreted protein acidic and cysteine rich (SPARC), 3'-phosphoadenosine 5'phosphosulfate synthase 2 (PAPSS2), vitamin K epoxide reductase complex subunit 1 (VKORC1) and cathepsin K (CTSK) were coexisted in GSEA and WGCNA gene sets, as well as TIMP metallopeptidase inhibitor 1 (TIMP1) and syndecan 1 (SDC1) were commonly contained in Degree and WGNCA gene sets, and C-C motif chemokine receptor 5 (CCR5) was covered by Degree, MCODE, and WGCNA gene sets. The CTD database showed that these genes targeted several musculoskeletal diseases and these data appear in Fig. 4E. ROC analysis showed that 5 genes such as GLG1, PAPSS2, CTSK, TIMP1 and SDC1 could serve as valuable biomarkers for distinguishing patients with OA from healthy controls (Fig. 4F, Table 2).

Discussion
OA is characterized by cartilage degeneration, subchondral bone changes and bone marrow lesions [19]. Although non-pharmacological interventions have made some breakthroughs in treatment of OA, improved treatment of non-invasively is lacking in the early or mid-stage OA phase [20]. One of the reasons is complexity of OA pathology and di culty of early diagnosis. Bioinformatics analysis reveals the biological mechanisms form large and complex biological data through a combination of biology, computer science, and information technology. In the present study, the gene expression dataset GSE82107 was obtained from the GEO database to explore core genes.
We analyzed the gene expression matrix to observe the relationship between samples. If the quality assessment results show that the biological repeatability of the samples is very poor or there are many interlaced regions between samples and groups, this may lead to a decrease in the reliability of the bioinformatics analysis results, even if the analysis results are nally obtained. However, quite a few bioinformatics studies on OA have not performed quality assessments. Our results indicated that the repeatability of the samples in the same group is relatively stable, and there are also obvious biological differences between different groups, thus laying relatively credible reliability for subsequent analysis. Afterward, the study applied different calculation methods to deal with the gene expression matrix. By comparing the difference of gene expression between two groups and setting a threshold, 1676 DEGs were ltered out, including 218 upregulated genes and 1458 downregulated genes. Gene functional annotation enrichment analysis is a high-throughput research strategy that increases the likelihood for researchers to pinpoint the biological processes most relevant to them [21]. However, the differences in expression of most genes may not be signi cant, and focusing only on these genes with signi cant changes may leave out or ignore a lot of information. Therefore, we also applied GSEA to analyze the enrichment properties of the entire gene expression data in speci c functional gene sets, and the results showed that some genes were enriched in bone development, bone remodeling, and cartilage development and differentiation. OA is associated with bone loss due to bone remodeling, and the slowing of bone turnover in the later stage of OA leads to the densi cation of the subchondral plate, which further hinders the signal exchange between cartilage and bone and inhibits the repair of damaged cartilage [22]. These genes included in the functional annotation sets may be involved in the pathological process of OA. In addition, PPI networks based on DEGs are important in most OA biological functions and processes, and most proteins appear to activate their function through their interactions [23], and WGCNA algorithm represents a new biological approach for detecting key genes associated with sample traits in gene co-expression networks among all datasets [24]. Besides, MCODE can also obtain interactive clusters by calculating the PPI network based on DEGs. In this study, the common genes in the gene set obtained by the above algorithm were searched as candidate core genes, including GLG1, SFRP2, SPARC, PAPSS2, VKORC1, CTSK, TIMP1, SDC1 and CCR5. Further, we explored the relationship between these genes and musculoskeletal diseases and found that all genes were involved in different musculoskeletal diseases such as osteoporosis, OA and arthritis. Moreover, SPARC, CTSK and PAPSS2 may be biomarkers of some diseases or play a role in the etiology of the diseases. However, when we carried out ROC analysis on these candidate genes, we found that only GLG1, PAPSS2, CTSK, TIMP1 and SDC1 could be served as potential biomarkers to distinguish patients with OA from healthy people.
GLG1 located in golgi apparatus belongs to the cysteine-rich broblast growth factor receptor family [25,26]. GLG1 is expressed widely, mainly in skeletal muscle, placenta, bone marrow, ovary and testis [27]. It may have a chaperone function that participates in the processing and targeting of growth factors in cells [28]. Bone morphogenetic proteins 4 (BMP4) promotes the healing of fractures by stimulating the synthesis of extracellular matrix in chondrocytes and BMP4 is reduced in patients with OA and rheumatoid arthritis [29]. Studies have shown that silencing GLG1 induces hormone transcription, suggesting that GLG1 may be a negative regulator of pituitary hormone transcription. And GLG1 has a negative impact on the expression of BMP4 [30]. Sulfation is a ubiquitous modi cation of exogenous and endogenous compounds. In mammals, PAPS is the only source of sulfate, which is produced by ATP and inorganic sulfate. PAPSS2 encodes PAPS synthase. It is speci cally expressed in cartilage and adrenal gland and may play an important role in bone development during postnatal growth [31,32]. The researchers found that the lack of PAPSS2 activity lead to degenerative knee joint disease in mice, while people who lacked normal PAPSS2 activity showed shortening and bending of long bones, as well as degenerative arthropathy [33]. In addition, PAPSS2 may be a candidate gene transactivated by sexdetermining region Y-box containing gene 9 (SOX9), and thus it may act as a regulator of SOX9 expression in cartilage [34]. The protein encoded by CTSK is lysosomal cysteine protease, which is a member of the peptidase C1 protein family and is mainly expressed in osteoclasts. CTSK is closely involved in bone remodeling and bone resorption [35,36]. Researchers performed a study on OA for four phenotypes, including hip OA, knee OA, knee and/or hip OA, and any OA. A genome-wide association study (GWAS) found that CTSK is a possible OA effector gene, which has mechanisms to support the evaluation of the e cacy of OA [37]. And this is similar to our ndings. Additionally, experimental studies have con rmed that CTSK −/− mice exhibit a reduction in the remodeling of subchondral bone and calci ed cartilage in the destabilization of the medial meniscus induced OA [38]. TIMP1 belongs to TIMP gene family, which encodes a protein that is a natural inhibitor of matrix metalloproteinase (MMP) involved in the pathology of OA [39]. The inhibitor works by forming a one-to-one complex with target metalloproteinases, such as collagenase [40]. By comparing the gene expression of human OA synovial broblasts stimulated by transforming growth factor β, experimental OA mouse synovium and end-stage OA human synovium, the results showed that the expression of TIMP1 was up-regulated under all these conditions [41]. Besides, clinical studies have found that TIMP1 level is associated with at least one radiographic grading in the evaluation of generalized OA [42]. SDC1 encodes a cell surface proteoglycan that is a member of the syndecan proteoglycan family. Syndecans mediate cytoskeletal organization, cell signaling and cell binding [43]. There is evidence that SDC1 expression changes in the meniscus in the early stages of OA, possibly because the pathological changes of the meniscus precede the lesions of cartilage [44]. Although there is relatively rare research on SDC1 in OA, it is related to the cytoskeleton that affects the shape of cells. According to the role of SDC1 in decentralized mechanical transmission [45], we speculate that it may be involved in the physiological or pathological process of OA by regulating the bone mechanostat by affecting the tensegrity of the cytoskeleton.

Conclusions
The hub-genes of GLG1, PAPSS2, CTSK, TIMP1 and SDC1 may be potentially valuable biomarkers or therapeutic targets for OA. However, in order to determine the precise role of these genes, further experimental veri cation is needed.