Data download, quality control, data filtering, standardisation of single cell sequencing
We downloaded three AS datasets from the Gene Expression Omnibus (GEO) database: the single-cell RNA-seq profile GSE157595 and the microarray datasets GSE25101 and GSE73754 and. The single-cell RNA-seq profile GSE157595 contains six AS patients extracted and mixed into one experimental sample and six healthy patients extracted and mixed into one control sample. The X-axis represents the names of the samples and the dots in the graph represent the cells for each assessment. The Y-axis reflects the number of genes in each cell. The distribution of genes in each cell is shown in Figure 1A and the depth of sequencing of the samples is shown in Figure 1B, the mitochondrial gene content was not screened in this study (Figure 1C). The X-axis represents the sequencing depth of each sample, and the Y-axis represents the number of mitochondrial genes. A correlation coefficient of 0 means that there is no direct relationship between the sequencing depth of the sample and the number of mitochondrial genes (Figure 2A). x-axis represents the sequencing depth of each sample, and the Y-axis represents the number of sequenced genes. A correlation coefficient of 0.13 means that there is a positive relationship between the sequencing depth of the sample and the number of sequenced genes (Figure 2B). x-axis represents the average expression of each gene. The X-axis represents the mean expression of each gene. The Y-axis represents the normalized variance. Genes with large normalised variances were screened and selected for the following analysis as they represent heterogeneity between cells. A total of 581 genes with large normalised variances were obtained (Figure 3A). The top 10 genes with normalised variance are shown in Figure 3B. The top 10 genes with normalised variance can be seen as: IGHA, IGHG3, IGLC2, LCN2, SRGN, CAMP, IGHG4, JCHAIN, IGHM, IGHG1.
Principal Component Analysis (PCA)
The main sources of variation were calculated in this study using variable genes as shown in PCA. the aim of PCA was to identify the characteristics of these variable genes. Figure 4A shows the overall distribution of cells in the samples in PC1 and PC2. Figure 4B shows the p-value values for each PC, with smaller p-values indicating greater importance in the main components of the samples. Figure 4C shows the top 20 signature genes expressed by each PC. Figure 4D shows the expression of the top 20 signature genes for each cell in each PC expression. pc1 was considered to be one of the most dominant principal components in PCA, with the top 20 genes being:DENND4B, KLK3, CCDC14, AR, RCE1, SAE1, ZNF577, TXNDC16, AMACR, ASH1L FBXO41, HNRNPC, SBNO1, KRR1, MALAT1, KLK2, COMMD2, ZNF254, C22orf29, FAM219B.
TSNE analysis of cell types and identification of marker genes
The results of the TSNE analysis classified the cells into 14 clusters (Figure 5C). After annotation (Figure 5D), we found that these cells could be divided into cluster 3 for T cells, clusters 0 and 12 for tissue stem cells, clusters 2 and 11 for monocytes, cluster 10 for progenitor B cells, cluster 13 for macrophages, clusters 1, 5 and 8 for B cells, clusters 4, 7 and 9 for erythrocytes and cluster 6 for myeloid cells. The scatter plot (Fig. 5A) shows the expression of the top 10 genes with the highest normalised variance in each cluster. The violin plot (Figure 5B) shows the expression of the top two genes in cluster 0, LEPR and COL1A2, in each cluster. The heat map shows the expression of the signature genes in each cluster 0-13 (Fig. 5E).
Pseudo-time and trajectory analysis
Fourteen clusters were annotated according to marker genes: clusters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13. Figure 6 shows the pseudo-temporal and trajectory analysis, clusters 0, 1 due to stem cells; clusters 11, 12, 13 by macrophages, monocytes; clusters 2, 5, 6 by T cells; clusters 3, 4, 10 by B cells; clusters 7, 8, 9 by erythrocytes.
GO and KEGG analysis
We performed GO enrichment analysis on these differentially expressed genes to analyse their BP, CC and MF and obtained the results of all enrichment analyses while visualising the top 10 most important genes (Fig. 7A). Figure 7A shows that GO entries were mainly enriched for neutrophil, leukocyte chemotaxis, granulocyte, interferon, extracellular matrix, intracellular matrix, cell adhesion molecule binding, myocardin, actin and endoplasmic reticulum composition. The clustering diagram in Figure 7B demonstrates the enriched GO term. The KEGG pathway (Figure 7C) is mainly enriched for phagosomes, regulatory actin, atherosclerosis, cell adhesion molecules, Rap1 signalling pathway, rheumatoid arthritis, Th17 cell differentiation, haematopoietic cell lineage and allograft rejection. The clustering diagram in Figure 7D demonstrates the enriched KEGG term.
Disease Ontology Enrichment Analysis (DO) and Gene Set Enrichment Analysis (GSEA)
Disease ontology enrichment analysis (Figure 8E) showed enrichment mainly in primary immunodeficiency diseases, hereditary coagulation, atherosclerosis, bacterial infections, fungal infections, renal diseases, and urinary tract diseases. Analysis of GSEA enrichment in the AS group showed that the main pathways of GO enrichment were cotranslational protein targeting to membrane, establishment of protein localization to endoplasmic reticulum, mitochondrial translation, ncrna metabolic process and ncrna processing (Figure 8A), whereas the KEGG pathway was predominantly enriched. in complement and coagulation cascades, citrate cycle tca cycle, graft versus host disease, oxidative phosphorylation and proteasome (Fig. 8C).
CAPZA2 and TRIB1 are genes used as diagnostic models for AS
We obtained the intersection by constructing venn diagrams from the single-cell dataset marker genes and the microarray dataset SVM, and the two genes CAPZA2 and TRIB1 met all our screening requirements (Figure 8G). We then constructed ROC diagnostic model curves to test our analysis and found that the area under the curve for the AS diagnostic model constructed by CAPZA2 was 0.741 with 95% CI of 0.637-0.835, (Figure 9C). The area under the curve for the diagnostic model of AS constructed by TRIB1 was 0.727, 95% CI 0.617-0.828 (Figure 9D). the area under the curve for the diagnostic models of CAPZA2 and TRIB1 was much greater than 0.5, indicating a more accurate diagnostic curve. box plot analysis of the differences between the CAPZA2 and TRIB1 diagnostic models in AS and controls (Figure 9A, 9B), the The differences were statistically significant (p<0.05), suggesting the significance of the CAPZA2 and TRIB1 diagnostic models in the diagnosis of AS. Pseudotime and trajectory analysis demonstrated a possible correlation between the development of ankylosing spondylitis and inflammatory cell differentiation. Pathogenic enrichment analysis revealed enrichment mainly in primary immunodeficiency disease, hereditary coagulation, atherosclerosis, and bacterial infection.
Differential analysis of immune cell correlations
Fitted curves were constructed for the correlation between the two genes and immune cells for the diagnostic model (Figures 10, 11). Figure 12A shows that 7 immune cells were associated with CAPZA2 (p < 0.05) and Figure 12B shows that 6 immune cells were associated with TRIB1 (p < 0.05).
We performed immunohistochemical staining of CAPZA2 and TRIB1 in the interspinous ligaments of five patients with AS and three patients with spinal fractures, respectively. The results showed that the specific expression of each of CAPZA2 and TRIB1was significantly higher in AS than in the control group (Figure 13A1-D2). After detecting the positive rate of all immunohistochemical images with Image J software, the positive rate data of CAPZA2 and TRIB1 were imported into R software for statistical analysis. The positive rate of CAPZA2 and TRIB1 were statistically found to be much higher in AS than in the control group, and the difference was statistically significant (P<0.05), indicating that CAPZA2 and TRIB1 in AS and the control group there was a significant difference.