Identification of novel classes for patients with lupus nephritis using two-step cluster model

To identify and reclassify the patients in the lupus nephritis (LN) cohort, and to further analyze the prominent clinical features and clinical significance of each cluster. In this retrospective cross-sectional study, we used a two-step clustering method to classify 635 patients with LN into different clusters, then we observed the main differences and analyzed relevant clinical significance between the clusters. Cluster 1 (20.5%) presented with the highest disease severity, patients in this group had the disease for a longer duration and higher systemic lupus erythematosus disease activity index (SLEDAI) score, with multiple positive auto-antibodies and lower complement level. Patients of cluster 2 (20.8%) had lower levels of IgG, IgA and IgM, with renal function being relatively worse in this cluster than in clusters 1 and 3. Cluster 3 was the largest group (58.7%), and the patients in this group showed mild disease severity. This study reclassified LN patients in a large cohort into three clusters. Our classification might be helpful to implement targeted therapy at various stages of systemic lupus erythematosus.


Introduction
Systemic lupus erythematosus (SLE) is an autoimmune disease involving multiple organs, with renal involvement significantly increasing the morbidity and mortality of SLE [1]. Therefore, effective management of the clinical manifestations of lupus nephritis (LN) could reduce the burden of SLE and improve patient prognosis [2]. Currently, the Systemic Lupus International Collaborating Clinics (SLICC)/American College of Rheumatology (ACR) damage index (SDI) [3] is used to assess the disease burden, with the 2003 International Society of Nephrology/Renal Pathology Society (ISN/RPS) classification system used to apply targeted treatment strategies [4,5]. Despite these efforts, 10-30% of patients with LN do progress to end-stage kidney disease within 15 years from the time of diagnosis [6]. The classification of LN could be improved by incorporating clinical indices and pathological features for more effective treatment.
Therefore, the aim of our study was to use a reasonable clustering method to identify clusters of prominent clinical features among patients with LN and to evaluate the clinical significance of the different clusters identified.

Statement of ethics
Our study was approved by the Medical Ethics Committee of Xiangya Hospital of Central South University in Changsha, China. All patients provided written informed consent. The study was conducted in accordance with the Declaration of Helsinki.

Study group
This was a retrospective, cross-sectional study conducted at the Rheumatology Department of Xiangya Hospital of Central South University between November 2006 and September 2020. Newly diagnosed patients and those already undergoing therapy were all included in the study. All patients met the ACR 1997 revised classification criteria for SLE. Data analysis was based on the CSTAR online registry, which is an academic union that was first funded by the Chinese Ministry of Science & Technology in 2009.

Data collection
The following data were retrieved from patients' medical records for analysis: demographic information, including sex, age and duration of disease at the time of recruitment; laboratory findings, including the white blood cell count (WBC), neutrophil count (N), lymphocyte count (L), hemoglobin (Hb) concentration, blood platelet (PLT) count, urinary albumin/ creatinine ratio (UACR), and creatinine (Cr) level; the estimated glomerular filtration rate (eGFR); levels of immunoglobulin G (IgG), immunoglobulin A (IgA), immunoglobulin M (IgM), C3, and C4; presence of antinuclear antibodies (ANA), antibodies to doublestranded DNA (ds DNA), anti-Smith antibodies, Sjögren syndrome (SS) A and B antibodies, nucleosome particles, antihistone antibodies, and antiribosomal P protein. The clinical evaluation included physicians' global assessment (PGA) of disease activity, systemic lupus erythematosus disease activity index (SLEDAI), and the SDI score [7], both of which were calculated by experienced clinicians. The PGA was scored prior to reviewing the complement and anti-DNA antibody results [8].

Statistical analysis
The analysis was performed using a two-step cluster model to identify groups of patients with LN who had similar patterns of SLE clinical presentation. The analysis was performed as follows.
In the first step, we established a cluster feature tree, with the first record of the data set onto a leaf node at the root of the tree. All variables for this record were included. A logarithmic similarity criterion was then used to measure the distance, or likelihood of similarity between data sets. Data sets with high similarity were located on the same node, while those with low similarity generated new nodes. The similarity criterion assumed that variables obey a certain probability distribution and that the variables in the clustering model are independent of one another. With respect to the probability distribution, it was assumed that continuous variables had a normal (Gaussian) distribution and categorical variables had a multinomial distribution. Empirical internal tests have shown that the clustering process is robust against violations of the assumptions of independence and distribution. We evaluated these assumptions for our own data set, using a bivariate correlation to test the independence between two continuous variables and cross-tabulation to test the independence between two categorical variables. Exploration processes were used to confirm the normal distribution of continuous variables, with the χ 2 -test used to confirm the multimodal distribution of categorical variables.
In the second step, we used a merging clustering algorithm to combine leaf nodes. A set of clustering schemes was generated for different numbers of clusters of leaf nodes, with the Bayesian information criterion (BIC) used to compare various clustering schemes to select the optimal clustering scheme.
The two-step cluster analysis automatically determined that three clusters were to be generated. Continuous variables within each cluster were reported as the mean and standard deviation (SD), with categorical variables reported as a count and percentage. Differences in continuous variables between the three clusters were evaluated using an analysis of variance (ANOVA), with the χ 2 -test used for categorical variables.
All analyses were performed using SPSS (version 26.0, SPSS Inc., Chicago, IL, USA), with significance defined by a P-value < 0.05.

Demographic and clinical characteristics of patients with LN
A total of 635 patients with LN were enrolled in the study of them 599 (94.3%) were women and the mean age of the patients was 33.8 ± 10.4 years. The baseline characteristics of our study group are described in Table 1. Since most of the patients had been treated,  Data are given as mean (SD), median, or as number and percentage WBC white blood cell, N neutrophils, L lymphocytes, Hb hemoglobin, PLT blood platelets, UACR urinary albumin/creatinine ratio, Cr creatinine, eGFR estimated glomerular filtration rate, IgG immunoglobulin G, IgA immunoglobulin A, IgM immunoglobulin M, C3 complement 3, C4 complement 4, ANA antinuclear antibodies, anti-dsDNA antibody antibodies to double-stranded DNA, anti-Sm antibodies antibodies to Smith, anti-rRNP antibodies antibodies to ribosomal P protein, PGA physician's global assessment, SDI Systemic Lupus International Collaborating Clinics/ACR damage index the average results of blood routine, renal function and immunological test were within the normal range; however, C3 levels were still lower than normal. Both ANA and anti-dsDNA autoantibodies were available for 606 (95.4%) patients. Among them, anti-Sm antibodies were found in 6%, anti-RNP antibodies in 1.4%, anti-SSA antibodies in 5%, anti-SSB antibodies in 2.2%, anti-rRNP antibodies in 0.8%, and antinucleosome and antihistone antibodies were detected in 3% and 1.3%, respectively. The PGA score ranged from 0 to 1.5, while mean level of SDI score was 0.3.

Cluster analysis
The three clusters identified are shown in Fig. 1, with the characteristics of patients in each of the three clusters summarized in  Fig. 1 Labels of important clinical variables for each cluster of patients. We analyzed the degree of influence (contribution degree) of each variable on the clustering results and scored. The higher the score, the greater the influence of the variable on this classification, the corresponding position in this figure will be farther away from the center of the circle. For example, in cluster 1, variables that most affect its classification were Sm, RNP, SSA, SSB, rRNP, nucleosome, histone and SDI; In cluster 2, variables that most affect its classification were C3, C4 and Cr; In cluster 3, variables that most affect its classification were ANA and dsDNA. WBC white blood cell, N neutrophils, L lymphocytes, Hb hemoglobin, PLT blood platelets, UACR urinary albumin/creatinine ratio, Cr creatinine, eGFR estimated glomerular filtration rate, IgG immunoglobulin G, IgA immunoglobulin A, IgM immunoglobulin M, C3 complement 3, C4 complement 4, ANA antinuclear antibodies, anti-dsDNA antibody antibodies to double-stranded DNA, anti-Sm antibodies antibodies to Smith, anti-rRNP antibodies antibodies to ribosomal P protein, PGA physician's global assessment, SDI Systemic Lupus International Collaborating Clinics/ACR damage index 204 Identification of novel classes for patients with lupus nephritis using two-step cluster model K original article  lower for cluster 3 than those for clusters 1 and 2; however, the levels of IgG and IgA in cluster 3 were the highest among the three clusters. All patients in cluster 3 had positive ANA and anti-dsDNA antibodies. Cluster 1 included the smallest group of patients, with these patients having the highest SLEDAI and SDI scores among all the three clusters, indicative of high disease activity and severe damage. The complement level was the lowest among all the clusters, with evidence of multiple positive auto-antibodies. The duration of disease was obviously longer in cluster 1 (9.0 ± 4.9 years) than in cluster 2 (5.4 ± 4.4 years) and cluster 3 (6.0 ± 3.3 years), p = 0.008. The treatment effect in this cluster was not desirable, even after treatment with belimumab. However, more follow-up data are needed to support the conclusion. Patients in cluster 2 had the highest levels of granulocytes and lowest levels of IgG, IgA, and IgM among the three clusters, with the renal function of these patients being relatively worse than of patients in clusters 1 and 3.

Discussion
Our analysis shows that the 635 patients with LN in our study group could be classified into 3 clusters, with each cluster having a different clinical profile. The severity of the disease was different among the three clusters, with highest disease severity for patients in cluster 1. These patients had a relatively high lymphocyte count, low complement level, multiple positive autoantibodies, higher SLEDAI, SDI scores and longer disease duration. We found that, compared with the other two clusters, there were individuals with significant damage to multiple organs in cluster 1, resulting in a correspondingly higher SDI score; however, the number of patients with multiple organs involved were not high; therefore, we did not list the specific conditions of organ damage in the table.
Differences in the WBC count played an important role in the cluster analysis. According to previous literature, abnormalities in T and B lymphocytes are the major adaptive immune responses in SLE, with anomalous complement activation and autoantibody production contributing to the pathogenesis of SLE and LN [9][10][11][12]. Of course, we cannot rule out that high levels of granulocytes might be the consequence of glucocorticoids at the same time. The presence of multiple positive autoantibodies in the ANA profile is the main clinical feature of patient in cluster 1. Previous studies suggest that a variety of positive ANA profile antibodies might be related to the development or organ involvement in SLE. For example, anti-Smith antibodies, associated with SLE disease activity, have also been associated with neurologic disorders and cardiovascular diseases [13][14][15], anti-SS-A antibodies presenting higher frequencies of musculoskeletal involvement [16,17] and anti-RNP antibodies are more prevalent among patients with renal involvement [18,19]. This might be the reason for higher SLEDAI and SDI scores in cluster 1.
We conducted statistical tests on the pathological results of LN of the three clusters of patients, and no statistically significant conclusions were found. Each group of patients included patients with type III, IV, and V lupus nephritis, and some patients were clinically diagnosed with LN, and did not undergo renal biopsy; hence, we did not show this part of the result in Table 2.
The limitations of the study are as follows: firstly, further longitudinal research is required to better explain the classification of these results; secondly, although most of the data types selected were continuous variables, there were several categorical variable indicators, which would have an impact on the clustering results.
In summary, the two-step cluster analysis we used in our study was successful in classifying patients with LN into three clusters, with laboratory indicators being different among these clusters. This classification may assist in providing targeted therapy to patients.

Conclusion
By applying a two-step clustering model, we were able to classify a large cohort of patients with LN into three clusters, presenting with different clinical and laboratory profiles and disease severity.