Prediction of Renal Damage in Children with Henoch-Schönlein Purpura Based on Machine Learning


 Background

This article is objected to explore the value of machine learning algorithm in predicting the risk of renal damage in children with Henoch-Schönlein Purpura, and to construct a predictive model of Henoch-Schönlein Purpura Nephritis in children and analyze the related risk factors of Henoch-Schönlein Purpura Nephritis in children.
Methods

Case data of 288 hospitalized children with Henoch-Schönlein Purpura from November 2018 to October 2021 were collected. The data included 42 indicators such as demographic characteristics, clinical symptoms and laboratory tests, etc. Univariate feature selection was used for feature extraction, and Logistic regression, support vector machine, decision tree and random forest algorithm were used respectively for classification prediction. Last, the performance of four algorithms are compared using accuracy rate and recall rate.
Results

The accuracy rate, recall rate and AUC of the established random forest model were 0.83, 0.86 and 0.91 respectively, which were higher than 0.74, 0.80 and 0.89 of the Logistic regression model; higher than 0.70, 0.80 and 0.89 of support vector machine model; higher than 0.74, 0.80 and 0.81 of the decision tree model. The top 10 important features provided by random forest model are Persistent purpura≥4weeks, Cr, Clinic time, ALB, WBC, TC, TG, Relapse, TG, Recurrent purpura and EB-DNA.
Conclusion

The model based on random forest algorithm has better performance in the prediction of children with allergic purpura renal damage, indicated by better classification accuracy, better classification effect and better generalization performance.


Background
Henoch-Schönlein Purpura (HSP), also known as IgA vasculitis (IgAV), is one of the most common systemic vasculitis in children [1] . Patients with HSP are often acute onset, and the incidence of HSP are about 3/100,000 ~27/100,000 [2][3] . The clinical features of HSP are skin purpura, abdominal pain, joint swelling and pain, gastrointestinal bleeding, hematuria, proteinuria and other renal involvement, among which renal involvement is the key factor affecting the prognosis. It has been reported that about 20% to 80% of children with HSP have involved to Henoch-Schönlein Purpura Nephritis HSPN in the rst 1 to 2 months after onset, and 1% of patients may even develop end-stage renal failure [4][5] , which would seriously affect the quality of life of the children with the disease. The presence of renal involvement is a key factor in the long-term prognosis for children with HSP, and the early accurate diagnosis of HSPN is crucial for the prognosis and individualized treatment of children.
In recent years, big data analysis and mining has attracted increasing attention. Machine learning, as an emerging statistical analysis method, is capable of deep mining and analysis of big data, and has been widely used in disease occurrence, prognosis prediction and other aspects, achieving certain results [6][7] .
Therefore, this article aims to build a prediction model of renal damage in children with HSP based on machine learning algorithm, identify related risk factors, and provide help for the early diagnosis and individualized intervention treatment of children with HSPN.

Research Objects
This article selected 288 children with HSP who were treated from November 2018 to October 2021 as study subjects. After following up for at least 6 months, the patients were divided into HSP group and HSPN group according to urine test results.

Diagnostic Criteria
The diagnosis of HSP is based on the 2010 EULAR/PRINTO/PRES diagnostic criteria for HSP [8] . The diagnosis of HSP is de ned as palpable rash with at least one of the following four clinical symptoms: abdominal pain, arthritis/joint pain, renal involvement, histopathological ndings suggesting IgA deposition.
According to "Evidence-based Guideline for Diagnosis and Treatment of Purpura Nephritis (2016)" by the Nephrology group of pediatric branch of Chinese Medical Association, the diagnostic criteria of HSPN [9] were proposed as: hematuria and/or proteinuria occurred within 6 months of the course of HSP.
Hematuria de nition: gross hematuria or hematuria with more than 3 red blood cells/high-power eld (HP) under microscope 3 times within 1 week. Proteinuria de nition: meet any of the following: 3 routine urine tests within 1 week qualitatively indicate positive urine protein, 24h quantitative urine protein > 150 mg or urine egg/creatinine (mg/mg)> 0.2, Urinary microalbumin was higher than normal for 3 times within 1 week.

Inclusion Criteria
The patient was included if he/she met the diagnostic criteria above and was under 18 years old, and the parents of the patient gave consent.

Exclusion Criteria
Patients are excluded if they had severe heart, liver, kidney, brain, immune system or other diseases or consumptive diseases, or do not cooperate with the observer.

Statistical analysis
All the data in this article are processed and analyzed by Python3.9. The data with different dimensions may vary greatly, and some of them are negative, which would affect the subsequent feature selection and machine learning. In order to solve the problem caused by dimensional disunity, it is necessary to standardize the data. This article uses the Max-min method, which has the advantage of improving the degree of normalization of the model. Missing values may cause confusion in model tting, and the output values are unreliable, and features with missing values exceeding 60% contain little availability information, so they are deleted directly. Because some features in the data have outliers, in order to avoid affecting the overall effect of the model, the remaining missing values are lled with median [10] .
Univariate feature selection was used to screen out some possible irrelevant variables. test was used for discrete features, and correlation analysis was used for continuous features. Among the results of this feature selection, P < 0. 05 was considered statistically signi cant.

Machine Learning Methods
Since the development of machine learning algorithms, the classical classi cation algorithms mainly include logistic regression, decision tree, support vector machine, Naive Bayesian, K-nearest neighbor algorithm, and integrated learning algorithms such as AdaBoost, Random Forest (RF), GBDT and XGBoost algorithm, etc. This article mainly uses Logistic regression, support vector machine, decision tree and random forest algorithm to conduct modeling analysis. Random Forest (RF) is a Bagging algorithm based on decision tree. The traditional decision tree chooses one optimal attribute each time when selecting partition attributes, but RF introduces random attribute selection in the training process of decision tree [11] . When building a decision tree in RF and selecting the partitioning attribute of a node, rst we randomly select a subset of K attributes from the attribute set of the node, and then select an optimal attribute from this subset for partition. If k=n (where n is the number of attributes of the current node), the construction of a decision tree in RF is the same as that of a traditional decision tree. If k=1, an attribute is randomly selected for partitioning, and is usually recommended.
The whole model construction process of risk prediction and feature analysis of renal damage in children with Henoch-Schönlein Purpura based on machine learning algorithm is shown in Figure 1.

Basic Information
A total of 288 samples were included in the study. After 6 months of follow-up, 174 cases (60.42%) had no renal damage and 114 cases (39.58%) had renal damage.
Univariate feature selection of renal damage Univariate feature selection was carried out on the 288 samples collected, and some possible irrelevant variables were preliminarily screened out. The results of Table 1 and Table 2   Ranking of feature importance in random forests The 11 features obtained from univariate feature selection are ranked according to the features importance provided by the random forest model, as shown in Figure 2. By setting appropriate threshold, 10, 8 and 5 features were screened as independent variables for model input. As shown in Figure 3, the AUC of the model is 0.916, 0.880 and 0.886, respectively. Therefore, the performance of the model is best when the feature with the top ten random forest importance score is used as the model input. According to Figure 2, the top 10 features are: Persistent purpura≥4weeks, Cr, Clinic time, ALB, WBC, TC, Relapse, TG, Recurrent purpura, EB-DNA.

Model Construction
In order to optimize the effect of the model, the super parameters need to be tuned during modeling. First, the number of decision trees (N_estimators) and the maximum depth of the decision tree (MAX_depth) are determined by grid search, and then the minimum number of samples needed for splitting(min_samples_split) and the minimum number of samples carried by leaf nodes(min_samples_leaf) are determined. It can be directly seen that the performance of the random forest model is better, which has great advantages compared with the linear model, as shown in Table 3. ROC curves of the four models were drawn for comparison, as shown in Figure 4. It was found that the AUC of the random forest model was 0.912, which was also signi cantly higher than the other three models, indicating that the classi cation accuracy of the random forest model was greater, the classi cation effect was better, and it had a great generalization performance. Therefore, it is considered that the fusion based on the random forest model has a better performance in the prediction of children's HSP renal damage.

Discussion
Machine learning can effectively learn the characteristics of a large number of data, which provides new research ideas and methods for accurate prediction. Machine learning algorithms include conventional algorithms (K-nearest neighbor, decision tree, support vector machine, etc.) and integrated algorithms (random forest, XGBoost, limit tree, etc.). In this study, Logistic regression, support vector machine, decision tree and random forest algorithm were used to construct the damage prediction model of children's HSPN. Through the comparison of the precision rate, accuracy rate, recall rate and F1 value of each model, we can see that the random forest model has a better effect, with values of 0.83, 0.87, 0.86 and 0.85 respectively, and its stability is better than the other three models.
The ROC curve of the four models was drawn for comparison, and it was found that the AUC of the random forest model was 0.912, which was also signi cantly higher than that of the other three models, indicating that the classi cation of the random forest model was more correct, had better classi cation effect, and had good generalization performance. Random forest is a collection of multiple decision trees, which can make up for the weak generalization ability of decision trees 11 . This method relies on computers to learn all the complex nonlinear interactions between variables by minimizing the errors between the observation and the predicted results [12] . With low computational overhead, it shows strong performance in many practical tasks.
Henoch-Schönlein Purpura is a systemic vasculitis mediated by immune complexes, which is a characteristic self-limited disease. Its pathogenesis is related to genetic, immune and other factors. Renal involvement is the key to determine its prognosis. Clinical judgment of renal involvement in children mainly depends on urine test, renal function test and renal biopsy. However, due to the relatively high risk and low acceptance of kidney biopsy, and lag time of routine urine test, in recent years, a large number of scholars have devoted themselves to studying the high risk factors of HSP renal damage and the methods of preventing renal damage. It mainly includes the analysis of the epidemiological characteristics, clinical manifestations, auxiliary examination, treatment and medication of the disease. This study is ranked according to the feature importance provided by the random forest model, the top 10 features are: Persistent purpura≥4weeks, Cr, Clinic time, ALB, WBC, TC, Relapse, TG, Recurrent purpura, EB-DNA. These features may be important risk factors associated with HSPN in children.
Skin purpura is the most common clinical manifestation of HSP in children [13] . Studies have shown that about 78% ~ 100% of children are accompanied by skin purpura at the beginning of disease, and the accuracy rate of initial diagnosis is high [14] . Persistent purpura usually refers to the rash lasting more than 1 month. Recurrent purpura refers to the recurrence of a typical purpura-like rash in groups (more than 3 times) after the previous rash has completely subsided. Chan H et al [15] found that the risk of renal damage in HSP children with persistent purpura was 1.22-13.25 times higher than that in non-persistent purpura patients. Rigante D et al [16] believed that persistent skin rash for more than one month was an important predictor of renal involvement and disease recurrence in children with HSP. Ma DQ et al [17] found that the recurrence of rash ≥ 3 times was a risk factor for renal involvement in children with HSP.
The reason may be the recurrent or persistent skin purpura, indicating that the recurrent and persistent presence of small vasculitis expands the in ammatory cascade reaction of the body, immune complex deposition and complement activation are widely active and persistent, and the renal capillaries are rich, so renal involvement happens.
Serum creatinine is an important indicator of renal function, and the increase of serum creatinine caused by the decrease of creatinine clearance is sign of renal insu ciency. AlKhater et al [18] showed that elevated serum creatinine was related to renal damage in children with HSP. Although in some reports, the average duration of renal disease is about one month after the onset of symptoms, the risk can last up to six months after the initial symptoms of HSP appear. In the report of Gupta et al [19] , 57.8% of the patients developed renal symptoms within 4 weeks, 84.4% of the patients developed renal symptoms within 8 weeks, and the remaining patients continued to develop renal symptoms within 6 months after diagnosis. This indicates the need to provide adequate follow-up and monitoring of patients to assess renal involvement. It is generally believed that hypoproteinemia in nephropathy is caused by the loss of a large amount of protein from urine. The results of this study showed that decreased serum albumin was one of the risk factors for renal involvement, which was consistent with the study by Mao et al [20] . This is related to the damage of the charge barrier of glomerular ltration membrane and the increase of permeability in children with HSPN, which leads to albuminuria.
In recent years, studies have shown that elevated serum total cholesterol is more common in children with HSP, especially those with renal damage. For example, Xu et al [21] showed that the age, creatinine and TC levels of children with HSPN were higher than those of children with NHSPN. Logistics multivariate analysis showed that TC level was one of the independent risk factors for HSPN (P < 0.05), which was consistent with the study of Ma et al [17] .
Wang et al [22] studies have showed that patients with an interval of less than 4 days have a higher risk of developing kidney damage and severe kidney disease than patients with an interval of more than 8 days from the onset of symptoms to diagnosis. This risk factor has rarely been reported in previous studies. Thus, HSP is a self-limited disease in most cases, but for a small number of patients, HSP may not be self-limited and it will progress to renal involvement or severe renal disease. This nding is similar to the view of Davin et al [23] . These results suggest that early treatment and early diagnosis may be bene cial to children with HSP. Recurrence refers to the recurrence of characteristic manifestations of HSP in children diagnosed with HSP at least 1 month after the disappearance of symptoms. Lei et al [24] de ned the interval of recurrence as more than 3 months, including a total of 1002 children, of which 83.6% had one recurrence and 16.4% had more than 2 times of recurrence, and children with recurrence were more likely to have renal damage (P < 0.05).
As studies have shown, infection is the most common cause of HSP, and about 40%-70% of children are mainly affected by respiratory tract infection [25][26] .Ma et al [17] showed that the increase of WBC was one of the independent risk factors for HSPN (P < 0.05). Chang et al [27][28] believed that the mechanism may be tissue damage caused by in ammatory mediators secreted by neutrophils, resulting in swelling and necrosis of renal vascular endothelium, while activated substances such as oxygen free radicals can chemotactic more WBC, aggravate vascular injury and form a vicious circle. EBV belongs to the γ subfamily of Herpesviridae, which is a linear double-strand DNA virus, and human is its only natural host. It has been reported that viral infection is the etiology of various renal diseases. EBV infection can directly activate cellular and humoral immunity leading to EBV infection-related renal injury, and can also promote the formation of blood antigen-antibody complex, and settle on the renal vascular wall, causing damage to renal function [29] .
This study has the following limitations: (1) the collected cases are one-way retrospective study, the included sample size is limited and has not been externally veri ed, the results may be biased, and further multicenter large-sample prospective studies are needed for veri cation, (2) The examination items of children were different, and some index features were omitted due to its absence, and the predictive variables may be left out.
To sum up, this study is based on clinical data, using machine learning algorithm to predict children's HSPN, aiming to intervene the possible clinical risk factors, to assist early clinical diagnosis and improve the prognosis of children, and to reduce the damage caused by invasive examination. Prospective intervention experiments can be carried out in the later stage to try to establish an early warning system for renal damage in children with HSP in hospital, so as to conduct individualized treatment and prevention for patients. The combination of machine learning models and medical big data may provide new ways to predict the risk of children with Henoch-Schönlein Purpura.

Declarations
Ethics approval and consent to participate:This is an observational study. The study was approved by the Medical Ethics Committee of A liated Hospital of Shandong University of Traditional Chinese Medicine.
All methods were carried out in accordance with relevant guidelines and regulations.We con rmed that informed consent was obtained from all patients or their parents.
Consent for publication:Not applicable.
Availability of data and materials:All data generated or analysed during this study are included in this published article and they are available from the corresponding author on reasonable request.