Association of Clinicopathological Features with IgA Nephropathy: A Principal Component Analysis

Background: IgA nephropathy(IgAN) is the leading form of glomerular disease worldwide. Currently, the pathogenesis of IgAN is unclear and IgAN can only be diagnosed by renal biopsy, which lacks non-invasive methods. This study aims to analyze the association between clinicopathological characteristics and IgAN by principal component analysis. Methods: Based on a combination of z-test and PCA, the �t of this model was evaluated with logistic regression analyses. Results: Data from 847 patients with biopsy-proven IgAN in 1395 cases from May 2008 to April 2013 were analyzed. the average age is 33.156±12.308 years old and males account for 43%. Z-test selected 27 clinical and pathological indicators related to IgAN, and the principal component prediction model was established based on these 27 indicators. Logistic regression model providing 91.93% IgAN renal recall rate and 71.29% overall accuracy, which shows that the PCA model has high reliability. Conclusions: As the model result shows, the higher level spheroid hardening rate, serum creatinine, blood uric acid and lower level eGFR might promote the occurrence of IgAN, which also provides more information for non-invasive diagnosis of IgAN patients.


Background
IgAN is a common primary chronic glomerular disease associated with end-stage renal disease [1].The pathogenic mechanism of IgAN is not clear.It is reported that IgAN eventually results in end-stage renal disease in approximately 30% of patients within 20 years of diagnosis [2].However, the requirement for diagnostic kidney biopsies has prevented the description of the full consequences of this disease [3].It is valuable to use non-invasive methods to facilitate early diagnosis and risk prediction in patients with IgAN.
A number of risk factors affecting the prognosis of IgAN have been identi ed, including hypertension, higher proteinuria, decreased glomerular ltration rate (GFR), hyperuricemia, sex, and severe pathological score, which have been used to build various scoring systems to predict the prognosis of IgAN [2,4].
However, these scoring systems are limited by the following factors: small sample size, multiple pathological scoring criteria, relatively few variables included, and poor clinical applicability [4,5,6].
Machine learning better identify variables related to kidney outcomes, predictive performance and to learn from multiple modules of data compared with conventional statistical methods and establishing accurate prediction models by machine learning has begun to be applied in medicine [6,7].Recently, a model have been developed to assist in predicting risk for kidney disease progression in IgAN patients based on the combination of survival analysis and machine learning algorithms [8], Although the abovementioned model show better prediction power than the absolute renal risk and may be a more bene cial tool to enhance individualized treatment and management of IgAN patients, it has its limitations such as signi cant effects of different therapeutic interventions on prognosis.
To alleviate such di culties, we proposes a new method for risk factor prediction and diagnostic decision based on the principal component analysis (PCA) model [9] ,which based on the results of multivariate analysis.In most cases, predictive factors of disease occurrence (such as hypertension and proteinuria) are interrelated, and it is not easily to assess their individual contribution to overall risk from a statistical perspective.PCA model can address this problem by reducing dimensions [10], from which we can obtain correlation coe cients for each factor re ecting their respective contributions.Logistic regression analysis is used to test the tness of the PCA model.Based on the above methods, we try to explore the predictive value of clinical pathological parameters for the risk factors of IgAN, so as to lay the foundation for the establishment of a more suitable prediction and diagnostic model for IgAN patients.

Study Population.
Data from 1395 patients undergone kidney biopsy from May 2008 to April 2013 were rst screened from the Department of Nephrology at the Second Xiangya Hospital.Such cases like secondary causes of mesangial IgA deposits, lack of pathological results or computer errors occurred were excluded.1042 patients data were nally included in the analysis, eventually 847 cases were con rmed as IgAN, and the remaining 195 data with Other types of kidney disease were in the non-IgAN ggroup.All studies were conducted in accordance with the guiding principles of the Helsinki Declaration and the approval of the ethics committee of Xiangya Medical College of Central South University.

Evaluation of clinicopathological parameters
Clinical data obtained during hospitalization of patients undergoing kidney biopsy.Urine samples were collected for 24 hours to detect urine protein excretion.Reference standards for semi-quantitative scoring of pathological indicators of glomeruli, tubules-interstitial, and renal vessels [11]: mesangial proliferation (0-4 points), interstitial brosis (0-3 points), and renal tubular atrophy (0-3points).Renal insu ciency, gross hematuria, and Interstitial in ammatory lesions were assessed by the presence(rated 1 point) or absence (rated 0 point) of such lesions.Grades of tonsil enlargement are I to III,no tonsil enlargement is recorded as 0 points, grade I is recorded as 1 point, unilateral Grade II was recorded as 2 points, bilateral Grade II was recorded as 3 points, and Grade III was recorded as 4 points.

Principal Component Analysis
PCA is a linear dimensionality reduction technique that reduces a set of potentially correlated variables( p assumed) to fewer variables that still contains most of the original information, these fewer variables also called principal component (PC) that are linearly uncorrelated.principles and steps of PCA are reported as previous [9].In short, PCA looks for linear combinations of variables in order to extract the maximum variance.PCA then removes this variance and seeks a second linear combination that accounts for the largest proportion of the remaining variance.there may be a third, a fourth, ..., n th linear combination.In addition, it involves calculating the eigenvalues and eigenvectors of the covariance matrix, then sorting these eigenvectors in descending order of eigenvalues, and nally projecting the actual data in the direction of the eigenvectors.Usually the mathematical processing is to linearly combine the original P indicators as a new comprehensive indicator.The most classic approach is to use the variance of PC1 (the rst linear combination selected, that is, the rst comprehensive indicator) to express it, that is, the larger Va (PC1), the more information PC1 contains.Therefore,PC1 selected in all linear combinations should have the largest variance, so PC1 is called the rst principal component.If the rst principal component is not su cient to represent the information of the original P indicators, Then select PC2, which is the second linear combination.In order to effectively re ect the original information, the existing information of PC1 does not need to appear in PC2 again.To express in mathematical language requires Cov (PC1, PC2) = 0, then PC2 is The second principal component, and so on, can be used to construct the third, fourth, ..., p-th principal component.

Statistical Analysis.
Datas were expressed as mean ± standard deviation(SD) analyzed using python version 3.6.3.the z test function was rst used to calculate the difference between the biochemical indicators of the IgA group (including all subtypes group ) and control group.Logistic regression analysis is used to test the tness of the PCA model.P value < 0.05 was considered as statistically signi cant.

Results
Clinical and pathological characteristics of the population 847 patients with IgAN were enrolled in this study(Table 1).We can conclude from the statistical data that the average age is 33 years old (13 to 73 years old), and males account for 43%.The average estimated glomerular ltration rate (eGFR) is 95.59 ml / min.The eGFR value of normal people is about 125 ml / min in the analysis of related data, so the average level of eGFR is in the normal range, but the variance is 95.83, indicating that the eGFR of patients with kidney disease There are large uctuations.
According to statistics, 96 patients were below 50 ml / min, and 33 patients were above 150 ml / min.The average value of urinary protein is 1.58 g / 24 h, the statistical difference of urinary protein is very large, the lowest is 0.0028 g / 24 h, and the highest is 16.331 g / 24 h.We divided the tonsil abnormalities into ve grades (0, 1, 2, 3, 4) with an average score of 0.5.The average serum albumin content is 35.47 g / L, and the average blood urea nitrogen content is 9.20 mmol / L (normal value is 40-55 g / L).
Normal values of serum creatinine are different in different hospitals.Generally speaking, the standard value of normal serum creatinine is: 44-133 µmol / L. When the blood creatinine exceeds 133 µmol / L, it means that kidney damage has occurred, and renal insu ciency and renal failure have already occurred.
In the statistical data, the average blood creatinine content is 101.11umol/ L, the lowest is 3.35umol / L, and the highest is 1333.0umol/ L. Therefore, the average renal function of the patients in the statistics is normal, but some patients have symptoms such as uremia.The mean and variance of other related factors are shown in Table 1.

Performance of the PCA model
We extracted the factors mentioned in Table2 that are obviously related to IgAN, and then eliminated all the computer error rows, so that the original total of 1395 pieces of data became only 1042 pieces of data, which were divided into IgAN and non-IgAN groups According to the results of kidney biopsy.To better diagnose IgAN, we use principal component analysis to extract the most important n principal components for the next step in constructing a judgment model.In order to determine the number of dimensionality reductions, we draw the number of dimensions and the variance chart of all components (Fig. 1).When the number of reduced dimensions is 20, the sum of the variances of all components is 90%, that is, about 10% of the information is lost.We can observe that when the dimension is equal to 3, the variance of all components is close to 100%, that is, about 2% of the information is lost, and this loss is within our ideal range.
The variance contribution rate of each principal component is 59.25%, 20.44%, 18.14%.The cumulative contribution rate is 97.83% (Fig. 2), so these 3 principal components can represent 97.83% of the information for the judgment of IgAN based on biochemical indicators.According to the indicators with the largest absolute value of each principal component coe cient, 3 representative indicators can be selected instead of 27 indicators.From the absolute value of the data in Table 3, the the biochemical parameters (also called variables) that determine the size of the PC1 are serum creatinine, blood uric acid and eGFR.Judging from the sign of the data, higher level of serum creatinine, blood uric acid and lower level of eGFR could promote the occurrence of IgAN.The biochemical parameters that determine the size of the remaining four main components are as follows: PC2: blood uric acid ; PC3: eGFR.The datas indicate that abnormalities in serum creatinine, blood uric acid and eGFR typically re ect an increased risk of IgAN occurrence.

Logistic regression analysis
Based on the sample data of the 27 indicators mentioned, the principal component analysis (PCA) was used to reduce the dimension of the index data.The dimension-reduced data was randomly divided into 80% of the data as the training set and 20% of the data as the test set.A logistic algorithm was used to construct a diagnosis model of kidney disease, and the results were as follows: The number of data was reduced to 3 dimensions.The accuracy, recall, and accuracy were used to evaluate the model.The results are shown in Fig.

Discussion
IgAN is the most common primary chronic glomerular disease and has a worldwide incidence exceeding 1.5 per 100000 persons per year [12].However validated tools for predicting disease risk remain limited [13].Establishing a good and widely accepted risk prediction and diagnosis model can help inform prevention and diagnosis to patients [14,15], which can help clinicians make decisions about precise treatment and follow-up.In addition, predictive factors in risk prediction models can increase the importance and awareness of these factors during a medical examination.
Based on 27 pathological and biochemical indicators with obvious factors for IgAN, 3 indicators (serum creatinine, blood uric acid, eGFR, BUN) were selected as the nal evaluation factors by principal component analysis.The results show that the higher level spheroid hardening rate, serum creatinine and lower level eGFR might promote the occurrence of IgAN,which is helpful for the non-invasive diagnosis and risk prediction of IgAN at the same time.According to the results of regression analysis, the IgAN renal recall rate was 91.93%, and the overall accuracy was 71.29%.It shows that the PCA prediction model has high reliability.Although proteinuria is one of the most important risk factors recognized by IgAN [2,16,17], there are two data in our data that directly detect proteinuria, which are urine protein quanti cation (g / 24 h) and protein qualitative.However, our protein qualitative data is seriously insu cient and may affect the model results.According to experience, the onset of gross hematuria is directly related to the number of urinary red blood cells, urine microscopy, and urine occult blood.Our data has counted these four detailed data, but the insu cient number of routine urine microscopy and routine occult blood records may also affect the model results.This is also the shortcoming of this study.
In summary, we explored the predictive value of the clinicopathological parameters of IgAN using the PCA model, which provides evidence for the prevention and non-invasive diagnosis of IgAN.

Table 1
The demographic, clinical, laboratory data.Abbreviations: MAP: mean arterial pressure, eGFR: estimated glomerular ltration rate, BUN: blood urea nitrogen.ordertostudy the correlation between biochemical characteristics and IgAN, this study intends to use a z-test to perform a bivariate analysis to compare the IgAN group with the healthy group.According to the statistical values in Table2below, the following 27 indicators have signi cant differences in the data of the sick and healthy groups: serum IgA, serum C3, urinary protein quanti cation, the number of glomeruli under light microscope, the number of sclerosis, spheric sclerosis rate, number of segment sclerosis, total number of crescent formation, number of cell crescent, number of cell ber crescent, number of ber crescent, number of glomerular adhesion, mesangial cell and mesangium stromal hyperplasia, basement membrane condition, capillary cavity opening, tubule atrophy, interstitial brosis, renal stromal in ltration of in ammatory cells, serum creatinine, tonsil abnormality, gross hematuria history, systolic blood pressure, age, renal insu ciency, eGFR, blood urea nitrogen and blood uric acid.That is, the above factors have an impact on IgAN.Whereas serum IgM, serum C4, segment sclerosis rate ,capillary endothelial hyperplasia, total cholesterol, body weight, gender, blood triglycerides and diastolic blood pressure were not signi cantly different between the IgAN group and the healthy group.Table 2 the correlation between biochemical characteristics and IgAN.P value < 0.05 or z value > 1.96 was considered as signi cant, p value < 0.01 or z value > 2.58 was considered as very signi cant.Abbreviations:eGFR: estimated glomerular ltration rate, BUN: blood urea nitrogen. In

Table 3
2. A total of 209 cases were selected as the test set, including 124 cases from IgAN patients and 85 cases from control group.A total of 209 cases were selected as the test set, including 124 cases of IgA kidney patients and 85 cases from healthy control groups, of which 114 were correctly judged as IgAN with a recall rate of 91.93%.However, only 35 patients were correctly judged as non-IgA patients, so the overall accuracy of the model is 71.29%.The above data shows that the PCA predictive model has a good tness.Component matrix.the larger the absolute value of the variable, the greater the contribution to the principal component.Abbreviations:eGFR: estimated glomerular ltration rate, BUN: blood urea nitrogen.
The risk factors of IgAN may include both clinical and pathological aspects.Studies in recent decades have found some clinical indicators related to renal outcome.In this study, we constructed and analyzed a risk-prediction model of IgAN.We rst compared the clinical and pathological indicators of the IgAN group and the control group, and then selected indicators with signi cant differences to build a PCA model, and then tested the tness of this model using logistic regression analysis.The new model is superior to the original model used to predict IgAN because it has the advantages of eliminating the correlation between the evaluation indicators and re ecting the contribution rate through objective and reasonable coe cients of each principal component.The test of regression analysis further proves the importance of these factors.
IgAN IgA nephropathy GFR decreased glomerular ltration rate PCA principal component analysis PC principal component SD standard deviation eGFR: estimated glomerular ltration rate; BUN: blood urea nitrogen.