Develop and validate a nomogram for predicting stroke inrheumatoidarthritis patients by electronic medical record data in northern China

Background: to develop and validate a serum lipid and in�ammatory marker model based on the nomogram for the prediction of stroke risk in rheumatoid arthritis patients. Methods: This study was conducted among 313 rheumatoid arthritis with stroke patients and 1827 rheumatoid arthritis patients divided into develop and validation cohorts from the First A�liated Hospital of China Medical University during January 2011 to December 2018. Logistic regression analysis was used to create a nomogram of predictive model of stroke risk in rheumatoid arthritis patients, after comparing with other machine algorithms. The performance of the nomogram was evaluated by discrimination, calibration and decision curve analysis, also compared with the Framingham Risk Score in predicting stroke in rheumatoid arthritis patients. Results: the nomogram was performed by logistic regression algorithm, and predictors of which included the strati�cations of sex, age, systolic blood pressure, C-reactive protein, erythrocyte sedimentation rate, total cholesterol, low density lipoprotein cholesterol and the distribution of being accompanied with hy-med, diabetes, atrial �brillation and coronary heart disease history, which exhibited a well goodness �t and a good agreement. The analysis with area under the curve, the net reclassi�cation index, the integrated discrimination improvement and clinical use, suggested that this is an easy-to-use nomogram compared with the Framingham Risk Score. Conclusion: This study presents a risk nomogram that incorporates the traditional risk factors, serum lipids and in�ammatory markers which can be used to predict stroke in rheumatoid arthritis patients.


Introduction
Rheumatoid arthritis (RA), a globally distributed disease, is one of the leading causes of work disability, with morbidity rates of 0.32%~0.36% in China, where the highest is 0.5% in northeast China 1 .RA can be classi ed into two major types of pathological lesions, namely, synovitis and pannus 2 .Synovitis is the basic damage to a joint, while pannus involves pathological lesions in extra-articular joints, which can worse the prognosis of RA and increase the risk of cardiovascular disease (CVD) and other diseases, such as stroke.
Elevated in ammatory levels and lipid abnormalities in RA patients are independent risk factors of atherosclerosis, stroke and other cardiovascular diseases [3][4][5][6] .We postulate that the risk of stroke among RA patients may be closely related to elevated levels of erythrocyte sedimentation rate (ESR), high density lipoprotein cholesterol (HDL) 7 , total cholesterol (TC), triglyceride (TG), anti-cyclic citrullinated peptide antibody (anti-CCP) 8 , low density lipoprotein cholesterol (LDL), and C-reactive protein (CRP) [8][9][10][11] .Typically, in clinical practice, these objective and quantitative descriptors would be routinely detected in RA patients and available in the electronic medical records (EMR).However, these biomarkers are so common that clinicians often ignore clinical importance and even don't know how to combine them together for clinical decision.Since most chronic diseases are caused by several weak risk factors acting together, statistically combining their effects may produce a more robust prediction of risk than considering these factors independently.Risk prediction tools are increasingly common in clinical medicine [12][13][14] , with wellknown models, such as the Framingham Risk Score for predicting cardiovascular events 15,16 , used for determining clinical guidelines, thus, these tools are sometimes powerful enough to change clinical management and decision-making.
Our previous study had suggested that the elevated ESR, LDL levels, and much higher CRP levels ≥230mg/L were independent risk factors or RA patients in developing stroke in our study population 17 .
Finally, in this study, we aimed to developed and validated a risk nomogram of predictors that incorporates serum lipids and in ammatory markers for the individual prediction of stroke in RA patients, based on the Framingham Risk Score.This study criteria included the American College of Rheumatology 1987/2010 18 for RA and the CVD criteria adopted at the Fourth Academic Conference by the Chinese Neuroscience Society in 1995 19 for stroke.RA with stroke cohort, which patients should meet the follow inclusion criteria: ) the patients conform to the above stroke and RA diagnostic criteria; ) Based on the rst record time of RA and stroke in the EMRs, if the record of stroke was later than that of RA, we believed that the patient has developed stroke after being diagnosed with RA and included in the RA with stroke cohort.) the patients have been detected at least one time laboratory assessments when they were in hospital at the rst time (i.e, serum in ammatory-, antibody-, complement-, lipid-assays); ) over 18 years old.RA cohort (without stroke), which patients should met the follow inclusion criteria: ) conform to the above RA diagnostic criteria; ) over 18 years old.RA with stroke cohort and RA cohort were excluded if they met the following criteria: ) patients who still suffered from other connective tissue diseases, including systemic lupus erythematosus, scleroderma, dry syndrome and vasculitis; ) RA patients with coexisting ankylosing spondylitis and gout arthritis.Finally, we selected 70% of the RA with stroke and RA patients as the develop cohort randomly, with the rest comprising the validation cohort 20 .

Data collection
All of data were screened from EMR, mainly including personal information, such as age, gender, height and weight, metabolic indices including serum TC, TG, LDL, HDL and fasting blood-glucose (FBG), and serologic pro les including CRP, ESR, rheumatoid factor (RF), complement3 (C3), complement4 (C4) and anticyclic citrullinated peptide (anti-CCP) antibodies, and coronary heart disease (CHD), atrial brillation (AF), left ventricular hypertrophy (LVH), cardiovascular disease (CVD) history records.The medication history in record was also included, that is, hypotensive medicine (hy-med), biologic disease modifying anti-rheumatic drugs (Bio-med).All laboratory tests were carried out using overnight fasting venous blood samples and conducted with clinical standard operating procedures for inspection items.In addition, when the results of multiple laboratory tests at different time points were assessed during the initial data ltering, the rst laboratory test results were selected at rst admission due to stroke among RA patients (RA with stroke cohort) and selected at rst admission of RA patients without stroke (RA cohort).

Statistical analysis.
All reported statistical significance levels were set at 0.05 with two-sided.The categorical data were expressed as percentages by cohort, and some continuous predictors (ie., age, SBP and CRP) were categorized after assessed using consensus approaches or guidelines and previously published articles.
In addition, the absence of some features in clinical medical records was inevitable (which accounted for less 20% in this study), and we used multiple imputation, based on 5 replications and a chained equation approach method (predictive mean matching, PMM for quantitative data, and linear regression for categorical data), to account for missing data in SPSS Machine learning models are conducted based on scikit-learn which is an open source machine learning library, using Bayesian optimization method to implement algorithm optimization, and cross-validation method, N-folds=5, to complete algorithm evaluation during the optimization process.We used 6 kinds of machine algorithms running three 30-minute sessions, including LR, Support Vector Machine (SVM), Random Forest (RF), xgboost (XGB), gradient boosting decision tree (GBDT), k-Nearest Neighbors (KNN), to compare algorithms and evaluate the simple model and complex model in develop and validation cohorts respectively, which were compared by evaluation metrics as follows: accuracy, precision, recall, f1-score, balance error (ber).

Comparing the performance of developed models with Framingham Risk model and validation of which in the validation cohort
The performance of the nomogram of the model was assessed by discrimination and calibration.Calibration curves, the accuracy of point estimates of the LR function, accompanied with the Hosmer-Lemeshow test to assess if the model calibrated perfectly or not.The discrimination of the nomograms was evaluated using the Harrell's concordance index (C-index), the predictive accuracy for individual outcomes (discriminating ability), is equivalent to area under the curve (AUC), and compared among the Framingham Risk Score in predicting stroke and our prediction models.In addition, the calibration was calculated via a bootstrap method with 1000 resamples in the develop cohort.Internal validation was performed using the validation cohort.The LR formula formed in the develop cohort was applied to all patients of the validation cohort.The net reclassi cation index (NRI) indicates the proportion of patients correctly reclassi ed by a new model compared with an existing or standard model, while the integrated discrimination improvement (IDI) indicates the change in difference in average predicted probabilities between those who combined with stroke and those who did not in a new and existing model 22 .Furtherly, NRI and IDI between the Framingham Risk Score in predicting stroke and our prediction models were assessed based on low risk (0~20%), medium risk (20%~59%), high risk (60%~100%).

Clinical Use
In the develop cohort, DCA was conducted to evaluate the clinical usefulness of the nomogram by quantifying the net benefits at different threshold probabilities and was used to identify the predictive models with the best discriminative abilities 23 .In addition, net benefit was defined as the proportion of true positives minus the proportion of false positives, weighted by the relative harm of false-positive and false-negative results 24 .Simple and complex models were used to predict risk strati cation of 1000 people with a bootstrap resample by the clinical impact curve.

Baseline demographics and clinical characteristics.
There , the results indicated that the effect of LR model building the nomogram comparing with the other machine learning models were at the same level even better, and the LR algorithm was effective and feasible for the prediction of current data, simultaneously indicating the complex model was better than the simple model.Finally, the nomogram was performed based on the complex model incorporating the above independent predictors, shown in gure3, which showed the score of the in uencing factor levels, the personal total cumulative score, and the predicted risk value of the individual outcome event for RA patients.

Assessing the Performance and Internal Validation of the Stroke Nomogram
Figure4 depicting the exible calibration curve, indicated good agreement between prediction and observation in the develop and validation cohorts (slope=1, intercept=0 all with simple and complex model).Furtherly, we comprehensively assessed and compared the performance of developed models with Framingham Risk model, shown as table3

Clinical Use
The DCA for the nomogram of the develop cohort, presented in gure s2 (see in additional le 1), indicated that if the threshold probability of a patient or doctor was about 15%, using the risk nomogram to predict stroke could add benefit compared to either the treat-all-patients scheme or the treat-none scheme.When the threshold value was 15%-55%, net benefit was comparable on the basis of the risk nomogram, suggesting that the benefit of the complex model (blue line) was higher than that in simple model (red line) in predicting the risk of stroke in RA patients.

Discussion
To our knowledge, this is the rst prediction model for the risk of RA patients developing stroke.By using EMR data from hospitalized patients in northern China, we developed and validated a nomogram for the individualized prediction of stroke in RA patients which incorporated several factors, sex, age, SBP, CRP, ESR, TC, LDL and the distribution of being accompanied with hy-med, AF and CHD history, between the Framingham Risk Score in predicting stroke and our prediction models .
Prediction models use multiple predictors to estimate the absolute probability or risk that a certain outcome is present or will occur within a speci c time period in an individual 25,26 .Recently, some studies have taken advantage of prediction models to create multi-markers for clinical decisions, such as Yan-qi Huang, et al. with a radiomics nomogram that incorporates the radiomics signature, CT-reported lymph node status and clinical risk factors to preoperatively predict lymph node metastasis in patients with colorectal cancer 13 .The Framingham Risk Score for predicting stroke events was performed with the data followed for 10 years by using Cox proportional hazards regression model 27 , while ours were aim to explore and utilize the data of Real World study.Our study evaluated six machine learning models, which suggested that LR algorithm performed well in the evaluation, also con rmed LR algorithm has a better generalization.Multivariable analyses that incorporate individual markers into marker panels have been embraced in recent studies, similarly, in our study, the model incorporating serum lipids, in ammatory markers and connecting multiple individual features demonstrated adequate discrimination in a develop cohort which was then improved in the validation cohort.Our prediction model also has a good calibration.While, it is unclear which of two models is more preferable 23 .By the Hosmer-Lemeshow, the developed models claimed a well goodness of t.We didn't nd the signi cant statistics difference between complex model and Framingham Risk model by the AUC analysis, however evaluated the 20.30% [12.54, 28.05] patients correctly reclassi ed by complex model than Framingham Risk model with 5.65% [3.41, 7.88] of IDI.DCA assessment, in theory, can inform on model effectiveness, or which of several alternative models should be used 28 .To this end, we used DCA to address the heterogeneity across different institutions in the clinical data collection and further to select the best model.When the threshold probability of a patient or doctor was >15%, the higher net-bene t of the complex and simple model were superior to either the treat-all-patients scheme or the treat-none scheme, and this was best with the complex model when the threshold probability was 15%-55%.Therefore, the discrimination, calibration, NRI, IDI and clinical use measures, suggested that this easy-to-use nomogram can effectively predict the risk of stroke among RA patients.That is, our predictive model has strong clinical value for clinical decisions for RA patients.
Some previous studies identi ed a number of demographic and clinical characteristics which should affect the risk of RA patients developing stroke, mainly increased lipid metabolism levels, high in ammatory levels, and other traditional CVD risk factors [29][30][31] .However, these studies were conducted in Caucasians, lacking data from Asians.Meanwhile, these studies' results were not consistent, Zhang J et al. 29 supported the hypothesis that RA-related systemic inflammation played a role in determining cardiovascular risk and a complex relationship between LDL and cardiovascular risk, be alike that some studies suggested the "lipid paradox" of LDL 7,32 , and some other studies 9,33 suggested that TC, LDL, TG levels were useful and limited for prediction of stroke in RA than in the general population.Our results emphasized the effects of serum TC, LDL levels in predicting the development of stroke among RA patients, reserved the traditional risk factors (hy-med, AF and CHD history) con rmed in previous studies about the Framingham Risk Score 16,27,34 .For this particular Population of RA, our ndings underscored the important contribution of systemic in ammation to the development of stroke in RA patients, founding CRP and ESR in ammatory factors to be independent risk factors for stroke in RA patients as well as con rmed in published papers 4,35 .Notably, considering the above factors played critical roles in RA patients, several weak risk factors combining their effects may produce a more reliable prediction of risk than the consideration of a single risk factor.Thus, in order to explain the comprehensive effect above clinical factors, we developed the nomogram incorporating several independent predictors to inform individual patient care to prevent stroke in RA patients.
Our study still had several limitations.First, the data used in the clinical prediction model was derived from a single center in a hospital and the design had better be a randomized controlled prospective trial owning to a gap on the classi cation of stroke in the RA with stroke patients even if the stroke was present before the onset of RA.Then, the model should be externally validated at additional sites even though we computed the C-index for the prediction nomogram via bootstrapping validation and assessing NRI of a bootstrap resample with 1,0000 people for multiple validation.Thirdly, recently there is increasing evidence of an association between RA and stroke, it is unclear if simply building a model that applies the traditional risk factors, serum lipids and in ammatory markers to predict outcomes ideally.This disagrees with the theory that scienti c inferences should be based on evidence from as many sources and individuals as possible, an accepted principle that is often used in intervention studies.Nevertheless, well-designed randomized clinical trials or cohort following investigation are necessary to further comparing the predictive power of our model with FRS for stroke in patients with RA.Clinical prediction models are increasingly used to complement clinical reasoning and decision-making in modern medicine.The adoption of such models by professionals can guide their decision making and improve patient outcomes and the cost effectiveness of care.Prediction models are not developed to replace doctors, but to provide objective estimates of health outcome risks for both individuals (patients) and healthcare providers to assist their subjective interpretations, intuitions and guidelines 36,37 .
In conclusion, our study presented an effective nomogram that incorporates the traditional risk factors, serum lipids and in ammatory markers that can be used to predict stroke in patients with RA.Our study can provide a theoretical basis for improving the prognosis of RA patients and preventing the onset of stroke.

Declarations
Tables

1. 1
Selection of the subjects A total of 8389 RA patients, with 9.04% prevalence of stroke (758 RA with stroke patients) were ltered from the inpatient department of rheumatology and immunology of the First A liated Hospital of China Medical University during January 2011 to December 2018 in this study.According to the inclusion and exclusion criteria, 313 RA with stroke patients and1827 RA patients were included into studies (shown as gure s1, see in additional le 1), all aged 18 and older, from EMR of a third-senior hospital in Liaoning province.EMRs were classi ed and coded by using the International Classi cation of Diseases Tenth Revision of the Beijing clinical version (RA: M05.x~06.x;stroke (ischemic and hemorrhagic): I60 I60.1-I60.0I61 I61.0-I61.9I69.0 I69.1 I63 I63.0-I63.9I69.3).The study conformed to the principles outlined in the Declaration of Helsinki and was conducted under the guidelines of the Institutional Review Board approved by the ethics committee of the medical science research institute of the First A liated Hospital, China Medical University (approval number: AF-SOP-07-1.0-01).All subjects gave written informed consent for the use of their data.

) 5 .
65 (3.41, 7.88) *: P=0.0013, statistically significant associations between Framingham risk and simple model; #: P=0.0016, statistically significant associations between complex model and simple model; &: P=0.0631, statistically significant associations between complex model and Framingham risk.Abbreviation: AUC, the area under the receiver operating characteristic curve; NRI, the net reclassification index; IDI, the integrated discrimination improvement, NA, not available; ref., the reference level.

Figures
Figures

Figure 2 Model
Figure 2

Figure 3 A
Figure 3 were 218 RA with stroke patients and 1136 RA patients included in develop cohort, and there were 95 RA with stroke patients and 486 RA patients included in the validation cohort.The clinicopathologic characteristics of the patients are listed in Table1.The baseline clinicopathologic data were similar between the develop and validation cohorts.As shown in Table2, univariate LR analysis of RA patients developing to stroke indicated that the strati cations of sex, age, SBP, CRP, ESR, TC, LDL and the distribution of being accompanied with hy-med, Diabetes, AF, CVD and CHD history, were signi cantly different between RA with stroke group and RA group (P < 0.05) in the develop cohort.
Data are represented as number and proportion, statistics were calculated by Chi-Square test; *: statistics were calculated by Fisher exact test ; Abbreviation: RA, Rheumatoid Arthritis; SBP, Systolic Blood Pressure; CHD, coronary heart disease; AF, atrial fibrillation; LVH, left ventricular hypertrophy ;

Table 3
Apparent Performance and Internal Validation of the Stroke Nomogram