Prevalence And Risk Factors of Electrocardiogram Abnormalities In Patients With Rheumatoid Arthritis: A Machine Learning Study With Follow-Up Data

OBJECTIVES: Electrocardiogram (ECG) abnormalities could predict some subsequent cardiovascular events. Cardiac involvement is a major extra-articular manifestation in rheumatoid arthritis (RA). We aimed to determine the prevalence of three major ECG abnormalities in RA patients, discover the associated ECG abnormalities associated with machine learning (ML) approaches, and then examine these preselected factors in the follow-up patients with traditional Cox regression. METHODS: Consecutive RA patients’ records were retrieved from the hospital database; about one-third of patients had follow-up data. Abnormal ECGs with clinical signicance were grouped into non-specic ST-segment/T-wave changes, QT interval prolongation, and QRS-T angle increase. Machine learning approaches assessed the associated factors of these abnormalities. The top-important factors selected by the most optimal ML would be used to construct Cox regression models. RESULTS: Two hundred twenty-six patients were enrolled for the rst step cross-sectional study. Nonspecic ST-T changes (27%) were the most prevalent abnormalities among patients with abnormal ECGs. Random forest models had the best performance in the discovery of associated factors for three outcomes. Cox regression validated that rheumatoid factor and low-density lipoprotein were common risk factors within those three abnormalities. Hypertension, ESR, and serum immunoglobulin G were inuential factors for non-specic ST-T changes, prolonged QT interval, and increased QRS-T angle specically. CONCLUSION: Non-specic ST-T changes were the most common abnormalities seen in ECGs of RA patients. Our nding suggests that rheumatoid factor, LDL, hypertension, and inammatory indicators are important risk factors for these ECG abnormalities. earlier to take appropriate action to prevent it.


Introduction
Rheumatoid arthritis (RA) is a chronic, in ammatory, autoimmune disease with synovial in ammation and joint destruction [1], affecting nearly 1% of the world's population [2]. Cardiovascular disease (CVD) is a major extra-articular manifestation and comorbidity in RA, increasing morbidity and premature CVDrelated mortality [3,4]. Except for the traditional cardiovascular risk factors, like diabetes, hypertension, and hyperlipidemia, the severity of in ammation in RA patients would also increase the risk of CVD [5][6][7][8].
Electrocardiogram (ECG) is a routine for detecting and diagnosing abnormalities in the cardiac system. Resting-ECG alterations could help physicians perform a timely intervention on those with high CVD risk.
For example, tented T-wave and/or ST-segment changes have been proven to increase the all-cause mortality and be independent risk factors for early CVD events and incident coronary heart disease (CHD) in middle-aged and older individuals [9][10][11][12]. Longer heart-rate corrected QT (QTc) intervals have been discovered in RA patients than healthy controls [13]; QTc prolongation was even associated with all-cause death in RA patients [14,15]. Moreover, QRS -T angle is a predictor of sudden death in a middle-aged general population [16] and could assess the dispersion of myocardial repolarization, a critical factor in arrhythmogenicity [17]. Since QRS-T angle has been established as a repolarization marker [18], there are still limited surveys on this change in RA patients. Therefore, as an affordable and non-invasive examination independent of the operator and the patient's condition, ECG could provide primary but stable evidence for consequent tests or treatments.
Multivariate logistic regression (LR) and Cox proportional hazard (CoxPH) regression are traditional methods for relative factors analysis [19,20]. However, certain assumptions, such as no multicollinearity among variables, must be satis ed; otherwise, the result would lose robustness. Furthermore, neither LR nor CoxPH regression could not be applied to sparse and high-dimension data directly. Machine learning (ML) has been widely performed for nonlinear prediction tasks in recent years, being more e cient in predictive modeling without the above assumptions [21].
In the present study, we will perform multiple ML methods and conventional LR to screen variables related to three types of ECG abnormalities in a cross-sectional RA population; then validate our ndings in the follow-up cohort with conventional CoxPH models.

Methods
Patient selection RA patients' records were retrieved from the database of the Division of Rheumatology of Third A liated Hospital of Sun Yat-sen University from January 2015 to September 2020. Eligible patients were 1) aged >=18, 2) met the 2018 American College of Rheumatology and European League Against Rheumatism (ACR/EULAR) classi cation criteria [22], and 3) had available ECG records at the time of diagnosis or the follow-up visit. Patients were excluded if they 1) had a CVD event or severe valvular disease; 2) were with hepatic and/or renal failure, and/or abnormal serum electrolytes, 3) were pregnant, 4) were with malignant tumors, 5) were with thyroid disorders (e.g., Grave's disease). All patients provided informed consent for the collection and use of clinical and laboratory data. The procedure complies with the declaration of Helsinki and is approved by the ethics committee The Third A liated Hospital of Sun Yatsen University.

Sample size
The sample sizes were estimated by PASS 15 software (https://www.ncss.com), with the statistical power (1-β) set 0.90, type I error (α) set 0.05, and assuming that the prevalence of non-speci c ST-T changes was about 18% among RA patients [23]. The software calculated that a total sample size of at least 174 would su ce. Finally, we recruited 226 patients for the present study.

Data collection
We used a standard protocol to collect data (characteristics were in table 1), including demographics, disease-related conditions, medication history, laboratory tests, complications, lifestyles, and standard digitally available recorded 12-lead resting ECG reports. Two experienced physicians rechecked computerassisted reading of ECG reports.

Missing Data
Proportions of missing values were less than 5% across all variables. Multiple imputations were implemented using the 'Multivariate Imputation by Chained Equations' algorithm in R package 'mice' to account for missing data to minimize bias and precision reduction.

Outcome de nition
ECGs were categorized into the normal group and the abnormal one. The abnormal ECGs were further de ned if patients were a) with non-speci c ST-T changes, b) after heart-rate corrected by Bazetts formula (QTc = QT/√RR ), QTc interval>=430 ms is considered prolonged QTc (since there is no consensus on de ning a normal QTc range, with proposed upper limits of normal extending from 430 to 470 ms [24]); c) spatial QRS-T angle is >=90° [25,26] is de ned increased. When more than one abnormality was observed in the same participant, each would be recorded separately.

ML modeling process
To assess the in uential factors for three ECG abnormalities, we respectively performed ve ML methods, including random forest (RF), stochastic boosting models ('ada'), the latest absolute shrinkage and selection operator (LASSO), extreme Gradient Boosting tree ('xgbtree'), and neural net ('nnet'). Then we used stacking models (using logistic regression as the core algorithm) to ensemble models mentioned above. Therefore, we created six models.
To begin with, we randomly divided samples into the training set and the validation set (ratio 70:30) with the same proportion of positive outcomes by synthetic minority oversampling technique (SMOTE) for reducing the negative effect of the imbalance class of the constructed models [27]. The training set was used to model with 5-fold three times repeated cross-validation.

ML model performance evaluation
Discrimination indicators, including the area under the receiver operating characteristic curve (ROC-AUC), precision, accuracy, recall rate, F1-Measure, and Brier's score for indicating the calibration of 6 models, will be evaluated in the validation set.
Once the most optional models for three outcomes were selected, we screened the top important factors for consequent survival analyses in the follow-up cohort to validate and calculate the hazard ratio (HR) by CoxPH regression. All the analysis was conducted by R 3.6.1 (R Core Team, Vienna, Austria). Package 'caret', 'coxph', 'DwMR' were used for data analysis. The detailed study ow diagram is illustrated in Fig.  1.

Statistical analysis
Statistics were presented as mean±standard error (SD) for continuous variables with normal distribution, median [Interquartile Range, IQR] for those without normal distribution, and percentage for categorical variables. A 2-tailed p-value <0.05 was considered to indicate statistical signi cance.

Prevalence of ECG abnormalities
The prevalence of three types of ECG abnormalities is shown in Fig. 2. 7 of 226 patients at baseline and 4 of 95 follow-up patients had more than two ECG abnormalities. All two follow-up patients with increased QRS-T angle had been diagnosed at baseline; on the contrary, follow-up patients with the other two abnormalities were all new-onset in the follow-up.
Models' performance and algorithm selection RF generated the highest AUCs in predicting non-speci c ST-T changes(0.974), QT interval(0.987), and QRS-T angle (0.915), with the lowest Brier's scores in all of these three models. The detailed models' performance is shown in Table 2. Finally, we selected the RF algorithm to construct models.

RF models and variables importance
As shown in Supplementary Fig. 1, we included all variables in the RF models ensuring the lowest error rate. After tuning parameters, we constructed three optimal RF models; parameters of models and the performance of RF models can be seen in Supplementary Table 4. Fig. 3 showed the top-6 important variables of three RF models. Age and indices of in ammation (CRP, ESR) were in the top rank of the three models. Supplementary Fig.2 showed the top-15 important variables of RF models.

Cox regression for validating factors found associated with ECG abnormalities in RF models
We selected 'stable' factors whose relative importance was >= 5 to perform univariate and multivariate Cox regression. Variables and crude hazard ratio (HR) were shown in Supplementary Table 5; the adjusted ones were compiled in Table 3.
Rheumatoid factor was signi cant for three kinds of abnormalities; however, its effect was opposite in

Discussion
In the present retrospective cohort study, we aimed to report the prevalence and associated factors of abnormal ventricular repolarization, which consisted of non-speci c ST-T change, prolonged QTc interval, QRS-T angle, etc. The prevalence of newly diagnosed non-speci c ST-T changes at baseline and followup was higher than reported studies (27% and 20% vs. 17% [23] to 18% [28]). Several factors can explain this difference. Our study population had more moderate and active patients than the previous study cohorts [23] (89.9% vs. 62%, evaluated with DAS28-CRP). When it comes to new-onset prolonged QTc interval at baseline and follow-up, the incidence in our cohort was marked lower than the previous literature (6.2% and 7.4% vs. 30% [28] and cumulative 47.5% [28]), probably because, compared with ours, Chauhan K. et al. had a cohort with a nearly double incidence of hypertension (43% vs. 22.5%) which plays a role in the mechanism of QT interval prolongation [29]. Not yet had published study revealed the occurrence of increased QRS-T angle in RA patients. Our cohort discovered that 5.3% of patients were with increased QRS-T angle.
Non-speci c ST-T changes are common ndings even in the general population [30]. Previous studies have indicated that these changes signi cantly correlate with cardiovascular morbidity and mortality [11,12,31,32]. Although the clinical signi cance of non-speci c ST-T changes in patients with RA without CVD is still not certain, it is enticing to speculate that they represent subclinical CVD. QT/QTc interval prolongation is also an independent cardiovascular risk factor [15,[33][34][35] and is mainly related to cardiac arrest, especially in the RA population. Emerging data have demonstrated the strong relationship between QTc and pro-and anti-in ammatory cytokines [36,37]. Also, the presence of parasympathetic dysfunction, one of the autonomic dysfunctions in RA patients, could in uence the QTc interval [38].
RF models could help primarily discover associated factors. In ammatory markers, including ESR and CRP, were important for three kinds of ECG abnormalities. When validated in follow-up patients with multivariate CoxPH regression, increased concentration of RF was found a risk factor for three abnormalities, except the concentration was lower than 900 IU/ml for ST-T changes. The impact of CRP and ESR were signi cantly associated with QRS-T angel increase and QTc prolongation, respectively; however, due to the limited sample size, we did not discover a signi cant concentration-effect relationship. ESR and CRP are part of extended DAS-28, but their importance varies from different ECG abnormalities.
Age and disease duration is not critical for the higher risk of non-speci c ST-T changes, consistent with the previous study [23]. However, another factor, GC usage period, which could partly re ect age, disease duration, and disease activity, in uences differently in ST-T changes and QTc prolongation. A regular, adequate GC therapy might be more vital.
A population-based study has shown that LDL levels are independently associated with subclinical atherosclerosis [39], one of the magni cations of non-speci c ST-T abnormalities [12], also statistically in uencing non-speci c ST-T changes in our study, even a near-optimal controlled level of LDL is protective. Nevertheless, when it comes to QRS-T angle increase, LDL should be strictly controlled at or under the optimal level.
Hypertension is a well-documented risk factor for ST-T change [40]; in our cohort, those in graded class-2 or low-risk groups have a higher risk; but hypertensive patients from the high-risk group had a protective effect on ST-T changes. Such difference could be probably because these patients were more cautious and active in controlling blood pressure.
Female has been reported positively related to QTc prolongation in RA [23]. These gender differences appear to correlate with age-dependent changes in serum levels of sex hormones. Androgens would accelerate cardiac repolarization processes and shorten potential action durations by the effect of testosterone on calcium currents [41]. Moreover, low BMI is an independent predictor of QTc interval prolongation in our cohort, similar to a cross-sectional study in women with eating disorders has demonstrated [42]. Therefore, appropriate nutrition enhancement for underweighted RA patients is recommended for lowering the risk of cardiac conduction abnormalities. sUA level has a positive correlation with prolonged QTc interval, especially in men [43]. Our result also identi ed that even in those with sUA slightly increased, the risk of QTc prolongation would be twice. QTc intervals were found shortening in female subjects with sickle cell anemia [44]. Those with QTc prolongation in our cohort were mostly women (6 of 7); therefore, mild anemia might have a protective effect on QTc interval.
The impact of immunoglobulin G on QTc prolongation in RA patients is a novel nding in our study. Abnormal levels of serum IgG are one of the early markers of autoimmune diseases [45], especially the increased ones. Aberrant glycosylated and autoimmunity-triggered IgG [46] could be responsible for in ammation-associated atherosclerosis cardiovascular symptoms [47].
In conclusion, machine-learning analysis can be used when multicollinearity occurs or in a high dimension data warehouse. For instance, the disease duration may depend on age; and preselected the essential variables for the following Cox regression. Another strength of the present study is the comparison between several machine learning approaches since each has its unique pros and cons. As a relatively affordable cardiovascular examination, ECG may be recommended for all the RA patients in their rst visit and follow-up visits because such a systemic-involved and chronic disease needs interdisciplinary cooperation to assess the condition holistically and longitudinally. Assisted by the machine-learning method and validated by traditional CoxPH regression, some objective information might be acquired before inviting cardiologist consultations and further expensive or intrusive examinations.
There are several limitations of our study that can impact its generalizability to other populations and the interpretation of its clinical signi cance. First, the sample size was insu cient, especially the follow-up subjects. Second, we only used resting-12-lead ECGs rather than 24-hour ECG monitoring (Holter), which can measure diurnal variations of ECG intervals. It may cause a higher 'false-negative rate' when we diagnosed ECGs. Third, we cannot exclude the possibility of patient selection bias because only a single tertiary referral center participated in this study. Therefore, the prevalence of ECG abnormalities in this study cannot represent the real rate in China. Moreover, although ECG is an affordable, stable, and quick test along with no harm, the information that ECG could offer is limited. Other cardiac imaging examinations, for example, echocardiogram, cardiac magnetic resonance imaging, or radionuclide perfusion, could provide more details about the cardiac lesion. Further longitudinal, prospective studies assessing the role of potential risk factors will help clarify the mechanism of ECG abnormalities among RA patients.

Conclusion
Our data reveals that non-speci c ST-T changes were the most common abnormalities seen in RA patients' ECGs, followed by QTc prolongation and QRS-T angle increase. In ammatory factors and rheumatoid factors are more important than disease activity for these ECG abnormalities, along with LDL levels. Moreover, relatively strict management of LDL and uric acid may bene t RA patients. Efforts to monitor the ECGs of these key populations need to be instituted.

Declaration
The authors did not have nancial support or bene ts from commercial sources for the work reported in the manuscript. The authors declare no con ict of interest.  <0.001 *: After entering variables into the Cox regression equations, the proportional hazards (PH) assumption would be checked using statistical tests based on the scaled Schoenfeld residuals. R function 'cox.zph' was used. P<0.05 is deemed violating the PH assumption so that we created an time-variable interaction item ( time-dependent variables), then replace the original variables with these items. Figure 1 The Study ow diagram Page 20/21 Figure 2 The prevalence of ECG abnormalities