This paper describes a retrospective longitudinal study conducted at a single center. The dataset comprised reviewed EMR data of 1,280 patients between January 2001 and December 2018. All patients were previously diagnosed with AS according to the modified New York criteria . The Hanyang University Seoul Hospital institutional review board approved this study (HYUH 2020-03-012-003). Informed consent was waived because this study retrospectively reviewed the EMRs.
Clinical characteristics including age, sex, disease duration from the first to the last follow-up, HLA-B27 positivity, eye involvement with uveitis, and peripheral joint involvement with arthritis other than axial joints were investigated. Baseline laboratory results consisted of hemoglobin (Hb), hematocrit (Hct), blood urea nitrogen (BUN), creatinine, aspartate transaminase (AST), alanine transaminase (ALT), alkaline phosphatase (ALP), albumin, cholesterol, protein, creatine phosphokinase (CPK), gamma glutamyl peptidase (GGT), lactate dehydrogenase (LDH), erythrocyte sedimentation rate (ESR), and C-reactive protein (CRP) levels. The prescribed drugs were classified as follows: non-steroidal anti-inflammatory drugs (NSAIDs), methotrexate, steroids, sulfasalazine, and biological disease-modifying antirheumatic drugs (bDMARDs). The mean values of laboratory tests, the sum value of prescribed medication, and clinical characteristics were used as machine learning features.
Assessment of radiographic progression
Two radiologists (SL and KBJ) independently assessed the images and scored them according to the mSASSS (0–72) . Intraobserver reliability with consistency for a reader was excellent (intraclass coefficient (ICC) = 0.978, 95% CI 0.976 to 0.979) and interobserver reliability with agreement between two readers was also excellent (ICC = 0.946, 95% CI 0.941 to 0.950) [16, 17].
The first, second, and third visits (T1, T2, and T3, respectively) were defined as the time points at which the first, second, and third radiographs were taken, respectively. In addition, the radiographic progression at each time point was calculated using the following formula: Pn+1 = (mSASSS n+1 − mSASSS n) / (Tn+1 − Tn ). Radiologic progressor was defined as worsening of the mean mSASSS by more than 1 unit over 1 year, and AS patients were categorized into two groups: progressor and non-progressor .
We composed three clinical datasets for prediction of radiographic progression: baseline dataset at first visit (T1) with the radiologic progression at second visit (P2), two-points dataset at first and second visits (T1 + T2) with the radiographic progression at third visit (P3), and three-points dataset at first, second, and third visit (T1 + T2 + T3) with radiologic progression at fourth visit (P4). Using three clinical dataset matrixes, we trained three prediction models for progressor and non-progressor groups (Figure 1). Three different machine learning classifiers were applied: logistic regression with least absolute shrinkage and selection operation (LASSO) using Python with the Scikit-learn package (https://github.com/scikit-learn/scikit-learn) , random forest (RF) using the Scikit-learn package , and extreme gradient boosting (XGBoost) using the Xgboost package (https://github.com/dmlc/xgboost) . The algorithms were selected based on their superior performance and readiness for application. All continuous clinical features were centered and scaled to a mean of zero and a standard deviation of one (z-score transformation was performed before feature selection). The three models were run and compared to determine the best combination for determining progressor or non-progressor in the three clinical datasets. All possible combinations of hyperparameters of models were investigated by grid search using the GridSearchCV library in the Scikit-learn package .
With a linear combination of the selected features weighted by their respective coefficients, a LASSO regression model was used for prediction. RF, which is one of the representative ensemble methods, is widely used because it is powerful and relatively lighter than other ensemble methods. RF constructs several tree-type base models and forms an ensemble through a technique called bootstrap aggregating or bagging. XGBoost is one of the gradient boosted decision tree algorithms for large datasets. Detailed hyperparameters of the three models in the three datasets are described in the supplementary information (Supplementary eDocument 1).
The prediction models were evaluated in three rounds of three-fold cross-validation . The procedures, including z-normalization and machine learning classification, were executed separately on the training data during each cross-validation. As the progressor and non-progressor groups are not equally distributed in the dataset, we used stratified cross-validation to divide the dataset. In each round, an entire dataset was randomly and equally divided into three with stratified probability. Two of these parts were used as the training dataset, and the final part was used as the test dataset. The process was repeated three times in three datasets of three models. The one-point dataset for predicting radiologic progression at second visit (T1 for P2) had 29 features, two-point dataset for radiologic progression at third visit (T1 + T2 for P3) had 53 features, and three-point dataset for radiologic progression at fourth visit (T1 + T2 + T3 for P4) had 77 features. Each average of three models of the three-fold cross-validation in the one-point dataset became the estimated performance of the models. Same was the case for the other two datasets. The predictive power of each predictor was assessed through receiver-operator characteristics (ROC).
Features with larger contributions to the LASSO regression model were selected. We performed feature importance analysis using RF and XGBoost to verify the robustness of the results. Variable importance was evaluated using the model-based variable importance scores and the important variables (particularly those informative to radiographic progression) were captured when the models were fitted to the training dataset [23, 24].
For continuous distributed data, the results are shown as means with standard deviations (SD); between-group comparisons were performed using Student’s t-test. Categorical or dichotomous variables were expressed as frequencies and percentages and were compared using the chi-squared test. Area under the curves (AUCs) were used to determine the diagnostic performance, with optimal thresholds of the clinical parameters determined by maximizing the sum of the sensitivity and 1−specificity, i.e., the Youden index values. Machine learning model training and statistical analysis were performed using Python (Python Software Foundation, version 3.5.2)