This study followed the 2015 TRIPOD recommendations(12).
Data Source
We designed and conducted a single-centre, retrospective longitudinal cohort study. We calculated survival for each patient in this cohort using the SFHF model and compared it with actual observations to determine whether the model is a good tool for predicting 1-year survival after fracture for these individuals. When necessary, the model is updated to better fit this population. This study included all patients with hip fracture who were consecutively admitted to the Orthopaedics Department of Shanxi Bethune Hospital in China from January 1, 2016, to October 31, 2021. Follow-up of these patients was then completed via telephone interviews from March 14 to 29, 2022. The preoperative individual characteristics of each patient were obtained from the patient's medical records. This study was approved by the Medical Ethics Committee of Shanxi Bethune Hospital. Informed consent was waived by the Medical Ethics Committee of Shanxi Bethune Hospital. All methods in this study followed the Declaration of Helsinki.
Participants
Shanxi Bethune Hospital is located in Taiyuan, the capital city of Shanxi Province in Central China. It is a comprehensive teaching hospital and a tertiary regional referral hospital. As in the developmental study, we identified our eligible participants by the following: 1. Patients ≥ 50 years old; 2. Patients with low-energy fractures (referring to fractures caused by a patient falling from a standing height or lower), excluding patients with pathological, high-energy fractures, periprosthetic fractures after previous hip replacement surgery, and reoperation due to failure of internal fixation for hip fracture, regardless of whether the patient had a primary or secondary fracture; and 3. Patients with hip fractures including femoral neck, intertrochanteric and subtrochanteric fractures.
Unlike in the developmental study, eligible patients could have other low-energy fractures at the same time, such as wrist fractures, femoral neck fractures, or vertebral compression fractures, which result from falls that lead to hip fractures. For the same patient who was hospitalized twice for hip fractures on different sides during the study period, the most recent hospitalization data were selected.
All patients were considered for surgical treatment after admission, and routine preoperative preparations were completed as soon as possible. If the patient suffered from obvious medical diseases that affected the surgical treatment, the surgery was postponed, and the relevant departments were requested to assist with diagnosis and treatment. Surgical treatment was given when the patient's overall condition was stable. In the developmental study, participants underwent osteodistraction immediately after hospitalization, unless it was determined that surgery would be performed soon. In this study population, osteodistraction was not performed unless it was clear that the patient could not undergo surgery in the short term or that surgery was not considered.
As in the developmental study, different surgical approaches were implemented depending on the fracture site. For femoral neck fractures, cannulated screw fixation and total hip or hemi-hip replacement were selected according to different age ranges. Intramedullary nails were used for intertrochanteric and subtrochanteric fractures.
All eligible patients were included in this study. In the case of individuals with missing data, we used advanced statistical methods to impute to avoid serious bias caused by simple deletion.
Outcome
As in the developmental study, our event of interest was all-cause mortality within 1 year (365 days) after fracture. We first collected demographic characteristics and preoperative clinical indicators for each patient, followed by telephone interviews.
Similar to developmental research, we also developed an interview strategy to increase the interview rate and reduce loss to follow-up(1).
Predictors
As in the developmental study, we collected 24 individual patient indicators. Except for whether surgery was performed, the remaining 23 variables were all preoperative characteristics. They included demographic characteristics such as age, sex and medical insurance; fracture-related characteristics such as fracture site, type of fracture, days from fracture to hospitalization, days from hospitalization to surgery, and length of stay (LOS); medical history information such as ability to live independently (ALI), lung disease (LD), cardiovascular and cerebrovascular disease (CCD), kidney disease (KD), malignancy (MAL), hypertension (HYP), diabetes, and mean arterial pressure (MAP); laboratory-related factors such as partial pressure of oxygen (PaO2), haemoglobin (Hb), serum creatinine (SC), fasting blood sugar (BS), albumin (ALB), and total protein (TP); and treatment-related factors such as osteodistraction and surgery (SUR).
The definition and measurement time of each indicator can be found in the developmental study (1). Here, we describe only the variables that differ from those in developmental studies. In this study, we redefined CCD. In the developmental study, CCD positivity was defined as previous diagnoses of myocardial infarction, cerebral infarction, cerebral haemorrhage, or extremity thrombosis, or patients who were previously undiagnosed but were identified as having an infarct or thrombosis during the admission examination. In this study, to better capture patients at high risk of cardiovascular and cerebrovascular diseases, we also judged patients with coronary heart disease who were not diagnosed with myocardial infarction as CCD positive. Regarding LD, positive patients also included patients with previously diagnosed chronic bronchitis and chronic obstructive pulmonary disease at baseline.
It should be noted that because the prefracture ALI was included in the telephone interview, when the ALI description in the medical history collection was different from the results obtained from the telephone interview, the results of the telephone interview were used.
To ensure the reliability of data collection, we checked the original data again when necessary to avoid human error or find a reasonable explanation for unreliable data. For example, we encountered extreme values when data cleaning (such as extremely high or low blood sugar values) or situations that should have been the same but were inconsistent (such as the inconsistency between the days from hospitalization to surgery and the number of people undergoing surgery).
Sample Size
There are few studies on the required sample size for validation studies, let alone validation studies on models of survival data. Some empirical studies have shown that for external validation of prognostic models, a minimum of 100 the effective sample size is needed, and the ideal effective sample size is 200 or more (13–15). If this criterion is followed, for this study to achieve an unbiased and accurate estimation of the performance of the prognostic model, the study population should have at least 100 deaths. Given the relatively fixed number of hip fracture patients admitted each year, the possible sample size can only be met by expanding the time frame of the study. However, going back too far will lead to difficulties in follow-up, which will lead to a decrease in the accuracy of the outcome information. Therefore, we had to strike a balance between ensuring the credibility of the follow-up results and expanding the study time frame. Under this premise and referring to the information of approximately 10% mortality in 1 year in the developmental study sample, we collected a dataset with a sample size of approximately 1000 to obtain 100 deaths.
Missing Data
We estimated missing values using advanced multivariate model imputation techniques. Although studies have shown that the use of multiple imputation (MI) methods approximates the obtained data to the true values, they are complex and sometimes unnecessary(16, 17). Some studies have shown that when the loss to follow-up rate is less than 10%, there is no significant difference between MI and other simple estimation methods(18). Empirical studies show that nonstatistician-friendly single imputation (SI) results in model prediction studies are not significantly different from MI results (19). Therefore, we chose SI to handle missing data (using the mice package in R). When implementing SI, all 14 variables entered into the model were referenced.
Statistical Analysis Methods
In the model developmental study(1), we clearly reported the mathematical equation of the SFHF model and the baseline hazard at 1 year after low-energy hip fracture. Therefore, we can easily obtain the prognostic index (PI) of the model and use it to calculate the one-year survival rate of each individual in the validation study. Although our developmental study provided a nomogram developed based on a simple model, we declined to use the nomogram to calculate the survival of patients in the validation cohort because it was inefficient and prone to error.
In the model developmental study, we provided the full model and the reduced model; therefore, we evaluated both models.
Traditional Measures
We evaluated the performance of the SFHF model by examining its discrimination and calibration on an external validation dataset (20).
Specifically, for this study, discrimination refers to the ability of the model to distinguish patients with shorter survival times after hip fracture (predicting patients as high risk) from those with longer survival times (predicting patients as low risk)(21, 22). We gave Harrell's c-index and Uno's c-value, the latter being more suitable for the censored survival data model(5, 21–23). The value of c ranged from 0.5 to 1, with 0.5 indicating that the model has no discriminative ability and 1 indicating perfect discrimination. Generally, a c value greater than 0.7 indicates that the discrimination of the model is acceptable.
Calibration refers to the degree of agreement or consistency between the 1-year survival rate predicted by the model and the actual 1-year survival rate of the patient, that is, the accuracy of the prediction. (24) Some studies have compared the pros and cons of mean calibration, weak calibration, moderate calibration and strong calibration and concluded that model development and validation research should focus on moderate calibration because it ensures that clinical decisions based on the model will not lead to harm(15). Therefore, we gave the calibration plot of the model to judge the calibration of the model(24). The specific method was to calculate the 1-year survival rate of each individual in the validation data using the SFHF model. The cohort was divided into ten equal groups according to the magnitude of the survival rate, after which the average predicted survival rate (as the x-axis) and the actual survival rate (as the y-axis) of each group were obtained; the bootstrapping procedure of 1 000 repetitions was used to obtain corrected values, which were added to a calibration plot for intuitive comparison. If each group of patients had a perfect prediction, their respective values would fall on the standard 45-degree straight line in the graph.
We also grouped the validation cohort by prognostic index size and plotted Kaplan‒Meier curves superimposed with Kaplan‒Meier curves of the developmental cohort. The discrimination and calibration performance of the model can also be obtained intuitively from this figure. If the survival curves for the risk groups in the validation cohort are separated to the same extent as the developmental data, then the discriminative power of the model is preserved in validation; if the degree of separation between the sets of curves decreases, it indicates that the model is less discriminative and vice versa. The model has good calibration if the sets of curves from the two studies are intertwined and follow the same trend. Otherwise, the calibration is poor.
We also calculated the overall performance measure, the Nagelkerke R2, to see the model's ability to explain the variation in results in the validation dataset(25). Its value ranged from 0 to 1. The larger the value was, the greater the ability of the model to explain the variation and the higher the prediction accuracy of the model(26).
Utility Measures
Using metrics such as model discrimination and calibration to judge the performance of a model is more of a statistical perspective that does not provide information on whether the model has clinical value(27–29). The original intention of developing and validating prognostic models was to provide references for clinical decision-making. In this study, the original intention of the SFHF model was to identify patients at high risk of dying within 1 year after fragility fracture and to treat such patients conservatively in an attempt to reduce the mortality rate among these patients. Whether the application of the model can achieve this clinical goal with the traditional model performance measures mentioned earlier cannot answer this question. Decision curve analysis (DCA), which has emerged in recent years, can link the model with clinical consequences and answer the most basic and most important question of whether the use of the model can promote clinical development(27, 28, 30). Therefore, we performed DCA to judge whether the SFHF model had clinical value in the validation population.
Decision curve analysis can tell us within which probability threshold range a model is valuable and give the magnitude of the net benefit(NE)(28, 31). The threshold probability means that under the probability of risk, for example, the expected benefit of choosing an intervention is equal to the expected benefit of rejecting the intervention(27, 32). A model is considered clinically valuable if it achieves a higher net benefit than the default strategy (treat all or treat none) within a reasonable threshold. Among them, whether this threshold range is reasonable depends on how much risk an individual is willing to take on a certain intervention.
It is important to stress here that if the validation results showed poor calibration in the SFHF model, then we simply revised the model to make it better suited to the new environment(33, 34). Otherwise, we did not update the model.
Risk Groups
Another purpose of establishing the SFHF model was to group patients with hip fractures according to the prognostic index to achieve clinical stratified management of such patients. The grouping standard was still inconclusive, and the patients were mostly divided into 3 or 4 groups based on clinical needs(4, 35). In the developmental study, we divided patients into low-, intermediate- and high-risk groups, each with equal numbers. In the validation study, however, we regrouped the developmental data in an attempt to maximize between-group differences and minimize within-group differences(5, 35). Our grouping method was as follows: according to the size of the prognostic index, in percentiles, the cut-off points were 0.15, 0.5, and 0.9. This resulted in a worst prognosis group of 10% of the total sample, close to the 1-year mortality rate (9.25%) shown in the developmental study. In this study, we also divided the validation cohort into 4 groups according to the abovementioned grouping method based on the prognostic index obtained by the SFHF model in the validation population.
Development Versus Validation
To help readers understand the differences in case mix between the two studies more clearly, we provided special explanations for the differences in setting, patient inclusion criteria, predictors, and outcome definitions and measurements between the validation and developmental studies.
The dataset for the developmental study came from Fenyang Hospital, a comprehensive teaching hospital for secondary care, in Luliang city, Shanxi Province, China (located in western central Shanxi). Among the patients admitted, the majority (approximately 80%) were from rural areas. The dataset for the validation study came from Shanxi Bethune Hospital in China, a level 3 regional referral hospital located in Taiyuan, the provincial capital, which belongs to the comprehensive teaching hospital of Shanxi Medical University. The rural population accounted for approximately 60% of the patients treated. Therefore, the two hospitals are not only geographically different but also have substantially different levels.
Among the predictors, we redefined CCD and LD (see above for details), expanded the range of diseases determined as positive results, and made disease definitions more truly reflect individual patient characteristics. All variables were coded in the same way as in the developmental study. SCs with extreme values were also winsorized (creatinine values > 99th percentile were contracted to 99th percentile values). We also made slight adjustments to the patient inclusion criteria. The developmental study excluded patients with hip fractures coexisting with other fractures. In the validation study, eligible patients were allowed to combine fragility fractures of other sites caused by the same trauma that caused the hip fracture, regardless of whether the fracture had been operated on. These combined fractures are the most prone to fragility fractures, including distal radius, proximal humerus, and vertebral compression fractures. The reason for this adjustment was because this condition is not uncommon in the clinic.
There was no difference between the two studies in the outcome of interest, which was all-cause mortality within 1 year of fracture.
In general, compared with the developmental study, this validation study differed considerably in the research setting and differed slightly in the inclusion criteria and the definition of predictors but did not differ in the target results.