Overview
The hospital ethics committee approved our research. We retrieved patient data from the medical records database from January 1, 2014, to December 31, 2021, for a total of 443 patients diagnosed with "intertrochanteric fracture of the femur," "fracture of the greater trochanter," or "subtrochanteric fracture," who underwent internal fixation surgery. We selected 94 sets of anteroposterior and lateral X-ray films of the proximal femur, independently judged by six orthopedic surgeons based on subjective impressions to determine whether the fractures healed. Then, the same imaging data were scored by the same observers using the Radiographic Union Score for Hip (RUSH). After four weeks, the research designer selected 47 imaging data sets from the 94 sets and randomly assigned them to the same observers for reevaluation based on subjective impressions and RUSH scoring. Subsequently, a consensus meeting was held by the study designer and observers to reveal the follow-up radiograph timing and series of follow-up X-ray films for the same cases, reaching a consensus on whether each group of X-ray films depicted healed fractures. Figure 1 illustrates a summary of our study method.
Observers
Our review panel comprised three orthopedic residents who had completed residency training (with an average orthopedic clinical practice experience of 3.3 years, ranging from 3 to 4 years) and three orthopedic attending physicians with expertise in the treatment of hip fractures (with an average experience of 10.0 years, ranging from 9 to 12 years). We recruited observers from two distinct seniority groups to investigate potential variations in scoring based on work experience and to evaluate the reliability of the RUSH scoring system.
Selection of Imaging Data
A follow-up imaging dataset includes an anteroposterior X-ray and a lateral X-ray of the proximal femur on the affected side of a patient. The inclusion criteria were as follows: ① Subtrochanteric fractures treated with an intramedullary nail; ② The timing for X-ray imaging was at least three weeks after surgery; ③ The time interval between the X-ray films of the same case was at least 30 days; and ④ We obtained a clinical result of fracture healing or failure. The exclusion criteria were as follows: ① other types of hip fractures; ② fixation using a plate or dynamic hip screw; and ③ incomplete preoperative or postoperative follow-up imaging data. Ultimately, the study included 38 cases and 94 follow-up imaging datasets that met the above criteria.
The selected imaging data represent various stages of recovery, with 18.0% of the images taken between 3 and 6 weeks after the initial surgery, 22.3% taken from 7 to 12 weeks, 22.3% taken from 13 to 24 weeks, 17.0% taken from 25 to 52 weeks, and 20.2% taken after 52 weeks. Personal information such as names, gender, hospital admission numbers, and time stamps were anonymized. The reviewing physicians were unaware of the follow-up times, and there were no potential markers, such as suture pins. The reviewers didn't participate in the case selection or imaging data processing.
Instruction of the RUSH Score
Bhandari et al. developed the Radiographic Union Score for Hip (RUSH) with a methodology such as the Radiographic Union Score for Tibial (RUST)(10). Its purpose is to standardize the assessment of fracture healing based on plain radiographs. This scoring system was developed by considering various definitions and criteria of fracture healing from the literature, such as bone bridging and disappearance of the fracture line(5). The RUSH score incorporates four component scores: cortical bridging, cortical fracture line disappearance, cancellous bone calcification, and cancellous bone fracture line disappearance, and the total RUSH score ranges from a minimum of 10 to a maximum of 30. Chiavaras et al. explained the indicators and scoring criteria of the RUSH scoring system in their literature (11).
Evaluation of Fracture Healing and RUSH score
The observers initially evaluated the healing status of the fracture based on their subjective impression of the set of imaging data, selecting between the options of "healed" or "not healed." While fracture healing is not a binary all-or-nothing outcome, the radiographs taken during the healing process demonstrate gradual changes. However, only orthopedic surgeons deem fully healed fractures are classified as "healed." Subsequently, each reviewer filled out the RUSH checklist for the same radiographs, which included specific inquiries regarding cortex bridging and trabecular consolidation across the fracture. Figure 2 displays X-ray images taken during two follow-up periods of the same patient and their corresponding RUSH scores.
Procedure for the Adjudication of Fracture Healing
If different surgeons evaluating the same patient arrive at similar conclusions about fracture healing, we will ensure excellent service to the patient. As a result, we made interobserver reliability our main evaluation focus while also investigating intraobserver variability. The study designer randomly uploaded the 94 digital radiograph sets to a secure, password-protected electronic adjudication platform system for online display. All reviewers received training in using the system and RUSH scoring to view and access X-ray photographs. Observers had 14 days to review these imaging data on the electronic adjudication platform and complete a fracture healing assessment for each case. Their judgments are independent, and reviewers are irrelevant to colleagues' reviews.
After four weeks, the study designer (ZTJ) selected 47 of the original 94 image files, rerandomized them, and uploaded them. We gave the reviewers an additional two weeks to reevaluate the imaging data. The questionnaires from the two assessments were gathered and summarized in a table. The study designer arranged a consensus meeting involving all reviewing doctors and disclosed the timing of follow-up and serial follow-up radiographs of the same case. Ultimately, the participants reached a consensus on whether the fractures had healed based on the provided imaging data for each group.
Data Analysis
The formula [2× (number of reviewers)2] is usually used for sample size calculation in consistency studies(12). We had six observers; the sample size requirement was 72 cases (2×62). According to the above sample size requirements, the sample size of 94 sets of image data is sufficient to achieve good statistical accuracy. We used the Fleiss-kappa (κ) coefficient to evaluate the consistency of multiple reviewers' subjective judgments on fracture healing. As the RUSH score is a continuous variable ranging from 10 to 30, the intraclass correlation coefficient (ICC) was employed to assess the consistency among observers. Both kappa (κ) and ICC values range from − 1.00 (indicating absolute disagreement) to 1.00 (indicating total agreement). According to the guidelines proposed by Landis and Koch, kappa values between 0.00 and 0.20 are categorized as poor agreement, 0.21 to 0.40 as fair agreement, 0.41 to 0.60 as moderate agreement, 0.61 to 0.80 as substantial agreement, and 0.81 to 1.00 as almost perfect agreement(13). Finally, we conducted regression analysis and ROC analysis to explore the possibility of optimizing the RUSH scoring criteria, specifically the overall score, the overall score for cortical bone items, and the overall score for cancellous bone items.