Data source
The data in this study was obtained from the National Health and Nutrition Examination Survey (NHANES) in the United States. This survey employs a sophisticated, multi-stage, and stratified probability sampling method to provide a comprehensive understanding of the health and nutritional status of the non-institutionalized population in the United States. The data from NHANES is nationally representative, making it an invaluable resource for conducting large-scale epidemiological studies and developing clinical prediction models. All NHANES survey protocols have received approval from the Research Ethics Review Committee of the National Center for Health Statistics, and participants have signed informed consent forms before participating in the survey. All the data from NHANES used in this study are publicly available at https://www.cdc.gov/nchs/nhanes.
Participant selection
In our study, we utilized data from three cycles of the NHANES survey, conducted in the years 2011–2012, 2013–2014, and 2015–2016. During these cycles, a total of 29,902 participants completed extensive demographic surveys, laboratory examinations, and health status questionnaires. To ensure the accuracy and reliability of our research, we conducted rigorous data screening and exclusions. Firstly, participants under 20 years of age were excluded (n = 12,854), as our study primarily focused on osteoarthritis in adults (n = 17,048). Subsequently, individuals with missing osteoarthritis-related data were excluded (n = 1,256). Furthermore,to guarantee data integrity, we excluded individuals lacking essential demographic information (n = 1,456), those with missing questionnaire and dietary data (n = 2,462), and those with missing laboratory data (n = 508). Ultimately, a total of 11,366 participants were included as the subjects of analysis in our study, as shown in Fig. 1.
Definition of osteoarthritis
The case definition in epidemiological studies often relies on self-reported osteoarthritis (OA)[17]. March et al. demonstrated an 81% consistency rate between self-reported OA and clinically well-defined OA[18], suggesting that OA can be reliably self-reported. All participants were asked if they had ever been diagnosed with arthritis: "Has a doctor or other health professional ever told you that you have arthritis?" If participants answered "yes," they were then asked, "What type of arthritis do you have?" Based on the answers to these questions, participants were categorized as having OA, other types of arthritis or no arthritis[19].
Demographics, laboratory Factors, anthropometrics, and lifestyles
In accordance with previous research, we identified factors including age, gender, race, educational level, poverty income ratio (PIR), marital status, body mass index (BMI), alcohol consumption, smoking status, recreational physical activity, self-reported health status, dietary intake factors, renal function, and the systemic immune-inflammatory index (SII) as influencing factors[20, 21].
Age (years) and PIR were used as continuous variables. Gender was classified as male or female. Race was classified as Mexican American people, Other Hispanic people, Non-Hispanic White people, Non-Hispanic Black people, and Other or multiracial people. Education was divided into five categories: Less Than 9th Grade, 9-11th Grade, High School Grad/GED, Some College or AA degree, College Graduate or above. Alcohol consumption was defined by the response to the question: “Have you had at least 12 drinks of any type of alcoholic beverage in any one year?” and was divided into two groups (yes or no). Smoking status was classified as current smoking, former smoking, and never smoking according to the response to the questions: “Have you smoked at least 100 cigarettes in your entire life?” and “Do you currently smoke cigarettes?” Marital status is classified into five categories: married, widowed, divorced, separated, never married, and cohabiting with a partner. Based on BMI, individuals are divided into categories of underweight (< 18 kg/m2), normal weight (18–25 kg/m2), overweight (25–30 kg/m2), and obesity (≥ 30 kg/m2). Leisure physical activity levels were categorized into two groups: active and inactive. Individuals reporting moderate or vigorous leisure physical activity in a typical week were classified as active. Those who report no moderate or vigorous leisure physical activities were classified as inactive. In our study, hypertension was defined as a self-reported diagnosis by a doctor, the use of anti-hypertensive medications, or blood pressure ≥ 140/90 mmHg. Diabetes mellitus (DM) status was classified as “diabetes” (self-reported diagnosis by a doctor, HbA1c level ≥ 6.5%, fasting plasma glucose [FPG] level ≥ 7.0 mmol/L random blood glucose level ≥ 11.1 mmol/L, two-hour glucose tolerance test blood glucose level ≥ 11.1 mmol/L, use of diabetes medications, or insulin). Dietary supplement information was obtained from questionnaires designed to collect detailed data on dietary supplement usage. During each NHANES cycle, participants provided detailed dietary intake information for two 24-hour periods, which was used to estimate intake of total energy, caffeine, and fiber. The first dietary recall was collected in person during the NHANES visit, while the second was collected via telephone 3 to 10 days later. The intake was estimated as the average of the two recall periods (or the available data from the first day if only one day's data was available)[22]. Data on urinary creatinine and albumin were obtained from laboratory examination within the NHANES project. Blood biomarkers include levels of vitamin D, neutrophil count (NC), lymphocyte count (LC), and platelet count (PC). As previously described, the systemic immune-inflammation index (SII) was calculated as PC × (NC / LC). Considering the right skewed distribution of SII, we performed a log2 transformation on SII[23, 24].
Statistical analysis
NHANES is a multiple and complex survey. To represent sample weighted data, it is necessary to calculate weighted data based on sample design[25]. However, in this study, we used raw unweighted data from the NHANES database to construct models for machine learning. The reason we did not use weighted data is that weighted data is usually used to estimate the incidence/prevalence rate nationwide. We don’t estimate the prevalence nationwide, we just need to know the relationship between OA and individual characteristics to train the model[11].
Data were statistically analyzed using R software (version 4.3.0). Continuous variables are represented as mean ± standard deviation (SD), and t-tests are used to compare differences between groups. Meanwhile, categorical variables are expressed in terms of frequency and percentage, and compared using chi square tests. All statistical tests are bilateral, and a P-value < 0.05 is statistically significant.
To facilitate model development, we randomly divided all 11,366 participants into two groups in a 7:3 ratio (7,958 individuals for training and 3,408 for validation). The training cohort was used for model development, while the validation cohort was served for internal validation. LASSO regression, XGBoost algorithm, and random forest (RF) algorithm were applied for 10-fold cross-validation and feature importance assessment. Subsequently, we developed a clinical risk prediction nomogram by integrating results from the three algorithms, considering the importance of feature variables.
For model evaluation, we plotted receiver operating characteristic (ROC) curves and calculated the AUC value. To evaluate the clinical utility of the model, we further conducted decision curve analysis (DCA).