Overview of the UK Biobank population
The UK Biobank project is a large-scale, prospective, and ongoing study initially established to allow extensive investigation on genetic factors and lifestyle determinants of a diverse range of common diseases of middle-aged and older adults1. To recruit the intended sample size of approximately 500,000 participants, over 9 million invitations were sent to individuals registered in the UK National Health Service (NHS) with an inclusion age range of 40–69 years old and based on their living within a reasonable distance from an assessment center. Baseline recruitment and data collection from 503,317 participants who consented to join the study took place between 2006 and 2010 in 22 assessment centers throughout Scotland, England, and Wales2. All participants gave written, informed consent, and the study was approved by the Research Ethics Committee (REC number 11/NW/0382). Further information on the consent procedure can be found elsewhere. Subsets of baseline participants were invited later for follow-up visits and/or were asked to provide data on certain online questionnaires at certain timepoints. The following datasets from different timepoints are used to address different aims of our study.
Biological Modalities
Broad biological modalities indexing four diverse domains of physiological health were selected for analyses. Biological measures included blood immunoassays, polygenic risk scores, bone structure estimates derived from Dual-energy X-ray absorptiometry (DXA) scans, and brain imaging phenotypes. For certain analyses, we segmented blood and brain modalities into subcategories. Specifically, blood was divided into three distinct assay types: those assessing inflammatory or immune functions, metabolic functions, and hematological functions. Brain imaging was divided into: T1-weighted MRI for anatomical insights, diffusion MRI (dMRI) for white matter architecture, and resting-state functional MRI (fMRI) for brain connectivity.
The availability across study visits, derivation, and processing of each modality are detailed below:
Blood immunoassays (52 features, N = 476,787)
The UK Biobank measured 31 haematological (https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/haematology.pdf) and 30 biochemical (https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/serum_biochemistry.pdf) blood assays using high fidelity haematology, immunoassay, and clinical chemistry analyzers. For the purposes of this study, 52 distinct assays were selected measuring a range of physiological functions including liver health (Gamma-Glutamyl transferase), kidney function (Cystatin C), systemic inflammation (C-Reactive protein), and immune activation (Leukocyte count), among others. Blood assays were collected twice: initially during the baseline evaluation of 502,494 participants and subsequently in a subset of 20,193 participants who returned approximately 4 years later (range: 1–6 years), hereafter referred to as the 4-year follow-up. At both time points, participants missing data for 22 or more assays (constituting over half of the selected assays) were excluded from further analysis. For the remaining participants with any missing data (up to 13% of the total sample, or a maximum of 62,216 individuals), we imputed missing values using the median of the available data. This process yielded a final analytic cohort of 476,787 individuals at baseline and 19,360 at the 4-year follow-up. To address distributional biases in raw assay measures, we applied logarithmic transformation to all immunoassay data prior to model training.
Polygenic Risk Scores (20 features, N = 408,597)
Blood samples collected at the baseline visit allowed different types of assays to be performed, including genetic data. A total of 488,118 underwent genotyping and have available phenotype information. Participants of similar genetic ancestry (Field: 22018), participants of non-European ancestry (Field: 22006), as well as participants who failed quality control (Field: 22010) were excluded from genetic analysis, resulting in 408,597 participants. This population was split into a training and a testing set, with the latter comprising individuals who attended the 9-year follow-up (n = 40,898), as conducted in Tanguay-Sabourin et al. 20236. Consequently, polygenic risk scores were estimated within the training set of 367,699 participants and evaluated on the testing set of 40,898 participants as described below:
Genome Wide Association Studies (GWAS): For each training set across each pain phenotype and pain-associated diagnosis, we conducted genome-wide association studies (GWAS) using regenie26. Regenie was selected for its ability to address cryptic relatedness among UK Biobank (UKB) participants and manage case/control imbalances. Covariates included sex, age, age squared, genotyping array, the first 40 genetic principal components, and dummy-coded recruitment sites. Sex was determined based on genetically ascertained sex (XY = man, XX = woman; field 22001). In regenie's step 1, 93K SNPs identified by the UKB for kinship estimation were analyzed. For step 2 in regenie, approximately one million SNPs were evaluated, selected from the European-specific subset of the pan-UK Biobank project. These SNPs are high-quality HapMap 3 variants that meet five criteria: located in autosomes, outside the MHC region, bi-allelic, with INFO > 0.9, and MAF > 1% in both UKB and gnomAD datasets. Individuals retained for analysis were of European ancestry (field 22006), excluding any with failed genotyping quality, unusual heterozygosity, sex chromosome anomalies, or those who had withdrawn from the study by December 2021 (according to the sample-QC file from resource 531). Lastly, narrow-sense heritability estimates for each pain phenotype and diagnosis were derived using LDSC from the GWAS summary statistics27.
Polygenic Risk Scores (PRS)
Polygenic Risk Scores (PRS) were constructed ensuring that included SNPs were not in linkage disequilibrium, starting from the genotyping data in PLINK format provided by the UK Biobank. Utilizing PLINK,28,29 we removed all non-rsID SNPs as previously described, and excluded SNPs with MAF < 0.001, genotyping error rates > 1%, and Hardy-Weinberg equilibrium P-value < 10− 12. Linkage disequilibrium analysis among SNPs was performed within a 100 kb window, 100 variant steps, and an R2 threshold of 0.2, resulting in 261,217 SNPs selected for PRS construction. These SNPs’ GWAS effect sizes were aggregated to compute the PRS using PRSice-2 for each individual and pain phenotype, incorporating only SNP effect sizes from summary GWAS data of the training sets.30 The PRS included all GWAS covariates and applied 20 P-value thresholds for SNP inclusion, ranging from P = 5x10− 8 to P = 1. Additionally, pathway-based PRS were analyzed using PRSet with Neural/Immune Gene Ontology (NIGO) pathways, focusing on those containing 10 to 100 genes for enhanced specificity.31,32,33
Brain imaging: A sub-sample of baseline participants were invited to attend a follow-up roughly 9 years later (range: 3–13 years), hereafter referred to as the 9-year follow-up. At the time of this study, data were available for 49,001 participants who had completed this follow up, which included a Magnetic Resonance Imaging (MRI) scan of the brain and the same questionnaires and assessments as baseline3. Structural and functional brain features were derived from three neuroimaging modalities, including T1-weighted MRI, diffusion MRI (dMRI) and resting-state functional MRI (fMRI). The image processing pipeline, artifact removal, cross-modality and cross-individual image alignment, quality control and phenotype calculation are described in detail in the central UK Biobank brain imaging documentation (https://biobank.ctsu.ox.ac.uk/showcase/showcase/docs/brain_mri.pdf) and by Alfaro-Almagro and colleagues4. Additionally, we excluded participants with excessive head motion or suboptimal signal-to-noise ratio, as well as those whose T1 brain images required excessive warping for non-linear alignment to standard space, defined quantitatively as deviations greater than three standard deviations from the group mean (exceeding the 99.75th percentile).
T1-weighted MRI
(1041 features, N = 41,979) T1-weighted MRI features include regional and subcortical gray matter volume, cortical thickness, surface area, and volume, regional gray/white matter intensity contrast as derived from T1-weighted MRI. Participants missing data for more than 75% of the features (exceeding 780 features) were excluded from further analysis, the remaining missing data was imputed using the median.
Diffusion-weighted MRI
(614 features, N = 36,779) Diffusion-weighted MRI features include microstructural measures of white matter tracts including mean fractional anisotropy and mean diffusivity as well as derivatives of diffusion tensor imaging including orbital diffusivity, MO, L1, L2, ICVF, and ISOVF. Participants missing more than 75% of the features (exceeding 461 features) were excluded from further analysis. This resulted in the final sample of 36,779 individuals with complete diffusion features, as such no imputation was required.
Resting state functional MRI: (38,781 features, N = 37,414) Resting-state functional Magnetic Resonance Imaging data is based on the minimally preprocessed pipeline designed and carried out by the FMRIB group, Oxford University, United Kingdom. The minimally preprocessed resting-state fMRI data from the UK Biobank were analyzed using the following preprocessing steps: motion correction with MCFLIRT4, grand-mean intensity normalization, high pass temporal filter, fieldmap unwarping, and gradient distortion correction. Noise terms were identified and removed using FSL ICA-FIX. Full information on the UK Biobank preprocessing is published4. Additional preprocessing included warping the image in native space to the 3mm MNI template (FSL), despiking using 3DDespike (AFNI from Nipype), 6-mm kernel smoothing (Nilearn), and resampling to 3-mm (for storage purposes). A modified Brainnetome atlas5 including additional midbrain, brainstem, and cerebellar regions was used to parcel the brain into 279 distinct regions. Dynamic connectivity was estimated between each region to derive a functional connectome using Dynamic Conditional Correlation (DCC). DCC is based on generalized autoregressive condition heteroscedastic (GARCH) and exponential weighted moving average (EWMA) models (implemented by https://cocoanlab.github.io/tops/).
Bone structure: (68 features, N = 40,815) At the time of this study, measures from Dual-energy X-ray Absorptiometry (DXA) bone scans were available for 40,815 participants at the 9-year follow-up (https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/DXA_explan_doc.pdf). Radiographers analyzed the scans at the whole body, spine, and hip levels either during or shortly after the image acquisition. This analysis facilitated the generation of comprehensive numerical measures, including bone area, bone mineral content (BMC), and bone mineral density (BMD). A total 91 features across seven bone systems—spine, femur, head, legs, pelvis, arms, and ribs were extracted by the UK Biobank. From this pool, we excluded features with low coverage (fewer than 10,000 participants or 25% of the sample), narrowing the feature space to 68. Among the 40,815 participants assessed, we imputed any remaining missing data, which affected 4.9% or 2,015 values, using the median.
Psychosocial Phenotypes (81 features, N = 493,211)
We used a truncated set of psychosocial features originally defined in our recent paper, see Tanguay-Sabourin, C. et al for details6. As the goal of the present study is to compare biological predictors to psychosocial predictors, we excluded biological features included in the original model to generate a strictly psychosocial modality, including BMI, age, sex, ethnicity, grip strength, and blood pressure. Thus, a total of 81 features collected at the baseline and follow-up visits were selected a-priori based on their relevance to chronic pain. The selection was based on the Prognosis Research Strategy (PROGRESS) group who recently provided a framework for the development of a prognostic model to determine risk profile7. Variables were organized through an iterative approach along a hierarchical framework from 81 features into 8 categories forming three distinct domains (i.e., mental health, physical health, and sociodemographic). The three domains are as follows:
Mental health: The mental health domain includes 3 categories: i) neuroticism – all individual items and their total sum-score – based on 12 neurotic behaviors closely linked to negative affect, ii) traumas – illness, injury, bereavement, or stress in the last 2 years – including 6 events, and iii) mood – reported frequency of certain moods in the past 2 weeks and visits to a general practitioner or psychiatrist for nerves, anxiety, tension, or depression.
Physical health: The physical health domain includes 3 categories: i) physical activity – Metabolic Equivalent Task (MET) scores computed using the International Physical Activity Questionnaire (IPAQ) guidelines8, ii) sleep – questions regarding duration, napping, snoring, and sleeplessness, iii) substance use – smoking and alcohol use.
Sociodemographic: The sociodemographic domain includes 2 categories: i) socioeconomic status – education completed, income, employment, etc., ii) occupational measures – individuals present within household, social entourage, and job type (e.g., manual or physical job).
For a detailed view of the biological and psychosocial features included in this study, see Supplementary Tables 1–6.
Pain phenotypes in the UK Biobank
One-month pain
Participants reported if they experienced pain impacting their usual activities at any major anatomical body sites (head, face, neck or shoulder, back, stomach or abdominal, hip, knee, or pain all over (PAO)) in the last month. Participants answering PAO were not allowed to report additional pain locations. This category consists of both chronic and acute pain.
Acute and chronic pain sites
Participants reporting pain at a given site in the last month were then asked if the pain at that site had persisted for more than 3 months. This question was used to distinguish between a chronic pain site, one present for more than 3 months, and an acute pain site, one present for 3 months or less, according to the classification from the International Association for the Study of Pain1.
Deriving 14 target pain phenotypes
From these data, we derived 14 pain phenotypes, including a general chronic pain phenotype representing pain at any of the body sites, and a general acute pain phenotype. Additionally, we identified seven chronic pain site phenotypes, each representing pain experienced at one of the specified body sites, and four phenotypes that quantified the number of chronic pain sites reported, categorizing the extent of chronic pain spread ranging from 1 to 4 or more distinct sites.
Pain-Associated Diagnoses
Participants' diagnoses were sourced from both self-reported interviews conducted at UK Biobank assessment centers (Field IDs 20001 and 20002) and healthcare records provided by the UK National Health Services (NHS). In the interview process, trained nurses validated and, if necessary, refined the medical conditions that participants initially reported through a touchscreen questionnaire. In cases of uncertainty, participants described their condition to a nurse, who then assigned a suitable code or logged it as a free-text description. This free text was later matched to a specific entry by a physician.
For health outcomes resulting in hospital admissions, hospital inpatient records utilized the International Classification of Diseases and Related Health Problems (ICD-10) coded primary or secondary diagnoses (Field IDs 41270 and 41271). Meanwhile, primary care data, available for 45% of the study cohort (n = ~ 230,000), were obtained from Read Codes v2, as coded by general practitioners (Field ID 42040). The UK Biobank additionally provided curated data fields indicating the first occurrence of a set of diagnostic codes (Category: 1712) for a wide range of health outcomes across self-report, primary care, hospital inpatient data and death data, mapped to a 3-digit ICD code. For broader diagnoses that were effectively captured with a three-digit ICD code (e.g., Rheumatoid arthritis; ICD code: M05 & M06), we extracted information from this first occurrences database. In contrast, for conditions defined by more granular ICD coding (e.g., ankylosing spondylitis; ICD code: M081), data were manually curated from self-report and hospital inpatient records.
For our analysis, we selected 35 pain-associated diagnoses based on their significant pain prevalence and ample sample size. Criteria included having over 45% pain prevalence (acute or chronic) and an occurrence exceeding 100 participants at the 9-year follow-up. These diagnoses were then aligned with the first occurrences database, ICD-10 codes and primary care data, collating all participants with a record for each specific diagnosis. For instance, the sciatica group amalgamated both self-reported and healthcare recorded (ICD codes: M543 & M544) instances of sciatica. A detailed list of codes for the 35 diagnoses is available in Supplementary Table 7. To ascertain that an illness's onset or diagnosis predated a given study visit, we contrasted the earliest recorded illness date with the participant's respective visit date at each visit (e.g. baseline, 4-year, and 9-year follow-up).
To enable comparisons across diagnoses, we defined the healthy control group as participants with no self-reported diagnoses and no self-reported or healthcare recorded instances of the 35 pain-associated diagnoses. This resulted in a healthy control sample size of 103,034 (20.5% of full sample size) participants at the baseline visit and 5,237 (11.6% of full sample size) participants at the 9-year follow-up. As such, the control population may not be representative of the full study population.
Demographic information including sex, ethnicity, and age distributions for each diagnosis within the UK Biobank, is detailed in Extended Data Fig. 4.
Validation cohorts
To validate the biomarkers identified in this study, data were sourced from two distinct validation cohorts: the All of Us Research Program (AoU) and four datasets from the OpenPain repository (Extended Data Fig. 8).
All of Us Research Program
All of Us Research Program aims to enroll over a million U.S. participants, with a focus on individuals from underrepresented groups in research. The program collaborates with healthcare organizations to collect and share electronic health records (EHR) from consenting participants, which includes any healthcare diagnosed medical conditions and blood assays the participants have had collected. These records are standardized to the Observational Medical Outcomes Partnership (OMOP) Common Data Model by data specialists at each organization, enabling uniformity across different EHR systems9.
OpenPain repository datasets
OpenPain is a repository featuring data from brain imaging studies on chronic pain. We utilized four studies from OpenPain that provided whole-brain resting-state fMRI imaging for chronic pain subjects and pain-free controls. The first dataset involved UK participants with chronic back pain (24 patients, 30 controls), the second dataset comprised Japanese chronic back pain patients (17 patients, 17 controls), the third and fourth datasets featured US patients with chronic back pain (74 and 66 patients, respectively, with 22 controls in the fourth dataset). These datasets were amalgamated to form a unified validation cohort totaling 181 patients and 69 controls. Consistent brain imaging processing protocols were applied across all datasets, as described below
Preprocessing was conducted using fmriprep version 1.4.1, incorporating the following preprocessing steps: motion correction with MCFLIRT, susceptibility distortion correction to correct for field inhomogeneities, registration to the T1w image using FLIRT, followed by warping from native space to the 3mm MNI template (FSL), and removal of physiological and motion-related noise terms, including raw signals, squared signals, and their first derivatives for white matter, cerebrospinal fluid, translations, and rotations along x, y, and z planes from timeseries using Nilearn’s signal.clean function. Aligning with the preprocessing of the UK Biobank fMRI data, we included connectivity despiking using 3DDespike (AFNI), 6-mm kernel smoothing (Nilearn), and resampling to 3-mm, followed by parcellating the brain into 279 distinct regions using the Brainnetome atlas and estimating parcel-wise dynamic functional connectivity using Dynamic Conditional Correlation (DCC). Lastly, neurocombat harmonization was employed to mitigate site-specific effects in the data.
Statistical analysis
Machine learning Models
Predictive machine learning models were constructed to classify the 14 pain phenotypes and 35 pain-associated diagnoses from pain-free or diagnosis-free controls, respectively. Machine learning models implemented nested cross-validated logistic regression (implemented using SnapML) with a randomized hyperparameter search (implemented using scikit-learn) to optimize the ridge regression regularization hyperparameter (‘l2’) for each individual model. To address cases of class imbalance, the 'class_weight' parameter in the models was set to 'balanced'. This adjustment modifies the loss function penalty to ensure that the majority class does not disproportionately influence the model. Both inner and outer layers of the nested Cross-Validation (CV) utilized a 5-fold strategy (see Extended Data Fig. 1 for a schematic of the modeling pipeline). To prevent data leakage and minimize model bias, feature preprocessing steps including standardization and residualization were fit to the training folds and then applied to the validation fold within each nested CV. Alternative algorithms such as support vector machines and gradient boosting trees were evaluated but were not chosen as the primary machine learning methodology based on their performance metrics (Extended Data Fig. 2).
Machine learning models were trained separately on features comprising the 5 distinct modalities (blood, genetics, brain, bone, and psychosocial) to classify the 14 pain phenotypes and 35 pain-diagnoses. This approach yielded 70 distinct pain phenotype models (14 pain phenotypes multiplied by 5 modalities) and 175 distinct pain-diagnosis models (35 diagnoses multiplied by 5 modalities). Additional models were trained on the modality subcategories (inflammatory/immune, metabolic, hematological, T1 imaging, diffusion imaging, and resting state functional connectivity), for each pain phenotype and diagnosis.
To bolster the reliability of our findings and account for potential variability, each model underwent 5 iterations with unique random states (e.g., 5-times repeated 5-fold nested cross-validation). This procedure randomized the distribution of subjects across the training and test sets, from which we calculated confidence intervals for model performance metrics.
Mitigating confounding variables
The influence of confounding variables on machine learning predictions, especially in the context of biological data, is well-documented11,12,13. To mitigate potential biases introduced by confounders, we integrated regression-based deconfounding (i.e., residualization) within each of our modeling pipelines.
For each cross-validation fold:
-
A linear regression model was fit within the training data on features from a given modality to predict confounding variables.
-
This trained model was then applied to the features of both the training data and the left-out validation data to extract residual values, with the linear effects of the confounding variables removed. The residual values were then used as the feature space for model training (on training folds) and evaluation (on validation fold) of pain endpoints.
For a comprehensive list of the confounders addressed and their specific handling within each modality, refer to the detailed modality pipeline descriptions below:
Blood biochemistry pipeline
A significant portion of participants within the diagnosis groups reported taking medications known to alter blood biochemistry. These include, but are not limited to, hypertensive drugs, anti-rheumatic medications, and immunosuppressants. Recognizing the potential biases these medications could introduce in the biochemical features under study, we incorporated propensity score matching to control for them. This approach aimed to align subjects from the disease groups with their counterparts in the control group based on self-reported medication intake.
While unable to eliminate all confounding effects of medications, our method effectively mitigated a considerable extent of the confounding impact, as evident from Extended Data Fig. 6. To provide a brief overview:
-
Propensity Score Estimation: Propensity scores, which represent the likelihood of an individual being treated considering specific confounds, were generated via a logistic regression model. We extracted confounds by implementing the top 5 components from a mixed factor analysis on the predominant 5 medications, including statins, associated with each disease group. This, combined with age and sex, served as the foundation for our matching process between case and control groups.
-
Matching Process: Each individual from the case group was paired with a counterpart in the control group using the Hungarian matching algorithm14. This algorithm determines an optimal match by minimizing the overall distance or "cost" between paired subjects. This distance is calculated based on the absolute difference in their estimated propensity scores. A conservative caliper, defined as 0.211, ensures that this difference does not surpass a predetermined threshold.
-
Timing and Application: It's important to note that propensity score matching was executed prior to the cross-validation data split. This ensured consistency in the data across every modeling iteration.
Polygenic risk score pipeline
In the polygenic risk score models, we applied a thresholding procedure to the outputs of each diagnosis GWAS, yielding 20 levels of statistical significance thresholds for each single nucleotide polymorphism, ranging from P = 5x10− 8 to P = 1. Models were subsequently trained on these thresholds to determine the optimal P-value threshold to distinguish between the pain outcome group and the healthy control group, without adjusting for any confounding variables.
Brain imaging pipeline: Brain imaging models were trained on multi-modal brain imaging features including white matter features from diffusion-weighted imaging, structural features from T1-weighted imaging, and functional connectivity features from resting state imaging. Several confounding variables were adjusted for in brain imaging pipelines: average absolute head motion, volumetric scaling (to account for variations in brain size) from the T1 image to a standardized space, age, and sex.
Bone structure pipeline
Bone structure models were trained on DXA derived features including bone area, bone mineral content (BMC), and bone mineral density (BMD), and were adjusted using age and sex as confounding variables.
Psychosocial pipeline
To align with the methodologies employed in other modalities we adjusted for age and sex as confounding variables in our models that were trained on psychosocial features.
Classification of pain phenotypes
The area under the receiver operating characteristic curve (ROC-AUC) scores were calculated from models trained to differentiate between pain phenotypes and pain-free controls (Fig. 4B, C, and D). ROC-AUC scores were obtained from the testing folds of five iterations of machine learning models, each employing a 5-fold cross-validation (CV) strategy, thereby yielding a total of 25 testing metrics per pain endpoint. The reported ROC-AUC scores estimate the effectiveness of the models in distinguishing between participants with a given pain phenotype (i.e. general chronic pain, general acute pain, chronic pain site, or number of chronic pain sites) and those without. Confidence intervals at the 95% level were calculated using 1,000 bootstrap resamples of the 25 AUC metrics. Additionally, heritability estimates were derived from GWAS of the general chronic pain, acute pain, and chronic pain site models.
Phenotyping and classification of pain-associated diagnoses
Pain-associated diagnoses were phenotyped based on their self-reported chronic pain prevalence and segmented according to the number of reported pain sites, ranging from 1 to 4 or more. Furthermore, the localization of chronic pain for each diagnosis was determined by calculating the prevalence of pain at the 7 recorded body sites for each condition. These prevalence rates were then normalized (z-scored) across conditions for each specified site, providing a standardized measure of pain localization or distribution for each diagnosis (Fig. 1B).
The classification performance for classifying pain-associated diagnoses from diagnosis-free participants was assessed using the same methodology applied for the pain phenotype models. This approach resulted in 25 testing metrics for each diagnosis across each modality-specific machine learning model. To identify the most effective biological modality for classifying each diagnosis, we selected the modality with the highest average AUC score across modeling iterations for each diagnosis. We then computed 95% confidence intervals for these selected scores using 1,000 bootstrap resamples, with the results depicted in Fig. 1C. Adjacently, AUC scores derived from the psychosocial classification models are presented, enabling a side-by-side evaluation of the biological and psychosocial modalities' performance across diagnoses. Comprehensive AUC scores for diagnoses across all modalities are provided in Extended Data Fig. 5. To interpret the distinct contributions of each modality to diagnosis classification, we trained additional models on modality subcategories. Consequently, AUC scores were derived from models trained on the blood modality subcategories, including inflammatory/immune, metabolic, and hematological markers. For the brain modality, models were trained on T1 structural features, diffusion-weighted imaging for white matter integrity, and resting-state functional connectivity. Within the psychosocial modality, we investigated eight subcategories: mood, neuroticism, life stressors, sleep quality, substance abuse, physical activity levels, socioeconomic status, and occupational factors. However, the bone structure and genetic modalities remained undivided and were analyzed as whole. Within diagnosis averaged AUC scores obtained from the subcategories—or from entire modalities in the cases of bone structure and genetics—were subsequently standardized using z-scoring relative to other diagnoses within the same modality or subcategory. By z-scoring the AUC values, we aimed to delineate whether certain diagnoses are associated with unique biological and/or psychosocial alterations or if they reflect broader deviations across multiple domains of biological and/or psychosocial functioning. This approach allowed a comparison of how pain-associated diagnoses are characterized by distinct or overlapping biological and psychosocial profiles.
The deviance explained by biological factors, psychosocial factors, and their unions in classifying two distinct chronic pain conditions - rheumatoid arthritis and fibromyalgia - was assessed using the optimal biological modalities for each condition, determined by the highest average AUC scores: blood for rheumatoid arthritis and brain for fibromyalgia. These calculations were then compared with the deviance explained in models of self-reported chronic pain, employing the same biological modalities for comparability.
The deviance explained (\({D}^{2}\)) by the biological (B), psychosocial (P), and combined (B + P) predictors was calculated following a modified approach based on the methodology outlined by Dinga et al. 2020, adapted to our context of comparing different types of predictive information rather than controlling for confounds12. First, predicted probabilities were extracted from models trained on: biological predictions (B) from the blood or brain models, psychosocial predictions from the models trained on psychosocial data (P), and combined data (B + P) integrating both biological and psychosocial predictions.
Logistic regression models were then fit on these predicted probabilities to predict each specified outcome (either the diagnosis or self-reported chronic pain), generating:
-
A model using only predicted probabilities from biological data (B).
-
A model using only predicted probabilities from psychosocial data (P).
-
A model using combined predicted probabilities from both data sources (B + P).
The deviance for each predictive model was then calculated using the following formula:
$$\text{D}\text{e}\text{v}\text{i}\text{a}\text{n}\text{c}\text{e}=2 \times \left({\text{l}\text{o}\text{g}\text{l}\text{i}\text{k}\text{e}\text{l}\text{i}\text{h}\text{o}\text{o}\text{d}}_{\text{s}\text{a}\text{t}\text{u}\text{r}\text{a}\text{t}\text{e}\text{d}}- {\text{l}\text{o}\text{g}\text{l}\text{i}\text{k}\text{e}\text{l}\text{i}\text{h}\text{o}\text{o}\text{d}}_{\text{m}\text{o}\text{d}\text{e}\text{l}}\right)$$
Where the saturated model represents a model with perfect fit to the data and the maximum possible log-likelihood. Subsequently, the fraction of deviance explained (\({D}^{2}\)) for each model was derived by normalizing the model's deviance against that of a null model, which includes only an intercept, according to:
$${D}^{2}= 1- \frac{{\text{D}\text{e}\text{v}\text{i}\text{a}\text{n}\text{c}\text{e}}_{\text{m}\text{o}\text{d}\text{e}\text{l}}}{{\text{D}\text{e}\text{v}\text{i}\text{a}\text{n}\text{c}\text{e}}_{\text{n}\text{u}\text{l}\text{l}}}$$
To dissect the deviance explained into unique and shared contributions, the following calculations were performed:
-
\(\varDelta {D}_{\text{P}}^{2}= {D}_{\text{B}\text{P}}^{2}- {D}_{\text{B}}^{2}\) : The unique contribution of psychosocial predictions beyond what is explained by biological predictions.
-
\(\varDelta {D}_{\text{B}}^{2}= {D}_{\text{B}\text{P}}^{2}- {D}_{\text{P}}^{2}\) : The unique contribution of biological predictions beyond what is explained by psychosocial predictions.
-
\(\varDelta {D}_{{\text{B}}_{\cup }\text{P}}^{2}= {D}_{\text{P}+\text{B}}^{2}- \varDelta {D}_{\text{P}}^{2}- \varDelta {D}_{\text{B}}^{2}\) : The shared or overlapping contribution of biological and psychosocial predictions to the explained deviance.
Where \({D}_{\text{B}\text{P}}^{2}\), \({D}_{\text{B}}^{2}\), \({D}_{\text{P}}^{2}\) are deviance explained of models containing biological and psychosocial predictions, biological predictions, and psychosocial predictions, respectively. These results quantify the extent to which biological and psychosocial factors individually and jointly explain the variance in the binary disease or self-reported pain outcome and are displayed in Fig. 1D.
Nociplastic Functional Signature (NFS)
We developed a predictive resting-state functional signature for nociplastic pain conditions within the UK Biobank cohort at the 9-year follow-up. Based on established literature, we identified diagnoses categorically defined as nociplastic conditions, which included fibromyalgia, chronic fatigue syndrome, and widespread chronic pain15,16,17. Nociplastic conditions are believed to be mechanistically linked to central sensitization or alterations in central nervous system processing of pain leading to symptoms such as diffuse pain, fatigue, sleep disturbances, and cognitive impairments16,18,19.
To develop the Nociplastic Functional Signature (NFS), we combined fibromyalgia, chronic widespread pain, and chronic fatigue syndrome into a single nociplastic condition (n = 535) and entered them into our machine learning pipeline to differentiate from diagnosis-free participants based on resting-state functional connectivity data. Structure coefficients were then derived from the predictive model to create the signature. These coefficients highlight the connectivity patterns most predictive of the nociplastic condition, and are described below:
Structure coefficients
Given the potential instability and bias in feature weights derived from multivariate predictive models, we utilized structure coefficients, or model encoding maps, to identify features individually linked to the outputs of our predictive machine learning models20. Structure coefficients, recognized in brain imaging and predictive modeling research21,22, provide a means to extract reliable features associated with a predicted outcome. Specifically, these coefficients reveal the individual association of features—in this case the Dynamic Conditional Connectivity (DCC) between two brain regions of interest (ROIs)—with the model's output—in this case the predicted probability of a nociplastic pain condition—thereby mapping individual features to the overall multivariate model prediction.
For each participant, structure coefficient maps were generated by computing the covariance between the predicted probabilities from the logistic regression model and the parcel-wise DCC values. This process created an encoding map delineating the directionality of the relationship between each feature and the model prediction. In our analysis, this method maps which brain voxels were positively or negatively associated with the predicted nociplastic condition. For all predictive models, structure coefficient maps were calculated for each of the five iterations of the machine learning pipeline and subsequently averaged.
Measuring functional dysconnectivity in nociplastic conditions: Structure coefficient maps, derived from models individually trained to predict fibromyalgia, widespread chronic pain, chronic fatigue syndrome, were correlated using two-sided Pearson's coefficients, with significance levels adjusted for multiple comparisons using the Bonferroni method. To highlight key cortical regions involved in each condition, these maps were node-averaged and thresholded to identify the top 25% of absolute coefficient values and visualized on cortical surface renderings. It is important to note that the correlation analyses between condition coefficients utilized the complete, unthresholded coefficient vectors, which included both subcortical and brainstem areas. However, the visual representations focused solely on the top 25% of coefficients within cortical regions of interest (ROIs) as shown in Fig. 3B. The same approach was taken to visualize the NFS in Fig. 3C. Surface rendering was performed using the Surf Ice software, available at https://www.nitrc.org/projects/surfice/.
Validation of the NFS
The NFS was validated in four external datasets available in the OpenPain repository. These datasets were merged to form a validation cohort of 250 participants, allowing for an evaluation of the NFS's generalizability in classifying chronic pain. This performance was then benchmarked against the Tonic Pain Signature (ToPS), a previously validated marker for sustained experimental pain23. ToPS weights were applied to the participant-level DCC data (brainnetome atlas parcellation) in the OpenPain dataset across various thresholds (100%, 50%, 25%, 10%, 5%, 1%, 0.5%) to test for generalizability. The same thresholding procedure was employed with the NFS to establish its performance within the OpenPain dataset. For each threshold, we calculated the AUC to differentiate between participants with chronic pain and pain-free individuals. A null AUC distribution for each threshold was generated by randomizing the outcome variable across 1,000 permutations and recalculating the AUC (Fig. 3E).
Composite blood assay signature
A composite blood assay signature capturing the common alterations in blood assay features across well-predicted diseases was created. Included were 13 diseases for which blood assay models predicted at a ‘good’ or better performance level (AUC > = .70). This signature was formed by calculating the average of the structure coefficients from the predictive models for each of the 13 diseases. The resulting map represents the direction and magnitude of the 52 blood assay features that are consistently altered across these conditions (Fig. 2C). The signature could then be transformed into a subject-level risk score by taking the dot product of the signature and a subjects standardized and de-confounded blood assay feature profile.
Assessment of composite blood assay signature: The blood assay signature was assessed by testing its ability to predict cross-sectional disease prevalence at baseline and longitudinal disease incidence at roughly 4 and 9 years in the UKB. First, subject-level signatures were derived by taking the dot product of the signature coefficients and each subjects standardized and de-confounded (residualization using age and sex as confounds) blood assay feature profile at baseline. For each of the 13 diseases included in the risk score, subjects were divided into 3 case groups: 1) disease present at baseline 2) disease onset at 4 year visit 3) disease onset at 9 year visit. Similarly, 3 control groups were created 1) diagnosis-free at baseline 2) diagnosis free at 4 year visit 3) diagnosis free at 9 year visit. Each case group was matched with its respective control pairing and effect size estimates were calculated between the risk scores for the pairs using Cohens d (95% confidence interval derived from 1,000 bootstrap iterations) and AUC scores for discrimination performance. Importantly, the control group within each timepoint analysis remained the same across diseases (Fig. 2B).
The signature was also evaluated on a pooled diagnoses phenotype, where subjects with any of the 13 diagnoses were aggregated into a single disease group at each timepoint. The pooled diagnosis group was then compared to its timepoint-matched control group to calculate effect sizes, using the same methodology as described previously for the individual diagnosis (Fig. 2D).
Disease progression analysis
We analyzed repeated blood assay data from the 4-year follow-up (n = 19,360) to measure signature evolution over time. The composite signature coefficients were applied to the repeat blood assay data, yielding a follow-up signature. This analysis included the three diagnosis case groups (e.g., disease present at baseline, disease onset at 4 year, disease onset at 9 year) to examine signature dynamics across varying disease progression stages relative to control subjects with no diagnoses at follow-up. We employed a linear mixed effects model to analyze changes in the signature from baseline to follow-up, incorporating fixed effects for time, disease group, and their interaction, and accounted for individual variability by considering participants as a random effect. We used Bonferroni correction to assess the significance of group-time interaction terms for signature changes between each disease group and healthy controls (Fig. 2E).
Validation of composite blood assay signature
Validation in the All of Us Research Program (AoU)
In the All of Us Research Program (AoU), a simplified version of the composite blood signature was developed to fit the program's available assays, resulting in a simplified signature with 10 retained assays detailed in Supplementary Table 1. This validation involved defining pain-diagnosis and control groups through curated ICD-10-CM codes within the AoU's Dataset and Cohort Builders, selecting individuals based on their recorded diagnoses of the 13 conditions relevant to the original signature. Healthy controls were defined as participants without inflammatory, musculoskeletal, or cardiovascular diagnoses. We applied this sparse signature to AoU participant blood assay profiles, adjusting for age and sex, to create participant-level sparse signatures. The effectiveness of this simplified signature was quantified using Cohen’s d effect sizes and AUC scores, derived from comparisons across individual diagnosis groups and controls, incorporating 95% confidence intervals from 1,000 bootstrap iterations (Fig. 2E & F).
Biological and psychosocial risk scores
For the blood, brain, and bone modalities, we identified diagnoses with sufficient classification accuracy, defined by AUC values greater than approximately 0.70. Only one disease was predicted 0.70 AUC or greater using the polygenic risk score modality, so it was excluded from this analysis. This approach yielded 13 diagnoses for the blood modality, 5 for the brain, and 4 for the bone. Within each modality, we generated biological and psychosocial risk scores for both diagnosed participants and diagnosis-free controls. These scores were calculated by averaging the log-transformed predicted probabilities from the respective disease prediction models. For instance, in the brain modality, for participants diagnosed with any of multiple sclerosis, stroke, fibromyalgia, chronic fatigue syndrome, or cervical spondylosis, we averaged the predicted probabilities from each brain model to establish a common biological risk score. Similarly, we computed a common psychosocial risk score by averaging the predicted probabilities from the psychosocial models for these conditions. In each modality group, participants were sorted into quintiles based on their biological and psychosocial risk scores, categorizing them into five risk levels: Low, Reduced, Neutral, Elevated, or High (Fig. 5A).
Biopsychosocial synergy in disease
In the blood modality analysis, we computed odds ratios for each of the 13 diagnoses independently against diagnosis-free controls. We determined the odds ratio using the unconditional maximum likelihood estimate, comparing the odds for participants within a specific quintile (e.g., ‘Low’) to those of all other participants in the cohort. This approach aimed to quantify the likelihood of having a specific condition (e.g., gout) based on a participant's placement in each of the biological and psychosocial risk quintiles. Next, we calculated odds ratios for each diagnosis based on combinations of biological and psychosocial quintiles, assessing the likelihood of a diagnosis for participants categorized within both 'Low' biological and psychosocial quintiles (LL), 'Reduced' (RR), and so forth, up to the 'High' (HH) combination (Fig. 5A).
Logarithmic odds ratios were computed for pooled (combined) diagnoses within each modality, covering all 25 possible combinations of biological and psychosocial risk quintiles. For each modality, Cohen's d effect sizes were also calculated, comparing common risk scores for each diagnosis against scores from diagnosis-free individuals (Fig. 5B, C, D).
Longitudinal analysis of biopsychosocial synergy
Within the blood modality, we analyzed the onset of pooled diagnoses at the 4-year follow-up based on baseline biopsychosocial risk quintiles. We calculated log-odds ratios for each of the 25 combinations of baseline biological and psychosocial risk quintiles, comparing those with new onset diagnoses (n = 1,006) at follow-up to those who remained diagnosis-free (n = 1,282), as presented in Fig. 5F.
The incidence of a diagnosis within a 15-year period following baseline assessment was analyzed across four groups categorized by their baseline biological and psychosocial risk quintiles: Low–Low, High–Low, Low–High, and High–High. Kaplan–Meier curves were employed to visualize the time to diagnosis across these groups. Additionally, Cox proportional hazards models provided hazard ratios and p-values, while log-rank tests were applied to identify significant differences in the hazard rates, particularly comparing the High–High and Low–High groups (Fig. 5G). Survival analyses we’re conducted using the lifelines package in Python.
Biopsychosocial chronic pain pathway
A structural equation model (SEM) was employed to investigate the relationships between biological and psychosocial markers and their effects on the development of rheumatoid arthritis and subsequent pain outcomes. Logarithmic predicted probabilities derived from the blood and psychosocial models of rheumatoid arthritis served as baseline biological and psychosocial risk indicators. An interaction term for the biopsychosocial interface was derived by multiplying these biological and psychosocial risks.
Pain outcomes were measured using the UK Biobank’s online pain questionnaire, which assessed pain across three dimensions:
-
Pain Impact: Assessed with the Brief Pain Inventory (BPI-39), the functional impact of pain was evaluated across seven areas: general activity, mood, walking ability, work, interpersonal relations, sleep, and life enjoyment, each on a scale from 0 (no interference) to 10 (complete interference). This measure was specifically targeted at participants reporting chronic pain at particular body sites.24
-
Pain Spread: The extent of pain was quantified by asking participants if they experienced pain or discomfort persistently or intermittently over more than 3 months, followed by specifying the body sites affected in the last three months. A summative phenotype representing the spread of pain was created based on the number of reported pain sites.
-
Pain Intensity: Chronic pain sufferers were prompted to rate their most bothersome pain at its worst in the past 24 hours on a scale from 0 (no pain) to 10 (pain as severe as imaginable).
The model specification was delineated as follows: rheumatoid arthritis development was modeled as a function of blood risk, psychosocial risk, and their interaction, while pain outcomes were modeled as a function of rheumatoid arthritis development, blood, and psychosocial risk. Model fitting involved estimating parameters that best reflected the covariances among the observed variables. The fit of the model was assessed using standard indices, including the Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and the Standardized Root Mean Square Residual (SRMR), to evaluate how well the model represented the data structure (Fig. 6). SEM were constructed using semopy.
Extended data
Biological amplification in pain spreading
We derived biological signatures from models predicting the number of self-reported pain sites, ranging from 1 to 4 or more, utilizing structure coefficients from predictive models. In blood modality analyses, we derived structure coefficients for each model to reveal blood assay signature similarity as pain increases in spread. For brain modality studies, we node-averaged the structure coefficients from resting-state connectivity data and visualized the top 25% of absolute values on cortical surfaces. In the bone modality, we averaged coefficients related to bone density, mineral content, and area across different skeletal regions—including the spine, femur, head, legs, pelvis, arms, and ribs. In genetic analyses, we identified the top 5% FDR-corrected Neuro-Immune Gene Ontology (NIGO) pathways from each PRS model specific to the number of pain sites and categorized these pathways into biological processes using REVIGO (Extended Data Fig. 3)25.
Cross-prediction models
Cross-prediction models were employed wherein models initially trained to classify specific diagnoses were tested on alternative diagnoses they were not originally trained to identify. We extracted ROC-AUC scores, sensitivity, and specificity from these cross-prediction tasks. Sensitivity and specificity scores were averaged for each original diagnosis across predicted alternative diagnoses within both biological and psychosocial modalities (Extended Data Fig. 6).
Impact of medication use on the composite blood assay signature
To assess medication impact on the blood assay signature, we considered 11 medication families linked to the 13 diagnoses comprising the signature (Extended Data Fig. 7). A network model mapped chi-squared associations between diagnoses and medication families, with edge betweenness clustering identifying diagnosis-medication clusters (Extended Data Fig. 7).
We then recalculated effect sizes and discrimination metrics for the signature, excluding patients taking individual medication families. For instance, statistics were re-evaluated after excluding patients on antiepileptic drugs (ATC code: N03A). This exclusion was systematically applied to each medication class to assess the signature's predictive accuracy devoid of medication influences (Extended Data Fig. 7).
Statistical analysis
Data pre-processing and statistical analyses were performed using Python v.3.8 (including Numpy (v.1.22.0), Pandas (v.1.4.3), Scipy (v.1.10.1), Sklearn (v.1.3.2), Nilearn (v.0.10.0), Lifelines (v.0.26.4), Semopy (v.2.3.9), and SnapML (v.1.9.1)). Nested five-fold cross-validation was used to obtain unbiased model performance results. Permutation tests (with 1,000 iterations) were used to test whether the associations by Pearson’s r correlation were significantly higher than a null association. We used bootstrap resampling with 1,000 iterations to indicate the estimated error in the Cohen’s d and ROC-AUC effect sizes. In all analyses, significance was based on P < 0.05 for single testing and Bonferonni < 0.05 for multiple testing. Further details of the statistical methods are specified in each relevant section above.