Beyond traditional metrics: A novel method for measuring mood instability in bipolar disorder

Background: Clinical care for bipolar disorder (BD) has a narrow focus on prevention and remission of episodes with pre/post treatment reductions in symptom severity as the ‘gold standard’ for outcomes in clinical trials and measurement-based care strategies. The study aim was to provide a novel method for measuring outcomes in BD that has clinical utility and can stratify individuals with BD based on mood instability. Methods: Participants were 603 with a BD (n=385), other or non-affective disorder (n=71), or no psychiatric history (n=147) enrolled in an intensive longitudinal cohort for at least 10 years that collects patient reported outcomes measures (PROMs) assessing depression, (hypo)mania, anxiety, and functioning every two months. Mood instability was calculated as the within-person variance of PROMs and stratified into low, moderate, and high thresholds, respectively. Outcomes: Individuals with BD had significantly higher mood instability index’s for depression, (hypo)mania, and anxiety compared to psychiatric comparisons (moderate effects, p’s<.001) and healthy controls (large effects, p’s<.001). A significantly greater proportion of individuals with BD fell into the moderate (depression: 52·8%; anxiety: 51·4%; (hypo)mania: 48·3%) and high instability thresholds (depression: 11·5%; anxiety: 9·1%; (hypo)mania: 10·8%) compared to psychiatric comparisons (moderate: 25·5 – 26·6%; high: 0% - 4·7%) and healthy controls (moderate: 2·9% - 17·1%; high: 0% - 1·4%). Being in the high or moderate instability threshold predicted worse health functioning (p’s < .00, small to large effects). Interpretation: Mood instability, as measured in commonly used PROMs, characterized the course of illness over time, correlated with functional outcomes, and significantly differentiated those with BD from healthy controls and psychiatric comparisons. Results suggest a paradigm shift in monitoring outcomes in BD, by measuring mood instability as a primary outcome index.


Introduction
Bipolar disorders (BD) are among the leading causes of disability worldwide due to early onset, high chronicity, and comorbidity rates (1).Premature mortality associated with BD equals or surpasses that of several common risk conditions, including smoking and cardiovascular disease (2).Despite this, accurate diagnosis is often delayed and, even when diagnosed properly, e cient treatment options remain stagnant (3).Traditional nosology and classi cation describe BD as relapsing disorders during which distinct episodes of mania, hypomania, and depression vacillate with remitted periods.These remitted periods have long distinguished BD from primary psychotic disorders or borderline personality disorder, and are based on a return to normal mood or "euthymia" in between episodes (4).However, the recent emphasis on intensive longitudinal designs in BD research has begun to paint a sobering picture of these patterns, challenging traditional nosology and de nitions of euthymia (5,6).
Efforts to model the course of BD across both micro (hourly) and macro (monthly to yearly) timescales have proliferated over the past decade.Using methods such as ecological momentary assessment (EMA) and intensive longitudinal cohort designs, time series analysis, and mathematical modeling nds that individuals with BD, as well as those at risk for BD, experience signi cant instability ( uctuations and deviations away from one's average state) in emotions and mood even outside the context of mood episodes (7-12) [1] .Importantly, mood instability is associated with important risk and outcomes.In one study, day-to-day mood instability predicted the development of bipolar but not unipolar disorders three years later (13).Furthermore, several studies have established that mood instability, outside the context of mood episodes, is associated with worse outcomes and poorer functioning (14)(15)(16).
Despite converging evidence that mood instability is a core phenotypic feature of BD, clinical care continues to have a narrow focus on prevention and remission of episodes.Symptom severity reduction is the 'gold standard' for outcomes in clinical trials of novel interventions; the 'measurement-based' care strategy is likewise motivated towards achieving a speci c integer threshold on gold standard measures of mood.Success is typically de ned by simple linear decreases in scores from two (pre-post treatment) to three time points (pre-treatment, mid-treatment, post-treatment).We, along with others (5,17), argue that for substantial improvements in diagnosis, care, and treatment development, alternative ways to measure change in BD based on mood instability are needed.
In this study, a unique cohort of individuals from the Prechter Longitudinal Study of Bipolar Disorder (PLS-BD) (18, 19) was leveraged.The PLS-BD includes individuals with BD, a psychiatric comparison (PC) group with non-BD diagnoses, and individuals with no psychiatric history (HC) who completed mood measures every two months for a minimum of 10 years.Using this intensive longitudinal data, mood instability was characterized across diagnostic categories to determine if thresholds could be determined that stratify individuals with low, moderate, and high levels of instability.
Furthermore, the hypothesis that being in a higher instability threshold would be associated with lower levels of functioning was tested.Instability thresholds are based on clinically relevant measures, widely administered across medical systems, and are a part of standard of care such as the Patient Health Questionnaire (PHQ-9), Altman Self-Rating Mania Scale (ASRM), and the Generalized Anxiety Scale (GAD-7).The GAD-7 was included as anxiety disorders are highly comorbid with BD, anxiety tends to uctuate with depression in BD (20), and it provides generalizability beyond mood episode symptoms (mania, depression). [1]Note that much of the literature refers to this construct as emotion or mood "instability."While we agree conceptually, statistically, the majority of these studies focus on variability, or, the intraindividual standard deviation or variance in responses rather than the mean square of successive differences (MSSD), the statistical measure of instability.See Sperry & Kwapil 2022 for a longer discussion of this issue.affective diagnoses (n = 28).The HC included individuals (n = 147) who had no psychiatric history and no rst-degree relatives with BD.Diagnosis was assessed using the Diagnostic Interview for Genetic Studies, version 4 (21).A team of at least two doctoral-level psychologists or psychiatrists con rmed diagnosis using criteria from the DSM-IV-TR and all available medical history from electronic health records.Demographic and descriptive information regarding each group is presented in Supplemental Tables 1-4.Written informed consent was obtained from participants and all study procedures were approved by the Institutional Review Board at Michigan Medicine.

2•2 Choice of primary measure
The full longitudinal protocol and procedures of the PLS-BD are outlined elsewhere (19).We focus on measures related to the current investigation which include mood and functioning measures that are administered every two months for the duration of the study.Demographics including sex at birth, race, ethnicity, and age of onset are collected annually on the participants' anniversary of enrollment while age was calculated as of November 1, 2023 the date at which data was extracted for analysis.
Self-reported depression symptoms over the past two weeks were measured using the gold-standard nine-item Patient Health Questionnaire (PHQ-9) (22).Items are answered on a Likert Scale from 0 (Not at all) to 3 (Nearly every day) with a scale score range from 0 to 27.Scores of 5 to 9 indicate mild depression, 10-14 moderate depression, 15-19 moderately severe depression, and 20-27 severe depression.Self-reported manic symptoms over the past two weeks were measured using the ve-item Altman Self-Rating Mania Scale (ASRM) (23).Items are answered on a scale from 1 to 5 with scale scores ranging from 5 to 25.A score of 6 or higher indicates a concern for manic or hypomanic presentations.Self-reported anxiety symptoms were measured using the gold-standard seven-item Generalized Anxiety Disorder Scale (GAD-7) (24).Items are answered on a scale from 0 (Not at all) to 3 (Nearly every day).Scores range from 0 to 21 (0-4: minimal, 5-9: mild, 10-14: moderate, >= 15: severe anxiety).General health and functioning were measured using the short form of the SF-36, the SF-12 (25).The SF-12 is a widely used measure to assess quality of life in health-sciences.The SF-12 results in two normed T scores (mean = 50, SD = 10), the mental component summary (MCS) and physical component summary (PCS), with higher scores indicating better-than-average functioning.Internal consistency for all measures were good to excellent in the current sample (Chronbach's a = 0•82 − 0•92).

2•3 Analytical Strategy
Rolling variance of the PHQ-9, ASRM, and GAD-7 was calculated for each participant using three window widths; 3, 6, and 12 corresponding to 3, 6, and 12 independent measures captured over 6-months, 1-year, and 2-years, respectively.Visual inspection with a LOESS (Locally Weighted Scatterplot Smoothing) line was completed along with calculation of the Mean Squared Error (MSE) and standard deviation (SD) to examine responsiveness of each window width.One-year of rolling variance was selected as it had the lowest MSE but highest SD across 20 random selected participants representing 12:4:4 ratio of BD:PC:HC.Raw rolling variances were z-scaled so that the distributions were similar and comparable across participants (continuous measure of instability).To identify categorical thresholds of low, moderate, and high instability based on rolling variances, independent of diagnostic group, ranked percentiles were created (n = 100).Thresholds were determined from ranked percentiles so that </= 60-%tile was considered "low", 61-94-%tile was "moderate" and >/= 95-%tile was considered "high" instability for each measured scale.As such, each participant had a continuous index and categorical threshold of instability for each mood measure.
To account for multiple comparisons and reduce chances of type I error with a large number of within-person observations, we adjusted the alpha level to .001across all models.To examine whether diagnostic groups differed in terms of continuous measures of instability, linear mixed-effects models were used in the lmer package in R. To examine pair-wise group differences, we completed post-hoc contrasts with Tukey correction for multiple comparisons.To test whether there were proportional differences in the low, moderate, and high thresholds by diagnosis, we ran pair-wise chi-square tests.To test whether thresholds predicted longitudinal functioning outcomes (mental and physical health functioning from the SF-12), over and above other important covariates (diagnosis, sex, race, age), we ran linear mixed-effect models robust to missingness.An example formula is provided below: Note SF12_MCS = mental health functioning.Diagnostic group (0 = HC), sex (0 = male), race (0 = White).represents a random intercept for each participant.

Results
Calculating instability based on the one-year rolling variance for each mood measure resulted in 20,854 observations for depression, 21,199 for mania, and 14,105 for anxiety [2].Summary statistics for all variables are provided in Supplemental Tables 1-4 and zero-order spearman rho within-and between-person correlations are presented in Supplemental Figure 1.

3•1 ASRM Instability
Individuals with BD experienced signi cantly higher instability in ASRM scores than HC (= 0•55, p<.001, large effect) and PC (= 0•41, p<.001, large effect).PC did not differ from HC (= 0•14, p=0•21).Using the 60 th and 95 th percentile to stratify low, moderate, and high instability, a signi cantly greater proportion of individuals with BD were above the moderate and high instability thresholds for the ASRM compared to HC and PC and a signi cantly lower proportion of individuals with BD fell below the low instability threshold (Table 1; Figure 1).These thresholds signi cantly predicted mental health functioning; those above the high instability threshold for the ASRM had an average T score 1•09 less than those in the low group, holding all other variables constant (Table 2; Figure 2).Interestingly, linear mixed effects models revealed that for the ASRM, physical health functioning improved with higher ASRM instability; those above the moderate instability threshold had a T score of 0•76 higher than those in the low group and those above the high threshold had a T score 1•81 higher than those in the low group, holding all other variables constant (Table 3).

3•2 GAD
Instability in GAD-7 scores was signi cantly greater among the BD group compared to HC (= 0•83, p<.001, large effect) and PC (= 0•37, p<.001, moderate effect).Those in the PC group also had higher instability than HC (= 0•46, p<.001, moderate effect).A signi cantly greater proportion of individuals with BD were above the moderate and high threshold compared to HC and PC (Table 1; Figure 1).These thresholds signi cantly predicted mental health functioning; those above the moderate threshold had an average T score of 3•11 less than those in the low group; those above the high instability threshold had an average T score of 3•7 less than those in the low group, holding all other variables constant (Table 2; Figure 2).GAD-7 instability thresholds were not signi cantly associated with physical health functioning (Table 3).

3•3 PHQ
On average, individuals with BD had signi cantly higher instability in PHQ-9 scores than HC (= 0•79, p<.001, large effect) and PC (= 0•48, p<.001, moderate effect).Those in the PC group did not differ from those in the HC group (= 0•31, p = .03).A signi cantly greater proportion of individuals with BD fell above the moderate and high thresholds compared to HC and PC and a signi cantly lower proportion fell below the low threshold (Table 1; Figure 1).These thresholds signi cantly predicted mental health functioning; those above the moderate threshold had an average T score 3•61 less than those in the low group and those above the high threshold had an average T score 5•34 less than those in the low group (Table 2; Figure 2).
Instability thresholds for the PHQ-9 were not signi cantly associated with physical health functioning.
[2] Within-person observations for the GAD-7 were signi cantly lower because the GAD-7 did not begin to be administered until year 5 of the study causing much lower number of measured instances.

Discussion
Mood instability is recognized as a core phenotype of BD, yet there are no established methods to measure and index this instability that can be easily and e ciently adapted to clinical trials and/or in the delivery of routine clinical care, such as with PROMs.In a unique longitudinal cohort with deep phenotyping, the PLS-BD, instability was characterized based on the variance of PROMs including the PHQ-9, ASRM, and GAD-7 over rolling 12-month windows.We then identi ed thresholds that strati ed individuals based on low, moderate, and high instability in each mood measure, respectively.Across all measures, those with BD had signi cantly higher instability compared to HC and PC with moderate to large effect sizes.The level of instability differed signi cantly from those in the PC group suggesting that it is not simply a function of psychopathology in general, rather re ective of the trajectory of BD.Only three individuals in the PC group were above the high instability threshold for the ASRM, two above the high instability threshold for the GAD-7, and none were above the high instability threshold for the the PHQ-9.This is suggestive of high speci city of this instability index to the BD group.Elevated instability in depression was associated with adverse outcomes for those with BD.Moderate and high instability in PHQ-9 scores was signi cantly associated with both lower mental (large effect) and physical health functioning (small effect).This is consistent with ndings from the Global Bipolar Cohort (n = 5,882 individuals with BD) that reported subsyndromal symptoms of depression are among the strongest predictors of poor functioning in BD (6).In contrast, instability in (hypo)manic symptoms was associated with lower mental health (small effect) but better physical health functioning (small effect).However, these small effects are likely clinically insigni cant, so further interpretation is cautioned.
Clinically, these results provide guidelines for practical clinical monitoring in the daily patient care setting and offer a novel strategy for outcomes assessments in clinical treatment trials for BD by calculating the variance of PROMs as an index of mood instability.The PHQ-9 and GAD-7 are among the most validated and most widely used PROMs across both primary care and psychiatric settings (22,26).Given their short length (7 and 9 items), accessibility, and translations into numerous languages, they are easy to integrate into standard operating procedures in primary care and psychiatric settings, research studies, and clinical trials.The ASRM is less widely administered and is used speci cally for BD monitoring but reliability of this measure (and most self-report measures of mania) is generally less than that of the PHQ-9 and GAD-7, but remain useful in the longitudinal setting (27).Traditional methods of scoring the PHQ-9, GAD-7, and ASRM focus on a sequential decrease in scores (e.g., PHQ-9 score from 15 to 8 over 12 months) as evidence of a treatment effect.For example, the reliable change index (RCI) in clinical trials calculates the decrease in a score that should be expected based on one's starting value and regression to the mean (e.g.,(28)).However, in the study of BD it is arguably of greater clinical relevance to monitor mood instability as measured by the variance in common PROMs such as the PHQ-9 scores over a 12 month period.Future investigations must consider that non-linear change (reduction in instability or a shift to a lower instability threshold) may be critical for both understanding the nature, trajectory, and treatment response in BD (8, 9).Once norms for mood instability indices are fully established, future research should investigate the impact that reducing mood instability has on primary outcomes of interest (e.g., physical and mental health functioning, wellbeing, cognition, interpersonal relationships, occupational outcomes).
Although results provide preliminary thresholds and guidelines for common PROMs that stratify individuals into the low, moderate, and high instability thresholds, next steps include identifying larger samples from electronic health record data or large global research collaboratives such as the Global Bipolar Cohort (GBC; (6, 29)) and National Network for Depression Centers (30) to establish norms based on larger and more diverse samples.

4•1 Limitations
A signi cant limitation of the PLS-BD is its relatively small size and limited ethnic and racial diversity, having been ascertained in a small geographical area.Effect sizes were large for diagnosis; those with BD and PC had signi cantly lower mental and physical health functioning than HC.While instability is likely intrinsic to the diagnosis, it is di cult to tease out the effect of the BD illness vs. the instability.A further limitation is the limited number of PROM assessments; participants completed PROMs every two months.This was done to minimize participant burden over an extended period of time; however, it is possible that the uctuation of the PROMs is greater than what is picked up by a bi-monthly cadence.

Page 3/ 9 2• 1
Study design and participants Participants were drawn from the PLS-BD, an ongoing cohort study of BD that has been continuously gathering phenotypic and biological data over the naturalistic course of BD beginning in 2006.Participants are recruited via advertisements, psychiatric clinics, mental health centers, and community outreach events in Michigan.Participants are excluded if diagnosed with neurological disease or alcohol or substance use that would interfere with the ability to complete research (e.g., attending interviews intoxicated).The present study included 603 participants (Mean Age = 39, SD Age = 14, 65•7% female, 78•6% White, 21•4% non-Caucasian, Mean enrollment = 13 years, Range enrollment = 10-17) who had completed at least 10 years of follow-up in the study.Participants in the BD group had a DSM-IV-TR diagnosis of BD I (n = 258), BD II (n = 80), BD Not Otherwise Speci ed (n = 30), or Schizoaffective BD (n = 17).Participants in the PC group had a DSM-IV-TR diagnosis of major depressive disorder (n = 20), non-affective diagnoses (n = 23), or other

4 • 2 study
outlines a paradigm shift in monitoring outcomes in BD, by measuring mood instability as a primary outcome index.Mood instability, as measured in commonly used PROMs, characterized the course of illness over time, correlated with functional outcomes, and signi cantly differentiated those with BD from HC and PC.With the growing datasets emerging worldwide, there will be su cient information from common clinical outcomes measures to assess instability indices of BD globally.DeclarationsAuthor Statement SHS was involved in conceptualization, funding acquisition, methodology, project administration, validation, and writing of the original draft.AKY was involved in conceptualization, data curation, formal analysis, methodology, visualization, and writing of the orignal draft.MGM was involved in conceptualization, funding acquisition, investigation, and writing -review & editing.All data and analysis was independently accessed and veri ed by SHS and AKY.Data Sharing Longitudinal and outcomes data used in the present study, along with data dictionaries, are available subject to review of the proposed analyses and acceptance of a Data Use Agreement.All PLS-BD data and samples are available through the Heinz C. Prechter Genetic Repository, distributed by the University of Michigan Central Biorepository (CBR).Enquiries can be addressed at http://www.prechterprogram.org/data.Declaration of Interests Melvin G. McInnis has received consulted and research support from Janssen Pharmaceuticals and has two US patents to the University of Michigan (US Patent #9,685,174; US Patent #11, 545, 173).AKY and SHS have no disclosures to report.

Figures
Figures