Measuring algorithmic bias to analyze the reliability of AI tools that predict depression risk using smartphone sensed-behavioral data

Abstract AI tools intend to transform mental healthcare by providing remote estimates of depression risk using behavioral data collected by sensors embedded in smartphones. While these tools accurately predict elevated symptoms in small, homogenous populations, recent studies show that these tools are less accurate in larger, more diverse populations. In this work, we show that accuracy is reduced because sensed-behaviors are unreliable predictors of depression across individuals; specifically the sensed-behaviors that predict depression risk are inconsistent across demographic and socioeconomic subgroups. We first identified subgroups where a developed AI tool underperformed by measuring algorithmic bias, where subgroups with depression were incorrectly predicted to be at lower risk than healthier subgroups. We then found inconsistencies between sensed-behaviors predictive of depression across these subgroups. Our findings suggest that researchers developing AI tools predicting mental health from behavior should think critically about the generalizability of these tools, and consider tailored solutions for targeted populations.


Introduction
Mental healthcare systems are simultaneously facing a shortage of mental health specialty care providers and a large number of patients whose treatment needs remain unmet 1,2 .This service gap is driving research into AI-driven mental health monitoring tools, where sensed-behavioral data, de ned as inferred behavioral data gathered by sensors and software embedded in everyday devices (e.g.smartphones, wearables), are repurposed to remotely monitor depression symptoms [3][4][5][6][7] .Sensedbehavioral data has also been referred to as personal, behavioral, or passive sensing data in other work 7 .
AI tools that leverage sensed-behavioral data intend to near-continuously identify individuals experiencing elevated symptoms in-between clinical encounters and consequently deliver preventive care 8 .These tools can also be integrated into digital therapeutics to automate precision interventions 9 .Initial work showed that depression risk could be predicted from sensed-behavioral data at a similar accuracy to general practitioners 10 in small populations 5,11 .More recent work shows that these AI tools predict depression risk at an accuracy only slightly better than a coin ip in larger, more diverse samples 4,6,12,13 .This prior work has not speci cally explored why accuracy is reduced in larger samples, and it is not clear how to improve AI tools for clinical use.
In this work, we hypothesized that accuracy is reduced in larger, more diverse populations because sensed-behaviors are unreliable predictors of depression risk: sensed-behaviors that predict depression are inconsistent across demographic and socioeconomic (SES) subgroups 14 .We intentionally use the term reliability due to its importance in both a psychometric and AI context.In a psychometric context, reliability refers to the consistency of a tool, typically a symptom assessment, across different contexts (e.g.raters, time) 14,15 .In AI, reliability is related to generalizability, if an AI tool is consistently accurate in different contexts (e.g.different populations, over time, etc.) 12 .Given these de nitions, researchers in AI fairness have argued that aspects of psychometric reliability are important in an AI context: similar inputs (e.g.sensed-behaviors) to an AI model should yield similar outputs (e.g.estimated depression risk) 16 .
In this paper, we adapt these ideas to study a speci c aspect of reliability important for mental health AI tools deployed in large populations, i.e. if similar sensed-behaviors are consistently related to depression risk across different groups of individuals.We hypothesize that if the sensed-behaviors predictive of depression risk are inconsistent across groups, AI models that use sensed-behaviors to predict depression risk will be inaccurate because similar sensed-behavioral patterns will indicate different levels of depression risk for different subgroups.For example, imagine that mobility positively correlates with depression risk in subgroup A, and negatively correlates with depression risk in subgroup B. An AI model trained across subgroups using exclusively mobility data, blind to subgroup information as is typically the case in this literature [3][4][5]17 , will receive unreliable information -high mobility can simultaneously indicate both low and high depression risk -and will make incorrect predictions for one of the subgroups. Wenote upfront that in this manuscript we do not consider temporal aspects of reliability, though we acknowledge that this is important in discussions of psychometric reliability, speci cally if the AI tool is consistently accurate for the same individual, with predictions made under similar conditions 18 .
We tested this hypothesis by identifying population subgroups where a depression risk prediction tool underperformed, and then analyzed sensed-behavioral differences across these subgroups.We identi ed subgroups where the tool underperformed by measuring algorithmic ranking bias (hereafter referred to as "bias"), where individuals experiencing depression from one subgroup (e.g.older individuals) were incorrectly ranked by the tool to be at lower risk than healthier individuals from other subgroups (e.g.younger individuals) [19][20][21][22] .Reliability was analyzed by measuring ranking bias because if individuals in large populations have inconsistent relationships between sensed-behavior and mental health, behaviors that represent high depression risk for one subgroup may represent lower risk for another subgroup.For example, imagine an AI tool predicting that higher phone use increases depression risk.Studies 23,24 show that younger individuals have higher phone use than older individuals.Thus, the AI tool may incorrectly rank older individuals with depression to be at lower risk than healthier younger individuals, decreasing model accuracy (Fig. 1a).
Against this backdrop, we developed an AI tool that estimated depression symptom risk using behavioral data collected from individuals' smartphones, using similar sensed-behaviors and outcome measures from recent work [3][4][5]13,25 (Fig. 1b). The dta used to develop and analyze the AI tool was collected during a U.S.-based National Institute of Mental Health (NIMH)-funded study 3,[25][26][27][28][29] , one of the largest, most geographically diverse studies of its kind.We then measured bias across attributes including age, sex at birth, race, household income, health insurance and employment to identify subgroups where the tool underperformed.We studied these speci c attributes because of known behavioral differences across demographic and SES subgroups 23,24,[30][31][32] that could impact the reliability of the developed AI tool.Finally, we interpreted why the tool underperformed by identifying inconsistencies between the AI tool and sensed-behaviors predicting depression across subgroups.A summary of this analysis can be found in Fig. 1.
Smartphone sensed-behavioral data on GPS location, phone usage (timestamp of screen unlock), and sleep were near-continuously collected from participants across the United States for 16 weeks and the PHQ-8, a self-reported measure of two week depression symptoms 33,34 , frequently used in mental health research 3,5,25,27 , was collected every three weeks beginning on the rst week of the study (e.g. on weeks 1, 4, 7, …, known as weekly reporting periods).Sensed-behaviors were summarized over two weeks to align with collected PHQ-8 depression symptoms for prediction (see Table 1).For example, sensed-behaviors collected during weeks 3 and 4 were summarized to predict PHQ-8 values collected during week 4. Table 2 summarizes the data used for analysis.3,900 samples were analyzed from 650 individuals, a large cohort and sample size compared to most studies to date analyzing associations between sensedbehaviors and mental health 4,5,25,35,36 .A sample was a set of sensed-behaviors, summarized over two weeks, with a corresponding PHQ-8 self-report.46% of self-reported PHQ-8 values were ≥ 10, indicating clinically-signi cant depression (CSD) 33 .The majority of participants were relatively young to middle aged (75% 25 to 54 years old), female (74%), white (82%), middle to high income (61% annual household income ≥$40,000), insured (93%) and employed (62%).We focused our results on groups with at least 15 participants 37 .The sensed-behavior distributions across the population for each subgroup can be found in the supplementary materials.Prefer not to answer Measuring Bias to Identify Subgroups where AI Models Underperform The PHQ-8 asked participants to self-report depression symptoms experienced over 14 days, and PHQ-8's were delivered multiple times throughout each weekly reporting period.We trained AI models using 14 days of smartphone sensed-behavioral data to predict if the average PHQ-8 value across each weekly reporting period (days 7 through 14, see Fig. 1c) indicated clinically-signi cant depression (CSD, PHQ-8 score ≥ 10 33 ) symptoms.Surveys were delivered multiple times each reporting week, and individual surveys may only re ect "brie y" elevated symptoms (e.g.work stress on the day the survey was administered).For this reason, PHQ-8 values were averaged over each reporting week to predict a more stable estimate of self-reported symptoms.In addition, by predicting average PHQ-8 values instead of individual self-reports, we could make use of all available data while ensuring that samples did not overlap temporally.Model performance was assessed by performing 5-fold cross-validation, partitioning on subjects, and predictions across folds were concatenated to calculate model performance.To analyze performance variability due to speci c cross-validation splits, we performed 100 cross-validation trials, shu ing participants into different folds during each trial.
AI models output a predicted risk score from 0-1 of experiencing CSD.We used the predicted risk to calculate common ranking bias metrics [20][21][22] (Fig. 2) across the subgroups in Table 2.These metrics were based upon the area under the receiver operating curve (AUC), which measured the probability models correctly predicted that CSD samples were ranked higher (in the predicted risk) than RH samples.We rst calculated the AUC within each subgroup (the "Subgroup AUC").Note that equal Subgroup AUCs do not guarantee high AUC across an entire sample.For example, Fig. 2a shows simulated data where an algorithm correctly predicted CSD risk within subgroups, but younger individuals, compared to older individuals, had higher predicted risk overall.Thus, healthy younger individuals were incorrectly predicted to be at higher risk than older individuals experiencing CSD.Two additional performance metrics assessed such errors.Speci cally, the background-negative-subgroup-positive, or BNSP AUC (Fig. 2b) measured the probability that individuals experiencing CSD (the "positive" label) from a subgroup were correctly predicted to have higher risk than RH (the "negative label") individuals from other subgroups ("the background"), and the background-positive-subgroup-negative, or BPSN AUC (Fig. 2c), measured the probability RH individuals from a subgroup were correctly predicted to have lower risk than background individuals experiencing CSD.
The highest performing AI model (a random forest, 100 trees, max depth of 10, balanced class weights, see methods) achieved a median (95% con dence interval, CI) AUC of 0.55 (0.54 to 0.57) across trials.Note that this low AUC was expected: it is comparable to the cross-validation performance of similar depression symptom prediction tools developed in larger, more diverse populations 4,6,13 , and motivates the objective of this work to study the reliability of these tools in larger populations.
Figure 3 shows the model results by each metric across subgroups.

Isolating the Effects of Subgroup Membership on Model Underperformance
We wished to account for intersectional identities (e.g.female and employed) and isolate the effect of subgroup membership on model underperformance.For an ideal classi er, the predicted risk would be low for RH subgroups, and high for CSD subgroups.In addition, we would expect groups with higher base rates (% of samples with PHQ-8 ≥ 10) to have a higher average predicted risk.We thus modeled expected differences from groups with either the lowest (for RH) or highest (for CSD) average risk across trials.
Generalized estimating equations (GEE, exchangeable correlation structure) 38 , a type of linear regression, was used to estimate the average effect of subgroup membership on the predicted risk after controlling across all other attributes.GEE was used instead of linear regression to correct for the non-independence of repeated samples across trials 38 .
Interpreting Sensed-Behaviors across Subgroups where Models Underperformed We hypothesized that models underperformed because sensed-behaviors predictive of CSD were inconsistent across subgroups.We thus conducted an analysis to understand differences between how AI tools predicted CSD risk and the different relationships between sensed-behaviors and CSD across subgroups.First, we retrained the AI model on the entire data, and used Shapley additive explanations (SHAP) 39 to interpret how the AI tool predicted CSD risk from sensed-behaviors.We then compared SHAP values with coe cients from explanatory logistic regression models estimating how subgroup membership affected the relationship between each sensed-behavior and depression.
We found different relationships between the SHAP values (Fig. 5a) and sensed-behaviors associated with CSD across subgroups (Fig. 5b, comparisons across each attribute and feature can be found in the supplementary materials).For example, the AI tool predicted that higher morning phone usage (6AM − 12PM) was generally associated with lower predicted depression risk.Higher morning phone usage decreased depression risk for 18 to 25 year olds (mean, 95% CI effect on depression, standardized units: − 0.77, − 1.07 to − 0.47), but increased risk for 65 to 74 year olds (0.60, 0.07 to 1.12).Younger individuals, overall, also had higher morning phone use (standardized median, 95% CI 18 to 25 year olds: 0.

Discussion
In this study, we hypothesized that sensed-behaviors are unreliable measures of depression in larger populations, reducing the accuracy of AI tools that use sensed-behaviors to predict depression risk.To test this hypothesis, we developed an AI tool that predicted clinically-signi cant depression (CSD) from sensed-behaviors and measured algorithmic bias to identify speci c age, race, sex at birth, and socioeconomic subgroups where the tool underperformed.We then found differences between SHAP values estimating how the AI tool predicted CSD from sensed-behaviors, and explanatory logistic regression models estimating the associations between sensed-behaviors and CSD across subgroups.In this discussion, we show how differences in sensed-behaviors across subgroups may explain the identi ed bias and AI underperformance in larger, more diverse populations.
Measuring bias showed that models predicted older, female, Black/African American, low income, unemployed, and individuals on disability were at higher risk of experiencing CSD (high BNSP, low BPSN AUC), and younger, male, White, high income, insured, and employed individuals were at lower risk of experiencing CSD (high BPSN, low BNSP AUC), independent of outcomes.Comparing SHAP values to explanatory logistic regression coe cients suggests why AI models incorrectly predicted depression risk.For example, our ndings show that younger individuals had higher daytime phone usage than older individuals.Models predicted that higher daytime phone usage was associated with lower CSD risk (Fig. 5a), potentially explaining why younger individuals, overall, had lower predicted risk, and older adults had higher predicted risk (Fig. 3).Differences could be attributed to younger individuals using phones for entertainment and social activities that support well-being, while older individuals may prefer to use their phones for necessary communication or information gathering 23 .
In another example, the model predicted that mobility, measured through circadian movement, location entropy, and GPS samples in transition, was associated with lower CSD risk (see Fig. 5).Prior work has identi ed a negative association between these same mobility features and CSD 5,25 , suggesting that mobility decreases depression risk.While we found the expected negative associations across majority, higher SES ($60,000 to 99,999 household income, insured, and employed) subgroups, we found the opposite, positive association across less-represented lower SES (<$20,000 household income, on disability, uninsured) subgroups, potentially explaining the reduced model performance (lower Subgroup AUC) in these groups.There are many possible explanations for the identi ed differences in behavior.
First, underlying reasons to be mobile (e.g.navigating bureaucracy to receive government payments) may increase stress for individuals who are lower income and/or on disability 31 , increasing depression risk.Second, the analyzed data was collected during the early-to-mid stages of the COVID-19 pandemic, when mobility for low SES essential workers may indicate work travel and increased COVID-19 risk, contributing to stress 32 and depression.These ndings suggest that sensed-behaviors approximating phone use and mobility used to predict depression in prior work [3][4][5][6]25 do not reliably predict depression in larger populations because of subgroup differences.
While existing work developing similar AI tools has strived to achieve generalizability 4,40 , our ndings question this goal.Instead, it may be more practical to improve reliability by developing models for speci c, targeted populations 41,42 .In addition, it may be helpful to train AI models using both sensedbehaviors and demographic information.In prior work and this study, AI models were trained using exclusively sensed-behavioral data [3][4][5]17 . Howver, prior work suggests that models may not be more predictive even with added demographic information 43 .This shows that additional methods are needed to clearly de ne subgroups, beyond demographics, with more homogenous relationships between sensed-behaviors and depression symptoms.
Another method to improve reliability is to develop personalized models, trained on participants' data over time 6,44 .While personalization seems appealing, researchers should ensure that personalized predictions are meaningful.For example, we experimented with personalized models using a procedure suggested from prior work 44 .The model AUC improved (0.68) compared to the presented results (0.55), but we achieved a better AUC (> 0.80) by developing a naive model re-predicting participants' rst self-reported PHQ-8 value for all future outcomes.Given at least one participant self-report is often needed for personalization, models should show greater accuracy than these naive benchmarks.
Even if accuracy improves, models can still be biased 19,37 , and it is important to consider the clinical and public health implications of using biased risk scores for depression screening.For example, more frequent exposure to stress 45 contributes to higher rates of depression in lower SES populations 46 , but overestimating depression risk for healthy low SES individuals allocates mental health resources away from other individuals who need care.Similar issues persist for underestimating depression risk.For example, models predicted lower risk for males experiencing depression compared to healthier females (see Fig. 3).Males are less likely to seek treatment for their mental health than females 47 , and AI tools underestimating male depression risk may further reduce the likelihood that males seek care.Uncovering these biases are important before algorithmic tools are used in clinical settings.
To reduce these harms, researchers can use methods described in this and other work 37 to identify subgroups where AI tools underperform by measuring bias.Resources could then be directed to develop new or retrain existing models for these subgroups.Simultaneously, clinical personnel using these tools can be trained to identify algorithmic bias and mitigate its effects 48 .In addition, depositing de-identi ed sensed-behavior and mental health outcomes data in research repositories could increase available data to analyze the reliability of AI tools 12 .Finally, our ndings show the importance of developing AI tools using data from populations that have similar behavioral patterns to the populations where these tools will be deployed.More thorough reporting of model training data 49 , and monitoring AI tools in "silent mode", in which predictions are made but not used for decision making 50 , could prevent AI tools developed in dissimilar populations from causing harm.
Finally, it is important to consider how the choice to classify depression symptom severity in uenced our results, speci cally choosing to predict binarized PHQ-8 values instead of raw PHQ-8 scores.Predicting binarized symptom scores is a fairly common practice in both the depression prediction literature [3][4][5]17 , as well as in the clinical AI literature, broadly 51,52 . Ths practice is motivated by an interest to use AI tools for near-continuous symptom monitoring, in which an action (e.g.follow-up by a care provider) is triggered at a speci c elevated symptom threshold. This otivation may be di cult to realize if the eld continues to use depression symptom scales as outcomes; as recent work shows, these symptom scales do not produce categorical response distributions, with a clear decision boundary distinguishing individuals experiencing versus not experiencing symptoms.Instead, responses tend to exist along a continuum 14 .It is also important to consider if subgroup differences affect the interpretation and selfreporting of depression symptom scales.Despite this consideration, prior work provides evidence that the PHQ-8 exhibits measurement invariance across demographic and socioeconomic groups 53,54 .Thus, it may be unlikely that the bias identi ed in this work was due to group differences in self-reporting symptoms, but our ndings could be partially attributed to the mistreatment of depression symptom scales as categorical in nature.
This work had limitations.First, we analyzed data from a single study, though the studied cohort was larger in size, geographic representation, and timespan compared to prior work.In addition, the study cohort was majority White, employed, and female, though we did not nd that sample size was associated with classi cation accuracy.Only inter-individual variability was considered, not intraindividual variability, and thus these ndings do not extend to longitudinal monitoring contexts, where changes in sensed-behaviors may indicate changes in depression risk.In addition, data was only analyzed from participants who provided complete outcomes data (participants who reported at least one PHQ-8 value during each of the 6 weekly reporting periods).Data was exclusively collected from individuals who owned Android devices, only speci c data types (GPS and phone usage) were analyzed.
Only smartphone sensed-behaviors were analyzed, and data collected from other types of devices (e.g.wearables) were not analyzed.Finally, data collection took place from 2019 to 2021, when COVID-19 restrictions varied across the United States, which may in uence our ndings.Future work can examine if these results replicate over larger, more diverse cohorts, in both demographic and socioeconomic attributes, as well as the devices used for data collection.In addition, future work can explore if sensedbehaviors are reliable predictors of depression in longitudinal monitoring contexts, though recent work suggests that sensed-behaviors have low predictive power, even when used for longitudinal monitoring 25 .
In conclusion, we present one method to assess the reliability of AI tools that use sensed-behaviors to predict depression risk.Speci cally we measured ranking bias in a developed AI tool to identify subgroups where the tool underperformed, and then we interpreted why models underperformed by comparing the AI tool to sensed-behaviors predictive of depression across subgroups.Researchers and practitioners developing AI-driven mental health monitoring tools using behavioral data should think critically about whether these tools are likely to generalize, and consider developing tailored solutions that are well-validated in speci c, targeted populations.

Cohort
In this work, we performed a secondary analysis of data collected during a U.S.-based National Institute of Mental Health (NIMH) funded study.The motivation for this study was to identify smartphone sensedbehavioral patterns predictive of major depressive disorder (MDD) 3,[25][26][27][28][29] .Participants were recruited from across the United States using digital registries and online advertisements, intentionally oversampling for individuals experiencing depression.Eligible participants lived in the United States, could read/write English, and owned an Android smartphone and data plan.In addition, eligible participants with at least moderate depression symptom severity based upon the Patient Health Questionnaire-8 (PHQ-8) ≥ 10 were oversampled to create a sample with elevated symptoms.Individuals were excluded from the study if they self-reported a diagnosis of bipolar disorder, any psychotic disorder, shared a smartphone with another individual, or were unwilling to share data.Eligible participants were asked to provide electronic informed consent after receiving a complete description of the study.Eligible participants had the option to not provide consent, and could withdraw from the study at any point.
Consented participants downloaded a study smartphone application 55 and completed a baseline assessment to self-report demographic and lifestyle information.The study application passively collected GPS location, sampled every 5 minutes, and smartphone interactions (screen unlock and time of unlock) for 16 weeks.Individuals completed depression symptom assessments every 3 weeks within the smartphone application (the PHQ-8 33,34 ).Data collection took place from 2019-2021, and all study procedures were approved by the Northwestern University Institutional Review Board (study #STU00205316).

Sensed-Behavioral Features
We calculated sensed-behavioral features from the collected smartphone data to predict depression risk.
Following established methods from prior work 3,5,25 , we calculated GPS mobility features including the location variance (variability in GPS), number of unique locations, location entropy (variability in unique locations), normalized location entropy (entropy normalized by number of unique locations), duration of time spent at home, percentage of collected samples in-transition (participant moving at > 1 km/hour), and circadian movement (24-hour regularity in movement) 5 .We also calculated phone usage features from the screen unlock data 40 , including the duration of phone use and the number of screen unlocks each day and within four 6-hour periods (12-6AM, 6-12PM, 12-6PM, 6-12AM).Finally, we used a standard algorithm 40,56 to approximate daily sleep onset and duration from screen unlock data.

Classifying Depression Symptoms
The PHQ-8 asks participants to self-report depression symptoms that occurred during the past two weeks.Symptoms are reported from 0 (not experiencing the symptom) to 3 (frequently experiencing the symptom).Scores are summed and thresholded to classify severity, where summed scores of 10 or greater indicate a higher likelihood of experiencing a clinically-signi cant depression 33 .We thus followed prior work 5,25 to calculate sensed-behavioral features in the two week period up to and including each weekly PHQ-8 reporting period.Behavioral features were input into machine learning models to predict clinically-signi cant symptoms (PHQ-8 ≥ 10).

Data Preprocessing
Screen unlock and sleep features were summarized to align with the PHQ-8 40 .The average and standard deviation of each daily and 6-hour epoch feature were calculated across the two week prediction period, and the number of days with daily phone use and use within each 6-hour epoch were summed.GPS features were directly calculated over the two weeks.As recommended by Saeb et al. 5 , skewed features were log-transformed.Missing data was lled using multivariate imputation 57 and then standardized (mean = 0, standard deviation = 1) based upon each training dataset prior to being input into predictive models.

AI Model Training and Validation
We trained machine learning models commonly used to predict mental health status from smartphone behavioral data including regularized (L2-norm) logistic regression (LR) 3,5 , support vector machines (SVM) 4,58 , and tree-based ensemble models including random forest (RF) and gradient boosting trees (GBT) 3,40 .We varied the strength of the LR and SVM regularization parameter (0.01, 0.1, 1.0), used a radial basis function SVM kernel, varied class balancing weights in the RF and SVM (unbalanced/balanced), varied the number of ensemble tree estimators (10, 100), depth (3, 10, or until pure), and the GBT learning rate (0.01, 0.1, 1.0) and loss (deviance and exponential).Non-logistic prediction models were calibrated using Platt scaling to approximately match the predicted risk to the proportion of individuals experiencing clinically-signi cant symptoms at each risk level 59 .Logistic regression models, as shown in prior work 59 , output calibrated probabilities.Models were implemented using the scikit-learn Python library 60 .
Multiple PHQ-8 surveys were administered each weekly reporting period (e.g.week 1, 4, 7, etc.).Survey scores in each reporting week were averaged to remove overlap between sensor and outcomes data.Data was analyzed from study participants who self-reported at least one PHQ-8 during each reporting week, resulting in 6 predictions per participant.Data from all other participants were removed (288 participants removed, 31% of total) to focus this analysis towards algorithmic bias due to group differences rather than bias due to missing outcomes data.

Declarations
Competing Interests   -based NIMH-funded study 3,[25][26][27][28][29] were used to train and validate AI models that predicted depression symptom risk from the behavioral data.We then measured algorithmic ranking bias in the developed tool to identify subgroups where the predicted CSD risk was incorrectly ranked lower than RH subgroups, and compared sensed-behaviors across subgroups where algorithms underperformed.c.Similar to prior work 3,25 , 14 days of sensed-behavioral data were used to predict

Figures
Figures

Table 1
Sensed-behaviors.An overview of the sensed-behavioral data used in this analysis.The same set of sensed-behaviors were collected from all participants, and were summarized over two week periods to align with self-reported PHQ-8 symptoms.Please see the methods for more details.

Table 2
St[25]coho[27]ata [29]collected within an NIMH-funded study to understand the relationships between digitally-collected behavioral data and depression symptoms3,[25][26][27][28][29].Participants contributed six total samples (summarized behavioral data and depression outcome measures) throughout the course of the study.A sample was a set of sensed-behaviors, summarized over two weeks, with a corresponding PHQ-8 self-report.
Circadian movement decreased CSD risk for employed individuals (-0.16, − 0.24 to − 0.07), but increased CSD risk for individuals who were on disability (0.44, 0.21 to 0.66).Circadian movement and location entropy also decreased depression risk for individuals from middle income ($60,000 to $99,999) households (circadian movement: − 0.21, − 0.35 and − 0.07; location entropy: − 0.34, 32, − 2.27 to 1.60) compared to older individuals (65 to 74 year olds: − 0.62, − 1.96 to 0.76).Figure 5a also shows that speci c mobility features, including the circadian movement (regularity in 24hour movement), location entropy (regularity in travel to unique locations), and the percentage of collected GPS samples in transition (approximated speed > 1 km/hour) were often associated with lower predicted CSD risk.
D.A. and T.C. have submitted patent applications related to this work.T.C. is a co-founder and equity holder of HealthRhythms, Inc. and has received grants from Click Therapeutics related to digital therapeutics.D.C.M has accepted honoraria and consulting fees from Boehringer-Ingelheim, Otsuka Pharmaceuticals, Optum Behavioral Health, Centerstone Research Institute, and the One Mind Foundation, royalties from Oxford Press, and has an ownership interest in Adaptive Health, Inc. J.M. has accepted consulting fees from Boehringer Ingelheim.G.J.A. holds equity in HealthRhythms, Inc. and Lyra Health, Inc., and has accepted consulting fees and honoraria from BetterUp and Quantum Health.