Interpretation
We investigated models that predict depression (using EPDS > = 12 as a proxy) at five time periods during and after pregnancy using survey responses during weeks 4–10 of pregnancy. In the first two trimesters approximately 20% of the women surveyed had an EPDS > = 12, this increased to 25% in the final trimester and then decreased to 16%-17% following delivery.
We developed models using three predictor sets: i) 17 questions from the EPDS and GAD scales, ii) 21 questions from the EPDS, GAD and PRES scales and iii) 164 questions including additional psychiatric scales, demographics, lifestyle, medical history and partner mental health questions. The results show that the performance was similar for all three predictor sets, with models tending to overfit when all 164 variables were used due to the small data size. Including the PRES scale questions tended to improve the prediction of depression after delivery, although the performance was not significantly better than the models using EPDS/GAD only.
The SHAP results show that crying early in pregnancy is a key predictor of high EPDS scores during pregnancy. In general, showing signs of depression/anxiety at week 4–10 was predictive of a high EPDS throughout pregnancy. Baseline predictors of a high EPDS after delivery were anxiety (worrying, nervousness and anxiety), difficulty sleeping and feeling afraid. Having somebody who makes you feel appreciated appears to be associated with lower EPDS score after pregnancy, however causality was not investigated in this study.
Focusing on the models developed using the 21 questions from the EPDS, GAD and PRES scales, the models AUROC performance across the time periods ranged between low to middle 70s, with trimester 1 being the easier to predict. This is expected, as trimester 1 was closest in time to the baseline survey. The calibration plots indicate reasonable calibration, although the models appear to slightly over-estimate risk for the highest risk groups. When predicting an EPDS > = 12 after delivery during weeks 4 and 12, the calibration plots show there is a group of women who are assigned a risk around 10% but approximately 25% of these women had an EPDS > = 12. This may be due to the model using variables early in pregnancy, which may be insufficient to identify these women as high risk after delivery.
To use the models a patient would need to be asked only 21 questions at week 4 to 10 of their pregnancy. These 21 items could easily be assessed via online survey, phone or tablet to determine risk at the different time-points. If using the models for decision making, we provide the positive predictive values (PPV) and sensitivities for ten different thresholds, see Table 5. The desirable threshold will depend on how the models will be used. For example, if the model is used to simply screen patients who may benefit from additional education, then a high sensitivity may be preferred at the cost of having a higher false positive rate (lower PPV). Alternatively, if the models are used to identify patients who may benefit from some restricted intervention, then a high PPV may be more desirable.
Limitations
The main limitation of this study is that a high percentage of women dropped out after the baseline survey. This may impact the generalizability of the models to the general population as the dropout may be associated with having or developing depression. We were unable to find any strong predictor of dropout using baseline variables. This suggests that the model may be generalizable, however it is import for these models to be externally validated to confirm this. Based on GRASP guidelines, the models need to be externally validated and also prospectively tested in any clinically setting they may be applied before the true performance is known (Khalifa 2019). Another limitation is that the EPDS score was used as a proxy for depression and the EPDS score is not a clinical diagnosis. In future work it would be useful to validate the model on data that has a clinical definition of perinatal depression as the outcome.
As we used gradient boosting machines, the models are hard to interpret. We used SHAP to provide variable importance plots to show which variables had more impact in the risk predictions. SHAP can also be used when using the online calculator to understand what contributed to the high risk.
A key strength of this study is that it used a prospective cohort design, but this resulted in having a smaller dataset of around 500–600 patients and outcome counts ranging between 77–140. The low outcome count limited the complexity of the models, so more discriminative models may be possible to learn with more data. It also decreases the confidence in the model performance estimates, leading to wider confidence intervals.