The design of the HNR Study has been described in detail elsewhere [18, 19]. Briefly, for baseline examination, 4,814 women and men (49.8% men) aged 45 to 75 years were recruited between 2000 and 2003 from mandatory citizen registries of three large cities (Bochum, Essen, and Mülheim an der Ruhr, Germany). For follow up, participants were invited further two times to the study centre in Essen every five years. In addition, a yearly questionnaire-based postal follow-up was conducted between these examinations. To perform a longitudinal comparison between the CES-D and the PHQ-9 we used the 8th (hereafter t8) and 9th (t9) follow up year in which both instruments were applied concurrently.
For our analysis, we included 3,084 participants (1,580 women, 1,504 men) who completed both the PHQ-9 and CES-D questionnaires at both time points.
The HNR study is confirmed by the local ethics committees and all participants gave written informed consent before participation.
Assessment of depressive symptoms
We assessed depressive symptoms using the validated tools “Center for Epidemiologic Studies Depression Scale” (CES-D)  and the PHQ-9, a sub-module of the "Patient Health Questionnaire (PHQ) ". Both instruments are structured self-administrative scales, in which the participants answer predetermined answer options. The CES-D was always measured immediately before the PHQ-9. The participants were also asked if they are currently in therapy for depression or are taking medication against it. These variables are only used for sensitivity analysis.
Center for Epidemiologic Studies Depression Scale (CES-D)
The CES-D is often used in epidemiological studies in the general population. The CES-D asks for the presence and frequency of symptoms and emotional states in the week before the interview including depressive mood, feelings of guilt or worthlessness, sleep disorders and self-doubt [20, 21]. The CES-D is considered as an indicator of depression and is highly correlated with a clinical diagnosis of depression .
In the HNR Study, a short version of the CES-D with 15 questions was applied. Answers are given on a 4-point Likert-scale ranging from ‘‘less than one day’’ (0 point) to ‘‘5–7 days’’ (3 points). We calculated a sum score ranged from 0 to 45 points with a higher score indicating more and/or more frequent depressive symptoms. Positively formulated items were coded backward and an average value was calculated over all 15 items. For up to three missing answers, the item value was replaced by the mean value of the answered questions. In the HNR Study, a cutoff point of ≥ 17 was defined as depression [23, 24].
Patient Health Questionnaire (PHQ)
The Patient Health Questionnaire developed by Spitzer, Kroenke, and Williams  is used to screen for MDD with items corresponding to the symptoms identified in the Diagnostic and Statistical Manual of Mental Disorders . In the HNR Study, we used the nine items subscale PHQ-9 which consists of the actual nine criteria of the DSM-5 diagnosis for depressive disorders. The PHQ-9 is widely applied in medical settings . Participants were asked about the frequency of the emergence of nine different problems or depression criteria over the last two weeks.
There are four possible answers: not at all (0 point), several days (1 point), more than half the days (2 points), and nearly every day (3 points). Total PHQ-9 score ranges from 0 to 27 and are categorized as “none or minimum” (0–4), “mild” (5–9), “moderate” (10–14), “moderately severe” (15–19), and “severe” (20–27) for depression severity. For up to two missing answers, the item value was replaced by the mean value of the answered questions. We defined a PHQ-9 score ≥ 10 as depression .
Basic similarities and differences
The CES-D and PHQ-9 are used to detect depression and depressive symptoms. Both scales are designed as self-administrative questionnaires, brief and easy to assess as well as available in the public domain. Nevertheless, there are differences. The PHQ-9 indicates major depression based on the DSM-5 diagnostic criteria and was developed for clinical use. The CES-D, on the other hand, was developed for large epidemiological studies and measures depressive symptoms with emphasis on the affective component and depressed mood. The CES-D consists of 15 items, whereas the PHQ-9 contains only nine items. Although there is a basic frequency questioning in both scales, the retrospective period differs. The CES-D refers to the last seven days, the PHQ-9 to the last 14 days. The response options are quite similar for both instruments. A four-step scale with increasing frequencies is given.
The socio-economic status was assessed in a standardized computer-assisted interview (CAPI) carried out by trained personnel at baseline examination. Education was classified according to the International Standard Classification of Education (ISCED-97) as total years of formal education, combining school and vocational training and was categorized into four groups (≤ 10 y, 11–13 y, 14–17 y, ≥ 18 y). Economic activity was categorized into four groups [employed, inactive (e.g. homemaker, but not unemployed), pensioner, and unemployed]. We also recorded if participants were cohabiting with a partner or not. For the classification by subgroups, age was divided into < 67 and ≥ 67 years, as this corresponds to the mean age of the participants.
Cronbach’s alpha was calculated for CES-D and PHQ-9 to evaluate the internal consistency based on the correlations between different items on the total scale. It describes the extent to which all the items in a test measure the same concept or construct. Convergent validity of CES-D and PHQ-9 was assessed using Pearson’s correlation coefficient to explore the magnitude of the associations between the scales.
Agreement probability is calculated by using the sum of the number of the same classification for both scales, as well as the disagreement probability as the sum of the number of a different classification. The response agreement between PHQ-9 and CES-D with dichotomous cutoffs was evaluated with Cohen’s kappa. Cohen’s kappa is used to assess the reliability of different measurement methods by quantifying their consistency in placing individuals or items in two or more mutually exclusive categories. We calculated kappa values including 95% confidence intervals (CI) by subtracting/adding the kappa from the value of the 95% CI level (1.96) times the standard error of kappa . We interpreted the strength of the agreement as slight (0-0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (0.81-1) .
McNemar's test of marginal homogeneity is conducted for the CES-D and PHQ-9, respectively, to test the hypothesis that the proportion of participants with depressive symptoms above the corresponding cutoff is the same at both times. Temporal changes in PHQ-9 and CES-D are represented by the percentage difference from t8 to t9. A change in the depression score is rated 10% or more on the given scale. Temporal changes were classified as an increase, decrease and no change. To assess the conformity of a temporal trend, the proportions of the concordantly and discordantly identified changes are compared.
For a sensitivity analysis, various cutoffs (CES-D score ≥ 16 to 22) were selected for the CES-D to demonstrate how the agreement and kappa values change. Participants were also stratified according to whether they are currently in therapy for depression. Based on this stratification, the observed agreement and the kappa value were calculated.
Descriptive results are expressed as mean ± SD, percentage (%) or number (n), as appropriate. Results are presented separately by sex. All analysis was performed using SAS 9.4.