Sex-specific Longitudinal Comparison of CES-D and PHQ-9 Depression Scales. A Concordance Analysis using data from the population-based Heinz Nixdorf Recall Study

Background: To measure depressive symptoms (DS) in population-based studies, there are two the well-established questionnaires: the Center for Epidemiologic Studies Depression Scale (CES-D) and the Patient Health Questionnaire-9 (PHQ-9). So far, comparisons between both instruments have only been performed using cross-sectional data, and in specific patient groups. Furthermore, comparisons for population-based studies are missing as well as sex-specific analyses. The aim is to evaluate the psychometric properties and concordance of the longitudinal results of CES-D and PHQ-9 in the German population-based Heinz Nixdorf Recall (HNR) study. Methods: We used data of n=3,084 participants (48.8% men, mean age: 66.8 years). CES-D and PHQ-9 were assessed in the 8th (t8) and 9th (t9) annual postal follow up within two years via questionnaires. DS were defined as CES-D score ≥ 17 and PHQ-9 score ≥ 10. Internal consistency reliability, convergent validity, and agreement between PHQ-9 and CES-D were assessed using respectively Cronbach’s alpha, Pearson’s correlation, and Cohen’s kappa. To analyse DS differences between t8 and t9 we used McNemar’s test. Sex-stratified results are presented. Results: The prevalence of DS at t8 was higher for CES-D (7.8%) than for PHQ-9 (4.4%). The prevalence slightly increased for CES-D (8.1%), as well as for PHQ-9 to 4.5% at t9. Internal consistency of the PHQ-9 and CES-D was good at both times (Cronbach’s alpha: CES-D t8 & t9: 0.89, PHQ-9 t8: 0.84; t9: 0.85). Cohen’s kappa of agreement between CES-D and PHQ-9 was moderate at both time points (t8: k=0.57; 95% CI: 0.51, 0.63; t9; k=0.58; 95% CI: 0.52,


Introduction
The mental disorder depression is one of the most common mental illnesses in the world and more than 300 million people suffer from depression worldwide (1,2). Symptoms include feelings of sadness, hopelessness, loss of pleasure, appetite, weight, self-esteem or self-confidence as well as sleep and concentration disorders that exceed a certain duration, persistence, and intensity. Often depression is associated with co-morbidity. Due to the high prevalence of depression and its comorbidity, the assessment of depressive symptoms is part of the standard program of many epidemiological studies (3,4).
For an individual medical diagnosis of depressive disorders, a structured and standardized diagnostic interview such as the internationally recognized Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders, 5th ed. (DSM-V) is usually used (5). These interviews are characterized by a comprehensive and time-consuming question structure and therefore generally unfeasible in large population-based studies (6). In epidemiological studies, a large number of self-administrative instruments in the form of psychometric personality tests are used to detect Major Depressive Disorders (MDD). MDD are determined by the presence of relevant multiple depressive symptoms within a defined time. Self-administered questionnaires are easy to use and cost-effective. Of the many instruments available, the "Center for Epidemiologic Studies Depression Scale (CES-D Scale)" (7) and the "Patient Health Questionnaire (PHQ)" are the most widely used ones in epidemiological studies (8).
It has been shown that these questionnaires performed well as a screening instrument in comparison with the reference standards, a clinical interview (9). However, to make an evidence-based decision for one of the instruments for measuring depressive symptoms in epidemiological studies, it is necessary to precisely analyse the differences and similarities of the instruments as well as to evaluate the psychometric strengths and weaknesses. To compare results between studies, it is also important to know whether these instruments are interchangeably or to what extent the results differ systematically.
The psychometric performance of the CES-D and the PHQ-9 has already been compared in specific patient groups like people with diabetes type 2, multiple sclerosis or systemic sclerosis (10)(11)(12)(13). Taken together, the literature shows no relevant differences between the two instruments in terms of reliability, validity as well as differences in specific subpopulations. It is only noted that PHQ-9 is more specific in indicating depression (14), while CES-D detects various parts of depression and some of the other emotions associated with serious illness (15).
However, to best to our knowledge, there are no comparisons between these instruments within the general population. In addition, there are no comparisons with longitudinal data measuring the development of depressive symptoms. Moreover, as it is known that there are large differences in the prevalence and development of depressive symptoms between women and men (16)(17)(18)(19) it is useful to examine the psychometric characteristics separately by sex.
We have chosen the CES-D and the PHQ-9 because they are commonly used instruments to assess depression symptomatology in epidemiological studies and they measure different aspects of depression. The PHQ-9 corresponds to major depression and was designed for clinical use. The CES-D has been developed for large epidemiological studies and investigates a variety of aspects of depression.
This study aims to compare the psychometric properties and the concordance of the CES-D and the PHQ-9 in an elderly population using the data set of the German longitudinal, population-based Heinz Nixdorf Recall (HNR) Study.

Study population
The design of the HNR Study has been described in detail elsewhere (20,21). Briefly, for baseline examination, 4,814 women and men (49.8% men) aged 45 to 75 years were recruited between 2000 and 2003 from mandatory citizen registries of three large cities (Bochum, Essen, and Mülheim an der Ruhr, Germany). For follow up, participants were invited further two times to the study centre in Essen every five years. In addition, a yearly questionnaire-based postal follow-up was conducted between these examinations.
To perform a longitudinal comparison between the CES-D and the PHQ-9 we used the 8 th (hereafter t8) and 9 th (t9) follow up year in which both instruments were applied concurrently.
For our analysis, we included 3,084 participants (1,580 women, 1,504 men) who completed both the PHQ-9 and CES-D questionnaires at both time points.
The HNR study is confirmed by the local ethics committees and all participants gave written informed consent before participation.

Assessment of depressive symptoms
We assessed depressive symptoms using the validated tools "Center for Epidemiologic Studies Depression Scale" (CES-D) (7) and the PHQ-9, a sub-module of the "Patient Health Questionnaire (PHQ) ". Both instruments are structured self-administrative scales, in which the participants answer predetermined answer options. The CES-D was always measured immediately before the PHQ-9. The participants were also asked if they are currently in therapy for depression or are taking medication against it. These variables are only used for sensitivity analysis.

Center for Epidemiologic Studies Depression Scale (CES-D)
The CES-D is often used in epidemiological studies in the general population. The CES-D asks for the presence and frequency of symptoms and emotional states in the week before the interview including depressive mood, feelings of guilt or worthlessness, sleep disorders and self-doubt (22,23). The CES-D is considered as an indicator of depression and is highly correlated with a clinical diagnosis of depression (24).
In the HNR Study, a short version of the CES-D with 15 questions was applied. Answers are given on a 4-point Likert-scale ranging from ''less than one day'' (0 point) to ''5-7 days'' (3 points). We calculated a sum score ranged from 0 to 45 points with a higher score indicating more and/or more frequent depressive symptoms. Positively formulated items were coded backward and an average value was calculated over all 15 items. For up to three missing answers, the item value was replaced by the mean value of the answered questions. In the HNR Study, a cutoff point of ≥17 was defined as depression (25,26 (5). In the HNR Study, we used the nine items subscale PHQ-9 which consists of the actual nine criteria of the DSM-V diagnosis for depressive disorders.
The PHQ-9 is widely applied in medical settings (8). Participants were asked about the frequency of the emergence of nine different problems or depression criteria over the last two weeks.

Basic similarities and differences
The CES-D and PHQ-9 are used to detect depression and depressive symptoms. Both scales are designed as self-administrative questionnaires, brief and easy to assess as well

Demographic Variables
The socio-economic status was assessed in a standardized computer-assisted interview (CAPI) carried out by trained personnel at baseline examination. Education was classified according to the International Standard Classification of Education (ISCED-97) as total years of formal education, combining school and vocational training and was categorized into four groups (≤10 y, 11-13 y, 14-17 y, ≥18 y). Economic activity was categorized into four groups [employed, inactive (e.g. homemaker, but not unemployed), pensioner, and unemployed]. We also recorded if participants were cohabiting with a partner or not. For the classification by subgroups, age was divided into <67 and ≥ 67, as this corresponds to the mean age of the participants.

Statistical Analyses
Cronbach's alpha was calculated for CES-D and PHQ-9 to evaluate the internal consistency based on the correlations between different items on the total scale. It describes the extent to which all the items in a test measure the same concept or construct. Convergent validity of CES-D and PHQ-9 was assessed using Pearson's correlation coefficient to explore the magnitude of the associations between the scales. Agreement probability is calculated by using the sum of the number of the same classification for both scales, as well as the disagreement probability as the sum of the number of a different classification. The response agreement between PHQ-9 and CES-D with dichotomous cutoffs was evaluated with Cohen's kappa. Cohen's kappa is used to assess the reliability of different measurement methods by quantifying their consistency in placing individuals or items in two or more mutually exclusive categories. We calculated kappa values including 95% confidence intervals (CI) by subtracting/adding the kappa from the value of the 95% CI level (1.96) times the standard error of kappa (29). We McNemar's test of marginal homogeneity is conducted for the CES-D and PHQ-9, respectively, to test the hypothesis that the proportion of participants with depressive symptoms above the corresponding cutoff is the same at both times. Temporal changes in PHQ-9 and CES-D are represented by the percentage difference from t8 to t9. A change in the depression score is rated 10% or more on the given scale. Temporal changes were classified as an increase, decrease and no change. To assess the conformity of a temporal trend, the proportions of the concordantly and discordantly identified changes are compared.
For a sensitivity analysis, various cutoffs (CES-D score ≥16 to 22) were selected for the CES-D to demonstrate how the agreement and kappa values change. Participants were also stratified according to whether they are currently in therapy for depression. Based on this stratification, the observed agreement and the kappa value were calculated.
Descriptive results are expressed as mean ± SD, percentage (%) or number (n), as appropriate. Results are presented separately by sex. All analysis was performed using SAS 9.4.

Results
The sex-stratified characteristics of the analysed population at t8 (n = 3,084) are shown in table 1, for the entire population as well as for women (n = 1,580) and men (n = 1,504) separately. The average age was 66.8 years at t8. Of the participants, 85.8% lived with a partner and 34.9% completed post-secondary education (≥14 years of education); 45.5% were employed.
The mean score for PHQ-9 and CES-D score was 3.5 resp. 7.1. Women showed a higher score on both scales than men. Figure 1 presents the distribution of CES-D and PHQ-9 at t8 and t9. Both scores were strongly skewed to the right. The mode of PHQ-9 was even 0.
For men, the distribution of the PHQ-9 scale was even more right-skewed, than for women.
No differences according to the distribution were observable between both time points. Table 2 shows the prevalence rates. The prevalence of depression at t8 was higher for CES-D (7.8%) than for PHQ-9 (4.4%) The prevalence slightly increased for CES-D (8.1%), as well as for PHQ-9 to 4.5% at t9. The sex-specific analyses admittedly show the wellknown observation of higher prevalence in women than in men. However it also reveals (i) an even greater difference between sexes measured by CES-D (men 6.0%, women 9.6%) than by PHQ-9 ( 3.6%, 5.3%), and (ii) a decreasing prevalence rate in women measured by PHQ-9 (5.3% to 4.9%) whereas it increased when measured by CES-D (9.6% to 10.4%).
Cronbach's alpha showed high reliability for the CES-D (a = 0.89) at both time points (Table 3). Inter-item correlations ranged from 0.16 to 0.67. Similar results were found for PHQ-9 (t8: a = 0.85, t9: a = 0.84). Inter-item correlations ranged from 0.20 to 0.54. No sex-specific differences could be identified. Table 4 depicts a high correlation between CES-D and PHQ-9 (t8: r = 0.84, t9: r = 0.85). The long-term approach between t8 und t9 achieved a correlation of r = 0.71 for both CES-D and PHQ-9. Table 5 shows the agreement of the two scales at t8 as well as at t9. At t8 both scales considered separately by sex at t8, did not differ from t9. Figure 2 illustrates the results for the agreement of the two scales showing the kappa coefficient and the 95% confidence interval, divided into different subgroups for both time points. The kappa coefficients for the subgroups ranged between 0.51 and 0.68. The agreement between the CES-D and PHQ-9 can be considered as moderate to substantial within the different subgroups. At both times, the kappa coefficient was greater for men than for women, suggesting that there are more concordant cases for men. It can also be seen that the agreement between the two scales was greater among the younger participants (< 67 years) than among the older participants. Similarly, the agreement among participants with post-secondary education was lower than among participants with less than 14 years of education.
To answer the question of whether there was a higher proportion of participants who were above the cutoff for depression at t9 than at t8, the McNemar's test was performed for each scales. The results ( indicates that the proportions of being depressed did not differ between t8 and t9 on both scales. As described, a relative change of more than 10% between the two time points on one of the scales was regarded as an increase or decrease in the depression score. Table 7 shows the results for analysing these development trends of both scales over time. In total, 75.8% of the participants on both scales identified similar trends (increasing, no change or decreasing) from t8 to t9. A similar pattern was observed among men for 77.1% of participants and 74.5% among women. A proportion of 24.2% of all participants, in contrast, shows a different development trend. Nevertheless, for 0.5% of the participants opposite developments are identified by the two scales.

Sensitivity analysis
By increasing the cutoff for the CES-D, the proportion of depressives decreased accordingly ( Table 8). The agreement between the CES-D and PHQ-9 improved with a higher cutoff. The Kappa coefficient improved, too. Table 9 shows the results stratified by current therapy for depression. The proportion of depressives currently in therapy turn out to be distinctively higher (t8: CES-D 45.2%, PHQ-9 33.3%; t9: 39.8%, 28.5%). Nevertheless, even among the group being not currently in therapy, 3 to 6% were depressed according to CES-D or PHQ-9 (t8: CES-D 5.7%, PHQ-9 2.8%; t9 6.1%; 2.9%). The agreement between the scales was higher among those who were not in therapy. Despite this, the Kappa coefficient at both time points was lower compared to those who were in therapy.

Discussion
This is the first study to compare systematically the psychometric properties of the CES-D and PHQ-9 in the general population in a longitudinal and sex-specific approach. The analyses carried out have shown that both scales achieved concordant results in detecting depression, but also differed from each other in some aspects. In this study, 4.4% of the participants were identified as depressed with the PHQ-9, slightly lower than the latest WHO prevalence data for depressive disorders in Germany (5.2%) (1). The CES-D, in contrast, classified 8% as depressed in our sample. Women were more likely than men to suffer from depression. This difference was more noticeable in CES-D than in PHQ-9. Each scale performed well and proved to be highly reliable instruments with high Cronbach's alpha values for the total score from 0.84 -0.89. Concerning internal consistency and convergent validity, our findings match those observed in other studies (10,11). The However, a proportion of the participants were also classified differently from the two scales. It has been shown that the CES-D assesses a participant as depressed rather than PHQ-9. This result is also consistent with other studies in which a higher sensitivity of the CES-D to PHQ-9 was found (31). Nevertheless, in our study some subgroups were more consistent in the agreement between the two scales, for example, men compared to women or participants younger than 67 years compared to even older participants. If the threshold for the cutoff of the CES-D is raised further, as the sensitivity analysis shows, the agreement between CES-D and PHQ-9 improved and the CES-D did not classify as many participants as depressed.
In addition, our study reveals that both scales detected identical tendencies observed with regard to changes in depression over time. Only 0.5% of participants were completely opposed to the development of depressive symptoms identified by the two scales.
According to the McNemar's test, the proportions to be depressed did not differ between the two time points on both scales.
There were no sex-specific differences in psychometric characteristics. Cronbach's alpha and Pearson's correlation coefficients were similarly high for men and women. Merely the agreement between the CES-D and the PHQ-9 and the corresponding kappa values show differences between men and women. Men had a higher agreement than women. This may be due to the slightly different domains (e.g. mood, cognition, somatic symptoms) that the two scales are measured. The CES-D measures rather depressed mood and also positive affects, whereas the PHQ-9 measures the criteria of major depression. Furthermore, the CES-D asks about the frequency of depressive symptoms in the last week, while PHQ-9 asks about the frequency in the last two weeks. The shorter time interval captured by the CES-D may detect rather short-term symptoms, including acute problems and stressors that may not indicate major depression. It can be assumed that women are more likely to be in a depressed mood and do not yet meet the criteria for major depression. Therefore, they had an elevated score for the CES-D, but not for the PHQ-9. This may explain the discrepancy between the prevalence determined by CES-D and the PHQ-9 and finally the poorer agreement.

Strengths
There are serval characteristics that are strengths of this study. First, HNR Study is a large representative sample of the general middle and older population followed annually.
There were multiple measurements of depressive symptoms available with the widely utilized and well-established scales CES-D and PHQ-9, so the longitudinal analysis was possible. The sample size allowed the psychometric characteristics of CES-D and PHQ-9 to be properly investigated in the general population. We also stratified by gender to observe gender-specific differences in scales.

Limitations
The study also has a few limitations. The basis for concordance analysis is the dichotomization of the initial continuous scores. This results in a loss of information as it is not possible to determine how close the actual score lies to the cutoff. Another limitation is the lack of clinical confirmation of depression, so we were not able to evaluate our results with a valid diagnosis of depression. This would have allowed testing of CES-D and PHQ-9 sensitivity and specificity.

Consent for publication
Not applicable.

Availability of data and materials
The corresponding author has full access to all data in the study and final responsibility for the submission of the article for publication. Due to data security reasons (i.e., data contain potentially participant identifying information), the HNR Study does not allow sharing data as a public use file. Data requests can also be addressed to recall@ukessen.de.

Competing interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.         Kappa coefficient with 95% confidence interval (CI) by subgroups