The EMHL test is a maximum performance test (Criterion-Referenced Test - CRT) based on the thematic contents of the EspaiJove.net. The item pool generation of the EspaiJove.net from the thematic module contents are: 1) Concepts of mental health and mental disorders; 2) Mental health multidisciplinary team network and use of health services; 3) Healthy and risk behaviours in mental health; 4) Social skills and antisocial behaviour, bullying and cyber-bullying; 5) Anxiety; 6) Depression; 7) Self-harm and suicidal behaviours; 8) Eating disorders; 9) Alcohol and substance use; and 10) Psychotic disorders.
The EMHL test development process involved two phases, as shown in Figure 1:
[ Figure 1]
The study adheres to CONSORT guidelines (see additional file 1 for checklist). Trial registration NCT03215654.
Phase 1: Questionnaire content development.
The development of EMHL test was based on: 1) a literature review of MHL measures for adolescents; 2) content analysis of the discourse emerging 6 focus groups held by trained child and adolescent mental health professionals (psychiatrists, psychologists and nurses) and high school students of 14/15y.
The questionnaire content development phase included 5 steps:
Step 1: Item pool generation. The first version of the EMHL test was developed from the thematic content of each module of the EspaiJove.net program resulting on 60-items (1st version) (1st part: 15 items; 2nd part: 45 items). The experimental versions of the EMHL test consisted of two parts with two response formats: (i) the part 1 consist of binary choice format (yes/no) for the recognition of mental disorders from a list of 15 different diseases (15 items). Mental disorders are based on Diagnostic and statistical manual of mental disorders (5th ed.) (DSM–5) . (ii) The 2nd part has multiple choice questions with four possible answer options, in which only one is correct. Incorrect answers were considered as distracting items and it were based on stereotypes, prejudices and erroneous affirmations about mental health. The distracting items were selected from focus groups held by adolescents (n = 39) and, then, developed by EspaiJove.net researchers and reviewed by mental health professionals.
Step 2: Then, we conducted 6 focus groups with a total of 29 mental health professionals (expert panel) from four public child and juvenile mental health centers to explore: (i) clinical relevance; (ii) mistakes in wording (question and answers of each item); (iii) comprehensiveness and offensiveness. Between four and seven participants were involved in each group. Semi-structured cognitive interviews were implemented to guide the discussions and recorded. As result, an initially selection of 45 items (2nd version) (1st part: 15 items; 2nd part: 30 items) from the preliminary 1st version was developed excluding the less clinical relevant items of each module considered by skilled mental health professionals.
Step 3: The EMHL test 45-item 2nd version (1st part: 15 items; 2nd part: 30 items) was administered to five high school classrooms (n = 141): two from 4th grade (n = 69), two from 3rd grade (n = 50), and one to 2nd grade (n = 22) in three different public schools. The objective was to examine (i) item relevance by adolescents, (ii) comprehensiveness and level of difficulty, (iii) offensiveness, and (iv) feasibility of the test.
Step 4: Finally, the EMHL test was administered to 3rd grade high school students (n = 50) and a focus group (n = 5) with the purpose of improving comprehensiveness and vocabulary. The last 35-items test (4th version) (1st part: 15 items; 2nd part: 20 items) were selected for pilot study.
Step 5: Pilot testing study. The final EMHL test 35-item version (1st part: 15 items; 2nd part: 20 items) was piloted at a 3rd grade high school students (n = 23) to assess comprehensiveness and vocabulary. Validation process was performed from this EMHL test version (5th version).
From the analyses of the results we finally selected 35-items (3th version) (1st part: 15 items; 2nd part: 20 items) deleting each questions that presented two of three following criteria: (i) Psychometric properties (<0.20 in item-total score correlations), (ii) Knowledge level (>50% of positive answer); (iii) Relevance (less relevant items by mental health professionals).
To obtain the EMHL test total score of each part of the test, the formula (A-E)/(n–1) is used, where A: is the number of correct answers, E: the number of errors (including missing values), and n: the number of options for each item. Then, 1st part of the EMHL test, the formula is (A-E)/(2–1), and for the 2nd part is (A-E)/(4–1), where each correct answer adds one point to the total score, and each incorrect answer zero points (Uncorrected total score) . To facilitate the interpretation of results, both sections were converted as deciles from 0 to 10 (transformed scores). Higher score means higher mental health knowledge.
Phase 2: Validation of the psychometric properties of the EMHL test
The validation process was performed through the administration and analysis of the final version of the EMHL test (5th version) to a convenience nonrandomized sample of high school students of 14/15y (N = 355) in 6 schools of Barcelona, Spain, and signed the informed consent by both adolescents and parents. Exclusion criteria were:1) Special education school; (2) Students with special educational needs and/or with cognitive problems; and (3) not understanding Spanish or Catalan language. Nurses and psychologists, members of the EspaiJove.net team, informed the participants about the contents of the study and administered the EMHL test.
Main validity measures
We hypothesized specific variables to associated with the level of mental health literacy, with a varying degree of strength.
Stigma. Stigma was measured with two questionnaires: (1) the Scaling Community Attitudes toward the Mentally Ill (CAMI) Spanish version  is an instrument for the systematic description of the attitudes of the community towards mentally ill people , which consists of 40 items divided into four dimensions (Authoritarianism; Benevolence; Community mental health ideology and Social restrictiveness). We only administered the Authoritarism dimension (10 items). For the Social restrictiveness dimension, the 4 questions in the future of the RIBS were chosen, since both works on the same concepts. We choose these two dimensions since they contain items related to the treating and caring for people with mental illness. The score for each subscale is the sum of the positive items, and the reverse of the negative items. All items from Authoritarism dimension are scored on an ordinal scale (5–1), respectively, ranging from 10 to 50. Higher scores mean greater agreement with engaging in the stated attitude; (2) Reported and Intended Behaviour Scale (RIBS) is used to assess reported and intended behavioural discrimination among the general public against people with mental health problems. The RIBS consists in 8 items; the first four items of the RIBS are designed to assess prevalence (past and current) of behaviour in each of the four contexts (1. living with; 2. working with; 3. living nearby; and 4. being in a relationship with someone with a mental health problem) while items 5–8 ask about intended (future) behaviour within the same contexts . We selected four items from 5 to 8 (future behavior). It uses an ordinal Likert scale with five response options: “totally agree”, “somewhat agree”, “neither agree nor disagree”, “somewhat disagree”, “strongly disagree” from 5 to 1 point, respectively. The total score of future behaviours is obtained from a sum of the total answers ranging from 4 to 20. Higher scores indicate greater agreement with engaging the stated behaviour. We hypothesized that higher mental health knowledge would have lower stigma-related.
Mental health. The Strengths and Difficulties Scale (SDQ) was used . The SDQ consists of 25 items which generate scores along five dimensions: emotional symptoms, conduct problems, hyperactivity-inattention, peer problems, and prosocial behaviour (positive mental health). We hypothesized that adolescents with higher emotional symptoms and peer and conduct problems would have lower mental health knowledge, and more prosocial behaviours would have higher mental health knowledge. Non a priori relationship will be found for hyperactivity-inattention because there is no item in the EMHL test about this construct. Each item is rated 0, 1 or 2 points in accordance with being “absolutely true”, “somewhat true” or “not true”. The score is inverted in those items whose presence indicates positive features. The total score ranges from 0 to 40 for each dimension. The SDQ has been validated for use with adolescents aged 11–16.
Health-related quality of life (HRQoL). The 5-level EQ–5D is a brief, multi-attribute, generic, preference-based health status measure [26,27]. The EQ–5D covers five dimensions of health (mobility, self-care, usual activities, pain or discomfort, and anxiety or depression) with five levels of severity in each dimension EQ–5D–5L. We used the Spanish version of EQ–5D–5L and time trade-off preference values from the Catalan general population . EQ–5D–5L scores range from negative values to 1, higher scores indicating better health status, and 0 is equal to death. The single-item EQ–5D visual analogue scale (EQ–5D-VAS) (range 0–100) was also used. We hypothesized that lower HRQoL of anxiety/depression would have lower mental health knowledge but not in other dimensions.
Bullying and cyberbullying. We developed a 4-item scale to assess bully victims and bully behaviors of perpetrators specifically for this study. Two items assess whether an adolescent has been bullied or ciberbullied victim and two items has bully or cyberbully behaviors. Option answers were “Yes” which scores 1 and “No” which scores 0. Total scores was the sum of each item in both dimensions. We hypothesized that adolescents who has been bullied or bulliers would have lower mental health knowledge.
Known-groups validity assessment
We recruited high school teachers (n = 43), nursing and psychology university students (n = 57); primary care physicians and nurses (n = 61); and mental health professionals (psychiatrists, psychologists and nurses) (n = 52). We hypothesized that some groups, in particular health professionals and teachers, would have significant higher mental health knowledge than high school students.
Missing values were assessed. The distribution of the item responses from complete responders was analyzed in order to detect highly skewed distributions and floor or ceiling effects of correct answers. Internal consistency index for CRTs was calculated using the phi (lambda) coefficient [29,30] as an estimation of the consistency. This coefficient is specific for CRTs and is interpreted as an alpha’s Cronbach coefficient , obtaining values between 0 and 1. One month test–retest reliability was assessed with the Intraclass Correlation Coefficient (ICC) two-way random model, testing for absolute agreement between the first and second administration of the scale. Values below 0.4 were considered poor, between 0.40 and 0.59 fair, between 0.60 and 0.74 good, and over 0.75 excellent .
The ability of the EMHL test uncorrected total score to distinguish among different groups was assessed. Differences across known groups were assessed with ANOVA parametric test. The magnitude of the association was estimated with effect size (ES) to compare average differences in MHL mean between subgroups in categorical variables. The cutoffs and the interpretation of ES were low (|0.20| ≤ ES ≥ |0.50|), moderate (|0.50| < ES ≥ |0.80|), and high (ES >|0.80|)(33,34). In the case of continuous measures, the magnitude of the association was assessed by cut-offs for Pearson correlation coefficients: very weak (<0.20), weak (≥0.20-<0.40), moderate (≥0.40-<0.60), strong (≥0.60-<0.80), very strong (≥0.80)(35). Significance tests were all evaluated at the 0.05 level.
Item discrimination index was also assessed and it evaluates how well an individual question sorts the sample who has mastered the material from students who have not. It is based on the comparison of the performance of the extreme groups (low and high) in the test scores. The number of participants who have been successful in the high group compared to the low proficiency group is compared. We selected the 36% of the sample. The discrimination capacity of each item was assessed by these cut-offs: items that must be deleted (≤0.0), inadequate (>0.0-<0.20), low (≥0.20-<0.30), acceptable (≥0.30-<0.40), strong (D≥0.40) discrimination.
Statistical analyses were conducted using the Statistical Package for the Social Science (SPSS) version 22.0  and Excel (Microsoft Office).