Development and validation of the curriculum viability inhibitor questionnaires comprised two main phases, as shown in Fig. 1. The first phase was the development of questionnaires and getting qualitative expert feedback to refine them. The second phase was establishing the content validity, response process validity, construct validity, and reliability of the questionnaires.
This study was approved by the Institutional Review Committee at Riphah International University (Appl. # Riphah/IRC/18/0394). Written informed consent was taken from all the participants.
The study duration was from October 2019 to July 2020.
Method-Phase 1
In this phase, answering our first research question, the authors developed the first version of the teacher and student questionnaires based on literature review, and refined the questionnaires after receiving qualitative feedback from expert medical educationalists.
Development and Qualitative Content Validation of Teacher and Student Questionnaires (Research Question 1)
Participants. Out of 27 experts who were invited based on their qualifications and experience in medical education, 21 (77%) responded and provided feedback on the first version of the questionnaire, with comments on the constructs and related items.
Materials. The first version of the teacher questionnaire had 62 items measuring 12 constructs, whereas the student questionnaire had 28 items measuring 7 constructs.
Procedure. The first author (RAK) developed the items for measuring each inhibitor based on a scoping review [2] and a consensus-building Delphi study amongst a group of experts [3]. The co-authors (AS, UM, MAE, and JJM) then refined the questionnaire before sharing it with medical education experts through e-mail. The experts were asked to provide qualitative feedback on the questionnaire items to improve their clarity and relevance to the inhibitor if needed, and also to comment on deletion or addition of items.
Data Analysis. The feedback was initially analysed by the first author by organizing the comments on the items. The changes in the items suggested by experts were made based on the criteria: (1) item easy to understand, (2) relevant to the construct, (3) avoid duplication or similar meanings, (4) minimize grammatical and formatting errors, and (5) avoid double-barreled statements. The questionnaire was then shared with co-authors for their feedback and consensus on modifications to the items.
Based on the expert feedback, items were reworded for clarity and grammatical inaccuracies or deleted if found not relevant to the construct or having a meaning very similar to another item. Some items were shifted to another construct if they were not found suitable for their current construct. When multiple suggestions were given for a single item, the commonly suggested modification was used and was finalized by the discussion and agreement of the authors.
Method-Phase 2
In this phase, the content validity, response process validity, and construct validity, along with the reliability of the questionnaires was established answering our second, third, and fourth research question, respectively.
Establishing the Content validity of Teacher and Student Questionnaires (Research question 2)
Participants. To rank the items for content relevance and clarity, 19 out of 21 (90.5%) medical education experts from Phase 1 participated in Phase 2.
Materials. The revised questionnaire (version 2) for teachers had 60 items measuring 12 constructs (see Appendix A); for students, it had 28 items measuring 7 constructs (see Appendix B). For both questionnaires, Likert scales were used to measure the relevance and clarity of the items. For relevance we used: 4 = very relevant, 3 = quite relevant, 2 = somewhat relevant, and 1 = not relevant. For clarity, we used: 3 = very clear, 2 = item needs revision, and 1 = not at all clear.
Procedure. The questionnaire version 2 was sent via email to 21 experts who had previously provided feedback in Phase 1, with a request to respond within 3 weeks. They were asked to score the items on the Likert scales and provide feedback to improve the items further. Out of 21 participants, 19 responded. The forms sent by 5 participants were incomplete and they were requested to send the completed forms. Only two participants complied, hence a total of 16 complete forms were included in the study.
Data Analysis. To establish content validity, quantitative and qualitative data were analysed. For the quantitative component, the content validity index (CVI) for the individual items (I-CVI), and of the scale (S-CVI) were calculated [9], based on the scores given by the experts.
I-CVI was calculated as the number of experts in agreement divided by the total number of experts, and S-CVI was determined by calculating the average of all CVI scores across all the items. To calculate I-CVI, the relevance ratings of 3 or 4 were recoded 1, and items ranked 1 or 2 were recoded as 0. For each item, the 1 s were added and divided by the total number of experts to calculate the I-CVI.
To improve the clarity of the items where a 3-point Likert scale was used, the content clarity average was calculated. The average clarity of an individual item was calculated by adding the sum of all the values given to the item divided by the total number of items. Average clarity above 2.4 (80%) was considered to be very clear [10].
The comments provided by the experts were categorized into general comments for the questionnaire and specific comments for the items. Based on these comments, the items were modified.
Establishing Response Process Validity through Cognitive Interviews (Research question 3)
Cognitive interviewing was used to answer the third research question. It is a technique that validates the understanding of items in a questionnaire by the respondents.
Participants. Interviews were held with 6 teachers and 3 students.
Materials. In version 3, the teacher questionnaire had 53 items measuring 12 constructs, and the student questionnaire had 23 items measuring 7 constructs. We used a combination of ‘think aloud’ and ‘verbal probing’ techniques [9].The participants were asked to read the item silently and think aloud what came to their mind after reading it [11]. In verbal probing, we asked scripted and spontaneous questions after the participant had read an item [12]. We combined the verbal probing and think-aloud techniques, as ‘think aloud’ acts as a cue for respondents, to yield additional information on the quality of the items as explained in the procedure section below.
Procedure.
Test interviews were conducted with 1 co-author, 1 teacher, and 1 student using Zoom (zoom.us) to identify possible issues related to combining think-alouds and verbal probing. The time participants needed to answer the items in the questionnaire was also determined. The average cognitive interview lasted approximately 60 minutes for 27 items in the teacher questionnaire and 50 minutes for 23 items in the student questionnaire. We also piloted cued retrospective probing [13], in which the primary researcher replayed the recorded think-aloud to the participant and explored the items with scripted and spontaneous probes. We found that it yielded no extra benefit in providing a cue as compared to the combination technique and also required more time.
The protocols regarding cognitive interviews for the study were planned based on the pilot interviews as they require a sustained concentration on behalf of the participants [14]. Hence for the teacher questionnaire, we divided the 53 items in the questionnaire between 2 participants whereas the student questionnaire did not require division as it had only 23 items. To increase the credibility of the interview technique and reduce bias, another researcher (UM) was also present during each interview.
Data Analysis. Analytic memos were created based on the think-aloud and verbal probing. These memos were coded into the following categories: (1) items with no problems in understanding, (2) items with minor problems in understanding, and (3) items with major problems in understanding [15]. These categories were assigned independently by RAK and UM. Items that required more clarity were reworded and further refined through review from the remaining co-authors (AS, MAL, and JVM). The details of the response process validity for the purpose of reproducibility are provided in the Appendix C.
Establishing reliability and construct validity (Research question 4)
Participants. Based on the adequate sample size (minimum of 10 participants per item) reported in the literature, our target sample was 520 teachers and 230 final-year medical students, [16, 17]. A total of 575 teachers from 77 medical colleges and 247 final-year students from 12 medical colleges filled out the questionnaire. We selected those teachers who were currently involved in teaching and had been involved in implementing or developing the curriculum. Curriculum involvement was described as the development of module or course and teaching, assessing, and managing it. Final-year medical students were recruited, as they have the maximum experience of the curriculum. The designation, academic qualification, experience of teaching, experience in medical education, and type of curriculum practiced is shown in Table 2. Out of the 575 teachers, 526 provided complete responses, whereas 245 out of 247 students provided complete responses.
Table 2
Participant Demographics for confirmatory factor analysis of teacher questionnaire (N = 526)
Designation
|
Qualification in Medical Education
|
Experience as a Teacher
|
Experience in Medical Education
|
Type of Curricula Practiced in the Institution
|
Professor (22%)
|
PhD (3%)
|
> 20 years (7%)
|
> 20 years (2%)
|
Discipline-based (29%)
|
Associate Professor (18%)
|
Master’s (44%)
|
16–20 years (10%)
|
16–20 years (1%)
|
Integrated (35%)
|
Assistant Professor (30%)
|
Fellowship (22%)
|
11–15 years (21%)
|
11–15 years (7%)
|
Problem-based (4%)
|
Senior lecturer (13%)
|
Diploma (4%)
|
5–10 years (30%)
|
5–10 years (18%)
|
Theme-based (3%)
|
Lecturer (17%)
|
Certificate (17%)
|
< 5 years (32%)
|
< 5 years (72%)
|
Hybrid (Mix of Discipline and Integration) (29%)
|
|
Workshops only (10%)
|
|
|
|
Materials. This fourth version of the teacher questionnaire had 52 items measuring 12 constructs, and the student questionnaire had 23 items measuring 7 constructs. The items had to be scored on a 5-point Likert scale: 1 = strongly disagree, 2 = somewhat disagree, 3 = neither agree nor disagree, 4 = somewhat agree, and 5 = strongly agree. The items were shuffled so that they were not grouped by the hypothesized constructs. We also shuffled the answer options in a few items and informed the respondents. We did this so that questions were carefully read and answered by the respondents to encourage response optimizing and prevent satisficing [18–20].
Procedure. A pilot study of the questionnaire was conducted with 20 teachers and 15 medical students to ensure the smooth working of the Qualtrics link (www.qualtrics.com) and resolve any difficulty browsing through the questionnaire. No issues were reported by the participants. To maximize the response, we shared the questionnaire link through different sources. The link was sent to the Deans and Directors of medical education of the colleges through emails. They were also shared with the master’s in health professions students in their WhatsApp Groups. The invitation message stressed the formative purpose and use of the evaluations and the confidential and voluntary character of participation. To encourage participation, e-mail reminders were sent on Day-5 and Day-10, apart from reminders through WhatsApp to the Directors of medical education departments.
Data Analysis. To ascertain the internal structure of the questionnaire, internal consistency was calculated through Cronbach’s Alpha. Then, we conducted confirmatory factor analysis (CFA) as we had specific expectations regarding (a) the number of factors (constructs/subscales), (b) which variables (items) reflect given factors, and (c) whether the factors correlated [21].
The questionnaires were evaluated using SPSS version 26 and AMOS version 26. Regarding internal consistency, Cronbach's alpha of between .50 to .70 was considered a satisfactory internal consistency for the scale and subscales [22–24]. Corrected item correlation test (CITC) was calculated for the items of the subscales that had low internal consistency. CITC in the range of .2 to .4 was considered an acceptable value to retain the item [25, 26].
Construct validity was established via CFA. For the goodness-of-fit of the measurement model, we measured the absolute, incremental, and parsimonious fit indices. Absolute fit indices assess the overall theoretical model against the observed data, incremental or comparative fit indices compare the hypothesised model with the baseline or minimal model, whereas the parsimonious fit model index assesses the complexity of the model [27, 28]. The indices used for absolute fit are root mean square error of approximation (RMSEA) < .05 as a close fit, < .08 as an acceptable fit [29], and goodness-of-fit index (GFI) > .90 as a good fit [30]. For incremental fit, the indices considered acceptable are comparative fit index (CFI) > .90, adjusted goodness of fit index (AGFI) > .90, Tucker Lewis Index (TLI) > .90 [31], and normed fit index > .90 [32]. For parsimonious fit, Chi-square difference (χ2/df) < 5.0 is considered acceptable [4, 33]