Multiple choice questions (MCQs) are ubiquitous in the medical profession. The Medical College Admissions Test (MCAT), United States Medical Licensing Examination (USMLE) exams, specialty-specific board certification exams, and innumerable other exams related to the basic and clinical sciences are built using MCQs. We may take them for granted in the present day, but in previous decades the utility of MCQs in medical education was disputed (Dudley, 1973; Joorabchi and Chawhan, 1975; Joorabchi, 1981). Data emerged that MCQs can be reliable, valid, and efficient (Norcini et al., 1984, 1985), but questions persisted about the ability of MCQs to evaluate cognitive levels above rote recall (Ferland et al., 1987). Over the years, a number of studies have attempted to classify MCQs and link them to the cognitive levels described in Bloom’s Taxonomy, although the effectiveness of this strategy has not been demonstrated (Huxham and Naeraa, 1980; Ferland et al., 1987; Zaidi et al., 2018).
Given the widespread use of MCQs, the question of whether they can assess different levels of cognitive function has broad implications. The systematic design of instruction typically begins with objectives and ends with an assessment that can determine if the objectives (learning outcomes) have been achieved. The promotion of higher order thinking is desirable given the complexity of clinical medicine. An ideal MCQ-based exam would therefore use questions that provide a meaningful assessment of higher cognitive functions, testing understanding and application of knowledge beyond rote memorization.
This idea also has possible implications for education research. For example, the posttest-only control group design is a true experimental design commonly found in such research (Tuckman and Harper, 2012). Briefly, an intervention is administered to an experimental group but not a control group, and a subsequent comparative assessment of the two groups (which may take the form of MCQs) tests the effect of the intervention. Since MCQs can be written to different levels of difficulty, and potentially assess different levels of cognitive function, they introduce an additional variable to the experimental design, as well as a potential source of measurement bias. It is possible that the selection of MCQs for comparative assessments may itself influence whether an effect is observed, how large it is, and how it evolves over time.
In fact, Karpicke and Blunt report such a finding (Karpicke and Blunt, 2011), where in certain experimental groups the posttest questions classified as “inference” showed greater recall than those classified as “verbatim.” While not the focus of that particular study, the observation is nonetheless intriguing. A related observation was shown by Zaidi et al. (2016), who used a dichotomized Bloom’s taxonomy to classify MCQs as “lower order” and “higher order” based on cognitive level. They found a modest but statistically significant difference in performance on the two question types, with lower order questions answered correctly more often than higher order questions. Both of these studies suggest it is possible to differentiate MCQs based on function: those that are designed to test memorization, and others that are designed to test deeper conceptual understanding.
The present study used Karpicke and Blunt’s work as a starting point to develop a classification scheme for medically-related MCQs, categorizing them as recall/verbatim or concept/inference. The hallmark of recall/verbatim questions is that they are based primarily on facts, and therefore assess whether a learner has absorbed relevant factual information. In contrast, concept/inference questions are based primarily on relationships, and therefore assess whether a learner knows how facts are connected, which in turn provides greater context and enables the ability to make predictions. A third category, mixed/ambiguous, was added for MCQs that could potentially be answered by recall or inference.
We formed two related hypotheses using this classification system. First, that after preparing for an exam, medical students will answer recall/verbatim questions correctly more often than concept/inference questions. This hypothesis presumes that concept/inference questions will be generally more challenging based on the relatively more sophisticated processes needed to answer them. Second, that erosion of conceptual knowledge, as assessed by concept/inference questions, will be slower than the erosion of straight factual knowledge, as assessed by recall/verbatim questions. This hypothesis presumes that conceptual knowledge is relatively difficult to acquire, but once encoded it is durable.