MCQ form useful assessment tool in measuring factual recall and if carefully constructed can test higher order of thinking skills which is very important for a medical graduate. The method of assessment should be regularly evaluated. Developing an appropriate assessment strategy is a key part in curriculum development. It is important to evaluate MCQ items to see how effective they are in assessing the knowledge of students (Gajjar et al. 2014). This the most widely used test format in health sciences today. The efficiency of MCQs as an efficient tool for evaluation depend mainly on their quality which is best assessed by item and test analysis (Poulomi et al. 2015, p. 47–52)
Item analysis is especially valuable in improving items which will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test administration. In addition, item analysis is valuable for increasing instructors’ skills in test construction, and identifying specific areas of course content which need greater emphasis or clarity. Separate item analyses can be requested for each raw score. (Pande et al. 2013, p. 45–50). Such statistics must always be interpreted in the context of the type of test given and the individuals being tested (Lord 1952, p. 181–194)
According to the results of this study many aspects of our MCQs exams in the department of pediatrics require further deep study and posttest re-evaluation as for consideration in the same test or the future use of the items. Optimizing the different important item analysis indices is an essential aim of medical exams constructors. A study by Poulomi et al. concluded that items with average difficulty, high discrimination and functional distractors are the best to be incorporated in the exams (Poulomi et al. 2015, p. 47–52). Avery similar conclusion was reached to by the study performed by Durgesh et al (Durgesh et al. 2017, p. 5351–5355) Determination of the item analysis indices can -to certain extent- differ according to the cognitive levels of Bloom taxonomy; Serpil et al in their study found that difficulty and discrimination were more associated with the remembering and understanding levels than with the applying level (Serpil 2016, p.16–24). In any case, all indices should be considered together before making decisions or revisions (Suskie 2017).
Comparable and similar results to this study were also reported by other authors when they evaluated the MCQs tests in their courses. Tarik Al Shaibani et al concluded that the mean DIF- I, DI and DE were in the acceptable ranges. A high percentage of items was easy, and a high percentage of distractors were NFDs. These distractors need to be revised to improve the DIF-I, DI and DE parameters. The reliability of the exams was acceptable (Al Shaibani et al. 2021, p. 471–476). Vrunda Kolter et al study results were as such: difficulty index of analysed MCQs ranged from 6.25% (lowest) to 90.6% (Highest) & discriminative index ranged from 0.0 (lowest) to 0.63 (Highest). Total 65% items were in acceptable range of difficulty level (‘p’ value 30–70%) & 10% items were very difficult which later discussed with students. Discrimination index of 60% items was excellent (d value > 0.35). No item had Negative discriminative power. About 47.5% items had100% Distracter Efficiency (DE) whereas 7.5% items had 0% DE (Vrunda 2015, p.320–326). These two studies have figures of indices of item analysis which may resemble or differ from our figures which indicate the need of individualization of the analysis of every institutional exam and looking for underlying factors of the analysis result. Many other similar studies in medical schools in the Gulph or other regional parts obtain useful results and indicators for improvement of their MCQs exams ; for example the studies done by Rao C et al, Prashant E. et al. and Deena Kheyami et al. (Rao et al. 2016; Prashant et al. 2016 Deena et al. 2018). Gujarat et al in their detailed study with objectives close to this study in assessing the analysis indices stated that out of 50 items, 24 had “good to excellent” DIF I (31–60%) and 15 had “good to excellent” DI (> 0.25). Mean DE was 88.6% considered as ideal/ acceptable and non-functional distractors (NFD) were only 11.4%. Mean DI was 0.14. Poor DI (< 0.15) with negative DI in 10 items indicates poor preparedness of students and some issues with framing of at least some of the MCQs. Increased proportion of NFDs (incorrect alternatives selected by < 5% students) in an item decrease DE and makes it easier. There were 15 items with 17 NFDs, while rest items did not have any NFD with mean DE of 100% compared with this present study there was no negative DI which is a positive finding in our setting. Actually they proposed the cause for negative DI in their sample that it can be wrong key, ambiguous framing of question or generalized poor preparation of students as was the case in their study where the overall score of their students was very poor (0–33/100) and none of them secured passing marks. Items with negative DI are not only useless, but actually serve to decrease the validity of the test (Gajjar et al. 2014)
Grades and scores of the student groups in the present study: the curve of grade distribution is acceptable as most of the students lie in the central part of the curve giving somewhat a normal distribution; however there is tendency of F results to double the A scores which is a source of some annoyance. The difference between the student scores in the first compared to the second terms may be attributable to the different choice level of students when they are initially accepted to the college as students with higher levels at entrance were those of the first term.
Difficulty evaluation of the exams: to calculate the Index of Difficulty we need three parameters. They are a) number of higher achievers (HA), b) number of lower achievers (LA) and c) total number of respondents (N) (M. Senthil et al. 2020, p. 80–85). At the end of the Item Analysis report, test items are listed according their degrees of difficulty (easy, medium, and hard) and discrimination (good, fair, and poor). These distributions provide a quick overview of the test, and can be used to identify items which are not performing well and which can perhaps be improved or discarded with this high percentage of very easy items in our midterm exams of 61% measures are needed to improve the difficulty levels of the items (although the difficulty is lower than average in some of the groups). Part of the problem may be attributed to the repetition of certain questions in a way or another and part may be due to possession of the students of large question banks with consumption of many questions used by the department. This necessitates renewal of the questions bank in the departmental exam committee and more effort of the faculty members to re –edit or replace the questions. Ideal difficulty levels for 4 response multiple-choice items in terms of discrimination potential is approximately 74%. This is highly recommended to improve the levels of the exams (Lord 1952, p. 181–194). There are instances when the value of DI can be 60% will result into to inflated scores and a decline in motivation. Too easy items should be placed either at the start of the test or better to be removed altogether, similarly difficult items should be reviewed for possible confusing language, areas of controversies, or even an incorrect key (Gajjar et al. 2014). A study by Kishore K.et al concluded that even if a single distracter is non-functioning or poor, it would seriously affect the psychometric properties and reliability of the test paper in a 3-option format. The present data clearly provides evidence that the items with three non-functioning distracters have serious psychometric inadequacy (Deepak et al. 2015, p. 428–435). Ideally, if the distractors are properly designed, it should lead to LA group selecting the incorrect options more often than the HA group (M. Senthil et al. 2020, p. 80–85). In their review of functioning and non-functioning distractors in 514 four option MCQs assessments, Tarrant et al; found that only 13.8% of all items had three functioning distractors and just over 70% had only one or two functioning distractors (Tarrant M et al. 2009).
A reliability coefficient between 70% and 80% is good for classroom tests Good for a classroom test; in the range of most. There are probably a few items which could be improved. While that of level 60% to70% is somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved. Between 50% and60% there is a need to revise the whole test (Scorepak 2015). Factors affecting the reliability of the test are multiple and include the length of the test with improved reliability with more number of questions, proportion of correct and incorrect responses, item difficulty as very easy and very difficult items do not discriminate properly and also number and individual factors regarding the examinee (Suskie 2017). Thus through item analysis, the instructors can improve their skill in constructing valid MCQs in the future. In addition it also directs the curriculum administrators to identify specific areas of the course content that needs revision or further clarification as evidenced by poor mastery of the subject (M. Senthil et al. 2020, p. 80–85). In a study by Shikha S. et al; after the analysis of their MCQs test and considering the DIFI, DI and DE together, 23.33% items were validated for MCQ Bank, 56.67% items would be re-validated after revision and modification, and 20% items were discarded(Shikha et al. 2016, p. 263–269).