Student performance assessment is an important part of educational programs. Since learning objectives for university students are set through performance assessment, it is given serious attention by higher education instructors and planners [1, 2]. All educational systems seek to level up the performance of learners to achieve predefined objectives [3]. Setting the passing/failing grades and/or acceptable performance level (or minimum pass level) is the natural and common outcome of Tests, which is important not only for the learners, but also for higher levels including the school, city, state, and country [4]. Nevertheless, setting cut-off point or passing criterion is less regarded as an assessment pillar [2].
Passing standard, passing criterion, or minimum pass level are hypothetical boundary within score range of a test that distinguishes individuals who have achieved mastery level from those who have not [5, 6, and 7]. Standard setting methods are used to determine the cut-score or minimum pass level [8, 9]. Typically, a fixed score, such as 10 or 12, is considered to be the minimum pass level in educational examination at university level and employment tests [10]. The use of a fixed pass level for all conditions is not fair, due to the effect of such factors as difference in item difficulty, execution of test, level of subjects, and objective of the test. Therefore, educational justice is obtained when the minimum pass level in each test is set based on the conditions of that test [11].
In general, standard-setting methods are either item-centered or person-centered [12]. In item-centered methods (e.g. Angoff method), test content is reviewed by a group of experts and judges; whereas, in person-centered methods (e.g. boundary groups), decision of judges is based on the actual performance of the subjects [13]. According to literature, Angoff [14] and Bookmark [15] are the most common item-centered methods. Angoff method has been proven by many evidence as the most common and best-known standard-setting method [17, 18, and 19]. In the Angoff method, prior to the conduct of the test, a group of experts and judges are asked to review the content of each test question and then predict the likelihood of correct answer to each item by minimally-qualified candidates. Then, the obtained values are discussed to arrive at consensus by all judges over all questions. Finally, the mean of scores given by the judges to all questions is set as the pass standard and cut-score [19]. Nevertheless, this method is associated with difficulties, such as having a long-term procedure and need for an expert group [17, 20]. In addition, ambiguity in the concept of minimally-qualified student is among other limitations of this method [17, 21]. In an attempt to overcome the shortcomings of the Angoff method, researchers proposed a new method which, in addition to being suitable for both multiple-choice and constructed‑response questions, reduces the experts’ work load, facilitates their decision-making, combines the experts’ decisions with measurement models in determining the cut-score, and considers the test content together with the performance level [22]. This method, named Bookmark, was introduced by Mitzel, Lewis, Patz, and Green [15] and quickly welcomed. In this method, the place of each item in the scale is first determined according to difficulty index extracted by item response theory (IRT) and then the items are placed in separate sheets from the easiest one to the most difficult item. The expert panel is then asked to place its bookmark somewhere between the questions, where they believe the probability of giving correct answer by the minimally-qualified subject is 50% or 67%. After items are determined by the expert panel, their difficulty for all judges is extracted and the mean difficulty score is calculated. The cut-score in this method is determined by converting the mean ability score into raw scores. In the second round, data such as passing and failing rates, based on the obtained cut-score, is given to the experts. Based on the feedback, the experts are able to change their bookmarks between the items. In case of any change, the cut-score is determined again and is given to the panel. This process continues until a general consensus over cut-score is achieved [23, 24, and 25].
Several studies have compared the Angoff and Bookmark approaches. Hsieh (2013) used Angoff and Bookmark approaches to assess language proficiency levels of students [26]. Results of this study showed that the mean scores obtained from 32 experts, using these two methods, were different in determining the three final cut-scores. In this study, the strengths and weaknesses of each method were addressed. Buckendahl, Smith, Impara and Plake [27] compared the Angoff and Bookmark methods in standard-setting for a math test containing 69 items. In this study, a group consisting of 23 experts participated. Both methods produced similar cut-scores. However, standard deviation of experts was lower in the bookmark method. Reckase compared the modified Angoff method and Bookmark method with simulated data in an ideal condition (without judgment and parameter estimation errors) [28]. This study showed a negative bias in the Bookmark method, meaning that the estimated cut-score was always lower than the cut-score hypothesized by the experts. This study also showed that the modified Angoff method had slight bias or was without skew. Olsen and Smith (2008) compared the modified Angoff and Bookmark methods for a home inspection certification, and found that the results were fairly similar [29]. Results of the two methods were also similar in terms of the standard error of judges and initial cut-score. These results were obtained in Schultz(2006)'s study, too [30].
Since the implementation of all methods for standard setting and selecting the best cut-score is not practically possible, adoption of an appropriate method is an important part of test construction. Hambleton and Pitoniak reviewed different comparative studies but did not found any effective and generalizable result [31]. They highlighted the need for further comparative studies. Plake emphasized that the majority of conducted studies in this area have supported on a specific method and many factors, such as their validity, need to be examined [32]. Cizek and Bunch mentioned current methodological problems facing standard-setting methods [24]. According to Cizek, investigation and comparison of validity of different methods can result in identification of the best one [33]. kane (1994) emphasized on the collection of evidence for the evaluation of three types of validity (namely process, internal, and external) to measure the validity of standard-setting methods. Process evidence pertains to the accuracy of execution process and trust of passing and performance grades, internal evidence relates to the degree of agreement between judges, and external validity refers to consistency of the obtained cut-score with an external objective criterion [34].
Given the importance of standard-setting methods in setting passing grade, especially in performance tests, the present study intended to compare two common standard-setting methods, Angoff and Bookmark, by comparing their validity indices. Previous comparisons between these two methods and other standard-setting methods have been theoretical or merely based on the comparison between their passing rate and standard error; whereas, this study tried to make comparisons between the process, internal, and external validity of these two methods to reveal the advantages of each method based on validity indices.