Using the frequency of correct responses, the M statistic can be derived by dividing the frequency of incorrect responses by the number of distracters. In Table 1, sample computations of the expected values are presented. For a number of distracters d = 3, a number of examinees N = 100, and a number of correct responses c = 70, for example, the expected frequency per distracter is 10.0. Given a certain observed frequency, its distance from the corresponding expected value can be computed, which may be a negative or positive distance. Squaring these differences and dividing by the expected value results in a chisquare with 1 degree of freedom. Summing these ratios of squared differences and expectations across the number of distracters d results in a chisquare test with d degrees of freedom.
An example of whether the hypothetical item distracters are equally plausible based on their frequencies is provided in Table 2. For instance, with 25 correct responses (c), the number of incorrect responses totals N – c, which is 100–25 = 75. If Distracters 1, 2, and 3 receive 35, 15, and 25 responses, respectively, and the expected value for each distracter is 8.33, their respective z values are calculated as 4, 4, and 0, summing to 8. The corresponding pvalue for this statistic with df = 3 is 0.046, indicating that the observed frequencies significantly differ from the expected frequencies. Therefore, the distracters are not equally plausible.
Application to a Dataset
We analyzed a dataset consisting of test responses from a graduatelevel statistics course with 198 participants. For six items, we recorded the frequency of responses for each option, marking the frequency of correct responses with an asterisk "*". Using the M statistic, we computed the expected values, which are displayed in Table 3. For instance, in Item 1, 104 out of 198 examinees selected the correct answer, while 94 chose incorrect answers (distracters). Based on these data, the expected value was approximately 31.33.
Table 3
Frequencies (percentages) of response to all options for the 6 test items and the corresponding expected frequency per distracter
Item  Options  Nc  Expected value 

A  B  C  D   

1  104*(53%)  37(19%)  41(21%)  16(8%)  94  31.33 
2  37(19%)  103*(52%)  19(10%)  39(20%)  95  31.67 
3  20(10%)  16(8%)  142*(72%)  20(10%)  56  18.67 
4  35(18%)  40(20%)  20(10%)  103*(52%)  95  31.67 
5  11(6%)  11(6%)  135*(68%)  41(21%)  63  21.00 
6  11(6%)  6(3%)  36(18%)  145*(73%)  53  17.67 
* frequency of correct responses (c) 
The DISC and DIFF parameters were estimated using the dichotomous Rasch model to evaluate the psychometric properties of the items. As shown in Table 4, two items (Items 2 and 5) had negative discrimination values, indicating poor quality. There were three easy items (Items 3, 5, and 6 with negative logits) and three difficult items (Items 1, 2, and 4 with positive logits). Further analysis using M revealed that three items had equally plausible distracters (Items 2, 3, 4), while three items had distracters of unequal plausibility (Items 1, 5, 6). Item 5 was flagged as poor quality due to both its nondiscriminative nature and unequally plausible distracters. Overall, only Items 3 and 4 demonstrated good quality based on the assessed properties.
In this new approach, the detection of implausible distracters is different from the traditional approach popularized by Haladyna and Downing (1993) [8]. Although most frequencies exceeded the > 5% criterion for functional distracters, except for Distracter B of Item 6, which had a 3% response, the items were still flagged for being collectively ineffective in attracting lessable test takers. Hence, the new method can complement the existing methodologies in distracter analysis to identify items with dysfunctional distracters for further investigation.
Table 4
Raschbased Item DISC and DIFF estimates and results of equality of distracters’ plausibility tests
Item  DISC  DIFF  M  pvalue  Equally plausible? 

1  0.244  0.519  11.51064  0.009  No 
2  0.062  0.546  7.663158  0.053  Yes 
3  0.364  0.577  0.571429  0.903  Yes 
4  0.150  0.546  6.842105  0.077  Yes 
5  0.103  0.358  28.57143  0.000  No 
6  0.475  0.676  29.24528  0.000  No 
The flagged items considered for revision are shown in Table 5. The contents of the 3 problematic items below are given with the recommended revisions at the distracter level. Item 1 was found to have unequally plausible distracters, as shown in Table 4; hence, distracter D with the least frequency was revised from “Wilcoxon signedrank test” to “Welch ttest”. The original option D was not attractive, possibly due to its association with paired data analysis. Replacing this with the “Welch t test” may be more effective since the tool is used as an alternative when the assumption for homogeneity of variances is violated.
Table 5
Item content and options of the three items flagged for unequal plausibility and recommended revisions with justifications
Item Content and Options (with recommended revisions)

Reason/s for revision

Item 1. The following are the characteristics of data: (1) dependent variable is measured at interval/ratio level; (2) There are two independent categories for the nominal independent variable; (3) The distribution in each group is normal; (4) The variances of the two groups are equal; (5) There are no observed outliers. What statistical tool is the most appropriate to compare the two groups?
 Independent groups t test*
 Mann‒Whitney U test
 Paired t test
 Wilcoxon signedrank test (Replace with Welch t test)

Unequal plausibility of distracters. Distracter D had the lowest frequency. The replacement is assumed to distract more effectively because the tool is used alternatively when t test Assumption (4) is not met.

Item 5. The following are the characteristics of data: (1) Variables X and Y are measured at interval/ratio level; (2) X and Y are paired; (3) The distribution of the paired data are bivariate normal; (4) There are no observed outliers; (5) There is a linear relationship between X and Y. What statistical tool is the most appropriate to test the hypothesis that there is no linear correlation between X and Y?
 Analysis of variance (Replace with Chisquare test of independence)
 Paired t test (Replace with Pointbiserial correlation)
 Pearson productmoment correlation*
 Spearman rank correlation

Unequal plausibility of distracters. Distracters A and B had the lowest frequencies maybe because the contents are tests of comparison. The replacements are assumed to distract more effectively because the tools are used alternatively to test relationships between variables.

Item 6. The following are the characteristics of data: (1) Variables X and Y are measured at interval/ratio level; (2) X and Y are paired; (3) The distribution of the paired data are not normal; (4) There are observed outliers; (5) There is a monotonic relationship between X and Y. What statistical tool is the most appropriate to test the hypothesis that there is no correlation between X and Y?
 Analysis of variance (Replace with Chisquare test of independence)
 Paired t test (Replace with Pointbiserial correlation)
 Pearson productmoment correlation
 Spearman rank correlation*

Unequal plausibility of distracters. Distracters A and B had the lowest frequencies maybe because the contents are tests of comparison. The replacements are assumed to distract more effectively because the tools are used alternatively to test relationships between variables.

In items involving the assumptions of tests of relationships between variables, Items 5 and 6 had unequally plausible distracters. Two of the three distracters were comparison tests (e.g., ANOVA and paired t tests); therefore, they were less appealing or less effective as distracters in a collective manner. This is likely because these types of tests may not be as relevant or plausible within the context of the question (e.g., assumptions of correlation test), making them less likely to be chosen by examinees who do not know the correct answer. Consequently, these distracters fail to effectively challenge test takers and are not as efficient at diverting them from the correct answer. Replacing these with other tests of relationships (e.g., chisquare test of independence and pointbiserial correlation) may address this unequal plausibility.
Correlation of M with DIFF and DISC
To investigate potential relationships between DIFF and M, as well as between item DISC and M, we conducted correlation and regression analyses. The results indicated a moderate negative linear correlation between M and DIFF (r = 0.458), although this relationship was not statistically significant (p > 0.05). This suggests that as item difficulty increases, the plausibility of the distracters tends to decrease, but the relationship is not strong enough to be conclusive. Additionally, no significant correlation was found between M and DISC (r = 0.037, p > 0.05), indicating that the discriminative power of an item is not related to the plausibility of its distracters. Finally, there was a nonsignificant negative correlation between DIFF and DISC (r = 0.465, p > 0.05), implying that while there may be a tendency for more difficult items to be less discriminative, this trend is not statistically significant. Overall, the analyses suggest linear independence among these metrics.
Despite the lack of correlations, we noted some polynomial trends in the relationships between M and DIFF. The scatterplot in Fig. 1 shows a rather curvilinear trend such that when the value of the M statistic changes, the value of DIFF follows an inverted U pattern with a maximum a value near M = 10. Therefore, we conducted a polynomial regression to determine if a polynomial function fits the empirical data.
Table 6
Results of polynomial regression showing significant coefficients for linear and quadratic trends
Coefficients:  Estimate  Std. error  t value  p value 

(Intercept)  0.610253  0.187745  3.250  0.04747 * 
M  0.189268  0.033788  5.602  0.01124 * 
I(M^2)  0.006447  0.001007  6.400  0.00773 ** 
Residual standard error: 0.1791 for 3 degrees of freedom
Multiple Rsquared: 0.9461, Adjusted Rsquared: 0.9101
Fstatistic: 26.31 on 2 and 3 DF, p value: 0.01253
In Table 6, polynomial regression showed a multiple Rsquared of 0.9461, which indicates that approximately 94.61% of the variance in DIFF is explained by the model, showing a high degree of fit. The adjusted Rsquared equals 0.9101 after adjusting the Rsquared value for the number of predictors in the model, still indicating a strong fit. The overall model significance test suggested that the model significantly predicted DIFF [F(2,3) = 26.31, p = 0.01253].
The polynomial regression results reveal a significant quadratic relationship between DIFF and M. The significant negative coefficient for M2 suggests a parabolic curve where DIFF initially increases with M but starts to decrease as M continues to increase. The high Rsquared and adjusted Rsquared values indicate that the model explains a large portion of the variability in DIFF. Given the significant p values for all the coefficients, we can infer that the equal plausibility metric (M) has a meaningful and complex impact on item difficulty (DIFF). This model can be useful for understanding how changes in M affect DIFF and can inform the design and evaluation of test items to achieve desired levels of difficulty.