**Definitions **

Any supervised binary classification problem needs two predefined classification groups to be formulated. When it came to define these two groups for our classification problem, we considered that the propensity to guess was the most identifiable indicator for this purpose given the two sources of information that we could use, namely the Likert scales provided by participants and the response time data collected by the test platform. Therefore, we divided responses into non-guessed answers, self-reported guesses (obtained from the Likert scales) and rapid guesses (extracted from data on response time). Self-reported guesses and rapid guesses may overlap, but none of these groups include any non-guessed response; therefore, it makes sense to conceive non-guessed and guessed answers (including both self-reported and rapid guesses) as opposing response categories. It is then straightforward to follow a majority rule and allocate response patterns to one group or another depending on whether more than 50% of their individual responses were guessed. This reasoning leads to the following instrumental definitions (which are meant to be valid only within the context of this paper):

**Self-reported guess: **We define “self-reported guess” as any answer marked as guessed by the test participants themselves.

**Rapid guess: **We define “rapid guess” as any response submitted less than x seconds after the previous one (or after starting the test in the case of responses to the first question), where x is defined by applying the normative threshold method with a 10 percent threshold (NT10) described by Wise and Ma in [12]. We have also set a lower threshold at 4 seconds in view of the very low share of correct answers among those submitted in 3 seconds or less (cf. Table 1).

**Guessed answer: ** We define “guessed answer” as any answer labelled as either “self-reported guess” or “rapid guess”.

**Guessing test taker: **Participants are identified as “guessing test takers” when more than 50% of their responses were classified as guessed answers.

**Guessing pattern:** We define “guessing pattern” as a sequence of answers pertaining to a guessing test taker.

The idea that there might be a difference between self-reported guesses and rapid guesses is supported by the breakdown of test answers by confidence levels, which shows that the percentage of correct answers among self-reported guesses was 36.18% for the test administered in the winter term 2020 [22]. This share is significantly higher than the expected percentage of correct answers for rapid guesses or random answer patterns; the probability of randomly choosing the right answer for a question also picked at random amounts to just 0.2353 for the test administered in the winter term 2020 [22]. This corroborates the hypothesis, also postulated by Kämmer et al. [24], that guessed answers were often not chosen entirely at random, but rather given by test takers who choose the option they consider most likely to be correct according to what they know, even if they do not yet possess enough knowledge to provide a more confident response.

**Features**

We decided to construct a baseline feature set with only two variables (self-monitoring accuracy and share of answered questions) and evaluated to which extent adding the time spent on the exam to the input features improves the model results (if at all). Self-monitoring accuracy can be measured by the share of correct answers among those submitted (that is, excluding omitted responses) [24]. It can be considered almost tautological that guessing participants will usually have lower levels of self-monitoring accuracy than non-guessing ones, but the implications of this obvious fact have not been explored much further; the usage of self-monitoring accuracy as a criterion to identify guessing patterns is rarely mentioned in the literature. Karay et al. [25] do acknowledge the existence of a relation between self-monitoring accuracy and guessing propensity in the context of formative tests with a “don´t know” option, going on to assume that all incorrect answers in the analysed test (the PTM as of 2011) were guessed.

The importance given to rapid responses in the literature about detection of careless test takers prompted us to determine the convenience of including the time spent on the exam as a model feature, under the assumption that guessing test takers would typically devote less time to the test. In any case, it is possible that the baseline feature set already succeeds in capturing test-taking effort, thereby making it unnecessary to consider the time spent of the exam as well; this is why we want to compare how our logistic regression models fare with and without this feature.

Subsequently, we investigated if considering each question as a feature offers any improvement with respect to the baseline model. Our reason to do so is that appropriateness measurement methods are based on assessing patterns extracted from the sequence of answers given by each participant; therefore, every single response (or even the absence of it) is relevant to the result. We wanted to explore whether the performance of a logistic regression model is any better when every response is taken as input.

**Logistic regression**

We chose logistic regression as binary classification method because it offers the advantage of delivering results that would be eventually easier to explain to other interested parties (e.g. students) due to the relative simplicity of its logistic function; the inner workings of the other algorithms would be much harder to grasp for non-experts. This advantage in transparency makes logistic regression the foremost candidate if we are required to communicate our results to the public on a regular basis. Logistic regression has been described as “a parametric method used for examining the relationship between a binary response variable (one that is categorical having only two categories) and a set of independent predictor variables that can be either continuous or categorical” [27]. It can be conceived as a linear regression model where the dependent variable is a natural logarithm [28], according to the expression [29]. In the context of this study, the variables stand for the possible features (number of answers, self-monitoring accuracy, time spent on the test); and represent the respective probabilities of each binary outcome. This makes it possible to express the threshold between guessing and non-guessing patterns as a simple linear equation, for example , where *a* would stand for the number of answers, *s *for the self-monitoring accuracy, and *t* for the time spent on the test.

**Self-reporting on confidence**

Within our framework, the labelling of answers as “guessed” depends mostly on the self-reporting of participants rather than on response time, to such extent that 94.84% of all answers marked as guessed in our dataset were self-reported guesses. In order to build reliable training and test sets for our models, we needed to remove data entries that are either inaccurate or inconsistent, or whose credibility cannot be verified due to lack of relevant information. In particular, we had to check whether self-reporting by participants matched a consensual definition of the confidence levels they were shown (“sure”,”likely” and “guessed”)

Our idea was to assign a fixed meaning to the labels “sure”, “likely” and “guessed”, that is, to define them in such a way that they all match numerically quantifiable degrees of certainty based on the average share of correct answers associated to each level. Thus, “sure” would imply a probability of 0.812 that an answer is correct; this probability would be 0.5847 for “likely” and 0.38 for “guessed”. For each participation, we postulated the null hypothesis that the underlying probability distribution for the number of correct answers among those marked as “sure” is the binomial probability distribution *B(N,P)*, where *N* is the total number of answers labelled as “sure”, and P=0.812 is given by the proportion of correct answers among those labelled as “sure” in the whole dataset (excluding rapid responses). This null hypothesis would be verified via a two-tailed test with a p-value of 0.0027, related to the three-sigma interval [0.00135; 0.99865] of a normal probability distribution; our intention here was to detect only the most extreme discrepancies from the norm given by P(correct | sure)=0.812, so that we had the possibility of discarding participations that are truly incompatible with this value. Same with answers labelled as “likely” (P(correct | likely) = 0.5847) and “guessed” (P(correct | guessed) = 0.38). In short, we carried out three hypothesis tests per participation, that is, one for each level in our Likert scale for confidence.

Before conducting these hypothesis tests, one must ensure that there is enough information available to do so, that is, there are enough answers so that at least one event *E* such that P(*E*)<*p*/2, where *p* is the p-value, is possible. For example, this does not happen if we have 3 guessed answers with P(correct)=0.38; in this case, the least likely event would be to answer all three questions correctly, and the probability of this event would be P(3 correct answers) = 0.383= 0.054872>0.0135=*p*/2. Since this is the least likely event, we see that there would be no event that could refute the null hypothesis, so we would need a larger sample of answers in order to test it. But we are working with real data collected within the framework of an established formative test, so we cannot obtain any more answers from the participants who did not provide them in the first place.

In summary, these hypothesis tests could return three possible outcomes: non-rejection of the null hypothesis, rejection of the null hypothesis, or inconclusiveness due to lack of information. We removed all entries where none of the three hypothesis tests could be carried out (that is, all tests were inconclusive due to lack of information) or the null hypothesis was rejected in at least one of the hypothesis tests, suggesting that the particular meaning they assign to at least one of the labels “sure”, “likely” and “guessed” differs significantly from their consensual definition based on the average values mentioned above.

We tested all 24,084 PTM participations registered through the ePT platform from October 2020 to June 2022 in order to examine their suitability for our study; 15,347 participations (63.72%) passed all three hypothesis tests, while 6,258 (25.98%) failed at least one of them and were thus not included in our dataset. Finally, there were 2,479 participations (10.29%) for which the test was inconclusive due to lack of information; 1,152 such cases (4.78%) concerned participants who did not respond any question.

**Case-specific accuracy**

We do not make public allegations of guessing for all possible or likely cases, but only for the most flagrant instances of this behaviour. Therefore, we are also interested in determining whether our method might provide a degree of certainty in identifying guessing test takers. We have then ranked all test set instances according to their algorithm-assigned probabilities of a positive identification, focusing ourselves on the results of the highest percentiles, so we can establish a threshold above which the algorithm´s decision can be regarded as fully dependable.

**Comparison with person-fit indices**

In order to answer our research question about the performance of logistic regression vis-à-vis person-fit indices and methods based on rapid responses, we tested 11 dichotomous non-parametric person-fit indices included in the R package PerFit [30]. Since the PTM is not an IRT-based test, parametric person-fit indices were not assessed; furthermore, we decided not to analyse the NCI statistic on the grounds that it is linearly related to another person-fit index also included in the PerFit package (the GNormed statistic, also known as “normed Guttman errors”). The procedure employed for this comparison is roughly similar to the one applied by Nazari et al. in [21]: for the test set of every PTM test included in our study, we computed the ROC-AUC scores of the prediction vectors of each method (i.e., the vectors containing the values with which predictions are made) against the vectors containing the actual response pattern labels (1 for guessing patterns, 0 for other patterns).

While we have chosen to set the guessing test taker threshold at a basis value of 50%, we have also explored how our method compares to person-fit indices when this threshold is set at a different value. To avoid large imbalances, we have only considered hypothetical thresholds where none of the two classification groups accounts for more than 80% of the data entries; under this rule, the range of possible threshold values goes from *T*=19% to *T*=64%. We therefore repeated the procedure described in the previous paragraph for the 46 integer values in this range (*T* {19%, 20%, 21%,…,64%} ), with guessing test takers defined as participants whose share of guessed answers exceeds *T*. This helped us determine whether the performance of our method against person-fit indices is independent of possible cutoff values.

**Pipeline**

**Dataset**

We have used an anonymized dataset containing data from the four PTM tests administered from October 2020 to June 2022, known as PT43, PT44, PT45 and PT46 [23], that we found out to be reliable according to the considerations explained in the “Self-reporting on confidence” section of this paper; this dataset contains records of 15,347 participations from ten medical schools in Germany, Austria and Switzerland. We have also considered the partial datasets corresponding to each PTM test from PT43 to PT46; these five datasets were combined with five different sets of features (see Table 2) to produce 22 logistic regression models. Combinations involving the global dataset and the set of individual answers to each question were not considered, since participants in different tests were not asked the same questions; therefore, the inclusion of features based on specific questions does not make sense for datasets with data from multiple tests.

As a necessary preprocessing step, we decided to discard all data entries belonging to the following categories:

**1. Participations by students enrolled in their “practical year” (junior residency).** Students in their “practical year” are not considered for this study because their participation in the PTM test is voluntary; moreover, not all universities in the PTM consortium give these students the possibility to take part in the test [22]. Hence, we discarded 380 participations associated to such students.

**2. Participations lacking reliable data about the amount of time spent on the test. **70 further participations were discarded because data about the amount of time spent on the test was unreliable or missing.

Our final dataset includes 14,897 participations, of which5,116 have been submitted by guessing test takers according to our definition.

**Model selection and hyperparameter setting**

Logistic regression was implemented using the sklearn.linear_model.LogisticRegression module of the Python library scikit-learn [31] [32]. All datasets used in this study were split into a training set with 80% of the data and a test set with 20% of the data. Parameter optimization was carried out with scikit-learn's RandomizedSearchCV, which implements a randomized search over parameters, where each setting is sampled from a distribution over possible hyperparameter values [33]. We ran RandomizedSearchCV 1000 times, keeping the hyperparameter sets that provided the best ROC-AUC score for each of the 22 combinations of data and features examined in this study.

**Metrics**

The accuracy levels shown in tables 3, 4 and 5 under “Accuracy (cv)” correspond to the highest accuracies reached with each algorithm-input combination as determined by 10-fold cross validation performed using the scikit-learn function cross_val_score(). 10-fold cross validation is a procedure whereby the training set is divided into 10 smaller sets (“folds”); then, each fold is used in turn as a test set while the other nine function as training sets. The value returned by 10-fold cross validation is the mean of the values computed for every iteration of the procedure [34].

The results shown in tables 6, 7 and 8 refer to the final evaluation of the test set after the classification task. Thresholds to identify the most likely cases of guessing patterns in a real setting were derived from the precision values for the subsets including the 5%, 10%, 15% and 20% of test set items with the highest algorithm-assigned probability to correspond to guessing test takers. These values, together with their associated confidence intervals, are shown in tables 6, 7 and 8 respectively under “Precision (95th percentile)”, “Precision (90th percentile)”, “Precision (85th percentile)” and “Precision (80th percentile)”.

Since many person-fit indices rely strongly on the identification of uncommon answer sequences, we have based our comparison between logistic regression and person-fit indices on the four partial datasets, in order to avoid comparing sequences which do not refer to the same questions. All person-fit index computations were carried out with the R package PerFit; the computation of the ROC-AUC scores was made with the R package pROC [35].