The importance of low stakes tests is established in education systems worldwide (43). It offers feedback for students and faculties on current knowledge and curricula, respectively. However, a common challenge of low stakes test is a potentially reduced test-taking effort as it not only influences the validity of the test data (44). but also, directly effects the feedback. In combination with the wish to include PTM-wide comparison in our analysis, we therefore wanted to explore patterns in the answer behavior of PTM participants in order to find grouping of participants beyond test-taking effort, semester peers, and possible curricular differences.
For that, we used a cluster algorithm for underlying pattern detection followed by a boosting classifier. At the end, we pass the obtained model from the previous step to an explainer that calculates the relevance of each feature for each class.
Parameters and Performance Measurements
We mainly used preset parameters and did not adjust the hyperparameters. Instead, we took the best result from 200 (cluster) and 100 (classifier) runs for further analysis. We chose this approach with the ulterior motive of adopting the pipeline into the standard analysis of the test. There, tuning the hyperparameters for each test is not feasible. The performance measurements for the multiple runs of the cluster- and classifier-algorithms gave values in close and appropriate ranges. The CHS for the k-means runs differed by only 2.5 with a median of 215.04. The classifier accuracy ranged from 85.5% to 89.9% and the F1-score ranged from 85.6% to 89.9%, with 89.9% for both scores in the final model. Given an accuracy of about 90% suggests that the classifier is able to learn well the assignment of the answer patterns to the corresponding clusters. We therefore concluded that the analysis pipeline gives meaningful and interpretable results without hyperparameter tuning.
Cluster
Our clustering yielded three ‘performance’ clusters and two ’leaver’ clusters; performance clusters are loosely structured around groups of study semesters, each one corresponding to a certain segment of medical studies. These groups offer an alternative to traditional cohort-based feedback, where participants are compared exclusively with peers of the same faculty and semester. Specifically, cluster 0 contains mainly participations from the end of the studies (semesters 9 and 10), while cluster 1 focuses on semesters 5 to 7, and cluster 3 on semesters 3 to 5. All relevant questions of the ‘performance’ clusters had a discrimination index above 0.35 (excellence cut-off) and almost all exceeded the test average (0.42).
The main difference between the relevant questions in cluster 0 and 1 was the question difficulty. Relevant questions for cluster 0 were slightly more difficult than the average difficulty of the entire test (cluster 0: 33.69% versus PTM: 34.64%), while the relevant questions for cluster 1 were slightly easier than the overall average. The participants in both clusters answered the respective questions with a high confidence predominantly correct.
However, at this point, it is important to note, that ‘relevant questions’ should always be considered together with the corresponding answer patterns. For example, relevant questions in cluster 3 showed a slightly higher average difficulty than the relevant questions in cluster 1, however Figure 6 shows that participants of cluster 3 answered the easier questions with high confidence correct and guessed the difficult ones wrong.
In summary, the combination of high discrimination index, question difficulty and respective answer pattern supports our claim that these questions serve as good group separators and support our ‘performance’ cluster groupings further. Although the separation between clusters might be somewhat fluent, these indicators describe the respective main group per cluster well.
Cluster 2 consists primarily of participants in the later stages of their studies. From their partial answer patterns, we can infer that most of them would have probably achieved scores comparable to those of the two upper performance clusters if they had completed the 200 questions. The graphs in Figure 7 additionally support this claim, as they show similar accuracy in confidence for participants in cluster 0, 1 and 2 for the questions that the respective participants answered. This means that the participants in cluster 2 showed similar answer characteristics to cluster 0 and 1 for their answered questions.
The answer patterns of the most relevant questions of cluster 2, namely 12 mostly correctly answered within the first 35 questions and 7 questions not answered at the second half of this PTM, presumably, could be responsible for the fact that the majority relevant questions for clusters 0, 1 and 3 are behind the first quarter of the test. Questions may be answered differently when they are positioned earlier in the test than at a later time. These positions for the relevant questions and certain answer patterns leads us therefore to believe, that the position a question has, should be seen as a feature in future analysis.
Cluster 4 must ultimately be considered in a differentiated way. The mean score is the lowest of all clusters with -18.09 and relevant questions have the unique characteristic of not being answered or wrongly guessed. However, labeling the whole cluster as ‘leavers’, is unfair to a high amount of allocated participations, because it contains in addition to all non-serious test takers, except six (N=590 of 596 overall), 61.09% of first semester students who mainly guessed. This cluster therefore serves partially as a lower ebb of cluster 3. For the participants of the first semester, the respective progress test can be considered the ‘baseline’, as it takes part at the beginning of the semester, and thus of their study. They do respond to many questions, but in a cautious way. Hence, the rather low scores in this cluster are not surprising. When looking at the line plots in Figure 7, self-assessment accuracy of cluster 4 stands out. The mean value is lower than average for each confidence level, but the variance is much higher than for the other cluster, again leading to the assumption that this cluster contains two different groups of participants. The separation of cluster 4 into two separate clusters, one with non-serious participants and one with ‘beginners’ would most likely require a different scoring. However, our goal was not to find non-serious test takers, but to gain a new view on our participations and group together similarly performing participants concerning the PTM-wide picture. A more in-depth look at this cluster might be of interest for further research in the detection of non-serious test taker by gaining insights into the differences between highly motivated, yet little medically knowledgeable early semesters versus low motivated higher semesters.
Inclusion in PTM Feedback
The integration of the pipeline and the resulting clusters into the PTM analysis can complement the current feedback with a global estimate, for both – students and faculties. Students ask for additional detailed comparisons including all participating faculties, beyond their own comparison cohort (their own semester at their own university) (14). Placing students in clusters based on their answer patterns is the first step in this direction. Great care has to be taken to include all information necessary to explain the results in a comprehensible way. With the help of the heat maps (Figure 4), participants can visually compare their answer patterns with the answer patterns of all clusters and thus get a better overview of their current level of knowledge. For example, participants in cluster 2 can estimate their performance on the first half of the test better than they can solely with standard numerical feedback and averaged scores. Additionally, by classifying all participants by clusters, faculties can get feedback on participants’ performance that is independent of the semester-dependent curriculum. Grouping similar performing participants can help to construct a profile focusing on their strengths and weakness.
Limitations
The influence of the imbalance in participation numbers of the faculties and semesters needs further analysis. The largest participating institution administers the test every semester and admits new students twice a year, as opposed to other faculties where enrollment only takes place at the beginning of the winter semester. Smaller faculties show a much more uneven distribution of test takers among semesters and not all faculties offer the PTM every semester. It would not have been possible to address the semester-related and the faculty-related imbalance, and still have the answer structure of the respective PTM. As the goal was to group similar participations within one PTM run, we thus believe it is necessary to keep the original structure of the dataset in this respect.
In the original data, each participants’ answer is represented by an identifier composed of the confidence of the given answer and its correctness. For clustering, we translated these categories to numerical values knowing that this transition has possibly an impact on our clustering results. Hence, our scoring assignment was based on a mathematical background, and as argued above, we find the resulting clusters reasonable and interpretable for our aim.
Future Research
In this study, we only included the participants’ scores in our analysis. A future research goal would be to include other variables to adjust for certain distractions. One feature to include could be for example the time a participant spends on a question. These inclusions could help to find new cluster indicators and possibly give us even deeper knowledge on the differences of certain participant groups.
Longitudinal analysis of both the participants and relevant features is of high interest as well.
The length of stay of participants in one cluster and the transition to another during their study could give information on knowledge growth of participants and derive developmental patterns. One usage example would be identification of those in need of support, possibly by early indicators.
Performing an analysis on relevant features of past PTMs could offer further insights into specific characteristics and identify relevant question groups.