GRE model
Although scatterplots are typically used for superficial exploratory analyses, we can expand their utility by creating generative causal models that produce similar distributions. These models in turn, may be queried and reasoned upon for interventions or policy formulations. Here, we first propose a myriad of explanatory models and then suggest how generated insights may be useful for test designers.
There is good reason to believe an association exists between Time to Graduation (years) and GRE Scores (Fig. 2; Tables 1 to 2). But how could the observed fuzziness and uncertainty in the distributions occur? Models, despite their rather simplifying assumptions, may help us better understand and explain observed phenomena. Here, by specifying various parameters that are potentially important, we can generate a causal explanation for observed data.
For this purpose, we use standard exam performances as a proxy for evaluating the true quality of a cohort of students, comprised of highperforming As, average Bs and Cs, and belowaverage Ds. This would be directly analogous to the GREs studies, where in an ideal scenario, we would also want GRE scores to be a proxy for eventual TimeToGraduation in graduate students. To simplify the model, let us also assume that only factors affecting exam performance matters in this model (setting aside other issues like changes in life priorities, the qualities of the mentor, lab culture, etc.) Then, our model scenario would be as follows:
To mimic the lack of correlation found in GRE studies, we specify a model that can explain how no “apparent” correlations can occur in a group of students with varying levels of quality (A to D grade students). Suppose a student has to learn 10 topics for a course. Suppose really good students (the real As) master 9 topics, the aboveaverage students (the real Bs) master 7 topics, the average students (the real Cs) master 5 topics, the belowaverage (the real Ds) master 3 topics, and the worst ones (the real Fs) master 1 topic. Suppose an exam cover 5 out of the 10 topics randomly, each topic is worth 20 marks out of 100. Then, depending on how many topics a student has not mastered get included in the exam:

A students will get 80–100 marks,

B students will get 40–100 marks,

C students will get 0100 marks,

D students will get 0–60 marks,

F students will get 0–20 marks.
From this very simple model, we can see that the variance increases toward students in the middle range of understanding, and tightening at the two extreme ends of understanding.
Based on these assumptions, when a student gets ≥ 80 marks in the exam, she could be an A, B, or C student, but not a D or F student. There are different probability of A, B, C students getting ≥ 80 marks that we can derive using hypergeometric probability:
\(P(\ge 80A)\) is read as the probability of obtaining a grade of 80 and above given this is a real A student. Since \(P\left(\ge 80A\right)=100\%\), this can be taken to mean that all real A students will score of 80 and above. While real A students easily obtain grades above 80, real B students also do quite well, with 92%.
Plotting this model should still show a clear negative correlation between examination scores and true student quality. To introduce more variability, we can specify ranges for correct preparations. We now say that really good students (the real As) master 7–10 topics, the above average students (the real Bs) master 6 to 9 topics, the average students (the real Cs) master 5 to 8 topics, the below average (the real Ds) master 4 to 7 topics. We set aside real Fs for now. Once again, suppose an exam covers 5 out of the 10 topics randomly, with each topic worth 20 marks out of 100. We also assume there are equal numbers of A, B, C and D grade students (100 each). Now, we have the following probabilities:
These assumptions allow both very good students and average students for a shot at obtaining grades of 80 and above. Since there are equal numbers of A, B, C and D grade students, we have an equal 25% chance for obtaining any of the student categories. Suppose we also know empirically that the probability of getting a grade of ≥ 80, \(P\left(\ge 80\right)=0.525\). Given all these information, we can express the probability of obtaining a particular student category given a grade of ≥ 80 is obtained as:
$$P\left(Category\ge 80\right)=\frac{P\left(\ge 80 Category\right)*P\left(Category\right)}{P(\ge 80)}$$
where Category includes A to D. We can now work out the following probabilities:
From these results, students who get ≥ 80 are still mostly A category (Fig. 7a). We also obtain a moderate negative correlation (corr = 0.44). Visually, we may still pick out the association.
We further modify and expand the model by incorporating more real world beliefs or observations. We increase the dominance of the centre part of the distribution by assuming that student categories are not equally distributed. Since students are often graded on a bell curve due to the belief that most people are average, it may be useful to incorporate normal assumptions on student categories.
We now let extreme A and D category students be relatively rarer (50 each) and moderate B and C category student be more common (150 each). This change alone is very effective, reducing the correlation from − 0.44 to 0.23 (Fig. 7b). But a correlation still exists nonetheless, although it is now harder to pick this out visually.
Thus far, we assume students study randomly but in truth, students often “spot” questions based on previous years’ papers or based on personal beliefs. We call these students who are able to spot questions correctly “muggers”. We could add in a condition to our model to account for these muggers, by allowing 25% of random “muggers” to correctly spot questions. The model may not be variable enough due to the limited number of questions being considered (10 of which 5 are examined). We may increase variability by increasing the number of questions and preparations by increasing it from 10 to 100. Although introducing these assumptions increase “realism”, there is not much significant impact: the “mugger” assumption does not produce a great enough upward pull for D category students, resulting still in a negative trend line (corr = 0.23). However, removing the D students will create a weakly correlated distribution, somewhat similar to the observed GRE plots (Fig. 7c). Interestingly, our GRE plot (Fig. 2) is also a truncated one as students rejected from the PhD programs in the two studies are not recorded. It is reasonable to presume that many of them have belowcutoff GRE scores for admission to those programs.
It is unlikely this model accounts all fuzziness seen in real world GRE scatterplots. However, we show that even incorporation of a limited set of variables can generate high fuzziness. We certainly should not expect that the GRE alone is going to be an exact predictor.
There are a few more useful intuitions that we may draw from the model above. As educators, we know that when two students have “similar” scores (e.g., 90 vs. 85 marks), their understanding of a course is probably at the same level (i.e., you give them another test, their score might well be 85 vs. 90); and when two students have a big gap in their test score, their understanding of the course is almost surely at completely different levels. So, the level of correlation of marks to understanding depends on the “error bar” of the marks. If error bar is big, there will not be clear correlation. Yet, when we look at the two ends (e.g., comparing As with Cs, omitting Bs), we will see a clearer association (Fig. 7c). Furthermore, there is also the question of how many exams is enough to distinguish A and B students. Given the conditional probability of obtaining ≥ 80 marks for each grade of students, as calculated before, For example, given:
Prob(80+  A) = 100% ,
Prob(80+  B) = 92% ,
Prob(80+  C) = 12% ,
Prob(80+  D) = 0% ,
Prob(80+  F) = 0% .
8 exams are needed to reduce the chance of a B student scoring 80 + in all exams to about 50%. This is because it is too easy for B students to get 80+. If the exam covers more topics (e.g., 60% instead of 50% of possible topics), then you will need fewer exams as this makes it harder for B to get 80+. In other words, ensure that the exam has sufficient coverage. Such considerations may help test designers develop more robust assessments and more reliably gauge a candidate’s aptitude.