To study IV task scores, we investigated the human judges’ inter-rater reliability using Cronbach's alpha and McDonald's omega. These were evaluated for IV1 (α = .80; ω = .82) and IV2 (α = .70; ω = .72) respectively, suggesting satisfactory inter-rater reliability and the use of CAT for score interpretation. Examples of story are presented in Supplementary information table 1.
In terms of descriptive statistics, we found slightly better apparent performance for GPT4 compared to GPT3.5 (Table 1). Based on an ANOVA, we note that some of these differences are significant (Table 2). There is a significant difference for AUT F(1,68.82) = 22.06, p < .001, and there is a significant Levene's test at p < .001, thus refuting the hypothesis of equality of variances. We therefore performed a Games-Howell post-hoc test to confirm the difference between GPT3.5 and GPT4 (MDiff = -4.74; t(68.82) = -4.70, p <.001). These results indicate that GPT4 can provide more ideas than GPT 3.5. For the divergence tasks the first assessed indicators, Fluency, showed similar significant differences: DV1 Fluency (F(1,96.67) = 67.96, p <.001) and DV2 Fluency (F(1,97.85) = 68.93, p < .001),confirming the superiority of GPT4 and GPT3.5 on the ability to generate a large number of ideas. In contrast, the second indicator, elaboration showed no significant differences between GPT3.5 and GPT4: DV1 Elaboration (F(1,96.23) = 1.66, p = 0.20) and DV2 Elaboration (F(1,97.45) = 1.85, p = 0.18). However, this lack of significance can be explained by the standardised methodology we used: one instruction and a “relaunch” for further ideas. In fact, there is no possibility for ChatGPT to exceed a certain number of characters (2048), which seems to explain the absence of significant differences between the two. More surprisingly, there were no significant differences for IV1, F(1,93.01) = 1.68, p = 0.20, whereas there were significant differences for IV2; F(1,96.27) = 12.59, p < .001. Given that this is the first in-depth evaluation of the creative ideas provided by ChatGPT, we took a closer look at these results.
The detailed qualitative study of the stories written by GPT3.5 and 4 showed that a number of elements concerning the creative production of stories provided by ChatGPT need to be nuanced. From a descriptive point of view, some texts are noticeably plagiarized from well-known stories, as can be seen in the following example:
“Once upon a time, there was a curious little girl named Alice. (…) she meets a white rabbit who tells her she must find a key to return to the real world. Alice begins her quest to find the magic key. She encounters a smiling cat, a smoking caterpillar and a wicked Queen of Hearts. (…)”
In this example, from GPT3.5 to IV1, the similarity to Lewis Caroll's "Alice in Wonderland" is particularly striking. From the character’s name to the magic key, the smoking caterpillar and the wicked Queen of Hearts, there are numerous elements that have been placed one after the other, in a statistical fashion, recreating the Alice in Wonderland mock-count. Other strongly inspired stories can be found in Task IV1, such as C.S. Lewis's "The Chronicles of Narnia" saga or HP Lovecraft's "The Silver Key". For IV2, other stories can be found, such as the Russian legend of "Firebird", or the Grimm brothers' "Golden Bird". For such stories, when they were detected by the human judges, scores of 2 or 3 were assigned depending on how much the stories varied from the originals. On the other hand, these few examples can serve to illustrate the creativity that GPT3.5 and 4 provide. Their aura of creativity is present, but when you get down to the details of the content, you realise that the LLMs models, which generate one word after another according to statistical probability, are likely to yield similar stories, in terms of content and/or form.
The qualitative study of the results of the IV tasks also enabled us to realize the recurrence of certain "first names" for the characters in the stories. Indeed, it seems that ChatGPT, while generating quite similar stories, quite often uses identical names. Table 3 shows the number of different names given by ChatGPT. There was a minimum of one name per story and a maximum of three names (corresponding to the IV2 character instructions). The "Total after cleaning data" column was processed so that similar names were grouped together (eg., Max & Maxime, Thomas & Tom, Maia & Maya, etc.). It seems important to note that, depending on the IV tasks, between 18% (IV2 GPT3.5 Max or Lucas) and 30% (IV1 GPT3.5 Lila) of stories had a character with the same name. Of all the stories, 8.5% (Elara and / or Rosaline and / or Lisa) had at least one character with the same name. This repetition of first names also seems characteristic of an LLM where names are statistical responses to a given input, which was standardised.
As stated before we used the "Code Interpreter" function of ChatGPT, a data analysis module released in July 2023 to assess the creativity of ChatGPT stories. To make a parallel with the human judges, we asked three different ChatGPT “judges” (in new conversations so that they wouldn’t have any memory of their scoring) to provide the creativity scores using the EPoC system. Convinced that it could do the job, "Code Interpreter" scored the different stories from "1 = Not at all creative" to "7 = Quite creative", but ChatGPT was inconsistent, with each judge being relatively uncorrelated with the other ChatGPT judges. Indeed, it’s scores (IV1: M = 3.18, s.d = 1.01; IV2: M = 3.32, s.d = 0.96) did not correlate well with those of the experimenters, and correlations were not significantly different from zero correlation (from r = -.01 to .14; NS, see Table 4). ChatGPT’s inter-judge reliability showed unacceptably low results (for IV1, α = .21; ω = .49 and for IV2 α = .11; ω = .45).
A detailed look at the correlation matrix reveals that only rare correlations are significant. First, the ability to diverge in AUT or DV (1 and 2) Fluency is correlated rather moderately and positively (r = .31 to .33; p < .01), showing that when the AI generates stories, it shows an associated generative capacity. In the DV tasks, Fluency correlated rather strongly and positively with Elaboration (r = .45 to .59; p < .001). This means that every time one of the GPT3.5 or 4 individuals provides many ideas, it will also elaborate on them. Interestingly, the two Fluency tasks are correlated together rather strongly and positively, r = .59, p < .001, meaning that when one of the GPTs provides many ideas for the first task, it will tend to provide a lot of ideas for the second. The rather moderate and positive correlation between DV1 Fluency and IV2 Human Scoring (r = .31, p < .01) suggests that the more ideas the AI generated on this divergent task, the higher the creativity scores awarded on the convergent task. This element, although explaining 9.61% of the shared variance, nevertheless seems to have little to do with the other results obtained in the correlation matrix and should not be interpreted further until additional studies are conducted. Overall, the positive correlations may be more related to questions of "time of day" and server availability, or unavailability. In fact, a server that is little used (in Europe, for example, in the morning, when it's the middle of the night in USA) is going to be much more available to generate ideas. Conversely, at other times, it may be saturated and provide fewer ideas.
Looking further to how the IV stories were generated and trying to learn more about the creativity of AI we worked with Code Interpreter python module coding to perform hierarchical clustering analyses on all the stories. We then let a human decide on the optimal number of clusters based on indices such as the Silhouette Index and the DBI. The number of clusters generated according to task and AI model is shown in the table 5 below.
As the Silhouette Index ranges from 0.46 to 0.55 and the DBI ranges from 0.58 to 0.70, the number of clusters is deemed acceptable in each of those conditions. This "objective" indicator allows us to see that 3 story types are generally present in most conditions. Each of these types of stories is then repeated a large number of times with variations, corresponding to the probabilities of the LLMs displaying words one after the other. We can better understand that in test IV2 there was a significantly different (human) creativity score between GPT3.5 and GPT4. Indeed, whereas the "fantastic" criterion is part of the EPoC manual's scoring grid for creative ideas, fanciful ideas are much more frequent for GPT4, as well as being the only condition to have a fourth class showing more some variety in the stories. Thus, even if the increase is not large (MGPT3.5 = 3.33, MGPT4 = 3.88) in a descriptive way, we can still argue that the increased score from GPT4 in IV2 is due to a better propensity to generate variation between the stories and with the characters.
Finally, the multifactorial approach to creativity assessed with EPoC’s norms for the French population, allows us to give an objective score in comparison to humans. As the EPoC is a test designed for children and teenagers, we chose to compare the scores at the maximum age possible, those of a teenager in "ninth grade" (end of French middle school). The results are presented in Table 6. The EPoC quotients can be interpreted like IQ quotients (m = 100), with each standard deviation from the calibration population being s.d = 15. For the ninth grade, the maximum score available in the norms at those quotients is 138. The EPoC can also be used to identify individuals who would be "High Potential" Verbal Creatives, if they have at least one standard deviation above the mean in the two quotients previously mentioned.
The results show that GPT4 is better overall on the Divergent Verbal Quotient (DVQ) than GPT3.5 (t(98) = 2.74, p < .01). However, for verbal divergent thinking scores, the scoring system may need to be reviewed, as it is based on fluency and ChatGPT showed a ceiling effect when scored based on human fluency norms. GPT4 always scored 138 on this scale with s.d = 0, indicating a lack of variability in scores. As mentioned above, for tasks requiring content generation, it is normal for GAIs to outperform humans. It is more interesting to study the verbal integrative quotient (IVQ). Although above the average for a ninth-grader (population mean = 100), the scores are below the first standard deviation, indicating that the perceived quality of creativity of the stories provided by GPT3.5 and GPT4 is not particularly high. It is important to note, however, that GPT4 has a statistically higher IVQ than GPT3.5 (t(98) = 3.60; p < .001), indicating that GPT4 performs better on creative integrative tasks.