We report all conditions, measures, data exclusions, and provide copies of all study materials on our Open Science Framework page (see Data and Code Availability). All research protocols were approved by the *name anonymized for review* University Human Research Protection Office and Institutional Review Board and performed in accordance with relevant guidelines/regulations (in accordance with the Declaration of Helsinki), including informed consent from all participants. Notably, our first two experiments were run in 2017 and 2018, before the more recent introduction of AI-art innovations like Midjourney and DALL-E2. In contrast, our last experiments were run in 2023 just as Midjourney Version 4 and DALL-E2 were beginning to reach a national audience. We share these dates because we believe it is important to note, historically, that our data was collected both before and after more recent (and sensationalized) coverage of AI art in prestigious media outlets like the New York Times [5] and Washington Post [6], documenting evaluations of art just as the implications of AI were beginning to be realized.
Sample size determination and randomization
All sample sizes were determined before collecting data, with data collection halting after analysis began. For Experiment 1, we hypothesized a small to medium effect size (Cohen’s d = .33), which we used to determine our sample size (i.e., roughly 95% power to detect an effect). We then used effect sizes from earlier experiments to make sample size determinations for later experiments. Data quality was ensured in several ways. For instance, we removed duplicate responses (e.g., repeated IP addresses or research IDs), removed participants who failed attention checks, and collected entirely new samples for every study.
Data analysis and reporting
All data analysis was conducted in R (v.4.2.2). Effect sizes were calculated as Cohen's d using the ‘effectsize’ package [24]. Whenever a T-test in our analysis did not demonstrate equal variance, a Welch's T-test with corrected degrees of freedom was used instead. Reported p values are all two-sided. Finally, all regression models include dummy-coded conditions that compare outcomes from against the control condition designated for that experimental design.
Experimental samples and procedures
Pilot Study. We pre-tested the 28 images used in Experiments 1, 2, and 5 in a pilot study (n = 105). This was done to ensure our stimuli captured a range of styles and quality. Half of these images were lesser-known paintings from respected artists (e.g., William Gear, Andy Warhol, and Paul Gauguin) while the other half were AI-generated images rendered in the same styles of those artists. These stimuli were chosen and tested to ensure that (a) participants could not tell the difference between human and AI-made art and (b) so that participants in Experiments 1 and 2 would be presented with style-matched pairs randomly labeled as “human-made” or “AI-made”. Results confirmed images represented a range of quality (m = 4.31, sd = 1.46) and that participants generally could not tell the difference between images that were or were not AI-made. For example, when asked to guess the origin of an image (1 = definitely human-made, 7 = definitely AI-made), responses were generally below the midpoint (regardless of each image’s actual origin), indicating participants thought images looked more “human-made” on average (human-made m = 2.79, sd = 1.58 vs. AI-made m = 2.89, sd = 1.56). To ensure our data accurately represented lay evaluations of stimuli participants were unfamiliar with (i.e., experimental fidelity) we used the question “Before taking this survey, had you seen any of these paintings before?” Participants who responded yes to this question in any experiment were removed before any analysis, though supplementary analysis including their responses did not change the direction or significance of effects reported in this article.
Experiment 1. We recruited 143 English speaking US residents from Mturk. Participants were excluded for failing to pass attention checks or reporting they recognized stimuli used in the study, yielding a final sample of n = 119 participants (men = 52%, \({m}_{age}=34\) ). Participants were paid $2 to complete the survey.
After rating three buffer images to acclimate participants to the task, all participants rated 14 images labeled as human-made and 14 images labeled as AI-made. Images were all presented in a random order. To make sure that differences in artistic style did not confound any results, labels were randomly assigned within style-matched pairs (see our pretest) such that one image in each style-matched pair was always labeled as AI-made and the other as human-made. This allowed us to make comparisons between images labeled as human or Ai-made while holding style constant. Participants then rated each painting on a battery of dimensions: how much they liked it, how skillfully it was painted, how colorful it was, whether they found it inspiring, how bright it was, how complex it was, how emotionally evocative, and whether they thought it was expensive (1 = Not at all, 7 = A great deal; ɑ = .88). For exploratory purposes we also asked participants about their general affinity for art (e.g., “Some people seem to need art in their lives more than others; I consider myself that kind of person.”) and their feelings about technological innovations (e.g., “I tend to dislike new technologies.”). Supplementary analysis revealed these had no impact on our main findings.
Experiment 2. We recruited 555 English speaking US residents from Mturk. Participants were excluded for failing to pass attention checks, comprehension checks, or reporting they recognized any stimuli used in the experiment, yielding a final sample of n = 415 participants (men = 51%, \({m}_{age}=36\) ). Participants were paid $2 to complete the survey.
After rating three buffer images to acclimate participants to the task, all participants rated 28 images in random order. Participants were randomly assigned to one of three conditions. In a control condition they were told we were simply interested in “how people perceive each painting on a number of dimensions.” In experimental conditions participants were either told “each painting was made by an artificial intelligence” or that “some of these paintings were made by a human and others were made by an artificial intelligence” but not which ones. Participants then rated each painting on a battery of dimensions: how much they liked it, how skillfully it was painted, how colorful it was, whether they found it inspiring, how bright it was, how complex it was, how emotionally evocative, whether they thought it was expensive, and how much they’d be willing to pay (1 = Not at all, 7 = A great deal; ɑ = .94). For exploratory purposes we also asked participants about their mood during the study (e.g., “Overall, my mood is:” = -10 = Very unpleasant, 10 = Very pleasant) and about their personal tastes in art (e.g., “I feel I have good taste in art.”). Supplementary analysis revealed that these did not differ by condition and had no impact on our main findings.
Experiment 3. We recruited 541 English speaking US residents from Mturk. Participants were excluded for failing to pass attention checks, comprehension checks, or reporting they recognized stimuli used in the study, yielding a final sample of n = 405 participants (male = 53%, \({m}_{age}=38\) ). Participants were paid $1 to complete the survey.
To increase the external validity of our findings, participants were given a cover story about these images representing real paintings for sale at a private gallery:
On the next page, you'll be shown two images of paintings currently for sale at the Lenham Private Gallery. We are curious about consumer impressions of these paintings and the blurbs attached to them. Please review the painting and information provided by the gallery and answer all questions honestly
Participants were then randomly assigned to one of four conditions where they rated two images. In a control condition, both images were labeled as human-made. In one experimental condition both images were labeled as AI-made. In another experimental condition, the first image was labeled human-made and the second AI-made. And in a final experimental condition, the first image was labeled AI-made and the second human-made. Image order was always held constant. Human and AI-made labels read as follows, “The following painting was created by Jamie Kendricks, in January of 2019.” or “The following painting was created by an artificial intelligence program, which imagines and paints images entirely of its own accord, in January of 2019.” Building upon our cover story, all paintings were presented with unique ID numbers and fabricated gallery information (e.g., “Lenham Private Gallery ID: #A2461; Untitled, 2019; Oil on canvas; 24 in x 36 in). Participants rated each image on a battery of dimensions: how they liked each painting, how skillfully it was painted, how colorful it was, whether they found it inspiring, how bright it was, how complex it was, how emotionally evocative, whether the creator was talented, and whether they were impressed by the execution (1 = Not at all, 7 = A great deal;\({\alpha }_{image 1}=.86, {\alpha }_{image 1}=.90\)).
In addition, direct estimates of monetary value were obtained on a separate page immediately after participants evaluated each painting on the dimensions listed above. On this page, participants were informed about pricing with the prompt: “The average painting in the Lenham Gallery sells for somewhere between $50 to $220, with most pieces retailing at $150.“ They were then asked, “How much do you personally think the Lenham gallery should sell this painting for?” and “Assuming that you wanted this painting and given the gallery's prices, how much would you pay to acquire it?” For exploratory purposes, we asked participants about their taste in art (e.g., “Compared to other people, I generally have a better eye for art.” and “I like artwork that depicts "real things" more than I like artwork that is abstract.”). Supplementary analysis revealed artistic taste had no impact on our main findings.
Experiment 4. We recruited 792 English speaking US residents from Prolific. Participants were excluded for failing to pass attention checks, comprehension checks, or reporting they recognized any stimuli used in the study, yielding a final sample of n = 789 participants (male = 49%, \({m}_{age}=38\) ). Participants were paid $1 to complete the survey. Our pre-registration can be found here: https://aspredicted.org/DJV_MN7 .
Participants were randomly assigned to one of three conditions where they were asked to evaluate two images. In a control condition, both images were labeled as human-made. In one experimental condition, the first image was labeled human-made and the second AI-made. In another experimental condition, the first image was labeled AI-made and the second human-made. We used the same labels and gallery information provided in Experiment 3. The perceived creativity of each image was measured by asking participants to rate how much they thought each painting was creative, novel, appropriate to be sold in a gallery, and likable (\({\alpha }_{image 1}= .78, {\alpha }_{image 2}= .82\)). As in Experiment 3, participants were then given information about pricing on a separate page and asked to estimate the monetary value of each painting. Additionally, participants were additionally asked to estimate labor with the item: “How many hours of active painting time do you think it took to create the painting above?” Finally, to make sure effects were not confounded by expertise in domains of art and technology, participants responded to five items about artistic experience (e.g., “I used to [or currently] work in a job that primarily deals with the visual arts [e.g. designer, gallery manager, art dealer].”; \(\alpha = .79)\) and five items about technological experience (e.g., “I used to [or currently] work in a job that primarily deals with computer programming, data science, or engineering.”: \(\alpha = .72)\). Expertise did not differ by condition and supplementary analysis revealed it had no impact on our main findings.
Experiment 5. We recruited 731 English speaking US residents, using Prolific to collect a representative sample of the U.S. population. Participants were excluded for failing to pass attention checks, comprehension checks, or reporting they recognized stimuli used in the study, yielding a final sample of n = 710 participants (male = 48%, \({m}_{age}=45\) ). Participants were paid $1 to complete the survey. Our pre-registration can be found here: https://aspredicted.org/GJ4_VS4 .
Participants responded to the same survey used in Experiment 4 with two differences. First, images were randomly selected and ordered from the larger pool of 28 pretested images used in Experiments 1 and 2. Second, participants were asked at the end of the survey to indicate their own specific attitudes toward AI (e.g., “I think Artificial Intelligence programs are a threat to human artists.” and “I think Artificial Intelligence programs are an exciting new tool for human artists.”; \(\alpha = .77)\). Notably, though participants who felt anxious about AI technology rated AI-labeled artwork less favorably overall, attitudes towards AI did not differ by condition and supplementary analysis using attitudes towards AI as a control variable had no impact on our main findings.
Experiment 6. We recruited 698 English speaking US residents, using Prolific to collect a representative sample of the U.S. population. Participants were excluded for failing to pass attention and comprehension checks yielding a final sample of n = 527 participants (male = 45%, 49). Participants were paid $1 to complete the survey. Our pre-registration can be found here: https://aspredicted.org/DXP_K4M .
Participants were given the same cover story and prompt used at the beginning of Experiments 4 and 5 (i.e., rating “paintings currently for sale at the Lenham Private Gallery”) before being presented with two images in random order. These images were not drawn from our previous studies but were instead both generated by the AI program Midjourney and pretested on a separate sample (see our supplementary materials online) to ensure they were comparable in terms of participant evaluations of creativity, estimated monetary value, and estimated production time.
Participants were randomly assigned to one of two conditions. In a control condition, the first image was labeled with the tag: “The following painting was created by Jamie Kendricks, in January of 2019.” In our experimental condition, the first image was labeled with the tag: “The following painting was created by an artificial intelligence program, which imagines and paints images entirely of its own accord, in January of 2019.” For participants in both conditions, the second image was always labeled with the tag: “The following painting was created by the artist Avery Taylor, collaborating with an artificial intelligence program capable of imagining and painting images entirely of its own accord, in January of 2019.” Participants rated both images using the same items from Experiment 5 as well as an additional question, specific to the second image, that asked “how much work do you think was done by the AI vs the human?” using a sliding scale (0 = All AI Effort, 100 = All Human Effort).