Art or Artifact: Evaluating the Accuracy, Appeal, and Educational Value of AI-Generated Imagery in DALL·E 3 for Illustrating Congenital Heart Diseases

Arti�cial Intelligence (AI), particularly AI-Generated Imagery, holds the capability to transform medical and patient education. This research explores the use of AI-generated imagery, from text-to-images, in medical education, focusing on congenital heart diseases (CHD). Utilizing ChatGPT's DALL·E 3, the research aims to assess the accuracy and educational value of AI-created images for 20 common CHDs. The study involved generating a total of 110 images for normal human heart and 20 common CHDs through DALL·E 3. Then, 33 healthcare professionals systematically assessed these AI-generated images by variable levels of healthcare professionals (HCPs) using a developed framework to individually assess each image anatomical accuracy, in-picture text usefulness, image appeal to medical professionals and the potential to use the image in medical presentations. Each item was assessed on a Likert scale of three. The assessments produced a total of 3630 images’ assessments. Most AI-generated cardiac images were rated poorly as follows: 80.8% of images were rated as anatomically incorrect or fabricated, 85.2% rated to have incorrect text labels, 78.1% rated as not usable for medical education. The nurses and medical interns were found to have a more positive perception about the AI-generated cardiac images compared to the faculty members, pediatricians, and cardiology experts. Complex congenital anomalies were found to be signi�cantly more predicted to anatomical fabrication compared to simple cardiac anomalies. There were signi�cant challenges identi�ed in image generation. These �ndings suggest adopting a cautious approach in integrating AI imagery in medical education, emphasizing the need for rigorous validation and interdisciplinary collaboration. The study advocates for future AI-models to be ne-tuned with accurate medical data, enhancing their reliability and educational utility.


Introduction
Illustrations and images are powerful methods to convey rich information and are widely used in medical practice [1].The saying "a picture is worth a thousand words" appropriately highlights the value of medical illustrations in effectively conveying information to healthcare professionals and patients.This principle emphasizes the role of visual aids in simplifying complex medical concepts, making them more understandable and impactful.In instructional design, it is established that images enhance learning, a concept supported by literature [2][3][4].This enhancement is supported by the mental model theory, which advocates that text and pictures facilitate the creation of both verbal (propositional) and visual mental models [5][6][7].These models are then integrated into the learner's working memory as an aid in understanding and smooth future retrieval.[6] Images are generally considered less cognitively demanding than text.Text needs to be interpreted into concepts and then into a mental model, whereas images directly assist in creating a mental model due to their visual nature [8].
AI-powered text-to-image generators (AI-TIG) hold promise for medical illustrations, optimizing the selflearning principles like self-determination theory, adult learning theory, and the experiential learning cycle [9,10].These tools cater to learners' motivation and autonomy, aligning with adult learning's self-directed nature and experiential learning's emphasis on a four-stage cycle, namely the concrete experience, re ective observation, abstract conceptualization, and active experimentation, which can be perfectly applied to the AI-TIG medical images and scenarios for training [11,12].AI-TIG can also create realistic and interactive simulations of medical situations, such as surgeries, emergencies, or clinical scenarios, that can help students and practitioners to learn and practice their skills and knowledge [13].
OpenAI announced DALL•E, a deep learning model, on January 5, 2021 [14].It is a transformer-based model trained to generate images from text prompts.In 2023, AI-TIG applications, like DALL•E 3 and Midjourney, had signi cant advancement, creating better-detailed images [15,16].DALL•E 3, more detailed than its predecessor DALL•E 2, translates words into vibrant images and integrates with 17].
In medicine, previous AI-TIG models, like DALL•E 2, have shown potential, such as in the eld of radiology [17].These tools generated "realistic" x-ray images from text prompts and were seen as promising for image augmentation and manipulation in healthcare.However, their capabilities in generating speci c images, such as CT, MRI, or ultrasound, or the abilities for generating images with pathological abnormalities, like fractures or tumors, remained limited [17].There is a growing interest in exploring how these tools can be ne-tuned and adapted for medical applications [18,19].
While previous studies investigated using deep learning, speci cally neural network, to model cardiac anatomies representing the various types of Congenital Heart Diseases (CHD) and heart shape variations in cardiac disease, however, none of previous studies had in-depth evaluation about the educational value of the widely-available deep learning AI-TAG of DALL•E 3 [20][21][22][23][24][25][26].We aimed to investigate the effectiveness and perfection of DALL•E 3 in producing educational illustrations for medical education, with a focus on CHD.
The study evaluated the accuracy and educational value of AI-TIG images for 20 common CHDs.Additionally, we explored the medical professionals' and students' perceptions of the utility and visual appeal of these AI-generated images in an educational context.

Study design:
Our model evaluation study investigated the tendency of DALL•E 3 to generate scienti cally accurate versus ctional images of common heart lesions.We conducted the text-to-pictures generative experiments with prompts designed to resemble a hypothetical potential usage by medical students or general healthcare providers of DALL•E 3 within clinical and medical education applications, taking the examples of CHDs (Appendix-1).

Selection of CHDs:
In the rst phase of our study, we identi ed the most relevant CHDs for educational purposes.This was achieved through the expertise of two pro cient pediatric cardiology experts (Drs.AAH and MAG).They compiled a comprehensive list of top 20 CHDs that they frequently discussed in their educational sessions.This list (Appendix-1) served as the foundation for the subsequent AI-TIG process.

Prompt Optimization and Selection Strategy:
This phase focused on choosing the most effective prompts for generating illustrative images of CHDs, to ensure the reproducibility and educational relevance of the AI-generated images.It involved: 2. A unique 'reverse engineering' approach was also employed.Here, we uploaded actual CHD illustrations into DALL•E 3, allowing the AI-TIG to describe them.The same text was then used to generate new images of the same CHD.This method helped in enhancing the prompt strategy by optimizing its text to match DALL•E 3 expectations and algorithm as much as possible.
3. Expert Panel Evaluation: A panel of medical experts reviewed the images from these various prompts.

Consistency Analysis:
We assessed visual similarities of images produced from different prompts.

Final Prompt Selection:
The chosen prompt template (as described below) was chosen by the expert panel as those that would be more likely used by medical students, healthcare providers or laypersons seeking illustration of CHD in AI-TIG (DALL•E 3).

Generation of Illustrative Images:
The creation of illustrative images was conducted using ChatGPT-4 integrated with DALL•E 3, under the supervision of the principal investigator, Dr. MHT.Over the course of three consecutive days, from November 29 to December 1, 2023, a series of prompts were issued to generate "accurate and educationally useful" illustrations based on the above-described methodology.MHT used the prompts in ChatGPT-4 as follows: "Draw an accurate illustration of ]CHD[ to simplify it for medical students, with text in the image to clarify the illustration" (Appendix-1).The aim was to produce a range of visual representations for each CHD, with ve repetitions for each.Ten images of a normal heart were also generated to establish baseline for comparison, with the following prompt "Draw an accurate illustration of a normal human heart to simplify it for medical students, with text supported image to clarify the illustration."

Development of the Image Assessment Framework:
A key component of our study was the development of a robust systemic framework for assessing the generated images.To accomplish this, an interdisciplinary expert panel was assembled, including two pediatric cardiologists (AAH, MAG), a cardiac surgeon (RN), an anatomist (MB), a medical educator (MAM), and two pediatricians (MHT, AAE).The panel developed a concise yet comprehensive evaluation tool, focusing on four key parameters: anatomical accuracy, value of integrated image-text, visual appeal to medical professionals, and usefulness for educational usage.Each image was assessed against the following criteria: Image Accuracy (accurate (score 3), midway (score 2), fabricated (score 1)) compared to a prede ned criteria of each CHD and a "gold standard image", described below.
Validation of the Assessment Tool: Prior to its application, the assessment tool described above underwent a thorough review process involving all co-authors of the study.This was essential to ensure the clarity and face validity of the tool to all team members.

Images Review and Assessments:
For the review and assessment phase, an online interface was set up on SurveyMonkey (Appendix 1).This platform hosted the collection of 110 colored images (10 normal heart and 100 CHDs).The assessment criteria (Appendix 1) were also embedded in the data collection tool [28,29].Alongside each image, the assessment scale was provided.The assessors were granted one-time access to this dataassessment portal, where they employed the agreed-upon assessment tool to evaluate each image.This method facilitated e cient and systematic data collection.

Ethical Considerations:
The Institutional Review Board (IRB) granted the approval of the proposal (Ref.No. 23/0155/IRB), and informed consent was obtained from the evaluators before their voluntary participation.

Statistical Analysis:
The mean and standard deviation were used to describe continuous variables and the frequencies and/ percentages for the categorically measured variables.The ratings of images were transformed from long data into wide data to account image sequence in the analysis, the resulted data matrix was equal to (110 image ratings*33 raters = 3630 image rating lines).The Cronbach's alpha test was applied to assess the internal consistency of the four measured cardiac image ratings or perceptions.The chi-squared test of association was used to assess the associations between categorically measured variables and the Spearman's (rho) correlations test was used to assess correlations between ordinal measured variables.The Spearman's Rho correlations test was used to assess the correlations between metric variables.A total relevance score for the AI generated images was computed via summing up the four indicators that characterized the images quality.These include following the four domains: anatomical accuracy, text usefulness, attractivity and usability for medical purposes.
The Generalized Liner Mixed Modelling with Gamma regression and Loglink was applied to evaluators mean overall AI-generated cardiac anomalies images perfection via regressing it against rater's demographic and professional characteristics with CHD complexity classi cations.The association between the predictor variables with the dependent outcome variable in the GLMixed modelling was expressed as a multivariate adjusted Risk Rate (exponentiated beta coe cient) with its associated 95% con dence intervals.The SPSS IBM statistical software version #28 was used for the statistical data analysis and alpha signi cance level was considered at 0.050 level.

Results
In the study, 33 HCPs evaluated 110 cardiac images produced by DALL•E 3. The group consisted of diverse medical experts: eight (24.2%)cardiology experts, including a cardiac surgeon, three pediatric cardiology consultants, three fellows, and an anatomy consultant.Others included seven pediatricians, four non-pediatric faculty members, ten trainees (three medical students, four interns, three pediatric residents), and four pediatric nurses.Using an online data collection tool, this varied cohort completed 3630 individual image assessments, providing a comprehensive analysis of the AI-generated imagery.The evaluators also rated each cardiac anomaly; whether it was considered as simple or complex (Figure -1).
Evaluators' Overall Rating of AI-TIG CHD Images: The evaluators' overall ratings for the AI-TIG cardiac images (N = 3630 ratings) are shown in Figure -2.Very few of the images (2.5%) were considered anatomically accurate, 16.7% as midway, and the majority (80.8%) were assessed as fabricated.In the evaluation of images' text label, 85.2% were rated as useless, only 1.2% were considered useful, and 13.6% fell into a mid-range of usefulness.
Regarding images' attractiveness, evaluators rated 18.7% of images as attractive, 18.2% as midway attractive, but most of images (63.1%) were considered as "not attractive at all".When considering usefulness for medical education, 78.1% were rated as "non usable", 21.6% as usable after modi cations, while only 0.4% were evaluated as usable without modi cation.
Variation of rating of AI-TIG Cardiac Images among various evaluator groups: The rating of images regarding the four different domains (anatomical accuracy, text usefulness, attractivity, usefulness for medical education) were compared among different groups of evaluators using the chi-squared test (Table 1).The medical students/interns/residents were found to be signi cantly more predicted to perceive the images as anatomically accurate, the illustrative text as useful, usable for medical educational purposes and attractive compared to the rest of evaluators (pvalue < 0.001).
Likewise, nurses perceived the images signi cantly more compared to others as attractive, useful for medical education and its illustrative text as useful (p-value < 0.001).Conversely, the cardiology experts were signi cantly more inclined to perceive the images as (inaccurate, not attractive, not for medical education and their illustrative text being not useful) compared to the other evaluators.Rating AI-TIG cardiac images of normal hearts, simple and complex CHD lesions: The AI-TIG images of normal hearts (Figure -3) were rated poor regarding anatomic accuracy (47.9% fabricated, 40.3% midway and only 11.8% accurate).An example of the "most fabricated images" is shown in Figure -4a, and "least fabricated" in Figure -4b.Moreover, 83.9% of images of normal heart were rated as having inaccurate and useless text labels.In addition, 64.2% of images of them were rated as not useable for medical education, 34.5% can be used after modi cation, and only 1% thought these images can be used without modi cation.
This extends to the individual rating of the AI-TIG images of the various CHDs that have been studied.Most AI-TIG images were rated poor regarding anatomical accuracy, illustrative text usefulness and usability for medical education 1-3%.However, generally the images were perceived as attractive in 15-22%.
Chi-squared test (Table S1) showed that the CHD complexity correlated signi cantly with the evaluators' perceived images' anatomical accuracy.Complex CHD images were found to be signi cantly more fabricated compared to normal heart or simple CHD, p-value < 0.001.While the other three evaluation criteria (image's text usefulness, attractiveness, or usefulness for medical education) did not signi cantly correlate with CHD complexity.

Correlations between evaluators' perceptions of the four criteria of AI-TIG Cardiac Anomalies Images:
Table 2 highlights the bivariate correlation between the four-criterion used to assess images quality.We found signi cantly positive correlation (P = 0.01) between all of them (r ranged between 0.337-0.566).
The best correlation was between image usefulness for medical education and its attractiveness.Furthermore, the lowest correlation was between image attractiveness and its anatomic accuracy.
Usefulness for medical education overall had the best correlation with all the other three criteria (r ranged between 0.441-0.566).Multivariable Analysis of evaluators perceived overall perfection score of AI-TIG Cardiac Images: We ran multivariable generalized linear regression for the overall mean perfection score of the AI-TIG cardiac anomalies images in comparison to cardiology experts mean perfection score.Nurses had signi cantly the highest perfection score compared to cardiology experts (34.1% times higher p < 0.001), followed by medical students/interns/residents (26.6% times higher p < 0.001), then faculty staff/academician (15.5% higher p < 0.001).Pediatric consultant/specialist had higher perfection score by 14.5% times higher p < 0.001).
Taking cardiac anomaly complexity into consideration (Table 3), complex ones were evaluated signi cantly less perfect compared to simple ones in overall by all evaluators (6% times less p < 0.001).For example, certain anomalies, like the coarctation of Aorta, Interruption of aortic, Aorto-left ventricular tunnel, were perceived signi cantly less perfect by all evaluators (4.4%-11% less perfect) as compared to other CHD images.

Discussion
DALL•E 3 is an AI-TIG model that generates images from text descriptions through transformative language models like GPT-3 [16,30].It can produce a variety of images, from realistic to abstract art, and can creatively combine elements from different ideas to create novel visuals.Despite its potential in areas like education and art, DALL•E faced challenges, such as generating coherent images from complex texts, maintaining image quality, addressing biases from training data, and managing computational demands [31].
AI-TIG models have shown pro ciency in generating images with correct style and content for some medical applications, such as histopathology and scienti c illustrations [32].Some potential bene ts of these technologies include educational applications without copyright limitations, tailored educational experience, data anonymization, and discovery of new morphological associations.Conversely, they have potential limitations that lie in their current inability to accurately generate complex medical images .
While offering innovative visual learning AI-tools, AI-TIG's integration in education requires careful balance and validation for accuracy and reliability, similar to Large Language Models (LLMs) [18,33,34].We demonstrated that AI-TIG, like LLMs, is liable to generate inaccuracies ('hallucinations' or 'confabulation'), posing risks in medical contexts.Consequently, one recommended approach for use is the 'sandwich technique': experts input text, AI-TIG generates the image, and then the expert evaluates and edits it for accuracy, ensuring safer application in the educational process [35].
Our study explored the current state of DALL•E 3 in the eld of medical illustrations, particularly CHDs.We discovered that while this technology opens novel avenues for visualizations, it also poses signi cant challenges.Like the "hallucinations" in LLMs, the tendency of DALL•E 3 to introduce inaccuracies and 'artifacts' in images was signi cant, raising concerns about its current suitability for medical illustrations [36].These insights emphasize the need for rigorous validation before employing AI-TIG imagery in complex areas like medical education, patient's education, or decision-making.
Our study found that the majority of 3630 evaluations rated DALL•E 3's AI-generated cardiac images as anatomically inaccurate and educationally limited.These shortcomings may stem from the model's training and its 'Zero-Shot' ability, which inconsistently adapts to untrained text prompts [14,37].
However, other research on AI has shown promise in enhancing medical imaging quality and interpretability in cardiology [1].Despite DALL•E 3's current limitations, ongoing research and developments may improve AI-TIG medical images' accuracy.
Another concern in our study was the erroneous AI-generated images text-labels, that were mostly misspelled or misplaced, rendering them "useless".For enhanced medical illustrations, future AI-TIG models should be developed to meticulously produce accurate medical images labeling [1].Specialized or ne-tuned GPT models could be trained to more accurately recognize medical structures and enhance their labeling [38].As these AI-TIGs undergo more medically-oriented training, their accuracy may improve, providing a better learning and personalized medical tool for healthcare professionals, patients, and educators [39].
Interestingly, 18% images in our sample were thought of as having "attractive appearance" for medical professionals, as was also noted by other studies describing DALL•E 3 images as more realistic [15].Nurses and junior trainees in our group had more positive perception about AI-TIG cardiac images; perceiving more images as anatomically "accurate," nding the illustrative text as more useful and usable for medical educational purposes and seeing more attractive images than the other evaluators.While these could be a positive signal for future medical curriculum adaptation of more accurate AI models, these ndings may indicate a risk of persuading non-expert medical professionals or laypersons to be in uenced by the vibrant artistic appearance of such images.
In our study, AI-TIG cardiac images, including those of normal hearts and simple lesions, were frequently rated poorly in terms of anatomical accuracy.This issue may be attributed to inherent challenges in DALL•E's capabilities, including di culties in image coherence, quality, and biases in training datasets [31].Moreover, while complex congenital anomalies were more prone to anatomical fabrication, the complexity of cardiac disease did not signi cantly impact the perceived educational value these images.Notably, there was a positive correlation between the perceived anatomical accuracy and educational usefulness of the images, emphasizing the importance of accuracy for medical education purposes.
The expert panel also observed additional inaccuracies in the AI-generated images, such as the depiction of non-existent blood vessels in the heart images and a notable lack of cardiac valves.In addition, the AImodel apparently did not seem to identify the various structures of the heart (e.g.aorta, pulmonary valve, atrial or ventricular septum), therefore, it could not draw the abnormalities of these structures neither link these structures to correct text labels.This is like several errors that were reported in the illustrations of the heart by three AI-TIGs: Microsoft Bing/DALL•E, Stable Diffusion and Craiyon [40].The investigator used the prompt to draw "detailed and accurate anatomy illustration of the human heart" on the three platforms on May 30, 2023, and found that they failed to show accurate coronary artery origins, the branching of the aorta and pulmonary trunk.
The inaccuracy issues may stem from DALL•E 3 possibly being trained on unrepresentative data, leading to a risk of over tting to inaccurate disease images from automation bias.[41] Sharing such awed images and illustrations to non-cardiac experts, like medical students, nurses, or laypersons, could unintentionally generate or intensify misinformation, a concern exacerbated by automation biases.This highlights the need for caution in using AI-tools for didactic purposes, particularly in sensitive elds like healthcare education [42][43][44][45][46][47].
To mitigate some risks of AI-TIG medical imagery, it is important to educate HCPs and patients on proper use of AI tools, such as appropriate prompts that are more speci c and at higher levels of medical literacy to produce higher-quality images [48].Also, careful interpretation of the medical images still requires experts' oversight, to ensure images are not misinforming users [35,42].One capability of AI-models is their ability to acquire knowledge and improve performance through increased exposure to data, therefore, IT experts could enhance current and future AI-models' training, emphasizing variety of accurate medical images datasets and improving algorithms to enhance generated image's reliability and usefulness in medical education [49,50].
Medical digital twins, serving as virtual representations of medical conditions, could improve merging the physical and virtual medical realms [51].Recently, digital twin technology, especially in cardiac modeling, witnessed substantial progress [52].However, challenges of the variability of human heart parameters and their implications on patient response to treatments persist, and personalized digital twins that mimic speci c heart pathologies demand signi cant computational resources [51].Therefore, AI-TIG may offer new opportunities, provided these models are both accurate, widely accessible, and easily editable, thus improving the personalized healthcare provision and medical education experience, in various medical elds [53, Study Limitations and Future Potentials: Our study focused on one category of anatomical lesions (CHDs) at speci c time on one AI-TIG (DALL•E 3).Therefore, future research of AI-TIG images for other health-related conditions or other AI-TIG models may produce variable outcomes.Our research is among the rst to explore AI-TIG images potentials of DALL•E 3 in CHD, and it may pave the way for more medical-speci c AI-training for future models.
Future research on AI-TIG may address other shortcomings, such as the 'black-box' nature of the models, the requirement for extensive medical-data training effects, better transparency of image standardization, or improved ltering of inaccuracies during training [55].The optimal use of AI-TIG images in medical education or individualized healthcare with digital twin models requires further collaboration between healthcare professionals and computer scientists.This includes de ning clear objectives, choosing the optimal deep learning algorithms and datasets, and interpreting image results with a balanced, humansupervised perspective.

Conclusion
This study explored the integration of AI-TIG technology in medical illustrations, particularly for visualizing CHDs, highlighting a novel approach.Despite experts identifying errors and questioning the medical utility of AI-generated images, non-experts like medical students and nurses viewed them more positively.These results point out the need for caution among AI-TIG users and healthcare professionals, emphasizing vigilance in their application.Additionally, there is an opportunity for computer scientists and AI stakeholders to re ne AI-TIG models with more realistic medical images.Importantly, text in the AI-TIG generated images should clearly indicate potential inaccuracies in both visuals and descriptions.Further research into other healthcare imaging techniques using generative AI is warranted.
AbbreviationsAI cial Intelligence AI-TIG AI-powered text-to-image generator ASD Atrial Septal Defect CCTGA Congenitally Corrected Transposition of the Great Arteries

Figures
Figures

Figure 1
Figure 1 1. Pilot Testing of Various Prompts: We experimented with different prompt structures to ensure these prompts produced similar images.Examples of the prompts we tried:

Table 1
Evaluators' ratings of the AI-generated cardiac images (anatomical accuracy, text usefulness, attractiveness, usefulness for medical education).N = 3630 image ratings.

Table 2
Bivariate Spearman's Correlations between evaluator's perceptions of the AI generated cardiac anomalies images.

Table 3
Multivariable Generalized Linear Regression (GLM) analysis of evaluators perceived Overall Relevance of AI generated cardiac images score .