DALL·E 3 is an AI-TIG model that generates images from text descriptions through transformative language models like GPT-3 [16, 30]. It can produce a variety of images, from realistic to abstract art, and can creatively combine elements from different ideas to create novel visuals. Despite its potential in areas like education and art, DALL·E faced challenges, such as generating coherent images from complex texts, maintaining image quality, addressing biases from training data, and managing computational demands [31].
AI-TIG models have shown proficiency in generating images with correct style and content for some medical applications, such as histopathology and scientific illustrations [32]. Some potential benefits of these technologies include educational applications without copyright limitations, tailored educational experience, data anonymization, and discovery of new morphological associations. Conversely, they have potential limitations that lie in their current inability to accurately generate complex medical images.
While offering innovative visual learning AI-tools, AI-TIG’s integration in education requires careful balance and validation for accuracy and reliability, similar to Large Language Models (LLMs) [18, 33, 34]. We demonstrated that AI-TIG, like LLMs, is liable to generate inaccuracies ('hallucinations' or 'confabulation'), posing risks in medical contexts. Consequently, one recommended approach for AI-TIG use is the 'sandwich technique': experts input text, AI-TIG generates the image, and then the expert evaluates and edits it for accuracy, ensuring safer application in the educational process [35].
Our study explored the current state of DALL·E 3 in the field of medical illustrations, particularly CHDs. We discovered that while this technology opens novel avenues for visualizations, it also poses significant challenges. Like the “hallucinations” in LLMs, the tendency of DALL·E 3 to introduce inaccuracies and 'artifacts' in images was significant, raising concerns about its current suitability for medical illustrations [36]. These insights emphasize the need for rigorous validation before employing AI-TIG imagery in complex areas like medical education, patient’s education, or decision-making.
Our study found that the majority of 3630 evaluations rated DALL·E 3's AI-generated cardiac images as anatomically inaccurate and educationally limited. These shortcomings may stem from the model's training and its 'Zero-Shot' ability, which inconsistently adapts to untrained text prompts [14, 37]. However, other research on AI has shown promise in enhancing medical imaging quality and interpretability in cardiology [1]. Despite DALL·E 3's current limitations, ongoing research and developments may improve AI-TIG medical images’ accuracy.
Another concern in our study was the erroneous AI-generated images text-labels, that were mostly misspelled or misplaced, rendering them “useless”. For enhanced medical illustrations, future AI-TIG models should be developed to meticulously produce accurate medical images labeling [1]. Specialized or fine-tuned GPT models could be trained to more accurately recognize medical structures and enhance their labeling [38]. As these AI-TIGs undergo more medically-oriented training, their accuracy may improve, providing a better learning and personalized medical tool for healthcare professionals, patients, and educators [39].
Interestingly, 18% images in our sample were thought of as having “attractive appearance” for medical professionals, as was also noted by other studies describing DALL·E 3 images as more realistic [15]. Nurses and junior trainees in our group had more positive perception about AI-TIG cardiac images; perceiving more images as anatomically “accurate,” finding the illustrative text as more useful and usable for medical educational purposes and seeing more attractive images than the other evaluators. While these could be a positive signal for future medical curriculum adaptation of more accurate AI models, these findings may indicate a risk of persuading non-expert medical professionals or laypersons to be influenced by the vibrant artistic appearance of such images.
In our study, AI-TIG cardiac images, including those of normal hearts and simple lesions, were frequently rated poorly in terms of anatomical accuracy. This issue may be attributed to inherent challenges in DALL·E's capabilities, including difficulties in image coherence, quality, and biases in training datasets [31]. Moreover, while complex congenital anomalies were more prone to anatomical fabrication, the complexity of cardiac disease did not significantly impact the perceived educational value of these images. Notably, there was a positive correlation between the perceived anatomical accuracy and educational usefulness of the images, emphasizing the importance of accuracy for medical education purposes.
The expert panel also observed additional inaccuracies in the AI-generated images, such as the depiction of non-existent blood vessels in the heart images and a notable lack of cardiac valves. In addition, the AI-model apparently did not seem to identify the various structures of the heart (e.g. aorta, pulmonary valve, atrial or ventricular septum), therefore, it could not draw the abnormalities of these structures neither link these structures to correct text labels. This is like several errors that were reported in the illustrations of the heart by three AI-TIGs: Microsoft Bing/DALL·E, Stable Diffusion and Craiyon [40]. The investigator used the prompt to draw “detailed and accurate anatomy illustration of the human heart” on the three platforms on May 30, 2023, and found that they failed to show accurate coronary artery origins, the branching of the aorta and pulmonary trunk.
The inaccuracy issues may stem from DALL·E 3 possibly being trained on unrepresentative data, leading to a risk of overfitting to inaccurate disease images from automation bias.[41] Sharing such flawed images and illustrations to non-cardiac experts, like medical students, nurses, or laypersons, could unintentionally generate or intensify misinformation, a concern exacerbated by automation biases. This highlights the need for caution in using AI-tools for didactic purposes, particularly in sensitive fields like healthcare education [42–47].
To mitigate some risks of AI-TIG medical imagery, it is important to educate HCPs and patients on proper use of AI tools, such as appropriate prompts that are more specific and at higher levels of medical literacy to produce higher-quality images [48]. Also, careful interpretation of the medical images still requires experts’ oversight, to ensure images are not misinforming users [35, 42]. One capability of AI-models is their ability to acquire knowledge and improve performance through increased exposure to data, therefore, IT experts could enhance current and future AI-models’ training, emphasizing variety of accurate medical images datasets and improving algorithms to enhance generated image’s reliability and usefulness in medical education [49, 50].
Medical digital twins, serving as virtual representations of medical conditions, could improve merging the physical and virtual medical realms [51]. Recently, digital twin technology, especially in cardiac modeling, witnessed substantial progress [52]. However, challenges of the variability of human heart parameters and their implications on patient response to treatments persist, and personalized digital twins that mimic specific heart pathologies demand significant computational resources [51]. Therefore, AI-TIG may offer new opportunities, provided these models are both accurate, widely accessible, and easily editable, thus improving the personalized healthcare provision and medical education experience, in various medical fields [53, 54].
Study Limitations and Future Potentials:
Our study focused on one category of anatomical lesions (CHDs) at specific time on one AI-TIG (DALL·E 3). Therefore, future research of AI-TIG images for other health-related conditions or other AI-TIG models may produce variable outcomes. Our research is among the first to explore AI-TIG images potentials of DALL·E 3 in CHD, and it may pave the way for more medical-specific AI-training for future models.
Future research on AI-TIG may address other shortcomings, such as the 'black-box' nature of the models, the requirement for extensive medical-data training effects, better transparency of image standardization, or improved filtering of inaccuracies during training [55]. The optimal use of AI-TIG images in medical education or individualized healthcare with digital twin models requires further collaboration between healthcare professionals and computer scientists. This includes defining clear objectives, choosing the optimal deep learning algorithms and datasets, and interpreting image results with a balanced, human-supervised perspective.