In this study, a pilot external validation test of an ML model that identifies 44 skin diseases that represent a very frequent reason for PC consultation was performed in a PC setting. This is a feasibility study in routine clinical practice and will help us to develop additional studies with a larger sample which may contribute to improve the ML model used in PC. The results have shown that the 100 cases included in the study were predominantly of phototype type III, and to a lesser extent type II. According to the new Medical Device Regulation [31], it is imperative to perform proper evaluations of ML models for dermatology imaging applications [33], also in all skin phototypes. Thus, more studies are needed in order to ensure that they are trained in an inclusive and balanced way, and thus perform with the same accuracy on any skin phototype to avoid the possibility of disadvantaging certain groups of people.
The overall diagnostic accuracy of the model in this study is lower than that of both GPs and the TD assessment, as well as the one obtained in the theoretical diagnosis in the proof of concept of the model [40]. However, the average diagnostic sensitivity improves substantially when analysing the 82 cases in which the gold standard is included in one of the 44 diagnoses for which the model is trained. Thus, the observed results highlight the importance of determining the diagnoses not included in order to train the model and adapt it to routine clinical practice. These results differ from most theoretical and retrospective studies in which AI accuracy is usually equal to or higher than that of clinicians [23,26,27,38], and are consistent with the few existing prospective and real-world studies [50]. In addition, it is of relevance that the specificity of the application of AI in dermatologic imaging was very close to 1, which suggests that it is a useful tool for application in routine clinical practice as a CDST
Moreover, the fact that the diagnostic accuracy metrics increase with the Top-3 and Top-5 assessment is consistent with the usefulness in differential diagnosis, a fact already pointed out by Muñoz-López C et al in their study [50]. Recent algorithms tend to perform a ranked list of diagnoses. Aiding a differential diagnosis rather than a single diagnosis is particularly important in dermatology, where differential diagnosis is used for diagnostic-therapeutic decision-making. Furthermore, it can improve diagnostic accuracy when all diagnoses are taken into account, which is relevant in PC, where most of the time the most important thing is to know whether or not we are dealing with a potentially malignant lesion in order to assess the need or not for referral and/or prioritisation.
The fact that TD has been established for years in the PC environment of Central Catalonia as a screening method for in-person dermatology consultations could influence different variables, such as the high quality of the images collected, the consultation time and the degree of participation acceptance of citizens [9]. With regard to possible interferences in the quality of the images, in the case of dermoscopic images, it should be noted that the dermatoscopes used in the PC setting are not digital or adapted for smartphones, which could lower its quality and bias the image analysis both by the dermatologists and by the ML model.
The results suggest that a diagnostic aid for GPs in the resolution of dermatologic consultations would be a significant time-saver. GP can better orient the consultation at the time it occurs, not having to wait for the response time of the TD consultation (24-48 hours), and, on the other hand, for dermatology specialists it would mean being able to focus their experience on cases that are difficult to manage in PC.
It is not possible to draw conclusions on the individual diagnostic sensitivity by disease and, therefore, it was represented by groups. However, the small number of cases in the pilot study allowed us to perform a more exhaustive analysis of the different diseases. Nonetheless, about 50% of the cases were encompassed within the same category of benign tumours, with the ML model having an advantage over the clinicians with a diagnostic sensitivity of 96% in the Top-5. In the analysis of the 3 cases in which the model failed to diagnose benign tumours, we can see that in 2 of the 3 cases, when analysing the dermoscopy of both nevi, the model included the diagnosis in the Top-5. Therefore, as far as the resolution of the case in routine clinical practice is concerned, it would have been correctly oriented. In the third case, the gold standard was intradermal nevus and, when analysing the Top-5 diagnosis, the ML model included the diagnosis of nevus, but not intradermal nevus, so in the overall analysis it was considered erroneous despite the fact that in clinical practice it is of no importance to differentiate between the two categories (nevus and intradermal nevus). In future versions of the ML model, these diagnoses should be considered as a single diagnosis (nevus) due to the lack of clinical relevance. Therefore, one could infer that the ML model’s diagnostic sensitivity in routine clinical practice in the Top-5 for benign tumours is 100%.
For malignant tumours, at a theoretical level the use of the ML model would not imply a diagnostic improvement. However, the results are not statiscally significant since the number of cases analysed was very small (n=7) and the average diagnostic sensitivity of the professionals was very high in the Top-3.
In the Top-5, an average model sensitivity of 83% was observed. The ML model did not include the diagnosis of the lesion in 2 of the 7 cases of malignant tumours. These cases were one BCC and one cSCC, and the pathology report of the lesion was used as the gold standard. This case also generated diagnostic doubt among PC clinicians, since in the case of cSCC was classified as melanoma, as did the ML model. At this point, we also believe it is important to highlight that the diagnoses included in the Top-5 of the image evaluation in all cases included diagnoses in the category of malignant tumours, thus considering the malignant potential of the lesion, a relevant fact for the diagnostic and referral approach of GP.
For infectious diseases, the sensitivity of the model in the Top-5 was 75%, failing in 3 of the 9 cases included. In the detailed analysis we see that two of the cases were verruca vulgaris. One on the face, with the clinical image, the ML model diagnosed a benign tumour (nevus, intradermal nevus and seborrheic keratosis), epidermal cyst and herpes simplex, but when including the dermoscopic image, the diagnosis of verruca vulgaris was the Top-1. Therefore, showing another case that would be solved following the clinical practice of the GP who used a dermatoscope to help with the diagnostic. The second case the ML model failed probably because the image taken by the GPs showed several lesions, which may have confused both the AI and TD. The third case was a tinea corporis of the scalp with diagnostic agreement between the 3 clinicians who assessed the image; the model’s Top-5 were seborrheic dermatitis, folliculitis, neurodermatitis, vitiligo and psoriasis. Photographing the scalp is always challenging, as cameras usually focus the hair and not the scalp, where most dermatologic diseases actually reside. Therefore, it is possible that the images used for training the ML model would have incurred this problem, decreasing its diagnostic accuracy [51].
For inflammatory diseases, the sensitivity of the Top-5 model was 93%, failing in 1 of the 11 cases. The case was acne vulgaris, in which different erythematous papular rashes could be seen, some of them with superficial crusting in the beard area. In this case, the 5 diagnoses issued by the model were: rosacea, impetigo, folliculitis, BCC and perioral dermatitis, most of them falling into the inflammatory or infectious disease category.
For genital diseases, only 2 cases were included; one of balanitis and one of condyloma, in both cases the model found the correct diagnosis in the Top-1. Despite the small number of cases included in this category, the high diagnostic sensitivity in genital diseases could be explained by the fact that the model was trained at a theoretical level with 30% of genital disease photographs in the dataset.
It is difficult to consider the optimisation of the model with the inclusion or exclusion of diagnoses to make it more accurate in routine clinical practice; however, there are diseases documented as absent, such as, for example, dyshidrotic eczema, granuloma annulare, scabies, fibroma and hidradenitis. Taking into account the authors’ clinical experience, we suggest including these diseases in future versions of the model to improve its performance.
A terminology review of the terms used by Autoderm® was performed, as some of the terms used are obsolete or inaccurate in clinical practice. For example, the term "unspecified dermatitis" has never been used among dermatologists, as it is a very unspecific term. As for vascular malformations, it only takes into account haemangiomas, which would be paediatric vascular malformations, but a case assessed in adulthood was also specified. We also suggest unifying the term "Borrelia" and "erythema migrans" to avoid confusion. A proposal has also been made to improve the subclassification of acquired nevi to: junctional nevus (flat mole), compound nevus (flat mole with central raised area), intradermal nevus (raised mole) and nevus with atypical clinical features (since the diagnosis of atypia is histological).
The gold standard in this study was defined as a diagnostic consensus between two or three dermatologists, a fact that may generate, in isolated cases of high diagnostic complexity, a greater difficulty compared to studies in which the histopathological analysis of all lesions is compared. These were isolated cases that, with careful deliberation among experts, were resolved correctly, reinforcing our will to act in routine clinical practice without having to perform biopsies that would imply unnecessary morbidity.
As for the technical side of the ML model, it should be noted that one of the main advantages is that it can continue to learn patterns indefinitely as more images are obtained. This is in contrast to the normal training period for a GP. This process takes several years and some of the information and experience gained during the working life is eventually lost. A neural network can learn and work indefinitely. Everything suggests that the ML models’ constant learning could also have a positive impact on the professionals’ continued training, who would use it as a DST.
On the other hand, it is important to mention the explainability aspect. Many automatic diagnostic algorithms do not have mechanisms for communicating why a prediction is made. This leaves the observer with only a percentage probability, which is insufficient to assess whether the decision has been made correctly or not.
Limitations
The most relevant limitation of the study is the number of images used (n=100) for the performance evaluation of the ML model. Since Autoderm® evaluates 44 skin conditions, and considering that the prevalence of a significant number of these conditions represent less than 1-5%, the sample data for each class may be unbalanced and some conditions may not be evaluated, leading to an insufficient confidence level and less conclusive results for these conditions.
Secondly, due to the size of the sample and the consecutive collecting of cases, no representative results were obtained for less frequent diseases. However, we have included most of the spectrum of skin lesions that are a common reason for PC consultation, as well as banal lesions to avoid selection bias.
Thirdly, it should be taken into account that the GPs who agreed to participate voluntarily in the study show an interest in dermatology. Not all of them have a higher academic training in the subject, but it could explain in part that the diagnostic accuracy was higher than that reported in the literature (6,7). In this context, the ML model would be at a disadvantage in the comparison of overall diagnostic accuracy and sensitivity, as well as in the analysis by disease subgroups.
Fourth, a diagnosis made with a single image may have inherent limitations compared to diagnoses made in a clinical setting. The result of the ML model was based on a single photograph, which differs from other ML models, which consider more than one photograph.
Finally, the majority of phototypes in the population where the present study was conducted are type II and III, which could be related to a decrease in diagnostic accuracy, as the other two clinical studies with Autoderm were conducted in Sweden (type I and II) and Uganda (type VI) (44,45).