Assessment of the change in accuracy of an articial intelligence algorithm for the detection of skin cancer in camera images following diversication and training

Background The US FDA recently stated in its Proposed Regulatory Framework for software as a medical device (SaMD) that “One of the greatest benets of AI/ML in software resides in its ability to learn from real-world use and experience, and its capability to improve its performance.” This study follows two previous publications which addressed the accuracy of a machine learning algorithm for the detection of malignant melanoma. The aim of this study was to quantify the change in the accuracy following modications to the algorithm (DERM) for the detection of non-melanoma skin cancers and potential precursors of skin cancer. A secondary aim was to assess any improvement in accuracy associated with continued training of the algorithm. Methods A total of 16,550 images of skin lesions with histopathology based assessment were available for assessment. The primary indicator of diagnostic accuracy was the area under the ROC curve with 95% condence intervals. Sensitivity and specicity at the most ecient cut-point was also estimated together with the numbers of false negative and false positive results. Results The inclusion of squamous cell cancer, basal cell cancer and intra-epidermal carcinoma in addition to melanoma results in an improvement in the scope of the algorithm. For the most recent version of the algorithm all skin cancers show an area under the ROC curve greater than 95%. For melanoma sensitivity=91% and specicity = 89%; for all non-melanoma skin cancers sensitivity=97% and specicity=94%. Continued training of the algorithm results in a statistically signicant (p<0.01) improvement in accuracy which diminishes as the ROC area approaches 100%. Conclusions The results indicate that as the algorithm is used in clinical practice it will become more accurate with continued training but the rate of improvement will diminish as the ROC area approaches 100%. A smartphone or other camera tted with a dermoscopic lens and with internet access to the algorithm can provide an accurate additional assessment of a suspected skin cancer lesion or precursor for primary care physicians and dermatologists.


Introduction
The skin is one of the largest of the body organs. It is frequently a target for cancer because of exposure to solar ultra-violet radiation, with an estimate of more than one million non-melanoma skin cancers and nearly 300,000 melanomas worldwide in 2018. (1) The data for non-melanoma skin cancer is not routinely reported in many areas because of poor reporting which results in an underestimate of the health service costs associated with these relatively low mortality cancers. Basal cell carcinoma (BCC) is the most common of the non-melanoma skin cancers with one estimate of 80% BCC and 20% squamous cell carcinoma (SCC) in the United States (2). Variation in the incidence of both skin cancer types between countries is marked and is related to both ethnic differences and climatic factors which in uence solar UV radiation as well as the damage done to the ozone layer by chloro uorocarbon pollution. For these reasons Australia and New Zealand have the highest skin cancer incidence (agestandardised per 100,000) with melanoma incidence at 33.6 and 33.3 per 100,000 and non-melanoma incidence at 147.5 and 138.4 per 100,000 for each of those countries. For the USA the rates are 12.7 (melanoma) and 55.8 (non-melanoma) per 100,000.
(3) There is a marked sex differential for non-melanoma incidence in all three countries with men more than twice as likely to develop the disease, possibly as a consequence of occupational exposure because of working outdoors exposed to sunlight. (4) Since the publication of the US Surgeon General's Call to Action to Prevent Skin Cancer in 2014(2) a high priority has been assigned to skin cancer prevention in the United States with an emphasis upon melanoma. The most recent Skin Cancer Prevention Progress Report (5) indicates that there has been some progress in safe sun exposure practices but there is still a lack of primary prevention with one in three adults and more than half of high school students reporting sunburn each year (6). Even if there was a rapid improvement in sun-protection behaviour and facilities there is a latent reservoir of damaged skin from past sun exposure which will generate many skin cancers into the future.
As a consequence early detection remains a potential secondary prevention intervention but the current recommendation on skin cancer screening from the U.S. Preventive Services Task Force assessment (2016) concludes "that the current evidence is insu cient and that the balance of bene t and harms of visual skin examination by a clinician to screen for skin cancer in asymptomatic adults cannot be determined." (7) The report indicates that one of the reasons for this conclusion is that detection of melanoma in primary care is not su ciently accurate to support a population-based screening program.
Another issue which has created problems for an assessment of the value of screening for melanoma has been discussed by Weyers and that concerns overdiagnosis. As he demonstrates there is disagreement between dermatology clinicians and epidemiologists on what constitutes overdiagnosis. This has resulted in two quite different perspectives on melanoma detection -particularly in the context of small lesions (less than 6 mm.). (8) Skin Analytics Ltd. has been developing an arti cial intelligence based algorithm 'Deep Ensemble for the Recognition of Malignancy' (DERM) for the classi cation of skin cancer lesions based upon images captured by readily available cameras. The initial phase of the project was devoted to the detection of malignant melanomas and the results of the evaluation of these studies have been published. (9,10) The studies showed that the DERM algorithm was as accurate as specialist dermatologists in detecting melanoma. The aim of the most recent development was to increase the scope of the algorithm to include a much wider range of skin lesions including non-melanoma skin cancers (SCC, BCC and intra-epidermal carcinomas) and lesions that may be precursors or be mistaken for skin cancers. (The list of the lesions can be seen in Table 1.) Even though the US Food and Drug Administration states that "One of the greatest bene ts of AI/ML in software resides in its ability to learn from real-world use and experience, and its capability to improve its performance" the statement is largely based upon non-medical commercial software apps and there is relatively little academic literature to describe the improvement that occurs with use and continuing training. This paper presents the results of an up-dated assessment of the accuracy of the DERM algorithm following diversi cation. It also assesses the in uence of continued retraining of the algorithm with newly acquired images.  (16), and DermNet skin disease atlas (17). An additional 434 images of intact healthy skin were also used during the assessment and these were assumed to be negative with respect to histopathology for all lesion types.
DERM was designed and developed using deep learning techniques, speci cally convolutional neural networks (CNNs) that can identify and assess features of skin lesions which are associated with each image type. Deep learning identi es features of a lesion directly from the data and contrasts the features that are associated with a positive compared to a negative histopathology assessment. Cross-validation was used to assess the performance of the algorithm; this approach allows every image to be assessed once, while ensuring the same image does not appear in the training and test dataset. Cross-validation is performed by splitting the dataset into 10 randomly sampled 'folds' (datasets). The algorithm is tested against each fold, with the remainder used for training. The results for each fold are then averaged so that the overall performance can be analysed. This method also avoids problems which result in over tting. (18) As images of lesions with histopathology assessment became available the algorithm was retrained on three subsequent occasions. This provided an opportunity to assess the improvement of the accuracy with continuing retraining. There were also changes in the algorithm which broadened the scope to include non-melanoma skin cancers and to specify potential precursors and speci c types of benign lesions so that the algorithm cycles through the series of lesions until it identi es the speci c type in a pre-determined sequence, which is illustrated in Fig. 1.
For this investigation we also assessed the accuracy of an algorithm based upon Google's Inception-V4 CNN Architecture (19) which allows a comparison between our trained versions of the algorithm and an existing pre-trained naive approach.
Receiver Operator Characteristic curves (ROC) with bootstrapped estimation (1,000 repetitions) were used to examine the overall diagnostic accuracy of the algorithm for each cancer type and the precursors (20). Area under the ROC (AUROC) was regarded as the most appropriate overall indicator of accuracy. Sensitivity, speci city and other diagnostic indicators were estimated for each lesion type at the most e cient decision threshold, where the threshold was determined as the point on the continuum which provided the closest balance between sensitivity and speci city. It should be noted that this approach assumes that false positives and false negatives have equal 'value' and this is unlikely to be valid in a clinical context but it is appropriate for the assessment of accuracy in the context of this study. The analysis was informative but it does not accommodate the way in which the algorithm might be used in practice as a 'virtual dermatologist'. In this clinical context any assessment that produced a positive result for melanoma would be referred for biopsy and histopathology assessment. If the initial result was negative for melanoma then the algorithm would assess for SCC and continue sequentially through the severity ordered lesions shown in Fig. 1. This approach allows each stage of assessment to occur and re ects potential severity for each outcome. We simulated this process for a more realistic mimic of the algorithm application. And used this severity ordered sequential assessment to assess the effective sensitivity and speci city for the skin cancers and precursor lesions.

Results
The distribution of the 16,550 lesions is shown in Table 1. The most frequent were benign lesions of various types (57.8%) followed by melanoma (14.5%) and BCC (7.50%). The total skin cancer lesions was 5,042 (30.0%).  Table 2 shows the accuracy of the assessment of the algorithm for each of the lesion types. The least accurate assessment was for melanoma (AUROC = 0.952) and the most accurate was for dermato broma (AUROC = 0.994). All of the AUROC estimates were greater than 95%. The severity ordered sequential assessment is shown in Table 3. It can be seen that all lesions are assessed for melanoma and those that are negative for melanoma are assessed for SCC et seq., the number of lesions assessed at each step can be seen from the column 'N'. It is clear that the deletion of the lesions that were positive for melanoma has signi cantly improved the accuracy of DERM for BCC and intra-epidermal carcinoma but not for SCC.  Table 4 summarizes the overall performance with respect to false negatives and positives from the melanoma assessment. Twenty seven of the false negative melanoma results were assessed as true SCC (9.82%), 31 (11.3%) were assessed as true BCC and none were assessed as true Intra-epidermal carcinoma so that 21% of the total 275 false negative ndings would be referred for excision and biopsy even though their true melanoma was not detected. A further 101 were assessed by the algorithm as actinic keratosis (n = 17, 6.03%) or dysplastic nevus (n = 84, 30.6%). Most of these would be removed using surgical or pharmaceutical treatments otherwise they would be monitored according to current clinical practice. (21,22) Overall 159 of the 275 histopathology positive melanomas missed by DERM would be managed in a clinically appropriate manner (0.96% of the total images; 3.2% of all cancer images and 57.8% of the histopathology positive melanomas missed by DERM).    Table 5 shows the contrast between the Inception V4 algorithm and the initially trained version of the DERM algorithm. A version of the Inception V4 CNN which had been pre-trained to perform large-scale image recognition was retrained using the same data set which was used to train the latest version of DERM. It is clear that the DERM CNN vastly outperforms the Inception V4 CNN at the task of identifying skin lesions from dermoscopic images.  Table 6 shows the AUROC for four development versions of the DERM algorithm in sequential order. Differences in the algorithm between development versions include improvements to the training methodology, changes to the neural network architecture, and the inclusion of additional training data. All versions of the DERM algorithm were assessed using the same data set, although the older versions used less training data. There was a statistically signi cant improvement in the area under the ROC curve over time for melanoma and for actinic keratosis and dysplastic nevus but not for BCC, SCC and Intra-epidermal carcinoma. The latter three lesions begin with very high levels of accuracy so this lack of improvement in accuracy with further training may be a consequence of a ceiling effect.

Discussion
The results of this study which is an addition to our other evaluations (10) (9) indicate that the DERM algorithm is capable of detecting skin cancers and potential precursors from images captured by cameras that are in common use, require inexpensive modi cation and little operator training. The level of accuracy is similar to that of a specialist dermatologist with AUROC ranging from 0.952 for melanoma to .987 for BCC. In addition, given the sequential nature of the algorithm assessment going from most serious (Melanoma) to least serious (Dysplastic Nevus), only 3.2% of the cancers and precursors would not be referred for biopsy or clinical follow-up.
While continued development of the algorithm improved the AUROC for all lesion types, the improvement was statistically signi cant for three (Melanoma, Actinic Keratosis and Dysplastic Nevus). SCC and BCC approached statistical signi cance but as both started with an AUROC greater than 0.98, only marginal improvement was possible which we attribute to a ceiling effect given that the upper limit of the AUROC is one.
DERM has improved over four development versions and therefore, has the potential to continue to improve with the addition of more clinical data and re nement of the algorithm. This is one of the strengths of arti cial intelligence.
An issue for this study is that we are following the convention of assuming that histopathology is the gold standard against which DERM should be judged. As Claassen pointed out in 2005 this is not intended to imply that the gold standard is without error. (23) For melanoma biopsies two studies show that concordance for melanoma between pathologists is about 75%. (24,25) It is therefore possible that some of the errors concerning the DERM assessment are because of errors in the gold standard which this study cannot determine.
A limitation of arti cial intelligence applications is that their adoption in clinical practice requires much more than a welldeveloped, validated algorithm. 1. Access to specialist dermatology diagnosis is not universal. People who live in rural and remote areas in most countries have access to primary care physicians but as the US Preventive Services Task Force report makes clear, the accuracy of skin cancer detection in primary care is poor and this is supported by the recent Cochrane Review. Access to DERM might mean that fewer patients from remote areas will be identi ed for distant specialist review and that they would be more likely to have skin cancer that requires specialist care.
2. Prior assessment of lesions by DERM might allow for more accurate triage of patients referred from primary care to secondary dermatology clinics.

3.
A third scenario is that DERM would allow a rapid response second opinion for dermatologists in secondary dermatology clinics.

Conclusions
Our study suggests that the use of a trained AI algorithm can be integrated into both primary and secondary care settings in a way that will improve the accuracy of diagnostic skin cancer assessment and reduce the number of unnecessary biopsies referred for histopathalogy review.

Declarations
Ethics approval and consent to participate This study did not require patient participation and so no ethics approval was necessary. All observations were made using freely available digital images of skin lesions or healthy skin which had been obtained from consenting adults. Those images which were derived from clinical trials had had ethics approval.

Consent for publication
Not relevant.
Availability of data and materials The images used for this study were derived from published databases.

Competing interests
JG is an employee of Skin Analytics Ltd.

Funding
The statistical analysis was funded by the Royal Perth Hospital Research Foundation, Perth, Western Australia (https://www.rphresearchfoundation.org.au/). The Foundation had no role in the conduct of the study or the reporting of the results.
Authors' contributions JG designed the machine learning algorithm and participated in the drafting of the manuscript. MP conducted the statistical analysis and wrote the initial draft of the manuscript.