Deep Skin Cancer Model based on Knowledge Distillation Technique for Skin Cancer Classication

Skin cancer is more treatable in the early stages since it spreads slowly to other parts of the body. Early detection is essential due to the rising number of cases of skin cancer, as well as the high fatality rate and high expense of medical treatment. Deep Learning algorithms have lately been used to improve the performance of a variety of biomedical image processing modalities. Their structures have been developed to solve classication challenges that are suffering from a lack of training data in skin cancer infections (actinic keratosis, basal cell carcinoma, dermatobroma, melanoma, nevus, pigmented benign keratosis, seborrheic keratosis, squamous cell carcinoma, and vascular lesions are skin infections). In this work, we propose a Deep Skin Cancer (DSC) model depending on the Knowledge Distillation technique and some optimization algorithms. The experimental results based on the used benchmark datasets from the International Skin Imaging Collaboration (ISIC) show superior performance in skin cancer diagnosis. According to the reported results, it has been empirically achieved that a sensitivity value of 99.16% has been empirically achieved, a specicity value of 99.57% with the Adamax optimizer, and a lack of labeled input image data. Furthermore, the results assist in diagnosing some COVID-19 cases due to the similarity between the skin cancer infection and the black fungus found in some COVID-19 survivors, particularly those with co-morbid conditions similar to skin cancer infection.


Introduction
Given that the skin is the body's largest organ, it's natural that skin cancer is the most prevalent type of cancer in humans. Melanoma Skin Cancer (MSC) affects only 1% of all patients, according to the American Cancer Society (ACS), but it is associated with a higher death rate; it is similar to some COVID-19 survivors' cases [1][2] [3].
© The Author(s), 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third-party material in this article are included in the article's Creative Commons license unless indicated otherwise in a credit line to the mmaterial If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this license, visit HTPP creativecommons.org/licenses/by/4.0/.
It is a type of cancer that is spread by cells called melanocytes. Computer-based technology helps in diagnosing skin cancer infections in a more comfortable, less expensive, and faster manner [4] [5]. Multiple noninvasive procedures are offered to analyze skin cancer symptoms and determine whether the patients have melanoma or not. The di culty of classifying skin cancer using images has been vastly reduced after the deployment of the Deep Neural Network (DNN) models [6] [7]. In this paper, a Deep Skin Cancer (DSC) model is proposed for detecting the visual characteristics of skin lesions. Knowledge Distillation (KD) is a form of model compression in which a small model is trained to resemble a larger model that has already been trained [8] [9]. In this training stage, the giant model represents the teacher, and the little model represents the student. This training stage, sometimes referred to as the teacher-student[10] [11]. The majority of DL models are computationally too expensive to run on mobile phones or embedded devices [12] [13]. Because the KD model is a type of model compression, the student model is frequently smaller than the teacher model. The student normally learns signi cantly faster and more reliably as a result of the losses that have a greater effect on the performance of the DSC model as shown in Figure (1). The DSC model's main contribution could be summarized in the following points: (i) proposing a DNN model depends on KD techniques that are used to classify skin cancer infections. (ii) utilizes KD techniques for designing the proposed model, which results in an accurate model with high generalization ability. (iii) exploration of the effects of using different optimizers, including Adagrad, Adam, Adamax, Nagam, RMSprop, and SGD, that is totally guided by medical experts to ensure and validate the proposed model and the nal results [14]. (iv) The proposed model was statistically compared with other optimization algorithms, which proves that our model is robust against over tting [15]. This paper is organized as follows: Section 2 describes the methods. Section 3 explains the experimental. Lastly, Section 4 concludes the paper.

Related Works
DL has had a lot of success in biomedical engineering recently. It reduces the need for feature engineering by learning and extracting meaningful features from raw data automatically. Many elds, particularly computer vision, have been transformed by DL. It has had a lot of success in biomedical engineering recently as shown in Table 1. While classifying skin lesions, Datta et al. [16] compare the performance of VGG (19), ResNet, InceptionResNetv2, and DenseNet architectures with and without the Soft-Attention technique. On the HAM10000 dataset, the original network outperforms the baseline by 4.7 percent while reaching a precision of 93.7 percent when combined with Soft-Attention. Mahboda et al. [17] created a baseline classi er as the reference model. Then, in both the training and test phases, we investigated the classi cation performances using either manually or automatically constructed segmentation masks in various settings. Hosny et al. [18] proposed a CAD system for skin lesions using the dataset ISIC2019. This dataset has a number of aws, including unequal classes. The authors utilised a bootstrap weighted classi er with a multiclass SVM. This classi er modi ed the weights based on the image class. They provided GoogleNet a new class to train with, each with a varying quantity of unknown photos obtained from diverse sources. Hameed et al. [19] proposed a skin lesion categorization system based on a multiclass multilevel algorithm. Traditional machine learning and deep learning approaches were used with the proposed model. Hasan et al. [18] proposed DSNet, a skin lesions semantic segmentation network. They employed depth-wise separable convolution to minimize the number of parameters, resulting in a lightweight network. The transfer of features extracted from pre-trained models appears to be preferable in existing reports on skin cancer datasets. However, for skin cancer datasets, the trends in model compression, in which a larger pre-trained model is produced to allow a smaller model to learn complicated characteristics while minimizing computation and memory costs, have not yet been examined. A vast and complicated network or ensemble model, in particular, is rst trained and pulls signi cant feature information from the given data, resulting in targeted predictions. This more complicated model is then used to train a small network.
The little model can yield equivalent ndings or mimic the outcomes of the larger model.

Methods
We propose a DSC model based on pre-trained models of the ImageNet dataset for medical imaging tasks. The model can overcome the challenge of a lack of training data, especially in the medical imaging domain. Knowledge is encoded and conveyed in the conventional KD model using the forms of softening class scores [22]. The student model's total training loss is provided by (1): where L CE represents the cross-entropy; y represents the one-hot vector of ground truths; σis the soft-max function; z s and z t are the output logits of student and teacher models, respectively; ∝ is a balancing hyperparameter, and T is the temperature hyper-parameter. The student model was trained utilizing the predictions of the teacher model as well as the ground truth hard labels, as illustrated in the standard KD from Fig. 1. However, it is widely accepted that reversing the KD operation will not considerably improve the teacher because the student model is incapable of learning and transferring relevant knowledge.
In the training phase, the dataset is split into batches, which are created from the data. The input dataset with a size of (224,224) pixels begins with two standard convolutional layers that are applied insequential order. The rst convolution has a kernel size of 2*2 and a lter number of 12823. The accuracy and cross-entropy of the DSC model are determined as follows. Accuracy refers to the number of cases in the overall dataset that were correctly identi ed; it's the true predictions that the model has made out of the overall predictions.  Utilizing a poorly-trained teacher model that has been trained on 50 initial epochs may also result in poorer results than using standard or reverse KD techniques. Finally, compared to all of the preceding ways, self-training the model may yield higher outcomes. When trained from itself with a 90% accuracy requirement, for example, the model would learn with a 10% error from its softened-class targets. To address these problems, we used the KD technique ( Fig. 1) to conduct all trials related to the ve key training strategies: standard KD teaching a teacher model to teach a student model; reversed KD training a ( ( ) ) ( ) ( ) training a model to teach itself. We chose six types of DNN models with identical input sizes to assess our proposed training methods in order to undertake all KD training approaches. However, it is widely accepted that reversing the KD operation will not considerably improve the teacher because the student model is incapable of learning and transferring relevant knowledge. Utilizing a poorly-trained teacher model that has been trained on 50 initial epochs may also result in poorer results than using standard or reverse KD techniques. Finally, compared to all of the preceding ways, self-training the model may yield higher outcomes. We chose six types of DNN models to investigate our suggested training methods (ResNet-50 in order to conduct all KD training procedures feasible.

Experimental
The results of our proposed method have ensured the effectiveness of it and the performance was compared with the performance of each pre-trained model individually.

Dataset
The ISIC dataset is used, which includes 2357 images of malignant and benign oncological diseases. All data were sorted according to the classi cation achieved with the ISIC dataset, and all subgroups were divided into the same number of images as shown in Figure 2, with the exception of melanomas and moles, whose images are slightly dominant (2). The illnesses included in the data collection include actinic keratosis, basal cell carcinoma, dermato broma, melanoma, nevus, pigmented benign keratosis, seborrheic keratosis, squamous cell carcinoma, and vascular lesion [23][24].

Evaluation of Results and Metrics
The DSC proposed model was developed and trained on the NVIDIA GeForce GTX 1080Ti. The DSC model is determined by the ten evaluation measures, which include accuracy, sensitivity, or True Positive Rate (TPR), speci city, or true negative rate (TNR), and precision. The sensitivity, speci city, precision, negative predictive value, false positive rate, false discovery rate, false negative rate, accuracy, F1-score, and Matthews Correlation Coe cient should be kept to a minimum to maintain a maximum value of tness (F), and they should be evaluated using the proposed model Figure 1. The number of times the positive class is correctly classi ed (TP), the number of times the negative class is correctly classi ed (TN), the number of cases that forecasted the positives incorrectly (FP), and the number of cases that forecasted the negatives incorrectly (TN) (FN). The DSC model is based on the ResNet (50) model. It is also advised that some weights in the convolutional layers be allowed to be readjusted in order to adapt to the situation, and it has been applied to some optimization algorithms (Adagrad, Adam, adamax, Nadam, RMSprop, SGD). The learning rate has been modi ed in such a way that it automatically decreases because the summation of the previous gradient square is always kept on increasing after every time step. Figure 3 shows accuracy and cross entropy for correctly identi ed instances. Table 3 illustrates the general parameter values for the mentioned optimizer algorithms. The Adam optimizer with a learning rate of 0.05 and (64 x 64) image batches. Model evaluation is evaluated using accuracy and cross entropy for correctly identi ed instances, as shown in gure (4). The total (lr 0.05) and (64 x 64) image batches Each image batch requires 2K gradient updates. The Adamax optimizer achieves the best accuracy and loss metrics as shown in gure (5) with lr = 0.01, 1 = 0.9, and 2 = 0.999. Figure (6) shows loss and accuracy for the Nadam optimizer with lr = 0.01, 1 = 0.9, 2 = 0.999, and schedule decay = 0.004 RMSprop reaches a robust evaluation with lr = 0.01, 1 = 0.9, and Rho factor (the discounting factor for the coming gradient) = 0.9 as shown in Figure (7). The SGD optimizer attains great metrics as shown in Figure (8). Table 4 illustrates the all-metrics measures for all optimising algorithms with KD techniques.

Discussion And Future Works
Pre-trained models were shown to be suitable for perceiving skin cancer images as well as performing well in competitive knowledge distillation. To categorize eight skin cancer instances, knowledge is transferred from large, highly regularized models into smaller ones, as well as from the model into itself. First, despite our efforts to evaluate a comprehensive ISCI dataset, simulating the practical clinical challenges of handling over 2357 images, correctly visualizing and discriminating the 8 classes using a deep learning framework proved di cult when the database was unbalanced and poorly supervised. Our extensive experiments con rmed the utility of KD methods in the classi cation of skin cancer cases. Despite the fact that we established KD's superior performance in terms of categorization results. Without addressing the instance's relationship to the student models or the inference technique, the KD model collected instance features as distilled knowledge from speci c layers of the teacher models. It's challenging for student models to directly suit all of the teacher's layer outputs. As a result, new KD designs are needed to help reduce intra-class variances while amplifying inter-class differences in the feature space, as well as to prevent major performance drops when both teachers and pupils have distinct architectures. Our experiments demonstrated the feasibility of implementing various KD training strategies, implying that the self-training KD method can improve the targeted models into which the distilled knowledge is transferred when selecting superior teachers is di cult or when computation resources are limited.

Conclusion
This paper proposes a DSC model using the concept of knowledge distillation and a variety of parameters to assess if a set of skin cancer images are any type of the eight types. The measurements show that the classi er is consistent. The ResNet (50) architecture consistently performs well, suggesting that it has perfectly generalized cancer classi cation in images. This architecture has a diagnostic accuracy of 0.9936 and works effectively. This architecture has the highest accuracy when compared to others. Despite its failure to correctly identify these images, the ResNet(50) architecture worked admirably. In the future, when the greatest number of high-resolution images have been obtained, this investigation will be conducted on a series of skin cancer images of patients. The online version contains supplementary material available at https://www.kaggle.com/nodoubttome/skin-cancer9-classesisic.