Dataset
Table 1
Classes and number of tongue lesion samples
Class
|
Number of samples
|
Class 0 (Cancer)
|
372
|
Class 1 (Precancer)
|
141
|
Class 2 (Benign or Inflammation)
|
144
|
Class 3 (Normal)
|
1,153
|
Total
|
1,810
|
In this study, 1,810 tongue images were collected from adult patients over age 20 who visited the Department of Oral and Maxillofacial Surgery from Jan 2006 to Dec 2020. The dataset was approved by the Institutional Review Board of the Seoul National University Dental Hospital (ER121036). Images were divided into four categories: malignant tumors, precancerous lesions, benign or inflammatory lesions, and normal (Table 1). Two board-certified surgeons with more than five years of clinical experience performed the classification of tongue images. For the test set, 10% (180 images) of the whole dataset was randomly selected by stratified sampling. The remainder were used as training and validation sets. Of these, 70% (1,141 images) were used for training and 30% (489 images) for validation. We defined the base model and tuned hyperparameters by comparing the results of the validation set. Results of the test set were used only for evaluating the final model.
Models
EfficientNet [7] served as the backbone model for the classification. It has the advantage of saving a significant number of parameters and computational costs while showing similar accuracy to several conventional CNN models, such as ResNet50, DenseNet201, and Inception-ResNet-v2. Additionally, it allows us to use the transfer learning approach by providing the pretrained models from a large-scale dataset (e.g., ImageNet).
VGG16 and Inception-ResNet-V2 were used as comparative models, both of which are representative CNN-based models and have been widely adopted in medical image classification tasks. Also, they provide pretrained models for transfer learning, likewise the EfficientNet.
Among the EfficientNet models, the most basic EfficientNetB0 model was adopted in this study because of the small datasets used. The number of parameters was 14,718,788 for VGG16 and 54,349,028 for Inception-ResNet-V2, respectively. In comparison, EfficientNetB0 had the lowest number, which was 4,059,815. The input shape of all models is fixed at 224 × 224 for a fair backbone model comparison (Fig. 1).
Transfer Learning And Fine Tuning
Transfer learning is effective in image classification tasks since many datasets share low-level spatial characteristics, such as line, point, and texture. These low-level characteristics are better learned by training a network with a big dataset, for example, ImageNet. Therefore, transfer learning can provide performance gains for medical deep learning applications prone to struggle with limited datasets [8]. Transfer learning was conducted using an Adam optimizer on a pretrained EfficientNetB0 model. The maximum epoch of the learning parameters, the mini-batch, and the learning rate initial value were 60, 8, and 1e− 2, respectively. After transfer learning, fine-tuning was performed to update the model to suit the current dataset, starting from the last-block part in a bottom-up manner. Finally, we decided that there was a big difference between the ImageNet dataset and tongue cancer images. Therefore, the feature extractor part was retrained (Fig. 2).
Data Augmentation
ImageDataGenerator of the Keras framework was used to generate batches of tensor image data with real-time data augmentation. Inputs were rotated with random angles between − 30° and 30°. They were randomly zoomed by 80–100% and flipped horizontally and vertically. They are randomly shifted to the left or right between − 10% and 10% of the total width and vertically in the same way. Points outside the input boundaries are filled according to the nearest pixels.
Weight Balancing
The amount of data per class was unbalanced. Classes 0, 1, 2, and 3 consisted of 372 (20.55%), 141 (7.79%), 144 (7.95%), and 1,153 (63.7%) images, respectively. When unbalanced data are used for deep learning, learning on data with low observations will not work well because learning is focused on data with high observations [9]. This study used the weight balancing method to mitigate class imbalance by assigning different weights to each class. The model sets higher weights for classes with fewer samples so that they could receive a stronger penalty for misclassification of classes with small samples and vice versa for classes with many samples. The class weights used are specified according to the number of samples. Values are as follows: class 0, 0.15; class 1, 0.4; class, 2, 0.4; and class 3, 0.05.
Performance Evaluation
To evaluate the screening performance of the model, accuracy, precision, recall, and F2 score were calculated. During early diagnosis and screening, missing a tumor is worse than giving a false alarm for a nonexistent tumor. Thus, the F2 score is used because it lowers the importance of precision and increases the importance of recall. The F2 score was computed using weighted average precision and weighted average recall. Accuracy, precision, recall, and F2 scores are defined as follows (Table 2).
Table 2
Metrics used for performance evaluation and the corresponding definition
Metric
|
Definition
|
True Positive
(TPi)
|
the number of correctly recognized observations for class Ci
|
True Negative (TNi)
|
the number of correctly recognized observations that do not belong to the class Ci
|
False Positive
(FPi)
|
the number of observations that were incorrectly assigned to the class Ci
|
False Negative (FNi)
|
the number of observations that were not recognized as belonging to the class Ci
|
$$accuracy= \frac{TP+TN}{TP+TN+FN+FP}$$
$$precision= \frac{TP}{TP+FP}$$
$$recall= \frac{TP}{TP+FN}$$
$${F}_{2} score=\left(1+{2}^{2}\right)* \frac{precision*recall}{\left({2}^{2}*precision\right)+recall}$$
Grad-cam
Grad-CAM (gradient-weighted class activation mapping) is a post-hoc model-explanation tool [10]. It produces a heatmap that indicates which part of the input image is considered important for the classification, making CNN-based models more transparent and explainable. Grad-CAM uses class-specific gradient information flowing into the last convolutional layer of the CNN model to understand each neuron for a particular decision of interest. The input image (left) and the heatmap (right) are shown. The red area in the heatmap (right) is the most important part of the model to classify the input image (Fig. 3).