In the past few decades, automated interpretation of the tongue was performed through conventional feature extraction algorithms and statistical methods. L.C. Lo et al. and Hsu, Y.C. et. al. used conventional image processing techniques to detect tongue features, and located corresponding regions. However, these studies did not provide assessment methods and results in detail [10, 11]. In recent years, artificial intelligence (AI) has been actively applied to medical technology, and significant progress has been made with deep learning in image processing, thereby eliminating the need for image processing experts to extract image features manually [12]. Furthermore, transfer learning, a deep learning model that is pretrained using big data sets, can often be easily applied to different big data sets to interpret image categories. In this study, a pretrained model was applied to the classification of tongue features, and tongue features were directly marked on tongue images.
Some studies have applied deep learning to the analysis of tongue images, but deep learning is yet to be applied to the clinical interpretation of tongue diagnoses in TCM. For example, Meng et al. designed the CHDNet model, which combined deep learning and support vector machine classifiers to extract and classify tongue features [13]. However, the digital features extracted by this model did not visualize the tongue features mentioned in TCM. As a consequence, these digital features could not be applied to clinical inspection diagnosis. Additionally, the classification results showed either “gastritis” or “no gastritis”, which were not related to either the name of a disease or diagnosis in TCM. Hou et al. analyzed tongue color using deep learning, outperforming the conventional methods [14]. The present study applied deep learning visualization techniques to tongue diagnosis, and used specific tongue features as reported by TCM physicians as an example to determine whether those features existed, and to locate the region where they were distributed. We used a well-known deep learning model named ResNet50 to discover and localize tongue fissures. [17] To our knowledge, this is the only study that has applied deep learning visualization techniques to tongue diagnosis in TCM.
Deep learning has been widely used in image classification. Using tongue fissures as an example, a deep neural network model—after the training is completed—can determine whether any tongue image contains tongue fissures, but it is unable to provide a clear basis for users to understand its logic. Essentially, the model functions as a black box that does not allow intuitive interpretation. For example, if the model identifies a certain tongue image as having a tongue fissure, but the user visually interprets it as having no tongue fissure, then the user cannot understand why the model arrived at that interpretation. An explanation might be that perhaps the tongue fissures identified by the model are vague, and cannot be easily spotted through manual inspection. On the other hand, if an incorrect judgment is made by the model, it would be very difficult for the user and the engineer to correct the model. Therefore, many studies have attempted to increase the interpretability of deep neural networks through various types of algorithms, such as visualization. For instance, Zhou et al. proposed Class Activation Mapping (CAM), which is able to locate class-specific regions in images [15]. However, CAM requires a neural network to satisfy specific requirements, i.e., a fully-convolutional neural network followed by a global average pooling layer and then a linear prediction layer. Hence, the network model often requires modifications. In another study, Selvaraju et al. proposed Gradient-weighted Class Activation Mapping (Grad-CAM), which is a generalization of CAM that may be easily applied to any existing neural network model without having to modify and train it [16]. Grad-CAM uses the gradient information of the last convolutional layer to differentiate the importance of each neuron and heat maps to show the degree of correlation between each region and class-specific regions. For example, red represents high specificity, while blue represents low specificity.
There were some limitations in this study. If tongue segmentation can be performed on tongue images first, where only the tongue in the image is retained, fissures outside the tongue will not interfere with the learning process of the neural network, and the localization of tongue fissures should be more accurate. However, the quality of localization cannot be accurately assessed because large numbers of tongue fissure images which have been recognized by the academic community in TCM, or on which consensus has been reached and marked, are currently not available as ground truths. As a result, inter-observer agreement is not high; hence, it is not easy to obtain consistent ground truths [9]. But automatic tongue fissure localization still can be used as a screen tool in a medical environment without experienced TCM.