Patient selection and data preprocessing
In this study, our dataset was obtained from 41,056 patients who were diagnosed in the department of otolaryngology in the people's hospital of Shenzhen Baoan district, from July 2016 to August 2019. Usually, patients got their eardrums and external auditory canal (EAC) photos via a conchoscope upon visit. These images were got by using standard endoscopes (Matrix E2, XION GmbH, Berlin, Germany) tethered to Olympus CV-170 digital endovision camera systems (Olympus Corporation, Tokyo, Japan). The resolution rate of these images is 586 × 583 pixels. In order to unify graph data and keep the original shape, we uniformly cropped and scaled these images with a size of 448 × 448 pixels with a ratio of 1:1. We chose 20,542 images, about 53.55% of the total candidate images. Male 11,797. female 8,745. occupied 57.43% and 42.57% of the total number of the selected images, respectively, as shown in Table 1. According to the age, aged (0,10] years have the maximum (22.187%) , aged (30,40] years (21.169%) and (20,30] years (20.861%) in order. Simultaneously, those images were randomly split into three sets, which of 80% for training, 20% for validation, respectively. The training set and validation set have no repetition and are consistent with each model that we trained.
This study confirmed that all methods were implemented in accordance with the relevant guidelines and regulations of the ethics committee of Shenzhen Bao'an District People's hospital. It is confirmed that all the experimental protocols have been approved by the ethics committee of Shenzhen Bao'an District People's hospital. The informed consent was obtained from all subjects, and the informed consent of parents and / or legal guardian was obtained for those under 18 years old.
Labelling of images
Image samples of eardrums and EAC were divided into eight categories based on Colour Atlas of Endo-Otoscopy [13], as shown in Figure 1. All the images classification were implemented by six ear specialists with more than six years of experience
(1) Normal eardrum and EAC (included completely normal eardrum, normal with healed perforation or some tympanosclerosis, NE, Figure 2).
(2) Chronic suppurative otitis media (CSOM): There is perforation of tension of tympanic membrane,and they are not uniform size. Most of them are single shot. The residual tympanic membrane may have calcification, ulceration and granulation tissue growth around the perforation margin.
(3) Cholestestoma of middle ear (CME): Loose inner pocket can be seen, and white exfoliated epithelium can be seen inside the pocket.
(4) External auditory cana bleeding (EACB): Bright red blood can be clearly seen in the external auditory canal.
(5) Impacted cerumen (IC): The external auditory canal can be blocked by brown black or yellowish brown lumps. The cerumen masses have different textures, some are loose like mud; some are hard like stone.
(6) Otomycosis external (OE): The external auditory canal and tympanic membrane are covered with yellow black or white powdery or villous fungal masses. The short process of malleus is apparently exoid.
(7) Secretory otitis media (SOM): The tympanic membrane is invaginated and the handle of malleus moves backward and upward. When the tympanic cavity has effusion, the tympanic membrane loses its normal luster, showing light yellow, orange oil or amber color, but If the liquid does not fill in the tympanic cavity, the liquid level can be seen through the tympanic membrane.
(8) Tympanic membrane calcification (TMC): The calcification of tympanic membrane is deposited like white plaque, which is located in fibrous layer of tympanic membrane, but the reason is unclear. It may be related to chronic inflammation, such as chronic otitis media and so on, which can be found in entire intact and perforated tympanic membrane.
Training transfer learning network models
In order to extract features from eardrums and EAC white light images for the automated detection of ear diseases, we used a model method which is typically used to solve image classification in computer vision [14,15]. In many models of deep learning models, ResNet [16] (ResNet50, ResNet101), DensNet-BC [17, 18] (DensNet-BC121, DensNet-BC161, DensNet-BC169), Inception-V3 [19], V4 [20], Inception-ResNet-V2 [20] and MoblieNet -V2 [21], V3 [22] were implemented and compared performance data from release to release, such as Inputting the image samples, training network, optimizing the network model. In this process, we used global average pooling in place of the fully connected layers in each model, generating output eight output nodes with a softmax activation function. The training makes full use of a stochastic gradient descent method [23] with a batch size of 100, epoch of 15, an initial learning rate of 0.01, momentum of 0.9 and weight decay of 10-4 to optimize parameters. This study was performed by means of the deep learning framework PyTorch [24] through four graphics processing unit (Tesla K80, NVIDIA) in Dell T640 station (inc., USA). For data augment in the process of model training, we performed random X and Y flip horizontal and vertical of input images.
Firstly, we classified the features of image samples of eardrums and EAC from the training sets by feeding to the deep neural network in the frame of PyTorch, and then we observed the performance of the training model on the validation dataset, simultaneously. And when the loss and the accuracy were stable, the training was stopped.
(1) Model structure adjustment
To reduce the size of image features slowly in the convolution operation process of training, we added one dense block ([1×1 conv, 3×3 conv] ×6) which is the same as the first dense block of DenseNet-BC in DenseNet-BC [17, 18] framework. Take DensNetBC161 for example, added one dense block ([1×1 conv, 3×3 conv] ×6) as the first dense block in DensNet-BC161, and the output size is 112×112 (tagged as DensNet-BC1615), as shown in Figure 3. Others, such as, DensNet-BC121→DensNet-BC1215, DensNet-BC169→DensNet-BC1915. Therefore, a total of 12 models were implemented and compared performance data from release to release adding the model described above.
(2) Selection of two appropriate models
Through the evaluation of the accuracy and calculation time performance among the 12 models, the appropriate models were selected. 80% and 20% of the images were set up the training set and validation set, respectively, from a total of 20,542 otoendoscopic images. Model optimization step was training-validation which was executed twice with training and validation set respectively.
(3) Ensemble classifier
An ensemble classifier was constructed by combining classifiers' outputs of the two appropriate models. Each classifier model assigns the probability of an input image to eight tags (NE, CME, CSOM, EACB, IC, OE, SOM, TMC) and the maximal probability among all tags is considered as a predictable label. The ensemble classifier combines 8-term score vectors from the predicted results of the two models together and the class with a maximal score will be treated as final forecast image's label.
(4) Sensitivity-specificity curve
Sensitivity and specificity are frequent clinimetric parameters that together define the ability of a measure to detect the presence or absence of a specific condition (i.e., likelihood ratio). On the whole testing set, a population-level sensitivity and specificity was calculated based on the following formula.
Where FP, PN, TP and PN represent the numbers of false positives, false negatives true positives and true negatives, respectively. A sensitivity-specificity curve can be created by changing the threshold value t (probability p ≥ t, where t is a threshold value).
(5) Confusion matrix
We used confusion matrix to evaluate the quality of the output of the classifier. The values in the diagonal line represent the number of correctly predicted samples, while the values not in the diagonal elements represent the number of misclassified samples. If the diagonal values are very high, it indicates the classifier has a very good performance.
(6) Overall accuracy
The overall accuracy is the ratio of the number of correctly categorized images to the total number of testing images, as shown below: