To investigate how a deep learning network identifies object types, a dataset containing images both sides of 250 types of blister packages were collected for training and testing data of a deep learning network. Identification results in terms of precision, recall, and the combined F1-score were computed, where an identification error can be regarded as an error due to look-alike cases.
Data resources
This study collected drugs from the Out-Patient Department (OPD) of a medical center. Of the 272 kinds of drug, this study focused only on recognition of pharmaceutical blister packages. As such, 6 classes of drug packaging (Fig. 1), totaling 32 kinds of drug, were excluded, as follows: clip chain bags, powder bags, foil packaging bags, transparent bags, paper packages, and bottle packaging. The remaining 250 drugs with blister packaging were considered.
We aimed to identify blister packages by their images, photographed using a camera from different angles. In collecting the training set, 72 images were taken for each side of each type of drug: the camera focused from 9 different angles, with 8 different rotation directions of the drug shown in the images (Fig. 2). Both front-side and back-side images were taken for each drug, resulting in a total of 36,000 images as the training data for deep learning. Images of the front sides of packages contained the shapes and colors of the pills or tablets, whereas images of the back sides contained mostly texture patterns of the drugs or logos of pharmaceutical companies. These images were used to train CNN networks, the deep learning networks, for object identification.
Deep learning architecture
The concept of the Convolution Neural Network (CNN) was proposed by LeCun and others in 1989. These deep learning networks usually consist of convolutional layers, pooling layers and fully-connected layers [29]. As the convolutional layers and the pooling layers in the network architecture enhance the relationship between pattern recognition and adjacent data, a CNN can be applied to signal types such as images and sounds. Through multi-layer convolution and pooling, the extracted features are treated as inputs, and then forwarded to one or more fully-connected layers for classification. Unfortunately, the simple CNN is not effective for more complex images. Krizhevsky et al. [30] reconstructed a CNN in 2012, and in CNN-based networks, the deep learning framework of "object detection" has also been continuously improved. R-CNN was the first successful CNN-based object detection method, but the speed of detection was very slow [31]. Later, the Fast and Faster R-CNN were constructed [32], optimized on the basis of R-CNN, and the speed and accuracy were improved significantly.
Software and hardware devices
This study used You Only Look Once (the abbreviation ‘YOLO’ having been proposed by Redmon et al. in 2015) as the solution framework for deep learning [33]. An end-to-end structure was adopted, and compared with the general deep learning method, YOLO focuses on both the area prediction part of detection and the category prediction part for classification. YOLO integrates detection and classification into the same neural network model, with fast and accurate target detection and recognition. These deep learning techniques employ the following features: batch normalization for faster convergence; passthrough for the features identification increasing; hi-res classifier to increase the resolution of the images; direct location prediction to strengthen the stabilization of position prediction; and multi-scale training to improve both speed and accuracy. The SENet and ResNet experiments in this study used the Kubuntu 14.04 system and the Darknet framework in the Caffe structure of Windows 7, which is a special hardware device host for deep learning. This study also employed an Intel® I7-6770 Eight-Core Processor (CPU), 16 GB RAM, and a NVIDIA GTX 1080 Graphic Processing Unit (GPU).
Experimental design
For model evaluation, this study partitioned the collected data into separate training and testing sets. The training set trained the deep network to generate models, while the testing set evaluated the performance of the constructed models. We randomly choose three-quarters of the 72 pictures of each type of drug for inclusion in the training set, and the remaining quarter were included in the testing set, with 13,500 images in total in the training set and 4,500 images in the testing set. This study trained 100 models for each of the front-side and back-side images using the training set. The best model was chosen, which was defined as the model with the greatest accuracy (highest F1 measure) and the fastest speed (fewest Epochs). This study also standardized the YOLO v2 protocol for both the training and testing datasets in each model. All images were converted into 224x224 pixels. Neither data augmentation nor pre-training of the model were performed during training. The batch size was 8, meaning that parameters were re-adjusted every 8 images. The highest training frequency was 100 Epochs, one Epoch meaning that the deep network ran all the pictures during training. The parameters were saved after every Epoch was completed (Table 1).
Outcome measurement
Confusion matrixes were used to record the results if blister packages were identified, correctly or not. Correct matches were listed on the diagonal of the matrix, whereas cases of missed identification were marked by non-zero values off the diagonal. The higher the number, the greater the chance of misidentification of blister packages of drugs. For example, assume that there is a system for classifying three different drugs (Table 3). Suppose that there are 28 drugs in total: 9 drug A, 6 drug B, and 13 drug C. In this confusion matrix, there are actually nine drug A, but three of them are misidentified as drug B. For drug B, one of the drugs is misidentified as drug C, and two are misidentified as drug A. The confusion matrix shows that it is more difficult to distinguish between drug A and drug B, but easier to distinguish drug C from the other drugs.
The data presented in Table 2 are for the model obtained from 100-Epoch training. The training time, number of training Epochs, precision, recall, and F1 measure were recorded as the evaluation results. The best recognition performance was identified according to the F1 score, and the Epoch number was used to identify the fewest numbers of training Epochs. The recall, also called the true positive rate or the sensitivity, measures the proportion of positives correctly identified. Recall = True Positive / (True Positive + False Negative), of which True Positive denotes a correct identification; while False Negative denotes a misidentified result by taking the correct target as something else. The precision, also called the positive predictive value, measures the proportion of positives among all identified. Precision = True Positive / (True Positive + False Positive), of which False Positive denote a misidentified result by taking something else as the correct target.[12] The F1 measure is an evaluation that combines both sensitivity (recall) and precision. The calculation formula of the F1 score is as follows:
See formula 1 in the supplementary files.
At the same time, we recorded the training time of the model, the number of Epochs in the training, and the classification performance of the model for the testing dataset.