Ethics approval and consent to participate:
The Institute of Institutional Review Board and Ethics Committee of the Fourth Affiliated Hospital of Nanchang University approved this study (SFYLL-PJ-2015-001). Written informed consent was provided by all participants. All biological samples were anonymized. All methods were carried out in accordance with relevant guidelines and regulations.
Fecal sample collection
In total, 676 positive samples were collected from the Fourth Affiliated Hospital of Nanchang University. These samples were diluted, stirred, allowed to stand and finally sent to a flow cell. To observe a clear sample image, an OLYMPUS CX31 was used in the optical system as the basic optical structure with a 40× objective lens (numerical aperture (NA): 0.65, material distance: 0.6 mm). An EXCCD01400KMA CCD camera was used to capture images with 6.45 µm resolution, and a standard halogen lamp was chosen for illumination.
The size of the collected images was 1600×1200. Annotation of each image was conducted manually as the ground truth. The location and size of RBCs, WBCs, molds and pyocytes were recorded according to the image analysis. Only the standard cell structure was annotated from the images, and the defocused image was not marked to reduce false detection of impurities. A total of 8,785 images with stylized components were collected. Training a on a small number of images can affect the test performance of a model. Therefore, to reduce the effect of overfitting, data argumentation was performed using random vertical and horizontal flipping and random contrast and saturation adjustments.
Four main elements must be identified during routine fecal examination: red blood cells (RBCs), white blood cells (WBCs), pyocytes (PYOs), and mildews (Mids). Other components, such as calcium oxalate crystals, starch granules, pollen, plant cells, plant fibers and food residues, are classified as impurities with less clinical significance. For detail, please see Fig. 1 (a)~(h).
Faster R-CNN (Ren et al., 2017) consists of three main parts: (1) a feature extraction layer, (2) a region proposal network (RPN), and (3) a classification and regression network; see Fig. 2 for a detailed model schematic diagram. Among them, the RPN and classification and regression network share the previous feature extraction layer, as shown in Fig. 2(a). The feature extraction layer is composed of a series of convolutional neural networks composed of a convolutional layer, pooling layer, and activation layer. According to the feature map generated by the feature extraction layer, the RPN can generate anchors of different sizes and aspects, which are then used to generate the region proposal. The proposed region generated by RPN is input into the classification and regression network for the type recognition and box accurate regression. Because the scale of the feature map layer corresponding to different foreground regions is inconsistent, Fast R-CNN adopts an ROI pooling strategy to unify the dimensions. Although the calculation is simplified, some features are lost; therefore, we propose PCA dimension reduction to normalize the dimensions of the features.
The feature extraction layers use Resnet (Ren et al., 2017), a 152-layer network composed of 4 residual network blocks: the first three residual network blocks are selected as feature extractors (see Fig. 2(b)).
The RPN network was used to generate a batch of proposals, similar to the selective search used in R-CNN (Girshick et al., 2014) and Fast R-CNN (Girshick, 2015). The network structure is consistent with the RPN used in Faster R-CNN: a 256-channel output is generated by a 3 × 3 convolutional layer after the feature map layer (conv4b_35), which is used to fuse the information around the features and to fuse information across channels. Meanwhile, the fused layer is connected by two branches, termed the SoftMax classification head and box location regression head; for details, see Fig. 3(a). Different from the RPN in Faster R-CNN, whose box dimensions are hand-selected, the generated anchors are based on the average size of the foreground target, which allows the regression network to run smoothly to learn and predict good locations; for details, see Fig. 3(b).
In the training process, the RPN module is trained jointly, rather than alternately, with the object recognition network. Since the structure of the Faster R-CNN is end-to-end, both the RPN and the object recognition network can provide feedback on the feature extraction layer. During backpropagation, the loss functions from both the RPN and the fast R-CNN were combined and calculated together. Moreover, we introduced the PCA strategy in the classification and regression component of Faster R-CNN that should be trained separately. The original Faster R-CNN model, denoted as M0, can improve the RPN network (3.1.2) and the ROI pooling strategy. PCA-based Faster R-CNN is denoted as M1. The training process is shown in Fig. 4.
All experiments were conducted using models developed based on TensorFlow, which provides libraries to build the main structure of deep learning models. The experiments were executed on a Windows system with an Intel Core i7-5960X CPU @ 3.0 GHz×8, an NVIDIA GeForce GTX 1080 Ti GPU and 32 GB RAM. The microscopy process involved taking 5 images with different focal lengths and recording 12 fields of view by means of a movable platform.