System architecture. Our progressive recognition system consists of the LR model, HR model and WSI classification model, as shown in Fig. 1. The LR model is designed to quickly locate suspicious lesion areas at low resolution. The HR model is to identify the lesion cells and recommend the top 10 lesion cells at high resolution. The WSI classification model uses a RNN to integrate the CNN image features of the top 10 lesion cells, outputting the positive confidence of WSIs. The LR model and HR model are both based on ResNet5030. For the LR model, we modify the fully connected layer of original ResNet50 and add a semantic segmentation branch for generating a rough location mask (Supplementary Fig. 1). Thus, the LR model can screen WSIs and locate the suspicious lesion areas. The semantic segmentation branch is constructed with residual blocks of dilated convolutions. The LR model accepts an image tile of 512 × 512 pixels (0.486 µm/pixel) as input and outputs a lesion probability and a location heatmap (Supplementary Fig. 2). Afterwards, for the areas with a probability higher than 0.5 predicted by the LR model, we perform some morphological operations on corresponding location heatmap to generate the location mask. A cropped image tile of 256 × 256 (0.243 µm/pixel) according to the location mask is input to the HR model and a new lesion probability is obtained. Finally, all identified lesion cells in WSIs are sorted by lesion probabilities, and the top 10 most typical lesion cells are recommended for cytopathologist reviewing. Further, the RNN model integrates the CNN image features of the recommended top 10 lesion cells to classify WSIs. For each lesion cell tile, 2048-dimensional features are extracted by the HR model. Then the total 10 × 2048 dimensional features are input to the RNN model, and positive probabilities of WSIs are output.
Multi-center WSI datasets. To assess the robustness and clinical applicability of our system, we collected 12 groups of datasets from 5 hospitals and 5 kinds of imaging instruments (see Dataset sources in Methods), which are referred as groups A-L (Fig. 2a). These 12 datasets include 1,467 (41.4%) positive WSIS and 2,073 (58.6%) negative WSIS with 79,218 annotated lesion cells by a consensus of three cytopathologists. Each WSI representatives one unique patient. The 12 datasets show completely different image styles of staining and imaging characteristics (Fig. 2b) and we quantified the difference in their numerical distributions (Fig. 2c). Groups A-D are used for training our system. Groups E-L are treated as completely independent test set to evaluate the generalization of our system. Groups A-D are randomly divided into training set, validation set and test set with a slide-wise ratio of 8: 1: 1 (Fig. 2d). WSIs of all groups are scanned under 20× or 40× magnification microscopes. We uniformly interpolate them to 0.243 µm/pixel in data preprocessing, since the different resolutions of various imaging instruments.
In order to verify the effect of our recognition system in practical applications, we invited three cytopathologists to evaluate the prediction results of 1,170 slides in groups E and F. Groups E and F are independent from the training data with new styles and thus are suitable for clinical-level experiments. We performed the below assessments: slide level accuracy, tile level accuracy and true positive rate of recommended top 10 lesion cells.
Assessment at the slide level. To assess the effectiveness of our system at the slide level, we compared the RNN classifier and cytopathologists in classifying WSIs on the independent groups E-F of 1,170 slides. Figure 3a shows the ROC (receiver operating characteristic) curve of our system for classifying positive and negative slides, achieving 93.5% Specificity and 95.1% Sensitivity with 0.979 AUC (the area under ROC). The most confusing 121 slides of the slides were classified by the RNN and the three cytopathologists. Each red dot in Fig. 3b refers to 1-Specificity and Sensitivity of a cytopathologist’s interpretation result. Our system achieves 0.5 Specificity and 0.746 Sensitivity with 0.647 AUC, which is comparable with the average level of cytopathologists. In addition, our system processes one giga-pixel WSI in about 1.5 minutes after deploying on single GPU card, much faster than manual slide reading time.
Analysis of false positive and false negative slides. From the frequency histogram of slide classification scores of groups E-F (Fig. 4), our system produces 0.8% false negative slides (the slide score threshold value is 0.5), all of which were confirmed as ASCUS slides (atypical squamous cells of undetermined significance). As we know that cervical cytology ASCUS slides and part hard negative slides are confusable, thus it is acceptable to misjudge a small amount of ASCUS slides. Meanwhile, our system produces 26.3% false positive slides, which is in line with the original intention of cervical cytology computer-aided diagnosis. These false positives will be further reviewed by cytopathologists. Further, our system achieves 49.3% Specificity while retaining 100% Sensitivity, which indicates 49.3% negative slides can be excluded.
Assessment at the tile level. To evaluate the difference between our system and the three cytopathologists at the tile level, we randomly selected 1,000 positive tiles and 3,000 negative tiles with a size of 256 × 256 (0.243 µm/pixel) from groups E-F as the test data. As shown in Fig. 3c, the ROC curve describes the performance of our system, and each red dot represents 1-Specificity and Sensitivity of a cytopathologist’s classification result. Our system achieves 95.3% Specificity and 92.8% Sensitivity with 0.979 AUC, better than the average level of cytopathologists.
Assessment on the recommended top 10 lesion cells. The three cytopathologists evaluated the recommended top 10 lesion cells on 447 positive slides in groups E-F. The average true positive rates of top 10 and top 20 recommended cells are 88.5% and 85.0% separately (Fig. 3d). Moreover, our system does not miss any positive slide when voting the evaluated results of three cytopathologists, i.e., positive slides at least have one true lesion cell in the recommended top 10 or top 20 cells. Figure 4 shows the recommended cells of slides with different classification scores. For high-risk slides, our system recommended typical lesion cells such as koilocytotic cells or hyperchromatic cells with large nucleus and irregular nuclear membrane. For medium-risk slides, some suspicious cells with slightly large or deep-stained nucleus were recommended. No typical lesion cells were recommended on low-risk slides. The results demonstrate our system can accurately recommend 10 lesion cells without missing positive slides, greatly reducing the workload of cytopathologists to screen WSIs.
Comparison of our system and Hologic ThinPrep Imaging System on recommending top lesion cells. To further verify the effectiveness of our aided screening system, we compared it with Hologic ThinPrep Imaging System (referred as TIS). The test data are cervical cytology samples from 58 positive patients in Maternal and Child Hospital of Hubei Province equipped with TIS. First, 58 glass slides were prepared from the 58 samples, stained, imaged and identified by TIS in the hospital. 22 suspicious fields of view were recommended by TIS on each slide. Then, we used another instrument (Shenzhen Shengqiang Technology Ltd. with 0.180 µm/pixel under 40× magnification) to scan the 58 glass slides, and we used our system to recommend the top 20 suspicious cell regions (about 60×60mm2, far less than TIS’ fields of view) for each slide. We asked the three cytopathologists to evaluate the results recommended by TIS and our system at the same time. The statistical results in Fig. 5 show that the true positive rate of our system is higher than that of TIS. Notably, TIS can only work under the closed-loop strategy of preparation, staining, imaging and recognition, while our system is robust to staining and imaging of various sources.
Importance of designed data enhancement, hard sample mining and diverse data learning. We conducted a set of ablation experiments to demonstrate the importance of designed data enhancement, hard sample mining and diverse data learning (see Methods). We used the three learning strategies to train a series of control high-resolution models step by step, and gave the classification accuracies on the test sets of groups A-F. Notably, the ratio of positive and negative tiles in the test set is 1:1. The ablation experimental model configs and results were provided in Fig. 6a. To evaluate model generalization, we treated groups E-F as the independent test data and showed the ROC curves of these control models on groups E-F in Fig. 6b. According to the results, with the designed data enhancement and hard sample mining, performance of the enhanced and mined models on groups E-F made great progress with AUC value increase of 0.138 and 0.072. The results indicate that our designed data enhancement and hard sample mining strategies are effective for improving model generalization and accuracy. Further, as more groups of training datasets were used, the AUC values of the mined, baseline and HR models increased gradually from 0.808 to 0.983. The results indicate that the diverse data learning of multiple groups with different styles is important for model generalization.
Generalized and rich feature representations of our models. We analyzed the alignment of features of high-resolution models between different groups of data by feature visualization. The dimension-reduced features of the original, enhanced, mined and baseline models by t-SNE31 on groups A, B, E and F are shown in Fig. 7a. For the models, groups E-F are independent test data. From the original model to the baseline model, features of positive and negative tiles are gradually separated. Further, the features are gradually aligned between groups A-B and groups E-F. The results indicate that the designed data enhancement, hard sample mining and diverse data learning strategies improves the discrimination and alignment of features on unseen groups E-F.
We further analyzed the feature representations of the HR model on groups A-L in Fig. 7b. The tiles with high and low lesion probabilities from the 12 groups are clustered and well separated. Tiles corresponding to the far-right points are the typical lesion cells, including koilocytotic cells and hyperchromatic cells with large nucleus and irregular nuclear membrane. These lesion cells with different staining and imaging characteristics are clustered together and share similar features. Normal cells from different groups are clustered on the left points. At the junction regions are the suspicious cells with about 0.5 lesion probabilities. The suspicious cells generally contain slightly large nucleus or deep-stained nucleus, but the degree is not enough. In addition, artifacts from staining and imaging may cause the suspicious cells. The results indicate that the learned features represent cervical lesion cells morphology well and the features are aligned between datasets with different staining and imaging characteristics. This is the key reason why our system has good generalization for unseen datasets of new styles.