Large geographical scale specimen collection, slides preparation and image acquisition from different geological multi-center for retrospective studies
To ensure the robustness of the WSI-level classification model and account for real world data variations, we collected a total of 174,748 specimens from 49 medical centres across six distinct regions, as illustrated in Fig. 1A. After applying quality control criteria, 172,665 slides were retained for further analysis, of which 76,614 were used to develop the AI system. These exclusion criteria included slides with faded/bleached smears that could not be accurately interpreted, slides with insufficient cells for interpretation, and slides that were excessively dirty. Supplementary Sections I-II provide additional explanations for the cell density detection and blur detection methods that applied to the exclusion criteria. In this work, each slide corresponds to one unique patient. Our data incorporated three common sedimentation methods (Fig. 1B), including natural sedimentation, membrane-based sedimentation and centrifugal sedimentation. We have also implemented manual and semi-automatic annotation strategies as shown in Fig. 1D to expedite the annotation of extensive datasets and minimize the workload of cytopathologists. This led to a total of 3,435,463 cell-level annotations shown Fig. 1C. All slides were scanned at 20X magnification (0.5µm/pixel) using 46 high performance slide scanners, such as KF-PRO-400 & KF-PRO-005 from Ningbo Konfoong Biotech International Co., Ltd.
Overview of AI-assisted CCA-DIAG system design and its optimization
The proposed AI-assisted CCA-DIAG system consists of four modules as illustrated in Fig. 1E: (1) Cell detection model to identify the locations and categories of abnormal cells; (2) Cell classification model to improve the accuracy of true positives and minimize false positives for various classifications of the abnormal cells detected in (1); (3) ASCUS nucleus-cell segmentation model to segment ASCUS cells returned by the abnormal classification model; (4) XGBoost classifiers37 to classify the WSI into different categories according to cell-level features extracted from the three above mentioned models. The categories on WSI-level includes six types of Squamous intraepithelial lesions: NLIM, ASCUS, LSIL, ASCH, HSIL, AGC, and two types of microbial infections: Trichomonad and Candida. Specially, for squamous intraepithelial lesions, ASCUS, LSIL, ASCH, HSIL and AGC were grouped as ASCUS+, while LSIL, ASCH, HSIL, AGC were grouped as LSIL+. The WSIs dataset were assigned in a 8:2 ratio to the cell-level analysis dataset, and the WSI classification dataset based on the clinical diagnostic categories. The number of annotations used for the development of different models were illustrated in Fig. 1C.
Semi-automatic annotation: Cytopathologist-AI interaction greatly increase the efficiency of large-scale annotations
The development of CCa screening system often requires a large amount of high-quality annotated data. To accelerate data curation, we developed an annotation platform for cytology and designed a two-phase annotation implementation as shown in Fig. 1D. The details of our two-phase annotation strategy are given in Fig. 2.
In phase 1, a preliminary abnormal cell detection model and a preliminary ASCUS nucleus segmentation was developed based on 37, 942 WSIs with manual annotations of 375, 055 abnormal cells and 3,271 nucleus of ASCUS single cells. In phase 2, by applying the two preliminary models to another 24, 339 WSIs exhaustively using sliding window approach, a large amount of pseudo annotations, including abnormal cells of different categories and nucleus of ASCUS cells were generated, which were subsequently reviewed and confirmed by 19 cytopathologists before including them to enlarge the training dataset. Additionally, a new category, NILM (including changes in flora suggesting bacterial vaginosis, Actinomyces, Herpes simplex virus, cytomegalovirus), was added and annotated during semi-automatic annotation.
In total, 1,223,374 abnormal cells of different categories and 97,221 nuclei of ASCUS were returned by preliminary abnormal cell detection model and ASCUS nucleus segmentation model respectively. After cytopathologists’ review, 31, 826 nuclei of ASCUS were confirmed as being correct. Details of semi-automatic annotation results of abnormal cells are shown in Supplementary Section VIII. Many of the abnormal cells identified by the preliminary abnormal cell detection model were corrected as false positive (NILM), demonstrating the necessity of developing an abnormal cell classification model which further classifies the abnormal cells to eliminate the potential false positive. As for annotation efficiency of cytopathologists, by comparing the total amount of annotation together with mean of annotation time of fully manual annotation and semi-automatic annotation respectively, we found that semi-automatic is approximately 2 times much faster, as shown in Fig. 3A.These findings indicate the semi-automatic annotations assisted by the preliminary models can generate large amount of high-quality annotated data while easing the annotation burden of cytopathologists significantly.
Abnormal Cell detection model optimization and model performance improvement by copy-paste method
The preliminary models of abnormal cell detection were trained on a fully manual annotations in phase 1 and were applied to generate larger number of annotations for further training in phase 2. For the preliminary abnormal cell detection model, we evaluated the average precision (AP) values for each category in the test set and 3 different backbone structures (resnet 50, resnet 101 and Swin-transformer-base ) were compared, as shown in Fig. 3B The optimal backbone network model is Swin-transformer-base, which reaches highest AP value in all categories (AP of ACUS = 0.527, ASCH = 0.408, LSIL = 0.564, HSIL = 0.439, AGC = 0.610, Trichomonas = 0.481, Candida = 0.663) and mean average precision (mAP) at 0.503 compared to other backbone network models (mAP of resnet 50 = 0.397, resnet 101 = 0.408). Therefore, it was selected as the preliminary model of abnormal cell detection to be used for semi-automatic annotation in phase 2.
By applying the copy-paste method using the abnormal cells obtained from semi-automatic annotation, a final abnormal cell detection was trained on a larger set of abnormal cells using the same structures of Swin-transformer-base. Results shows that, comparing to the performance of preliminary model, the AP scores of all categories increased (AP of ACUS = 0.556, ASCH = 0.508, LSIL = 0.571, HSIL = 0.571, AGC = 0.689, Trichomonas = 0.593, Candida = 0.663) and mAP at 0.593 after applying copy-paste method (Enhanced Swin-transformer-base indicated in red line) in phase 2, as shown in Fig. 3B. We also performed binary detection for evaluation of abnormal cells in each class. Figure 3C illustrates the AUC obtained from the ROC curve for abnormal cell detection using the enhanced Swin-transformer-base (using the copy-paste method) as the overall result of discriminative performance showed to be excellent (AUC > 0.8).
Abnormal cell classification model can effectively eliminate the false positive cells
The abnormal cell classification model was developed using the annotated abnormal cell obtained from semi-automatic annotation in phase 2. The main objectives of the abnormal cell classification model were to further classify the abnormal cell identified by abnormal cell detection model and eliminate the potential false positive. We performed F1-score evaluations comparing four different backbones (EfficientNet-B2, EfficientNet-B3, convnext tiny and convnext small) to understand the performance summary of precision and recall. As indicated by the results of the F1 value in Fig. 3D, convnext_small is the worst, while efficientnet-b2 performs slightly better than the others. Therefore, it is selected as the final model. Some representative identified abnormal cells are shown in Fig. 3E. Confusion matrix of abnormal cell classification in Fig. 3F shows that the abnormal cell classification model can identify positive cells effectively. We can observe that the classification ability of this model for relatively similar abnormal cell types is still good. For example, in the ASCUS cell group, compared with LSIL 13.90% and NILM 20.22%, the successful classification rate of ASCUS cells is 60.84%. In the LSIL cell group, compared with ASCUS 22.04%, the successful classification rate of LSIL cells was 64.78%. LSIL cells were easily distinguishable from NILM (9.44%) compared to ASCUS cells. The summary of the confusion matrix is that it is difficult for cytologists to distinguish cells such as ASCUS and NILM, and our model has a good ability to distinguish them.
ASCUS nucleus-cell segmentation: model performances are significantly increased after incorporating large amount of semi-automatic annotations
In phase 1, the preliminary nucleus segmentation model was developed on 3,271 ASCUS nucleus manually annotated by cytopathologists. In phase 2, annotations of 31,826 ASCUS cells and their nucleus were collected through semi-automatic annotations, Otsu segmentation and manual adjustment.
For preliminary nucleus segmentation model, the mean intersection over union (IOU) and Dice score of ASCUS nucleus is 0.85 and 0.92 respectively. In the final model, the IOU of nucleus increased to 0.94 and Dice score increased to 0.95, as shown in Fig. 4A. Comparing the performances of preliminary and final models, there were significant improvements after merging large amount of data obtained from semi-automatic annotation into training set. These results demonstrate the necessity and advantage of using larger data set for development of Deep Learning models. The IOU and Dice Score of ASCUS cell segmentation is 0.92 and 0.96 respectively. The results indicate that the final model can segment ASCUS cell and nucleus accurately. Examples of ASCUS cell and nucleus segmentation results are demonstrated in Fig. 4B. The final nucleus and cell segmentation model contributed to the copy-paste method used in the development of final abnormal cell detection model by segmenting the abnormal cells in the bounding box.
WSI-level classification using extracted features
Taken together, the three models provide comprehensive analysis of abnormality on cell level, which will contribute to the WSI-level classification. By extracting discriminative features on cell level, classifiers using different Machine Learning algorithms (mean AUC of Linear = 0.8, Support Vector Machine = 0.918, Random Forest = 0.857, Logistic regression = 0.688, XGboost = 0.945, and Catboost = 0.929) were trained for three types of WSIs (the natural sedimentation method, the membrane-based method and centrifugal sedimentation method). We evaluated and compared the performances of different models on differentiating ASCUS + vs. NLIM in test sets. As demonstrated in Fig. 4C, the XGBoost classifier reaches the highest AUC for all 3 types of WSIs (mean AUC is 0.945). Its AUC for WSIs prepared using the Natural Sedimentation method is 0.998, followed by 0.93 for Membrane-based Sedimentation method and 0.908 for Centrifugal sedimentation method, which shows high accuracy on screening of ASCUS + cases. The AUC of identifying ASCUS + and LSIL + using XGBoost is demonstrated in Fig. 4D. Considering that the identification of ASCUS might be ambiguous and vague in clinical practice, we excluded ASCUS from the dataset and assess the performance of LSIL + vs. NLIM only. For the Natural Sedimentation method, the AUC is 0.996, which is similar to the performance of classifying ASCUS+. However, compared to ASCUS + vs. NLIM, the AUC of LSIL + vs. NLIM classification on WSIs prepared using the Membrane-based method and Centrifugal sedimentation method is increased by 0.035 and 0.057, respectively. This finding suggests that for classification on WSI prepared by Natural Sedimentation method, using the proposed AI system to perform classification between ASCUS + and NLIM is preferred, while for the other two methods, performing LSIL + vs. NLIM classification is more secure. As for efficiency, the average processing speed of the complete workflow as demonstrated in Fig. 1D is 5.02–6.99 mm2 /s, as shown in Fig. 4E.
Clinical Validation: Independent Test Samples from Xiangya Hospital
In the clinical validation of 2,059 cervical liquid-based cytology images at Xiangya Hospital, we first evaluate the model performance based on cytology results as the gold standard. The findings showed that the sensitivity of the AI-assisted diagnosis system was 78%, specificity was 69.81%, and accuracy rate was 73.68%. In lesion categories above ASC-US (ASC-US+), the sensitivity reached 95.08%, specificity was 69.81%, and accuracy rate was 77.89%. The AUC of AI system classifying LSIL + is 91.6% and ASCUS + is 82.1%, as shown in Fig. 5A. We observed that AI-assisted demonstrated the highest sensitivity at 97.6% compared to Cytologist (78.5%) and slightly increase the specificity from 83.4–84.1%, as in Fig. 5B. In addition, we also compared our results with histopathological results as the gold standard. From 2,059 cases, we finally obtained 1,088 cases for analysis by excluding cases with incomplete data. Compared with the gold standard of histopathological results, the sensitivity of cytologists to LSIL + was 78.45%, the specificity was 83.48%, the false negative rate was 21.55%, and the accuracy was 82.17%; the sensitivity of AI to LSIL + was 98.23%, and the specificity was 83.48%. The accuracy was 56.70%, the false negative rate was 1.76%, and the accuracy was 67.46%. The sensitivity of AI-assisted doctor diagnosis for LSIL + was 98.23%, the specificity was 84.10%, and the accuracy was 87.78% as illustrated in Fig. 5C. The results show a substantial increase in the percentage improvement with AI-assisted assessment as shown in Fig. 5D. It is noteworthy that the sensitivity of diagnosis increased by 19% after cytopathologists were assisted by the AI systems. Additionally, efficiency of different types of efficiency were also assessed in Xiangya hospital. Results show that AI-assistance diagnosis is about 3.69 times faster than manual examination, as demonstrated in Fig. 5E. These findings indicate that the CCA-DIAG model can not only improve the quality of diagnosis, but also significantly increase the efficiency, which will greatly ease the burden of cytopathologists.
Multi-center clinical validation: a retrospective study and a prospective study
In order to further validate the performance of the proposed CCA-DIAG AI models, a retrospective clinical validation was conducted using 8,438 samples collected from 10 hospitals in China, including Xiangya Hospital. The comparison between the WSI-level classification results of AI-assisted diagnosis and the ground truth was used to assess the model performance, as shown in Fig. 6A. The results showed that in all 10 centres, the sensitivity of LSIL + was nearly 100%, while the specificity (also known as the sensitivity of NILM) was above 60% regardless of the preparation method, as demonstrated in Fig. 6B. These findings suggest that the AI model detected almost all LSIL + cases and generated few false positives, indicating its ability to generalize to samples prepared in different hospitals. Additionally, the sensitivity of ASCUS + was above 0.9 in 5 centres, demonstrating the model's strong performance in detecting ASCUS+.
The prospective clinical validation of the proposed AI-assisted system was conducted in two hospitals over a period of 6 months, during which 106,439 samples were collected, as demonstrated in Fig. 6C. The assessment time of AI-assisted diagnosis were recorded in two hospitals, and it was found that there was no significant difference in assessment time between these hospitals. This result was comparable to the AI-assisted diagnosis in the prospective study, indicating the consistency of the efficiency of AI-assisted diagnosis, as shown in Fig. 6D. Quantitative analysis indicates a notable enhancement in the efficiency of AI-assisted assessment, with a 2.93X increase for MI1 (6 cytologists) and a 3.56X increase for MI2 (4 cytologists) compared to manual assessment under the microscope over a period of 6 months. This signifies a reduction in cytologists' workload, surpassing the World Health Organization's recommended limit of 100 assessments per day as illustrated in Fig. 6E. Furthermore, there is potential for improvement in clinical turnaround time.