From Microscope to AI: Developing an Integrated Diagnostic System for Endometrial Cytology

MATERIALS AND METHODS

Overview of the Study Protocol

The Study Protocol Description, as illustrated in Fig. 1, is dedicated to the integration of AI object-detection techniques within the context of pathological workflows. We employed a smartphone-based system to capture images of endometrial cells and subsequently trained the YOLOv5x model. In cases of cancer, we annotated clusters of abnormal cells, whereas benign cases provided essential background data for our model training.

Throughout the evaluation phase, our AI models not only processed validation and test datasets comprising digital static images but also engaged in real-time analysis of microscopy images captured via a CCD camera, with the goal of achieving real-time detection of abnormal cell clusters. When the AI model identified abnormalities exceeding a predefined confidence score threshold, the results were visualized by displaying red bounding boxes around the abnormal cell clusters on a monitor. The total count of abnormal cell clusters was computed at the completion of AI analysis for each slide. If the count surpassed the predefined detection threshold, the AI made a positive judgment, indicating the necessity for further biopsy examinations. In contrast, if the count fell below the threshold, the AI judgment was considered negative, signifying a normal or benign case. To assess the accuracy of the AI-assisted diagnosis, the AI's final judgment was juxtaposed with the cytodiagnosis conducted by three pathologists and four medical students in 20 new cases. Furthermore, we investigated whether the accuracy and speed of human diagnosis can be improved with and without AI assistance.

Case Selection and Data Preparation

Ethical approval for this study was granted by the Institutional Review Board at the Nippon Medical School (approval number: 23K08900).

From April 2017 to March 2023, at Nippon Medical School Hospital, we selected endometrial cytological slides from 146 cases, including 72 cases labeled as 'malignant’, consisting of endometrial cancer, and 74 cases labeled as 'benign’, consisting of nonmalignant endometrial lesions such as leiomyoma. All cases were pathologically confirmed using hysterectomy specimens. All endometrial cytology specimens were prepared using the smear technique and stained with Papanicolaou stain. For the purposes of our study, case selection and data preparation were conducted as follows: 96 cases (comprising 49 benign and 47 malignant cases) were used to develop YOLOv5x. Additionally, 30 cases (15 benign and 15 malignant) were used to assess the accuracy of real-time detection of abnormal cell clusters and to quantify these clusters at the slide level. The remaining 20 cases (10 benign and 10 malignant) were reserved for evaluating the performance metrics of the AI in real-time diagnostic scenarios and for comparative diagnostic evaluations performed by pathologists and medical students. The purpose of this evaluation was to assess diagnoses made by AI alone, humans alone, and AI-assisted humans. focusing on the accuracy of the slide-level analysis for both positive and negative cases. The distribution of these patients is schematically represented in Fig. 2A. Table 1 shows the distribution of patients according to histological category and median age.

Table 1. Distribution and Median Age of Malignant and Benign Patients

		AI model training	Real-time object detection under microscope
		Training, validation and test cases (n=96)	Cases for cell-cluster and slide-level assessment (n=30)	Cases for diagnostic concordance with/without AI (n=20)
		Training, validation and test cases (n=96)	Cases for cell-cluster and slide-level assessment (n=30)	Cases for diagnostic concordance with/without AI (n=20)
Malignant	Median age (range)	57(31-82)	58(38-83)	54(28-77)
Malignant	Number of cases	47	10	20
Endometrioid carcinoma	Grade 1	24	10	5
	Grade 2	17	3	4
	Grade 3	5	0	0
Serous carcinoma		1	2	1
Benign	Median age (range)	47(37-73)	46(30-57)	45.5(38-51)
Benign	Number of cases	49	10	20
Leiomyoma		49	10	20

Acquisition of Digital Images Using a Smartphone-Based Diagnostic Imaging Device

Digital images were acquired by using a smartphone-based imaging system. Specifically, an iPhone SE (Apple Inc., Cupertino, CA, USA) was mounted on an Olympus BX53 microscope (EVIDENT/Olympus, Tokyo, Japan) using a specialized adapter (i-NTER LENS; MICRONET Co., Kawaguchi, Saitama, Japan), as shown in Fig. 2B. The captured images had a resolution of 4,032 × 3,024 pixels. While examining the cytological slides, the focus was manually adjusted, and the images were taken at 20x objective lens magnification. For malignant cases, images were taken such that abnormal cell clusters identified by a gynecologic pathologist were positioned in the center of the image. For benign cases, images of all visible cells were captured, which were randomly selected and considered normal.

Dataset and Annotation

In our study, the dataset comprised 3,151 endometrial cytology images, including both benign and malignant images (Fig. 2A). Malignant images were annotated for abnormal cell clusters using LabelImg (version 1.8.6), a Python-based graphical annotation tool, while other material was treated as "background" without annotation. Abnormal cell clusters were labeled "malignant" regardless of their type. Substances appearing in benign images were also treated as "background" without annotation. In addition to atypical cell clusters, various materials, such as benign cell clusters, mucus, and inflammatory cells, appear in endoscopic cytology specimens. Labeling each individual is an extremely complex task. Because the purpose of this model is to detect malignant cell clusters, it is not always necessary to detect other substances. Therefore, we prioritized the advantage of simplicity in annotation and labeled only the atypical cell clusters. The dataset was refined to 1,579 malignant and 1,572 benign images, with the final training, validation, and testing split set at an 8:1:1 ratio (Fig. 2A).

Architecture of the Object Detection Model

Our study employed YOLOv5x¹⁴, an object detection model chosen for its proven image recognition performance and capability for high-speed analysis, which is critical for the real-time processing demands of endometrial cytology image analysis. YOLOv5x operates as a one-stage detector, optimizing the efficiency of the model for detecting cellular anomalies with high precision ¹⁵. The general architecture of YOLOv5 is illustrated in Fig. 3. The YOLOv5x architecture is structured around three main components: the backbone, neck, and head. The backbone incorporates a cross-stage partial network with DarkNet53 to enhance the feature map processing efficiency by splitting the input into two paths, conserving computational resources while maintaining information diversity. Neck employs a path aggregation network for improved feature propagation and spatial pyramid pooling fusion to handle inputs of varying sizes, thereby facilitating robust detection across different object scales and contexts. Adopting the YOLOv3 architecture finalizes the detection process ¹⁵.

Pretrained on the extensive Microsoft Common Objects in Context (COCO) dataset 2017, YOLOv5x leveraged a wide-ranging preexisting knowledge base, allowing it to adapt to the nuanced challenges of cytological imagery. The primary hyperparameters were carefully selected based on extensive preliminary experiments aimed at maximizing training efficiency and model performance. The hyperparameters, including the number of epochs, batch size, and learning rate, used are listed in Table 2. Other parameters were kept at their default values, as provided in the YOLOv5x GitHub repository¹⁴, such as with no frozen layers, to fully exploit the learning capacity of the model in recognizing the specific features of endometrial cytology images.

Table 2

Training conditions for model optimization
Settings	batch size	epoch	optimizer	learning rate	weight decay
YOLOv5x	4	200	SGD	0.01	0.0005

Image Preprocessing

Following the acquisition and annotation of the dataset, we proceeded with an image preprocessing phase. The original high-resolution images, measuring 4,032 × 3,024 pixels, were resized to 640 × 640 pixels to conform to the YOLOv5x input requirements. The training process used YOLOv5x's built-in data augmentation features, including mosaic, rotation, flipping, and color adjustments, to enhance the robustness of the model against variations.

Computational Setup and Software Environment

Our computational setup included an Iiyama Sense 15F161 laptop PC powered by an Intel Core i7 CPU and an NVIDIA GeForce RTX3060 GPU. The software environment was established using Anaconda Distribution (version 2022.10), with Python 3.10.9 as the programming language and PyTorch 1.13.1 as the deep learning framework.

Model Performance Evaluation Using Static Images for Object Detection

Following the previously described model training and optimization process, we evaluated the performance of each model using static images from the validation and test datasets. For this evaluation, we employed several metric standards in object detection tasks, including precision, recall, F1 score, and mean average precision (mAP).

Precision quantifies the fraction of accurate positive identifications relative to the total number of positive identifications made by the model and is defined as

$$Presicion = \frac{TP}{TP+FP}$$

Here, TP and FP denote the true positives and false positives, respectively. Recall, or sensitivity, measures the fraction of accurate positive identifications relative to the total number of actual positive instances and is defined as

$$Recall = \frac{TP}{TP+FN}$$

Here, FN denotes false negatives. The F1 score is the harmonic mean of precision and recall and is defined as

$$F1 Score = \frac{2 \times Precision\times Recall}{Precision+Recall}$$

With the precision on the vertical axis and the recall on the horizontal axis, a precision-recall curve (PR curve) can be drawn. The area ratio occupied by the PR curve is the average precision (AP) value of the PR curve. AP is a standard in object detection evaluation and offers a comprehensive metric by averaging the maximum precision values across varying recall thresholds. This is defined as follows:

$$AP={\sum }_{k=1}^{m}p\left(k\right)△r\left(k\right)$$

where k is the number of abnormal cell clusters that have been detected, p (k) is the precision at the cutoff k in the list, and Δr (k) is the change in recall from items k − 1 to k.

where mAP is the average AP value. The mAP is defined as follows:

$$mAP=\frac{1}{n}{\sum }_{i=1}^{n}AP\left(i\right)$$

where n is the number of classes and AP(i) is the AP value for a given class. In this study, because only one class of objects was trained, the AP and mAP values matched.

During the model training phase, performance was assessed using the validation set, with the testing set applied posttraining. The model was evaluated using an intersection over union (IoU) threshold of 0.5, which measures the congruence between the ground-truth bounding box and the model's predicted bounding box, providing insights into the precision of the model regarding object location and size. Owing to the training approach where benign cases were used as the background, precluding the generation of a receiver operating characteristic (ROC) curve (as true negatives could not be measured), we employed the precision‒recall (PR) curves of YOLOv5x. The performance of the model in a static image test set was evaluated. Based on these results, PR curves were plotted using Matplotlib.

Microscope Setup for Real-time Object Detection

For real-time detection, we connected a microscope (ECLIPSE Ci, Nikon Co., Tokyo, Japan) with a C-mount to a charge-coupled device (CCD) camera (JCS-HR5U, CANON Inc., Tokyo, Japan). The trained model was set to inference mode in an integrated development environment (Visual Studio Code, Microsoft Co., WA, USA), with the input image set to the video from the CCD camera to instantly display the bounding box and confidence score for the detection of abnormal cell clusters (Fig. 4). The confidence score threshold for displaying bounding boxes was set to 0.01 for YOLOv5x. We evaluated the speed of model detection using frames per second (FPS), which refers to the number of images that can be processed per second. The higher this value is, the smoother and more natural the image on the monitor. Generally, an FPS greater than 30 is considered to provide real-time detection.

Evaluation of Abnormal Cell Cluster Detection Under a Microscope by a Real-Time Detection Method

In line with the case selection detailed earlier, where 20 cases (10 benign and 10 malignant) were earmarked for real-time detection accuracy assessment, a total of 100 points were marked (50 for malignant cases and 50 for benign cases), with each slide having five points near the abnormal or randomly selected benign cell clusters for comprehensive evaluation by the trained YOLOv5x. Following detailed case selection, where a total of 100 points were marked across 20 cases (50 points for malignant cases and 50 points for benign cases), the trained YOLOv5x model was employed. Bounding boxes with confidence scores for the detected cell clusters were recorded in real time under a microscope. To establish the optimal confidence score thresholds, ROC curves were generated from the recorded confidence scores, and the area under the curve (AUC) was calculated. This allowed the computation of performance metrics in detecting abnormal cell clusters above the confidence score threshold, providing a measure of the model's detection accuracy.

Evaluation of Slide-Level Real-Time Diagnostic Performance Using YOLOv5x

At our institution, endometrial cytopathology is classified according to the categorization of cells into five classes: class I for normal cells, class II for benign changes, class III for indeterminate or atypical cells that require further assessment, class IV for cells suspicious of malignancy, and class V for cells that are unequivocally malignant. In this study, we translated this system into a binary schema for streamline training and evaluation of a deep learning model. Classes I and II were labeled 'no abnormality', while classes III to V were grouped as 'requires further examination,' establishing a binary gold standard (GS) for comparative analysis. To evaluate the diagnostic accuracy of the trained YOLOv5x model at the slide level using real-time detection, each slide was moved at a consistent speed from one end to the other over a duration of 4 min, and the number of detected abnormal cell clusters was counted. A new set of 10 slides (five benign and five malignant) was analyzed for this purpose. ROC curves were plotted based on the number of clusters detected by YOLOv5x and GS, and the AUC was calculated to determine the optimal threshold for slide-level detection. Slides with bounding box counts exceeding this threshold were designated as 'requires further examination,’ while those below were set to be considered 'no abnormality.’

Diagnostic Performance Evaluation with and without the AI Assist

Human Evaluators

The human evaluators in this study comprised three pathologists with different specialties and four medical students at various stages of their education. Pathol-1 is a pathologist who specializes in gynecology, whereas Pathol-2 and Pathol-3 have expertise in nongynecological areas. Medical students were divided based on their exposure to the AI model and their academic year: Stud-1 (4th year) and Stud-2 (5th year) were directly involved in the AI model's annotation and training process. Conversely, Stud-3 (3rd year) and Stud-4 (1st year) were provided with a 20-minute lecture on endometrial cytology to familiarize them with the subject before their participation in the study. This setup aimed to assess diagnostic performance across a spectrum of experience levels and determine the impact of AI assistance on diagnostic accuracy and speed.

Cohen's kappa coefficient analysis

To establish a comprehensive understanding of the diagnostic process, the baseline performance of the classifiers was first assessed using Cohen's kappa coefficient. This statistical measure accounts for chance agreement and provides a robust method for gauging concordance in binary-classification tasks. By comparing the predictions of the AI model and each human evaluator with the gold standard, we obtained a baseline diagnostic accuracy that reflected the true performance without the influence of AI. The kappa scores were calculated using Scikit-learn, and heatmaps for visualization of agreement levels were generated with Matplotlib, illustrating the alignment between the gold standard and the evaluators' diagnoses.

AI-assisted Implementation

Following the baseline assessment, an AI assist was used to evaluate its impact on diagnostic performance. The YOLOv5x model was integrated into the diagnostic workflow, providing real-time assistance by presenting predictions and bounding boxes for abnormal cell clusters along with cytological slides. This enabled a direct comparison of the evaluators' performance metrics, with and without the aid of AI. The accuracy is defined as the ratio of the number of correctly predicted observations to the total number of observations and is defined as

$$Accuracy = \frac{TP +TN}{TP + TN + FP + FN}$$

Precision, recall, and F1 score were defined in the 'Model Performance Evaluation using Static Images in Object Detection' section. A paired t test using SciPy was conducted to statistically compare these metrics, offering a clear visualization of the performance differences in a boxplot format using Matplotlib.

Time Measurement for Diagnosis

The efficiency of the diagnostic process, with and without an AI assistant, was quantified by measuring the time taken to diagnose a set of patients. This study focused on the practical application of AI in a clinical setting, aiming to identify any significant time savings offered by the integration of AI technology. The total diagnostic time for all patients was recorded in a controlled environment, and a paired t test was applied to determine the significance of the time differences, thereby highlighting the potential for AI to streamline the diagnostic workflow.

RESULTS

Model Performance Evaluation for the Detection of Abnormal Cell Clusters in Static Images

To evaluate the ability of the YOLOv5x model to identify abnormal cell clusters in static images, we utilized hold-out validation and test datasets to assess the performance metrics. These datasets were carefully split to ensure that the validation and test sets remained independent and representative of a broad spectrum of imaging conditions. The precision‒recall curves, which provide a comprehensive visual representation of the detection capabilities of the model, are illustrated in Fig. 5A and B. The evaluation focused on key performance metrics, including precision, recall, F1 score, and mean average precision (mAP). These benchmarks were selected for their relevance in quantifying the detection performance in static image analysis, which is a critical component of AI-assisted diagnostic processes. The YOLOv5x model demonstrated robust performance in this context, achieving an mAP of 0.791 for the validation data and 0.752 for the test data (Fig. 5C). These results highlight the ability of the model to accurately and reliably detect abnormal cell clusters in a controlled static image environment. This performance not only validates the effectiveness of YOLOv5x in standard image-based analyses but also sets a foundational benchmark for further exploration of real-time detection and comprehensive slide analysis, which are crucial for real-world clinical applications.

Performance Outcomes of the Trained YOLOv5x Model

The YOLOv5x model, trained specifically on endometrial cytological images, demonstrated a high processing rate of 60 frames per second (FPS) during real-time object detection tasks. This rate, as shown in the real-time visualizations on the monitor, highlights the rapid and effective analysis capabilities of the model for identifying abnormal cell clusters.

Performance of the Trained YOLOv5x in Real-time Detection of Abnormal Cell Clusters

The real-time detection capabilities of the trained YOLOv5x model were quantified based on its ability to identify abnormal cell clusters on cytological slides. The performance of the model was evaluated using 100 marking points of cell clusters by a gynecological pathologist across 20 benign and malignant cases, as depicted in Fig. 6A. Marking points and annotations by expert pathologists served as a reference for validating the performance of the model in identifying cell clusters, with the associated confidence scores being recorded. The ROC curve, shown in Fig. 6B, was used to establish an optimal confidence score threshold, determined by selecting a point on the ROC curve with a favorable balance between the true positive rate (TPR) and the false positive rate (FPR). The area under the curve (AUC) was 0.91, reflecting the diagnostic performance of the model.

Slide-Level Real-Time Object Detection Performance of YOLOv5x under a Microscope

In our comprehensive study, using a new set of 20 slides that were time-intensively analyzed and evenly divided between 10 malignant and 10 benign cases, we evaluated the slide-level diagnostic performance of the trained YOLOv5x model (Fig. 6C). The number of bounding boxes with confidence scores surpassing a predetermined threshold, which is reflective of our rigorous binary classification standard, was recorded for each slide. The ROC curve, constructed to represent the relationship between the number of AI-detected abnormal cell clusters and the cytological diagnosis of the slide, was used to determine the optimal threshold for the bounding box count. At this threshold, the YOLOv5x model demonstrated an AUC of 0.92, indicating high diagnostic performance (Fig. 6D).

Performance Metrics of the Final Diagnostic Test for AI and Human Evaluators in a New Set of 20 Patients

The final diagnostic tests for the AI and human evaluators were evaluated in a new set of 20 patients, including 10 patients with malignant and 10 patients with benign lesions. The performance metrics of the trained AI and human evaluators are summarized in Table 3, which indicates that the AI model YOLOv5x outperformed all the human evaluators in terms of accuracy, precision, recall, and F1 score. YOLOv5x achieved an accuracy of 0.85, precision of 0.82, recall of 0.90, and an F1 score of 0.86. Among the human evaluators, 'Pathol-1', a gynecological pathologist, showed strong performance, with scores of 0.80 for accuracy, precision, and F1, along with a recall of 0.80. 'Pathol-2' and 'Pathol-3', representing nongynecological pathologists, demonstrated lower scores, with 'Pathol-2' achieving 0.70 accuracy, 0.67 precision, 0.80 recall, 0.73 F1, and 'Pathol-3' scoring 0.65, 0.64, 0.70, and 0.67, respectively, for the same metrics. For the medical students involved in AI modeling, 'Stud-1' and 'Stud-2' both recorded consistent scores of 0.80 across all the metrics. In contrast, 'Stud-3' and 'Stud-4', who had only received a twenty-minute lecture on endometrial cytology, had accuracy scores of 0.65 and 0.75, precision scores of 0.64 and 0.73, recall scores of 0.70 and 0.80, and F1 scores of 0.67 and 0.76, respectively. These results highlight the potential of trained YOLOv5x for demonstrating the high accuracy and reliability of endometrial cytology diagnosis.

Table 3

Performance Metrics of the Final Diagnostic Test for AI and Human Evaluators
	Accuracy	Precision	Recall	F1
YOLOv5x	0.85	0.82	0.90	0.86
Pathol-1	0.80	0.80	0.80	0.80
Pathol-2	0.70	0.67	0.80	0.73
Pathol-3	0.65	0.64	0.70	0.67
Stud-1	0.80	0.80	0.80	0.80
Stud-2	0.80	0.80	0.80	0.80
Stud-3	0.65	0.64	0.70	0.67
Stud-4	0.75	0.73	0.80	0.76
Note: 'Pathol-1' represents a gynecological pathologist; 'Pathol-2' and 'Pathol-3' are nongynecological pathologists; 'Stud-1' and 'Stud-2' denote medical students involved in AI modeling; and 'Stud-3' and 'Stud-4' indicate medical students who received a twenty-minute lecture on endometrial cytology.

In addition, the kappa coefficients of trained AI and human evaluators were calculated and are detailed in Table 4, providing insight into diagnostic concordance across evaluators.

Table 4

Comparison of Cohen's kappa values between the gold standard, AI (YOLOv5x), and human evaluators
	YOLOv5x	Pathol-1	Pathol-2	Pathol-3	Stud-1	Stud-2	Stud-3	Stud-4
GS	0.7	0.6	0.4	0.3	0.6	0.6	0.3	0.5
Note: 'GS' refers to the gold standard, which in this context is the preoperative cytological diagnosis. 'Pathol-1' represents a gynecological pathologist; 'Pathol-2' and 'Pathol-3' are nongynecological pathologists; 'Stud-1' and 'Stud-2' denote medical students involved in AI modeling; and 'Stud-3' and 'Stud-4' indicate medical students who received a twenty-minute lecture on endometrial cytology.

The AI model exhibited substantial agreement with the gold standard (GS), achieving a kappa coefficient of 0.7. Senior medical students (Stud-1 and Stud-2), who were involved in the model's annotation and training, demonstrated good agreement with GS (kappa = 0.6), as did the specialist gynecology pathologist (Pathol-1), with a kappa of 0.6. In contrast, nongynecology specialists (Pathol-2 and Pathol-3) showed moderate to fair agreement with GS (kappa = 0.4 and 0.3, respectively), while junior students (Stud-3 and Stud-4) presented fair agreement (kappa = 0.3 and 0.5). This underscores the reliability of the model in identifying abnormal cell clusters, similar to a real cytological diagnosis at the same level as an expert.

Impact of AI-assist implementation on diagnostic performance

The results of the investigation of the impact of AI assistance on diagnostic evaluations are summarized in Table 5, which shows the differential impacts of AI assistance on diagnostic performance metrics and time for various evaluators. The implementation of AI assistance led to improvements in accuracy for most evaluators, with 'Pathol-3' increasing from 0.65 to 0.80 and 'Stud-1' and 'Stud-2' each achieving an accuracy of 0.85. The precision also improved with AI assistance, especially for 'Pathol-3', which improved from 0.64 to 0.75, and for 'Stud-1' and 'Stud-2', which improved from 0.64 to 0.82. In the context of cytological examinations, where the sensitivity for not overlooking abnormal cells is crucial, AI assistance demonstrated its efficacy by increasing recall scores for all evaluators except 'Stud-3' and 'Stud-4', who received a twenty-minute mini lecture for endometrial cytology. Although there was a trend toward improvement across all the performance metrics (Table 5), the differences were not statistically significant (Fig. 7A).

Table 5. Performance Metrics and Diagnostic Times With and Without AI Assistance

Accuracy		Pathol-1	Pathol-2	Pathol-3	Stud-1	Stud-2	Stud-3	Stud-4
AI-assist	(- )	0.80	0.70	0.65	0.80	0.80	0.65	0.75
AI-assist	(+ )	0.75	0.75	0.80	0.85	0.85	0.65	0.60
improvement			✓	✓	✓	✓

Precision		Pathol-1	Pathol-2	Pathol-3	Stud-1	Stud-2	Stud-3	Stud-4
AI-assist	(- )	0.80	0.67	0.64	0.80	0.80	0.64	0.73
AI-assist	(+ )	0.69	0.67	0.75	0.82	0.82	0.64	0.60
improvement				✓	✓	✓

Recall		Pathol-1	Pathol-2	Pathol-3	Stud-1	Stud-2	Stud-3	Stud-4
AI-assist	(- )	0.80	0.80	0.70	0.80	0.80	0.70	0.80
AI-assist	(+ )	0.90	1.00	0.90	0.90	0.90	0.70	0.60
improvement		✓	✓	✓	✓	✓

F1 score		Pathol-1	Pathol-2	Pathol-3	Stud-1	Stud-2	Stud-3	Stud-4
AI-assist	(- )	0.80	0.73	0.67	0.80	0.80	0.67	0.76
AI-assist	(+ )	0.78	0.80	0.82	0.86	0.86	0.67	0.60
improvement			✓	✓	✓	✓

Diagnostic time (second)		Pathol-1	Pathol-2	Pathol-3	Stud-1	Stud-2	Stud-3	Stud-4
AI-assist	(- )	3,014	2,544	6,600	4,458	4,681	4,593	2,940
AI-assist	(+ )	1,800	2,040	2,700	1,860	3,354	2,764	2,460
improvement		✓	✓	✓	✓	✓	✓	✓

Note: 'Pathol-1' represents a gynecological pathologist; 'Pathol-2' and 'Pathol-3' are nongynecological pathologists; 'Stud-1' and 'Stud-2' denote medical students involved in AI modeling; and 'Stud-3' and 'Stud-4' indicate medical students who received a twenty-minute lecture on endometrial cytology.

These results indicate the potential need for acclimatization to the operation of an AI-assisted endometrial cytology system.

Assessment of Diagnostic Time Efficiency with AI Integration

In the assessment of diagnostic time efficiency, the integration of AI significantly reduced the time required to diagnose endometrial cytology slides. Figure 7B shows a box plot comparing the diagnostic times obtained with and without the aid of AI. The median time for slide evaluation without AI was substantially greater, with a median of 4,500 s, compared to 2,750 s when AI was used. The interquartile range (IQR) for manual diagnosis without AI ranged from 3,000 to 6,000 s, indicating wide variability in the time required by pathologists. In contrast, the IQR with the AI was notably greater, ranging from 2,000 to 3,500 s, reflecting a more consistent and reduced diagnostic time across all patients. Statistical analysis confirmed that the observed reduction in time was significant, with a p value of 0.010. This finding underscores the potential of AI to streamline the diagnostic process and suggests that AI assistance could alleviate the time burden on pathologists, particularly in resource-limited settings, where efficiency is paramount.

DISCUSSION

In this study, we explored the integration of AI-assisted diagnosis into a workflow by conducting real-time direct detection of abnormal cell clusters using a CCD camera attached to a microscope utilizing YOLOv5x. This approach significantly reduced the time required for diagnosis by almost half compared to traditional methods without AI assistance and demonstrated that the performance metrics could match or exceed the performance of pathologists. This not only highlights the effectiveness of AI in improving diagnostic efficiency but also represents the first instance of integrating an AI-assisted system equipped with real-time object detection for endometrial cytology observed under a microscope without the use of WSI.

In gynecologic oncology, cervical, endometrial, and ovarian cancers are the three main types of gynecological malignancies. Cytology is usually performed to screen for cervical and endometrial cancers. Although endometrial cancer is a common gynecological cancer in developed countries and shows an increasing trend, there are far fewer reports of AI research on cytology than on cervical cancer ^{10,11,13,16,17}. The difficulty in acquiring digital images due to overlapping cells as well as the appearance of a variety of cells, such as glandular epithelial cells, endometrial stromal cells, and inflammatory cells, which are likely to be affected by the hormonal environment, making annotation challenging, is cited as the main reason ^8,9. AI research on endometrial cytology reported to date includes models that manually extract nuclei from glandular epithelial cells to classify them as benign or malignant ^13,16. Recently, high-performance metrics have been achieved using a two-stage method that segments cell clusters from WSI patches to reduce annotation costs ¹¹.

These findings indicate that it is possible to create accurate endometrial cytology AI models through innovation to resolve the unique challenges of endometrial cytology. The performance metrics of currently reported endometrial cytology AI models mainly use static image datasets. In actual endometrial cytology diagnosis, various classification methods are used worldwide; however, fundamentally, the atypia of individual cell clusters is first detected, and the cases are classified as requiring further examination based on the presence of these atypical cell clusters throughout the slide. Therefore, recent cytological AI research has attempted a two-step classification for both the detection of atypical cell clusters and the counting of the number of clusters through the slide levels, attempting to bring it closer to real clinical practice ^18,19. In this study, in accordance with the actual cell diagnosis process, the final inference result of real-time object detection was classified into two categories based on the detection of abnormal cell clusters and the counting of the number of abnormal cell clusters in the slides, and the final classification was made into 'needs further examination' and 'no abnormality' cases.

In this study, the existing YOLOv5x, pretrained on the COCO dataset, which is accessible to anyone, was employed as the diagnostic support model with transfer learning conducted on actual case images. YOLOv5x was chosen because of its ability to achieve fast and accurate object detection while being capable of immediate detection in videos and being relatively light during operation. In recent studies on cervical cytology with object detection tasks for static images, the YOLO series has shown high efficiency ^20,21. In this study, real-time object detection with YOLOv5x, such as 60 FPS, which is applicable to actual clinical use and is likely to be very effective for object detection during microscopic observation, achieved high efficiency and fast real-time detection.

Regarding real-time object detection in medical images, studies on real-time object detection models using colonoscopy cameras and ultrasound examinations, which are designed to assist diagnoses by detecting abnormal areas observed in clinical settings, have been reported ^22,23. In the pathology field, an augmented reality microscope was developed by Chen et al ²⁴. This groundbreaking development drew support images such as AR images within the microscope's field of view for purposes such as identifying lymph node metastases in breast cancer or grading prostate cancer according to the Gleason score. Inspired by these reports of real-time detection, our study investigated whether immediate support seamlessly integrated into the diagnostic workflow during microscopic observation is preferable. Considering the need for immediate digitization and processing of microscopy images, we chose to capture images using a CCD camera, as it was deemed suitable and straightforward, eliminating the need to purchase new equipment.

The process of integrating AI into healthcare is delicate and depends not only on technological advances but also on acceptance by medical professionals. Lambert et al. highlighted several factors that influence the acceptance of AI in medical settings, including perceived efficacy, potential loss of professional autonomy, and ease of integration of AI into clinical workflows ²⁵. With these considerations in mind, our study sought to explore ways to provide real-time feedback while improving the efficiency and performance of endometrial cytology in harmony with the existing cytology workflow. Unlike radiology, which is already diagnosed using digital images, one challenge in using AI-assist models for pathological diagnosis, including cytology, is the necessity of digitizing glass slides. In recent years, WSIs have significantly contributed to the digitization of glass slides, and many AI developments have been conducted using WSIs. However, WSI scanners are expensive, and their large image sizes make storage and management challenging, leading many facilities to still be diagnosed under the microscope. Current AI research requires conversion to WSIs at the time of inference when using AI assistance, which necessitates significant changes in the current diagnostic flow. As Lambert et al. noted, these factors constitute a considerable barrier to the initial introduction of AI in pathology. The high implementation cost of WSIs and the need for AI diagnostic support that can be integrated alongside microscopic diagnosis in regions with limited resources or a shortage of pathologists underscore the necessity of developing such AI models.

The diagnostic accuracy of the AI model in this study was 0.8 in the final test using new cases, which is higher than the mean score of the pathologists (0.75), suggesting that AI can supplement diagnostic capabilities in specific specialties and sometimes match them. Analysis using Cohen's kappa coefficient showed a correlation of 0.7, which is a consistently high correlation between the AI and GS (general diagnostic standards), indicating that the AI has the ability to make diagnoses in accordance with general diagnostic standards. However, no significant difference was observed in the accuracy of cytological diagnosis between human participants with various levels of diagnostic experience and AI assistants. These findings outline the potential benefits and limitations of integrating AI into clinical diagnostic processes. Although the introduction of AI may improve the accuracy of pathological diagnoses, its effectiveness seems to depend on the user's prior knowledge and experience. Future research should further explore the quality and diversity of datasets for AI model training, the role of AI in training medical students and pathologists, and educational strategies to maximize the benefits of AI assistants. Additionally, this study revealed a significant reduction in the time required for diagnosis by the AI assistant, indicating that the AI assistant can significantly reduce the diagnostic time.

This study had two main limitations. First, owing to the use of actual clinical cases, the sample size was inherently limited and restricted to specimens provided by a single facility. Second, the model for immediate object detection in this study was intended primarily for support use, not for automatically classifying diagnoses, necessitating human intervention. Although an automatic classification model might be convenient if the transition to digital pathology diagnosis is realized in the future, we consider this study to be the first step toward promoting the integration of AI into current pathology diagnosis. Further research should consider innovations such as user-friendly application development.

CONCLUSION

This study aimed to develop an AI-assisted model for the diagnosis of endometrial cytology using real-time object detection technology. This is the first report utilizing AI to provide real-time assistance under a microscope to detect abnormal cells in the endometrium without the need for WSI, offering a significant advantage by seamlessly integrating into the existing diagnostic workflow. The potential of AI assistance through real-time object detection under a microscope is significant, not only for the rapid diagnosis required during procedures such as rapid on-site evaluation (ROSE) but also for use in resource-limited settings. This approach could be beneficial beyond endometrial cytology, enhancing the rapid diagnostic capabilities in various medical fields.

Declarations

FUNDINGS

This work was supported by the Japan Society for the Promotion of Science Grant-in-Aid for the Diversity Women Leader Development Grant and Scientific Research (C) (No. 23K08900) to Mika Terasaki.

ACKNOWLEDGEMENT

We thank KIKAGAKU Co., Ltd., for their AI educational program, Ultralytics for developing YOLOv5, and Editage for their English language editing services.

References

Cancer Statistics. Cancer Information Service, National Cancer Center, Japan (National Cancer Registry, Ministry of Health, Labor and Welfare)
Gu, B., Shang, X., Yan, M., Li, X., Wang, W., Wang, Q. et al. Variations in incidence and mortality rates of endometrial cancer at the global, regional, and national levels, 1990–2019. Gynecol. Oncol. 161, 573–580 (2021).
Crosbie, E. J., Kitson, S. J., McAlpine, J. N., Mukhopadhyay, A., Powell, M. E. & Singh, N. Endometrial cancer. Lancet 399, 1412–1428 (2022).
Funston, G., O’Flynn, H., Ryan, N. A. J., Hamilton, W. & Crosbie, E. J. Recognizing Gynecological Cancer in Primary Care: Risk Factors, Red Flags, and Referrals. Adv. Ther. 35, 577–589 (2018).
Fujiwara, H., Takahashi, Y., Takano, M., Miyamoto, M., Nakamura, K., Kaneta, Y. et al. Evaluation of Endometrial Cytology: Cytohistological Correlations in 1,441 Cancer Patients. Oncology 88, 86–94 (2015).
Frias‐Gomez, J., Benavente, Y., Ponce, J., Brunet, J., Ibáñez, R., Peremiquel‐Trillas, P. et al. Sensitivity of cervico‐vaginal cytology in endometrial carcinoma: A systematic review and meta‐analysis. Cancer Cytopathol. 128, 792–802 (2020).
Yang, X., Ma, K., Chen, R., Zhao, J., Wu, C., Zhang, N. et al. Liquid-based endometrial cytology associated with curettage in the investigation of endometrial carcinoma in a population of 1987 women. Arch. Gynecol. Obstet. 296, 99–105 (2017).
Ivanovic, M. Cytopathology in Oncology. Cancer Treat. Res. 160, 1–12 (2013).
Byrne, A. J. Endocyte endometrial smears in the cytodiagnosis of endometrial carcinoma. Acta Cytol. 34, 373–381 (1990).
Zygouris, D., Pouliakis, A., Margari, N., Chrelias, C., Terzakis, E., Koureas, N. et al. Classification of endometrial lesions by nuclear morphometry features extracted from liquid-based cytology samples: a system based on logistic regression model. Anal. Quant. Cytopathol. Histopathol. 36, 189–198 (2014).
Li, Q., Wang, R., Xie, Z., Zhao, L., Wang, Y., Sun, C. et al. Clinically Applicable Pathological Diagnosis System for Cell Clumps in Endometrial Cancer Screening via Deep Convolutional Neural Networks. Cancers 14, 4109 (2022).
Makris, G., Pouliakis, A., Siristatidis, C., Margari, N., Terzakis, E., Koureas, N. et al. Image analysis and multi‐layer perceptron artificial neural networks for the discrimination between benign and malignant endometrial lesions. Diagn. Cytopathol. 45, 202–211 (2017).
Pouliakis, A., Margari, C., Margari, N., Chrelias, C., Zygouris, D., Meristoudis, C. et al. Using classification and regression trees, liquid‐based cytology and nuclear morphometry for the discrimination of endometrial lesions. Diagn. Cytopathol. 42, 582–591 (2014).
https://github.com/ultralytics/yolov5
Suhail, K. & Brindha, D. Microscopic urinary particle detection by different YOLOv5 models with evolutionary genetic algorithm based hyperparameter optimization. Comput. Biol. Med. 169, 107895 (2024).
Makris, G., Pouliakis, A., Siristatidis, C., Margari, N., Terzakis, E., Koureas, N. et al. Image analysis and multi‐layer perceptron artificial neural networks for the discrimination between benign and malignant endometrial lesions. Diagn. Cytopathol. 45, 202–211 (2017).
Akazawa, M. & Hashimoto, K. Artificial intelligence in gynecologic cancers: Current status and future challenges – A systematic review. Artif. Intell. Med. 120, 102164 (2021).
Xie, X., Fu, C.-C., Lv, L., Ye, Q., Yu, Y., Fang, Q. et al. Deep convolutional neural network-based classification of cancer cells on cytological pleural effusion images. Mod. Pathol. 35, 609–614 (2022).
Jiang, H., Zhou, Y., Lin, Y., Chan, R. C. K., Liu, J. & Chen, H. Deep learning for computational cytology: A survey. Méd. Image Anal. 84, 102691 (2023).
Xiang, Y., Sun, W., Pan, C., Yan, M., Yin, Z. & Liang, Y. A novel automation-assisted cervical cancer reading method based on convolutional neural network. Biocybern. Biomed. Eng. 40, 611–623 (2020).
Liang, Y., Pan, C., Sun, W., Liu, Q. & Du, Y. Global context-aware cervical cell detection with soft scale anchor matching. Comput. Methods Programs Biomed. 204, 106061 (2021).
Yang, T., Yuan, L., Li, P. & Liu, P. Real-Time Automatic Assisted Detection of Uterine Fibroid in Ultrasound Images Using a Deep Learning Detector. Ultrasound Med. Biol. 49, 1616–1626 (2023).
Pacal, I., Karaman, A., Karaboga, D., Akay, B., Basturk, A., Nalbantoglu, U. et al. An efficient real-time colonic polyp detection with YOLO algorithms trained by using negative samples and large datasets. Comput. Biol. Med. 141, 105031 (2022).
Chen, P.-H. C., Gadepalli, K., MacDonald, R., Liu, Y., Kadowaki, S., Nagpal, K. et al. An augmented reality microscope with real-time artificial intelligence integration for cancer diagnosis. Nat. Med. 25, 1453–1457 (2019).
Lambert, S. I., Madi, M., Sopka, S., Lenes, A., Stange, H., Buszello, C.-P. et al. An integrative review on the acceptance of artificial intelligence among healthcare professionals in hospitals. npj Digit. Med. 6, 111 (2023).