The algorithms’ performances of the included studies are presented in many different ways, most of them using cross validation and not mentioning confidence intervals nor the number of false positives, false negatives, true positives and true negatives (10–12, 14–17). This lack of confidence intervals prevents objective statistical analysis of the results and the possibility of meta-analysis. The common metrics used in all the included studies except (16) - which provides the ROC curve - are the average accuracy, specificity, and sensitivity. Consequently, we have focused our analysis on the specificities and sensitivities.
The three most recent algorithms are the ones that have a higher specificity (16–18) (Figure 2c). Nevertheless, they only contain 537, 300 and 44 patients, respectively. Using such small datasets could give rise to a lack of generalization of the proposed methods. Further testing in larger datasets should be done to accurately estimate their performance. Similarly, from Figure 2g we can observe that the two algorithms with higher sensitivity also include a limited number of cases (18) or do not mention if they performed cross-validation (13), having also a higher risk of overfitting.
Two out of the three algorithms with the highest specificity use more than one image per patient: in (17), a pair of images taken before and after the acetic acid, and, in (18), 120 sequential images after the application of the acetic acid (Figure 2e), showing the potential of using multiple sequential images of the cervix. Colposcopists not only detect the precancerous lesions based on the intensity of aceto-whitening but also based on its time evolution. Furthermore, (27) demonstrated that although most of the lesions are visible one minute after the application of acetic acid, it is reasonable to perform VIA during 3 minutes. Thus, instead of using single images as done in (10–16) using multiple sequential images could enhance automatic screening algorithms.
Two key features that could enable the use of these tools in LMICs have been identified. Firstly, algorithms that use images from portable devices (smartphone and camera) achieve similar or better performance than the ones using colposcopes (Figure 2e). For LMICs, portable devices seem to be more appropriate for acquiring and analysing the images than expensive tools such as colposcopes. Secondly, simpler algorithms achieve - even outperform – similar performance than some of the CNNs based algorithms (Figure 2f). For instance, in (10–12), all studies conducted by the same authors and using the same database, two traditional ML-based algorithms are presented (10, 11), published in 2015 as well as one CNN-based method, published in 2017, (12). The three studies present similar results even when using much more sophisticated algorithms in (12) than in (10, 11). The use of simpler algorithms facilitates the integration into mid-range smartphones allowing offline use of the tool. By contrast, more sophisticated algorithms such as the ones based on complex architectures might require the use of external servers for performing the classification.
The main limitations of most of the included studies are the limited number of patients or images, the high selection of the patients used to train and test the algorithms, or the lack of large-scale tests. The risk of bias due to patient selection was defined high for (10–12, 14–18) when assessing the risk with QUADAS-2. In Costa Rica, the study (13) recruited a large number of patients resulting in a prevalence of 0.3. The other studies relying on this dataset have chosen a small subset of cases to develop their algorithms, without specifying the criteria for the patients’ selection and having a prevalence ranging from 0.31 (10–12) to 0.5 (14).
Out of the 9 studies, only (13) and (15) have a high risk of bias due to the gold standard as they use histopathology only to confirm positive patients. In (13), only patients with abnormal cytology or visual inspection were referred to colposcopy and biopsied(13). Thus, its dataset consists of positive cases confirmed by histology and negative cases confirmed by normal cytology and normal visual inspection. Similarly, in (15), patients with normal cytology and colposcopy were considered negative while positive cases were confirmed by pathology. The remaining studies used histology as gold standard for all the patients or, in (16), for a small subset used for testing.
Furthermore, different screening approaches were used in the included studies to collect the data. During each screening visit in Costa Rica, cytology, HPV testing and visual inspection with acetic acid were performed (10–14). Studies (15), (16) and (17) do not indicate the screening or patient selection criteria. In (18), women were referred for colposcopy after positive HPV testing in Cameroon, and after both positive cytology and HPV testing in Switzerland.
Finally, only two of the included studies, (13) and (18), present a comparison between experts and their algorithms. In (13), each pair of images taken during VIA was graded by one expert, blinded from the histopathologic diagnoses, as normal, atypical, low-grade lesions, or CIN2+. In (18), three experts classified, blinded from the histopathologic diagnoses, the 44 patient’s images as positive (CIN2+) or negative. From Figure 2h, we can observe that the algorithms in both studies achieved higher sensitivity than the experts. In (18), on average, experts had worse sensitivity and specificity than the algorithm, while in (13), experts had higher specificity than the automated algorithm in identifying atypical cases or CIN2+ cases.