The first neural network prediction model in IVF based on the clinical data was introduced in 1997 [13]. We simulated it in this study but unfortunately, its accuracy and reproducibility were insufficient to predict the likelihood of clinical pregnancy occurrence in IVF protocol (loss 0.6658, accuracy 0.5735). This led to analysis of additional publications and studies involving modeling of raw protocol data and its embryological component in outcome prediction [14–17]. The aim was to select optimal input parameters to predict the embryo implantation. Analysis of linear dependencies for those parameters makes it quite challenging to trace their influence on the embryo transfer outcome. That's why we employed a DNN in our work instead of other ML algorithms.
A comparison between the developed model and other ML approaches described in the literature has been conducted on our data [18]: The DNN exhibits a significantly higher (U-Statistic 230.0, p-value 0.002) accuracy (0.67; AUC 0.63, SD = 0.058). That means a high difference in data from multiple IVF clinics and importance of multicenter validation for all ML models before practical implementation. We observed comparative results with the van Loendersloot's model capacity (AUC 0.64, CI 0.61–0.67) [19]. That result demonstrated that our DNN was robust on the pretraining step.
After a complete training process, the DNN (AUC 0.68, SD = 0.054), in comparison with AI models described in the literature, has revealed advantages in terms of implantation probability prediction (AUC 0.573 [20]; 0.629 [21]; 0.543 [22]; 0.638 [23]; 0.64 [24]).
After fine tuning we achieved DNN Average Accuracy 0.855, SD = 0.027. We compared our DNN model with three fine-tuned pre-trained convolutional neural network (CNN) models for pregnancy prediction utilizing the ImageNet system: VGG16, ResNet50, DenseNet121. The best performance was obtained in the ResNet50 architecture in reconstructed image data, with an AUC 0.741, Accuracy 0.682, Sensitivity 0.7114, Specificity 0.669. Two other CNN showed comparative but slightly lower results [25]. Only sensitivity was higher than in our DNN model because of actual blastocyst images in the CNN models. Therefore, those embryos that did not develop to the blastocyst stage were excluded.
During validation, our model exhibited AUC = 0.67–0.75 (U- Statistic 145.0, p-value 0.471), comparable to the AI KIDScore™ (for known implantation data AUC 0.66, CI 0.60–0.75 across different clinics). Additionally, the model after fitting demonstrated a curve similar to the KIDScore™ (0.86 vs 0.89) for all embryos. This convincingly illustrates performance in predicting the pregnancy occurrence frequency for the developed model equivalent to commercially utilized TL systems [26–29], also fitted with additional clinical data (AUC 0.72–0.78) [24, 30].
Possible discrepancies in quality metrics between our DNN model and traditional TL systems can be explained by different features used as the variables for outcome prediction. Our neural network allows looking at the formation path of necessary features for assessing the chances of clinical pregnancy occurrence using ML from a different perspective - introducing laboratory KPI to the analysis. Meanwhile, TL models utilize morphokinetic data of individual embryo development for forecasting. Traditional embryo quality assessment combined with KPI consideration enables reasonably accurate predictions, similar to constant video monitoring in TL. However, embryo developmental kinetics data utilization involves analyzing a large number of images, which introduces noise into the primary data and its final assessment, unlike calculated KPIs or static images.
The accuracy of 78.13% and sensitivity of 62% of the model in clinical pregnancy prediction are just as good as in the Irvine Scientific Life Whisperer model [31] (accuracy 76.85%, sensitivity 70.1%, specificity 60.5%, combined accuracy 64.3%); and the FiTTE model [32] (accuracy 65.2%, AUC 0.71) and has a significantly (p < 0.01) better AUC (0.67–0.75) than in traditionally used predictive models (AUC 0.6202–0.6367) [33, 34] on external validation data.
Finally, our model validated on PGT-A protocols demonstrated higher performance than IDAScore V2 [35] for the same patient category (AUC 0.654) and similar performance to the Eeva model (AUC 0.698–0.744) [36]. Predictive ability of the model is comparable to those of GERI AI with overall accuracy of 67.8%, AUC 0.61–0.65 [37] and of MIRI AI model with AUC 0.69 [38].
At this point, it is difficult to assert confidently whether our algorithm is better or worse than traditionally used ones, additional multicenter studies are required. However, it enables a more comprehensive evaluation of implantation predictors related to intralaboratory conditions and is easier to interpret within the framework of quality control programs compared to TL assessment.
DNN performance metrics are the same that are reported by Alife Health artificial intelligence model with AUC 0.62–0.64 [39], Fairtility artificial intelligence model (AUC 68–0.70) [40] and CHLOE-EQ model (AUC 64–0.726) [24].
Compared to the available ALIFE IVF success rate online calculator, our DNN model provided lower pregnancy probabilities (0.43 vs 0.30) which are much more reliable to actual implantation rate in single embryo transfer protocols (0.33) from the analyzed data set. The results achieved have shown the real importance of laboratory KPIs for precise prediction in individual protocols. This approach gives less optimistic predictions than those based only on clinical data. ALIFE model has been trained on a different patient population, and assured model comparison may be held only after tuning of both models on the same training data.
After logistic regression calibration we compared our DNN model to the calibrated Large-scale simulation model reported at the ESHRE Annual Meeting [41]. It has the same results of the average pregnancy rate for the top-ranked embryos (60% vs 59.4%) with the minimal pregnancy possibility of 18.4% in low grade embryos. These results convincingly demonstrate effectiveness and reproducibility of the DNN model prediction algorithm (mean AUC 0.86) compared with other ML (AUC 0.632) for single embryo transfer [42], along with ML spatial stream model (AUC 0.76), temporal stream model (AUC 0.77), and ensemble model STEM (AUC 0.82) [43].
The model interpretation with SHAP, LIME and ICE methods has established the minimum requirements for pregnancy achievement. Concerning prognosis of the treatment cycle outcome, they accord with Bologna Criteria [44] for poor responders and POSEIDON (Patient-Oriented Strategies Encompassing Individualized Oocyte Number Criteria) [45]. Also, they demonstrate the same range of essential KPI levels that were proposed in the Vienna and the Maribor consensus.