Advanced KPI Framework for IVF Pregnancy Prediction Models in IVF protocols

doi:10.21203/rs.3.rs-4445375/v1

The utilization of neural networks in assisted reproductive technology is essential due to their capability to process complex and multidimensional data inherent in IVF procedures, offering opportunities for clinical outcome prediction, personalized treatment implementation, and overall advancement in fertility treatment.

The aim of this study was to develop a novel approach to IVF laboratory data analysis, employing deep neural networks to predict the likelihood of clinical pregnancy occurrence within an individual protocol, integrating both key performance indicators and clinical data.

We conducted a retrospective analysis spanning 11 years, encompassing 8732 protocols, to extract the most relevant features to our goal and train the model. Internal validation was performed on 1600 preimplantation genetic testing for aneuploidy embryo transfers, while external was conducted across two independent clinics (over 10,000 cases).

Leveraging recurrent neural networks, our model demonstrates high accuracy in predicting the likelihood of clinical pregnancy within specific IVF protocols (AUC: 0.68–0.86; Test accuracy: 0.78, F1 Score: 0.71, Sensitivity: 0.62; Specificity: 0.86) comparable to time-lapse system but with a simpler approach. Our model facilitates both retrospective analysis of outcomes and prospective evaluation of clinical pregnancy chances, thus presenting a promising avenue for quality management programs and promotes their realization in medical centers.

Biological sciences/Computational biology and bioinformatics

Health sciences/Health care

Health sciences/Medical research

Deep neural networks

IVF laboratory data analysis

Clinical pregnancy prediction

Personalized treatment

ART quality management

In recent decades, neural networks and machine learning models (ML) have become key tools in various fields, including assisted reproductive technology (ART). They serve as a foundation for creating algorithms capable of extracting complex dependencies from data, and making decisions based on these dependencies. In some cases, neural networks can identify a broader spectrum of associations than other statistical methods, thanks to their ability to recognize highly nonlinear associations among input parameters [1].

Most ML models applied in the field of in vitro fertilization (IVF) are based on regression and logistic regression algorithms to identify relationships between the target variable (outcome metric of clinic success) and input parameters [2]. However, most models used for that evaluation are based on patient data and their previous IVF protocols, often failing to track the patterns in changes of quality laboratory indicators relevant to the final transfer outcome [3].

The importance of the laboratory stage in IVF is explained in detail in consensus resolutions that outline quality control (QC) measures and the evaluation of key performance indicators (KPIs), in accordance with international standards.[4]. Nevertheless, it is hard to disagree that laboratory and clinic efforts have to be aimed in fact not towards achieving outstanding KPI values, but towards application and integration of quality monitoring to increase the number of protocols resulting in embryo implantation and birth of a healthy child.

Modern infertility treatment methods involve a personalized approach to individual patients, their treatment protocols, and specific embryos within them [5]. To facilitate this, many clinics have introduced medical information systems and time-lapse (TL) imaging equipment to monitor the fate of individual fertilized oocytes. Applying artificial intelligence strategies to TL, a tool for additional embryo ranking has been developed, estimating the implantation odds [6, 7]. However, this approach entails the acquisition of additional expensive equipment: TL incubators and software, as well as allocation of additional human and time resources. Assessing the feasibility of such an approach is beyond the scope of this study, but in practice, it may not always be fully achievable in the laboratory setting. For extrapolation to patients from different clinics in various regions and countries, and for successful operation, such algorithms require additional validation in a specific laboratory with individual patient population characteristics [8]. However, there is very little open access data for validation assessments performing in such decision-making systems. Another deal is that the majority of the studies on the TL imaging effectiveness are conducted on private databases, which do not allow an independent quality assessment of the models offered as commercial solutions.

To simplify the presentation of the necessary information for subsequent predicting, alternative methods have been developed, including principal component analysis for models of embryo selection based on previous IVF protocols and embryo morphology before transfer [9]. But this method still requires embryo images, which are often impossible to achieve in some clinics. However, most laboratory KPIs, used for quality control and competence level maintenance, are overlooked by these algorithms [10, 11].

Taking into account the limitations of existing systems, the aim of this study was to develop a different approach to laboratory data analysis with deep neural networks (DNN) to predict the chances of clinical pregnancy occurrence in a specific protocol based on a combination of IVF laboratory KPIs with the sum of their rankings (KPIScore [12]) and clinical data. The major reason to select DNN was that factors influencing the outcome of embryo transfer are not always obvious, which is confirmed by abundance of “unexplained infertility” diagnoses. It is evident that manipulations with the embryo in laboratory conditions (in vitro) and its individual development cannot be adequately described by conventional linear mathematical models and requires comprehensive approaches to enhance our understanding.

The developed neural network makes predictions based on 19 parameters: 13 of them are recorded in the laboratory database, and 6 are mathematically calculated in the script code (Supplementary 1). Using XGBoost we justify correlation of values with pregnancy occurrence for selected parameters, considering their significance in predicting the embryo transfer outcome.

Model training and performance

A total of 8732 complete protocols were available for primary analysis. After selecting the protocols containing all the necessary information for neural network training, 3858 cases of embryo transfer with known outcomes have been used to form the training (80%) and validation (20%) set, and 4874 protocols without known transfer outcome were used for model pre-training process with partial teacher involvement for the purpose of propria weights forming. It was the first stage in the DNN development. The developed model exhibits a significantly higher accuracy (0.67; AUC 0.63, SD = 0.058; U-Statistic 230.0, p-value 0.002) compared to the Gradient Boosting, Random Forest, and Decision Tree models.

A correlation analysis between model accuracy and various age groups of patients (up to 29; 30–34; 35–39; over 40 years) was conducted, considering different methods of fertilization (IVF and ICSI). No significant (p > 0.05) differences were observed. In cross-validation, Average Accuracy 0.68 (SD = 0.05), Maximum Accuracy 0.83 were observed.

For model propria training and fitting, we added transfer outcomes from cryo protocols with known implantation data to the previous dataset (totally all 8732 protocols have been used). We split that data to train (70%), validation (20%) and test (10%) sets. The stratified random sampling approach was used to ensure that all split data sets have the same distribution of pregnancy “positive” and “negative” classes. On this data, a high stability of model was observed: Accuracy 0.74; Sensitivity 0.6; Specificity 0.66; PPV 0.46; NPV 0.78; FPR 0.33; FNR 0.4; Overall Accuracy (test) 0.71, AUC 0.68 (SD = 0.054).

Validation and fine-tuning

Concordance validation between actual and predicted values of pregnancy occurrence using DNN was carried out on 1600 IVF protocols with embryos after PGT-A at the validation dataset. Accuracy of 0.7685 with an 86% sensitivity for pregnancy class (the proportion of actual positive results detected by the model out of all actual positive results) and F1-score of 0.71 have been obtained.

We compared our neural network model to Logistic Regression, Gradient Boosting, Random Forest, and AdaBoost using leave one out (LOO) cross-validation for performance evaluation in a single IVF protocol. Our research demonstrates that the neural network has performed favorably across all considered metrics, notably standing out with a high MCC (0.412) that indicates its ability for precise and balanced classification (Table 1).

Table 1

Leave one out cross-validation for different ML models with performance metrics
Model	AUC	CA	F1	Precision	Recall	MCC
DNN	0.798	0.735	0.732	0.730	0.735	0.412
Logistic Regression	0.787	0.720	0.711	0.712	0.720	0.366
Gradient Boosting	0.779	0.712	0.708	0.706	0.712	0.359
Random Forest	0.764	0.706	0.702	0.700	0.706	0.346
AdaBoost	0.739	0.694	0.692	0.690	0.694	0.325
DNN - Deep learning neural network; AUC - Area Under the ROC Curve; CA - Classification Accuracy; F1: The harmonic mean of precision and recall; MCC - Matthews Correlation Coefficient

After training the DNN, we utilized the probabilities obtained and applied logistic regression for calibration: Precision 0.837 and Recall 0.66. On 5-fold cross validation the DNN model demonstrated high specificity (0.864) and ability to correctly predict clinical pregnancy (Average Accuracy 0.781, SD = 0.04, Maximum Accuracy 0.875) (Table 2).

Table 2

DNN model performance after tuning on 5 fold cross-validation
Folds	AUC	Sensitivity	Specificity	PPV	NPV	FPR	FNR	Accuracy
1	0,849	0,714	0,833	0,769	0,789	0,167	0,286	0,781
2	0,888	0,583	0,850	0,700	0,773	0,150	0,417	0,750
3	0,895	0,400	0,909	0,667	0,769	0,091	0,600	0,750
4	0,823	0,700	0,773	0,583	0,850	0,227	0,300	0,750
5	0,864	0,700	0,955	0,875	0,875	0,045	0,300	0,875
Mean	86,37%	61,95%	86,39%	71,88%	81,13%	13,61%	38,05%	78,13%
SD	0,024	0,109	0,057	0,090	0,039	0,057	0,109	0,044

AUC - Area Under the ROC Curve; PPV – Positive predictive value; NPV – Negative predictive value; FPR – False positive rate; FNR – False negative rate.

The model then was fine-tuned based on the database from 2018 to 2023 (3500 protocols with known implantation data) for all embryos with oocyte source specification (autological or donor). During this period, a comprehensive quality control system was being implemented in the laboratory, and there was no variability in approaches to oocyte pick-up, embryo culture and transfer. Average Accuracy 0.855 (SD = 0.027), Maximum Accuracy 0.914, Minimum Accuracy 0.786, AUC 0.86 (SD = 0.06).

In the validation process for the pregnancy occurrence prediction accuracy, a quarterly analysis for 2022–2023 was conducted, comparing the model's predictions with the actual clinical report and statistics. The average difference between the calculated predictions and reports was 2.56%, not significant (U-Statistics = 20.0, p value = 0.62).

External validation

For external validation, we used a distinct patient population with significantly different data distribution (KPIScore t-statistics − 9.05378, p-value < 0.0001) from independent 2 centers from different countries (Russia#1 with the majority of poor-prognosis patients (2013–2023 years), 6240 protocols; and Georgia #2 with good prognosis patients and the majority of cycles with donor oocytes (2022–2024 years) 3888 protocols).

From the protocols provided by clinics #1 and #2, those were selected executed by the current team of embryologists (2 embryologists from clinic #1 and 4 from clinic #2). Criteria for protocol exclusion: incomplete data, errors in information provision, and procedures performed by multiple embryologists. Protocols meeting the Vienna consensus criteria were chosen to prepare the dataset for individual staff KPI analysis.

Quarterly analysis (for years 2013–2023 in clinic #1 and 2022–2024 in clinic #2) of the input laboratory parameters showed stability in MII oocyte rate (med 87%, Q1-Q3 0.855–0.893) above the Maribor consensus competency threshold (≥ 74%, target value ≥ 90%), and its invariance across years. Statistically significant differences in laboratory KPIs between IVF and ICSI procedures were not found for fertilization rate (U statistic 4687.0, p-value 0.2945), blastocyst development rate (U statistic 6096.5, p-value 0.0528), TGBDR (U statistic 4920.5, p-value 0.90306), and availability of oocytes for fertilization (U statistic 5404.0, p-value 0.8479), allowing the use of these datasets as reference examples for validation the training process of our model in a specific IVF center. The mean AUC 0.73 was reached during that external validation.

Explanation of the Model results

The importance of explaining and interpreting the results of neural network models used in reproductive medicine is critical for ensuring transparency and trust both for patients and medical professionals. Understanding which factors have influence on clinical pregnancy probability is particularly important to ensure the most effective treatment and to improve outcomes.

We chose linear regression to analyze probabilities predicted by the neural network in order to find the simplest approach to increase its interpretability. The dependent variable was predicted probability of pregnancy, and the independent variables were parameters used in the model. The mean squared error value was 0.00267. This suggests that the model's predictions are close to the actual values of pregnancy probability. The coefficient of determination (R^2) was found to be 0.937, indicating that approximately 93.7% of the variance in pregnancy probability is explained by the variables in the model, confirming the model's high predictive potential.

We utilized SHAP, ICE and LIME algorithms to interpret the results of DNN model. Combining these methods, we were able to determine the average minimum conditions of the IVF laboratory stage for pregnancy occurrence in a specific protocol.

For our DNN predictions the following threshold values have been obtained: KPIScore = 15, Number of follicles = 4, OCC = 3, MII = 2, 2pN = 2, Number of D3 embryos = 2, Number of D5 embryos = 1, Good quality blastocysts = 1, Transferred embryos = 1 and Patient Age = 36. With the LIME algorithm we observed that for positive pregnancy result it is necessary to reach KPIScore = 17, OCC retrieval rate = 0.83, MII rate = 0.78, Fertilization rate = 0.64, Blastocyst development rate = 0.44, TGBDR = 0.35. It is remarkable that those values are consistent with Vienna and Maribor consensus opinion for good IVF practice.

The first neural network prediction model in IVF based on the clinical data was introduced in 1997 [13]. We simulated it in this study but unfortunately, its accuracy and reproducibility were insufficient to predict the likelihood of clinical pregnancy occurrence in IVF protocol (loss 0.6658, accuracy 0.5735). This led to analysis of additional publications and studies involving modeling of raw protocol data and its embryological component in outcome prediction [14–17]. The aim was to select optimal input parameters to predict the embryo implantation. Analysis of linear dependencies for those parameters makes it quite challenging to trace their influence on the embryo transfer outcome. That's why we employed a DNN in our work instead of other ML algorithms.

A comparison between the developed model and other ML approaches described in the literature has been conducted on our data [18]: The DNN exhibits a significantly higher (U-Statistic 230.0, p-value 0.002) accuracy (0.67; AUC 0.63, SD = 0.058). That means a high difference in data from multiple IVF clinics and importance of multicenter validation for all ML models before practical implementation. We observed comparative results with the van Loendersloot's model capacity (AUC 0.64, CI 0.61–0.67) [19]. That result demonstrated that our DNN was robust on the pretraining step.

After a complete training process, the DNN (AUC 0.68, SD = 0.054), in comparison with AI models described in the literature, has revealed advantages in terms of implantation probability prediction (AUC 0.573 [20]; 0.629 [21]; 0.543 [22]; 0.638 [23]; 0.64 [24]).

After fine tuning we achieved DNN Average Accuracy 0.855, SD = 0.027. We compared our DNN model with three fine-tuned pre-trained convolutional neural network (CNN) models for pregnancy prediction utilizing the ImageNet system: VGG16, ResNet50, DenseNet121. The best performance was obtained in the ResNet50 architecture in reconstructed image data, with an AUC 0.741, Accuracy 0.682, Sensitivity 0.7114, Specificity 0.669. Two other CNN showed comparative but slightly lower results [25]. Only sensitivity was higher than in our DNN model because of actual blastocyst images in the CNN models. Therefore, those embryos that did not develop to the blastocyst stage were excluded.

During validation, our model exhibited AUC = 0.67–0.75 (U- Statistic 145.0, p-value 0.471), comparable to the AI KIDScore™ (for known implantation data AUC 0.66, CI 0.60–0.75 across different clinics). Additionally, the model after fitting demonstrated a curve similar to the KIDScore™ (0.86 vs 0.89) for all embryos. This convincingly illustrates performance in predicting the pregnancy occurrence frequency for the developed model equivalent to commercially utilized TL systems [26–29], also fitted with additional clinical data (AUC 0.72–0.78) [24, 30].

Possible discrepancies in quality metrics between our DNN model and traditional TL systems can be explained by different features used as the variables for outcome prediction. Our neural network allows looking at the formation path of necessary features for assessing the chances of clinical pregnancy occurrence using ML from a different perspective - introducing laboratory KPI to the analysis. Meanwhile, TL models utilize morphokinetic data of individual embryo development for forecasting. Traditional embryo quality assessment combined with KPI consideration enables reasonably accurate predictions, similar to constant video monitoring in TL. However, embryo developmental kinetics data utilization involves analyzing a large number of images, which introduces noise into the primary data and its final assessment, unlike calculated KPIs or static images.

The accuracy of 78.13% and sensitivity of 62% of the model in clinical pregnancy prediction are just as good as in the Irvine Scientific Life Whisperer model [31] (accuracy 76.85%, sensitivity 70.1%, specificity 60.5%, combined accuracy 64.3%); and the FiTTE model [32] (accuracy 65.2%, AUC 0.71) and has a significantly (p < 0.01) better AUC (0.67–0.75) than in traditionally used predictive models (AUC 0.6202–0.6367) [33, 34] on external validation data.

Finally, our model validated on PGT-A protocols demonstrated higher performance than IDAScore V2 [35] for the same patient category (AUC 0.654) and similar performance to the Eeva model (AUC 0.698–0.744) [36]. Predictive ability of the model is comparable to those of GERI AI with overall accuracy of 67.8%, AUC 0.61–0.65 [37] and of MIRI AI model with AUC 0.69 [38].

At this point, it is difficult to assert confidently whether our algorithm is better or worse than traditionally used ones, additional multicenter studies are required. However, it enables a more comprehensive evaluation of implantation predictors related to intralaboratory conditions and is easier to interpret within the framework of quality control programs compared to TL assessment.

DNN performance metrics are the same that are reported by Alife Health artificial intelligence model with AUC 0.62–0.64 [39], Fairtility artificial intelligence model (AUC 68–0.70) [40] and CHLOE-EQ model (AUC 64–0.726) [24].

Compared to the available ALIFE IVF success rate online calculator, our DNN model provided lower pregnancy probabilities (0.43 vs 0.30) which are much more reliable to actual implantation rate in single embryo transfer protocols (0.33) from the analyzed data set. The results achieved have shown the real importance of laboratory KPIs for precise prediction in individual protocols. This approach gives less optimistic predictions than those based only on clinical data. ALIFE model has been trained on a different patient population, and assured model comparison may be held only after tuning of both models on the same training data.

After logistic regression calibration we compared our DNN model to the calibrated Large-scale simulation model reported at the ESHRE Annual Meeting [41]. It has the same results of the average pregnancy rate for the top-ranked embryos (60% vs 59.4%) with the minimal pregnancy possibility of 18.4% in low grade embryos. These results convincingly demonstrate effectiveness and reproducibility of the DNN model prediction algorithm (mean AUC 0.86) compared with other ML (AUC 0.632) for single embryo transfer [42], along with ML spatial stream model (AUC 0.76), temporal stream model (AUC 0.77), and ensemble model STEM (AUC 0.82) [43].

The model interpretation with SHAP, LIME and ICE methods has established the minimum requirements for pregnancy achievement. Concerning prognosis of the treatment cycle outcome, they accord with Bologna Criteria [44] for poor responders and POSEIDON (Patient-Oriented Strategies Encompassing Individualized Oocyte Number Criteria) [45]. Also, they demonstrate the same range of essential KPI levels that were proposed in the Vienna and the Maribor consensus.

Thus, our DNN model provides the ability to determine a theoretically justified actual potential of pregnancy occurrence probability in a specific patient group by using KPIs data from individual protocol. Above all else, application of the model as a quality control tool is valid in terms of accurate determination of competency level threshold values in pregnancy rate for individual clinics with their unique patient subpopulation, allowing to identify specific time frames for audits, and areas of increased interest during quality control assessments.

Data collection

We used retrospective data with known outcomes in “IVF and Genetic Center”, Moscow, Russia from SQL database in the medical informative system Medwork 3.5 from January 2013 till January 2024 to develop a data set for DNN model: totally 8732 protocols (3856 with fresh embryo transfer). Additional data from 2018–2024 (3500 protocols) and for PGT-A embryos (1600 protocols) was used for model internal validation. For external model validation we used data from 2 independent ART centers: “#1” Moscow, Russia with 6240 protocols and “#2” Tbilisi, Georgia with 3888 protocols. All the protocols with missing data values were discarded from the study.

Patient informed consent for that study was not necessary because only retrospective and fully de-identified data from embryo development has been used, fully noninvasive for patients or their embryos (no medical intervention was performed on the subject, and no biological samples from the patient were collected to develop that model). For embryo evaluation, the ESHRE recommendations and Gardner blastocyst grading system was used, in which “good blastocysts” were identified as Bl3BB and higher grade.

Our DNN model incorporates KPI data on the base of the Vienna Consensus, showing a correlational dependence with occurrence of pregnancy. We also use fine-tuning KPI for total good blastocyst development rate (TGBDR) in our model according to publications [11]. TGBDR has positive correlation with all itralaboratory KPIs and negative correlation with patient age in our data. Model also includes the rank sum of laboratory and clinical parameters (KPIScore), used with adaptation to our data (SART embryo grading and antral follicle count on the trigger day) to forecast the chance of pregnancy occurrence [12]. The positive correlation (+ 0.785) for KPIScore with TGBDR has been observed.

Statistical analysis

For data set description, statistical analysis of individual KPIs was conducted using the StatTech v. 3.0.6 software. Quantitative indicators with a normal distribution were described using the mean and standard deviation (SD), with 95% confidence intervals. Direction and strength of the correlation between two quantitative variables were evaluated using the Spearman rank correlation coefficient. For statistical approach P-value < 0.05 was used as a significant threshold limit in analysis of clinical features. Comparison of groups based on quantitative indicators was performed with one-way analysis of variance (ANOVA), with post hoc comparisons using the Kruskal-Wallis test for non-normal distribution.

Analysis was performed with Information Gain, Gini Coefficient, ANOVA, Chi-square (χ²) and ReliefF to evaluate feature importance in classification tasks (Table 3).

Table 3

Evaluating feature importance to identify the most informative variables for DNN training
Rank	Metric	Info. gain	Gain ratio	Gini	ANOVA	χ²	ReliefF
1	KPIScore	0.107	0.054	0.062	1421.400	906.190	0.017
2	Patient age	0.085	0.043	0.051	1045.865	747.026	0.016
3	Transfer day	0.080	0.051	0.051	1036.645	1281.690	0.009
4	Number of oocytes	0.073	0.036	0.045	750.572	721.893	0.009
5	Antral follicle count	0.073	0.036	0.043	648.521	629.716	0.009
6	Day 3 embryos number	0.072	0.036	0.042	614.261	617.259	0.004
7	Total Blastocyst number	0.068	0.036	0.042	629.322	880.509	0.002
8	2pN	0.065	0.033	0.038	557.655	548.234	0.007
9	Blastocyst rate	0.064	0.034	0.040	699.393	731.757	0.004
10	Number of inseminated	0.062	0.031	0.036	505.209	515.078	0.010
11	Good Blastocyst number	0.062	0.037	0.039	475.841	934.627	0.003
12	TGBDR	0.060	0.037	0.038	472.284	800.634	0.009
13	Day 5 embryos number	0.059	0.030	0.036	571.130	643.703	0.001
14	Number of transferred embryos	0.039	0.025	0.025	281.349	254.111	0.002
15	IVF attempt number	0.037	0.020	0.022	313.250	515.722	0.008
16	Number of cryo embryos	0.028	0.015	0.018	236.567	325.789	0.003
17	Fertilization rate	0.026	0.013	0.017	106.370	13.275	0.018
18	Cleavage rate	0.022	0.031	0.013	476.941	49.713	0.001
19	Oocyte retrieval rate	0.018	0.011	0.012	2.625	28.416	0.002

TGBDR - total good quality blastocyst development rate

Python 3.6, Scikit-learn 1.4.2 and Sklearn 1.4 were used to implement machine learning models and statistical modeling. A t-test was used to determine whether there was a significant difference between the means of two groups. Mann-Whitney U Test (Wilcoxon Rank Sum Test) was used to compare differences between two groups: model predicted values and real clinical pregnancy rate in different patient groups, staff members and time periods.

Model Compilation

The neural network model has been developed and executed in the GPU PyCharm 17.0.10 environment using Python 3.6 programming language with the Tensorflow 2.15.0 and Keras library 2.14.0.

The proposed neural network model is a sequential Recurrent Neural Network (RNN) consisting of two layers with L2 and L1 regularization to control overfitting by adding penalties to the model's weights. The first layer - SimpleRNN with 32 neurons with Rectified Linear Activation (ReLU). The second layer - SimpleRNN − 16 neurons and ReLU activation with added Dropout layer to prevent model overfitting by randomly excluding neurons during training. The final Dense layer with one neuron and Sigmoid activation, designed for binary classification. Sigmoid is used to predict the probability of a positive class - a positive outcome of the embryo transfer procedure.

This DNN model was optimized using the gradient descent method with the Adam optimization algorithm with learning rate 0.0001. Binary cross-entropy was used as the loss function. The model was trained on a data array for 12 epochs (independently passes through the entire data frame, splitting it into train (70%), validation (20%) and test (10%) set) with a batch size of 8. Cross-validation for the model was conducted to evaluate it on 5 different data splits (folds).

Calibration of the DNN was performed using the CalibratedClassifierCV from Scikit-learn, which applies logistic regression to align probabilities. A comparative analysis of prediction errors was conducted with Area under the receiver operating characteristic curve (AUC), Accuracy, Positive Predictive Value (PPV), Negative Predictive Value (NPV), False Positive Rate (FPR), False Negative Rate (FNG), Specificity, Sensitivity and Matthews Correlation Coefficient (MCC).

Explanation of the Model results

In the context of approaches to interpret the DNN results, we have utilized SHAP (SHapley Additive exPlanations), ICE (Individual Conditional Expectation) and LIME (Local Interpretable Model-agnostic Explanations) methods.

Competing interests

The authors declare no conflict of interest and no competing interests.

Author Contribution

S.S. and I.D. designed the study. S.S. conducted studies on the establishment of DNN architecture, implemented machine learning for comparison, and analyzed the data. I.D. interpreted the external validation data and drafted the manuscript. All authors reviewed the manuscript.

Data Availability

https://github.com/embryossa/DNN.git

Glatstein, I., Chavez-Badiola, A., & Curchoe, C. L. New frontiers in embryo selection. Journal of assisted reproduction and genetics. 40, 223–234. DOI: https://doi.org/10.1007/s10815-022-02708-5 (2023).
Alizadehsani, R. et al. Handling of uncertainty in medical data using machine learning and probability theory techniques: a review of 30 years (1991–2020). Annals of operations research, 1–42; https://doi.org/10.1007/s10479-021-04006-2 (2021).
Fernandez, E. et al. Artificial intelligence in the IVF laboratory: overview through the application of different types of algorithms for the classification of reproductive data. Journal of assisted reproduction and genetics. 37, 2359–2376. DOI: https://doi.org/10.1007/s10815-020-01881-9 (2020).
ESHRE Special Interest Group of Embryology and Alpha Scientists in Reproductive Medicine. The Vienna consensus: report of an expert meeting on the development of ART laboratory performance indicators. Reproductive biomedicine online. 35, 494–510. DOI: https://doi.org/10.1016/j.rbmo.2017.06.015 (2017).
Medenica, S. et al. The Future Is Coming: Artificial Intelligence in the Treatment of Infertility Could Improve Assisted Reproduction Outcomes-The Value of Regulatory Frameworks. Diagnostics. 12, 2979. DOI: https://doi.org/10.3390/diagnostics12122979 (2022).
Uyar, A., Bener, A., & Ciray, H. N. Predictive Modeling of Implantation Outcome in an In Vitro Fertilization Setting: An Application of Machine Learning Methods. Medical decision making: an international journal of the Society for Medical Decision Making. 35, 714–725. DOI: https://doi.org/10.1177/0272989X14535984 (2015).
Bamford, T. et al. A comparison of morphokinetic models and morphological selection for prioritizing euploid embryos: a multicentre cohort study. Human reproduction. 39, 53–61. DOI: https://doi.org/10.1093/humrep/dead237 (2024).
Blais, I., Koifman, M., Feferkorn, I., Dirnfeld, M., & Lahav-Baratz, S. Improving embryo selection by the development of a laboratory-adapted time-lapse model. F&S science. 2, 176–197. DOI: https://doi.org/10.1016/j.xfss.2021.02.001 (2021).
Xi, Q. et al. Individualized embryo selection strategy developed by stacking machine learning model for better in vitro fertilization outcomes: an application study. Reproductive biology and endocrinology. 19, 53. DOI: https://doi.org/10.1186/s12958-021-00734-z (2021).
Ratna, M. B., Bhattacharya, S., Abdulrahim, B., & McLernon, D. J. A systematic review of the quality of clinical prediction models in in vitro fertilisation. Human reproduction. 35, 100–116. DOI: https://doi.org/10.1093/humrep/dez258 (2020).
Zacà, C. et al. Fine-tuning IVF laboratory key performance indicators of the Vienna consensus according to female age. Journal of assisted reproduction and genetics. 39, 945–952. DOI: https://doi.org/10.1007/s10815-022-02468-2 (2022).
Franco, J. G. et al. Key performance indicators score (KPIs-score) based on clinical and laboratorial parameters can establish benchmarks for internal quality control in an ART program. JBRA assisted reproduction. 21, 61–66. DOI: https://doi.org/10.5935/1518-0557.20170016 (2017).
Kaufmann, S. J., Eastaugh, J. L., Snowden, S., Smye, S. W., & Sharma, V. The application of neural networks in predicting the outcome of in-vitro fertilization. Human reproduction. 12, 1454–1457. DOI: https://doi.org/10.1093/humrep/12.7.1454 (1997).
Liu, H. et al. Development and evaluation of a live birth prediction model for evaluating human blastocysts from a retrospective study. eLife. 12, e83662. DOI: https://doi.org/10.7554/eLife.83662 (2023).
Blank, C. et al. Prediction of implantation after blastocyst transfer in in vitro fertilization: a machine-learning perspective. Fertility and sterility. 111, 318–326. DOI: https://doi.org/10.1016/j.fertnstert.2018.10.030 (2019).
Raef, B., Maleki, M., & Ferdousi, R. Computational prediction of implantation outcome after embryo transfer. Health informatics journal. 26, 1810–1826. DOI: https://doi.org/10.1177/1460458219892138 (2020).
Bormann, C. L. et al. Performance of a deep learning based neural network in the selection of human blastocysts for implantation. eLife. 9, e55301. DOI: https://doi.org/10.7554/eLife.55301 (2020).
Li, L., Cui, X., Yang, J., Wu, X., & Zhao, G. Using feature optimization and LightGBM algorithm to predict the clinical pregnancy outcomes after in vitro fertilization. Frontiers in endocrinology. 14, 1305473. DOI: https://doi.org/10.3389/fendo.2023.1305473 (2023).
Sarais, V. et al. Predicting the success of IVF: external validation of the van Loendersloot's model. Human reproduction. 31, 1245–1252. DOI: https://doi.org/10.1093/humrep/dew069. (2016).
Chamayou, S. et al. The use of morphokinetic parameters to select all embryos with full capacity to implant. Journal of assisted reproduction and genetics. 30, 703–710. DOI: https://doi.org/10.1007/s10815-013-9992-2 (2013).
Basile, N. et al. The use of morphokinetics as a predictor of implantation: a multicentric study to define and validate an algorithm for embryo selection. Human reproduction. 30, 276–283. DOI: https://doi.org/10.1093/humrep/deu331 (2015).
Dal Canto, M. et al. Faster fertilization and cleavage kinetics reflect competence to achieve a live birth after intracytoplasmic sperm injection, but this association fades with maternal age. Fertility and sterility. 115, 665–672. DOI: https://doi.org/10.1016/j.fertnstert.2020.06.023 (2021).
Bori, L. et al. The higher the score, the better the clinical outcome: retrospective evaluation of automatic embryo grading as a support tool for embryo selection in IVF laboratories. Human reproduction. 37, 1148–1160. DOI: https://doi.org/10.1093/humrep/deac066 (2022).
Benchaib, M., Labrune, E., Giscard d'Estaing, S., Salle, B., & Lornage, J. Shallow artificial networks with morphokinetic time-lapse parameters coupled to ART data allow to predict live birth. Reproductive medicine and biology. 21, e12486. DOI: https://doi.org/10.1002/rmb2.12486 (2022).
Kim, H. M. et al. Improved prediction of clinical pregnancy using artificial intelligence with enhanced inner cell mass and trophectoderm images. Scientific reports. 14, 3240. DOI: https://doi.org/10.1038/s41598-024-52241-x (2024).
Fréour, T. et al. External validation of a time-lapse prediction model. Fertility and sterility. 103, 917–922. DOI: https://doi.org/10.1016/j.fertnstert.2014.12.111 (2015).
Tran, D., Cooke, S., Illingworth, P. J., & Gardner, D. K. Deep learning as a predictive tool for fetal heart pregnancy following time-lapse incubation and blastocyst transfer. Human reproduction. 34, 1011–1018. DOI: https://doi.org/10.1093/humrep/dez064 (2019).
Reignier, A. et al. Performance of Day 5 KIDScore™ morphokinetic prediction models of implantation and live birth after single blastocyst transfer. Journal of assisted reproduction and genetics. 36, 2279–2285. DOI: https://doi.org/10.1007/s10815-019-01567-x (2019).
Berntsen, J., Rimestad, J., Lassen, J. T., Tran, D., & Kragh, M. F. Robust and generalizable embryo selection based on artificial intelligence and time-lapse image sequences. PloS one. 17, e0262661. DOI: https://doi.org/10.1371/journal.pone.0262661 (2022).
Lee, C. I. et al. Associations between the artificial intelligence scoring system and live birth outcomes in preimplantation genetic testing for aneuploidy cycles. Reproductive biology and endocrinology. 22, 12. DOI: https://doi.org/10.1186/s12958-024-01185-y (2024).
VerMilyea, M. et al. Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during IVF. Human reproduction. 35, 770–784. DOI: https://doi.org/10.1093/humrep/deaa013 (2020).
Enatsu, N. et al. A novel system based on artificial intelligence for predicting blastocyst viability and visualizing the explanation. Reproductive medicine and biology. 21, e12443. DOI: https://doi.org/10.1002/rmb2.12443 (2022).
Nelson, S. M., & Lawlor, D. A. Predicting live birth, preterm delivery, and low birth weight in infants born from in vitro fertilisation: a prospective study of 144,018 treatment cycles. PLoS medicine. 8, e1000386. DOI: https://doi.org/10.1371/journal.pmed.1000386 (2011).
Ratna, M. B., Bhattacharya, S., & McLernon, D. J. External validation of models for predicting cumulative live birth over multiple complete cycles of IVF treatment. Human reproduction. 38, 1998–2010. DOI: https://doi.org/10.1093/humrep/dead165 (2023).
Ueno, S., Berntsen, J., Okimura, T., & Kato, K. Improved pregnancy prediction performance in an updated deep-learning embryo selection model: a retrospective independent validation study. Reproductive biomedicine online. 48, 103308. DOI: https://doi.org/10.1016/j.rbmo.2023.103308 (2024).
Tzukerman, N. et al. Using Unlabeled Information of Embryo Siblings from the Same Cohort Cycle to Enhance In Vitro Fertilization Implantation Prediction. Advanced science. 10, e2207711. DOI: https://doi.org/10.1002/advs.202207711 (2023).
Diakiw, S. M. et al. An artificial intelligence model correlated with morphological and genetic features of blastocyst quality improves ranking of viable embryos. Reproductive biomedicine online. 45, 1105–1117. DOI: https://doi.org/10.1016/j.rbmo.2022.07.018 (2022).
Duval, A. et al. A hybrid artificial intelligence model leverages multi-centric clinical data to improve fetal heart rate pregnancy prediction across time-lapse systems. Human reproduction. 38, 596–608. DOI: https://doi.org/10.1093/humrep/dead023 (2023).
Erlich, I. et al. Pseudo contrastive labeling for predicting IVF embryo developmental potential. Scientific reports. 12, 2488. DOI: https://doi.org/10.1038/s41598-022-06336-y (2022).
Loewke, K. et al. Characterization of an artificial intelligence model for ranking static images of blastocyst stage embryos. Fertility and sterility. 117, 528–535. DOI: https://doi.org/10.1016/j.fertnstert.2021.11.022 (2022).
Cho, J.H. et al. Large-scale simulation of pregnancy rate improvements using an AI model for embryo ranking. 38th Hybrid Annual Meeting of the ESHRE (2022).
Sayed, S. et al. Time-lapse imaging derived morphokinetic variables reveal association with implantation and live birth following in vitro fertilization: A retrospective study using data from transferred human embryos. PloS one. 15, e0242377. DOI: https://doi.org/10.1371/journal.pone.0242377 (2020).
Liao, Q. et al. Development of deep learning algorithms for predicting blastocyst formation and quality by time-lapse monitoring. Communications biology. 4, 415. DOI: https://doi.org/10.1038/s42003-021-0193Cho7-1 (2021).
Ferraretti, A. P. et al. ESHRE consensus on the definition of 'poor response' to ovarian stimulation for in vitro fertilization: the Bologna criteria. Human reproduction. 26, 1616–1624. DOI: https://doi.org/10.1093/humrep/der092 (2011).
Esteves, S. C. et al. The POSEIDON Criteria and Its Measure of Success Through the Eyes of Clinicians and Embryologists. Frontiers in endocrinology. 10, 814. DOI: https://doi.org/10.3389/fendo.2019.00814 (2019).

No competing interests reported.

SupplementaryInformationfile.pdf

Advanced KPI Framework for IVF Pregnancy Prediction Models in IVF protocols

Status:

Version 1

Abstract

Introduction

Results

Model training and performance

Validation and fine-tuning

External validation

Explanation of the Model results

Discussion

Conclusion

Materials and methods

Data collection

Statistical analysis

TGBDR - total good quality blastocyst development rate

Model Compilation

Explanation of the Model results

Declarations

Competing interests

Author Contribution

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1