Three different methods to measure cumulus expansion were evaluated by three observers. The working principle, user-friendliness (based on equipment and time requirements), and performance (based on level of subjectivity) of these methods are summarized in Table 1.
Table 1. Comparison of methods to measure cumulus expansion. Methods were compared by three observers and evaluated for equipment- and time requirements and the level of subjectivity. +, easy or low; ++, moderate; +++, complicated or high. AEquipment required: apart from a stereomicroscope, software was needed to calculate cumulus expansion for the area and 3-distance method.
Method
|
Working principle
|
EquipmentA
|
Time
|
Area
|
Measuring the area by drawing the COC contour
|
++
|
++
|
3-distance
|
Measuring 3 distances between zona pellucida and outer cumulus
|
++
|
+++
|
Scoring
|
5-point Likert scale
|
+
|
+
|
3.1 Cumulus expansion in numbers
Both the area and the 3-distance method resulted in numerical data, while the results of cumulus expansion derived from the scoring method were ordinal. Therefore, no direct comparisons were made between the scoring method and the other two methods.
Observations of cumulus expansion evaluated by the area method varied from 0.06 to 346.3% with a median growth of 74.28% (interquartile range (IQR): 31.01–154.2%; Fig. 4A). The observations resulting from the 3-distance method ranged from 0.15 to 346.9% and had a median growth of 52.68% (IQR: 15.29–106.2%; Fig. 4A). The scoring method yielded a median of 3.0 (IQR: 1.0–3.0; Fig. 4B).
3.2 The area method resulted in the highest inter-observer and intra- observer agreement
3.2.1 Inter-observer agreement
For all three methods, the agreement between the observers was evaluated by calculating the corresponding ICC, as illustrated in Fig. 5a and Table 2. This ICC was calculated in duplicate, as the measurements were performed twice by every observer for every method. The inter-observer ICCs for the area method showed a very good level of agreement, while the 3-distance method resulted in a moderate level of agreement. The inter-observer agreement for the scoring method was fair.
3.2.2 Overall inter-observer agreement
An overall agreement level was calculated for every method, where both repetitions of inter-observer agreement were considered. This resulted in a very good overall agreement for the area method, a moderate overall agreement for the 3-distance method and a poor overall agreement for the scoring method, as shown in Fig.5b and Table 2.
3.2.3 Intra-observer agreement
The intra-observer agreement was evaluated by ICC calculations for every method and every observer (Fig. 5c; Table 2). Overall, intra-observer agreements for observers 1, 2, and 3 were very good for the area method, while moderate to good for the 3-distance method. The results for the scoring method varied per observer, as the level of agreement ranged from poor, over moderate, to good.
Table 2. Intraclass correlation coefficients for three cumulus expansion measurement methods. Data are reported as intraclass correlation coefficients with their respective 95% confidence intervals.
Method
|
Inter-observer agreement
|
Overall inter-observer agreement
|
Intra-observer agreement
|
|
Repetition 1
|
Repetition 2
|
Observer 1
|
Observer 2
|
Observer 3
|
Area
|
0.89 (0.88–0.92)
|
0.90 (0.85–0.93)
|
0.89 (0.85–0.93)
|
0.87 (0.84–0.90)
|
0.90 (0.87–0.92)
|
0.96 (0.95–0.97)
|
3-distance
|
0.56 (0.49–0.63)
|
0.51 (0.44–0.59)
|
0.54 (0.44–0.63)
|
0.61 (0.53–0.69)
|
0.59 (0.50–0.67)
|
0.64 (0.56–0.71)
|
Scoring
|
0.23 (0.12–0.34)
|
0.38 (0.3–0.47)
|
0.30 (0.12–0.47)
|
0.69 (0.63–0.76)
|
0.11 (-0.01–0.24)
|
0.51 (0.42–0.6)
|
3.3 AIxpansion automatically measures cumulus expansion based on the COCs area
3.3.1 AIxpansion processor capacity
As ‘area’ resulted in the highest ICC values for manually measuring cumulus expansion, a DL model, AIxpansion, was created based on this method. The pre-processing stage, in which the region of interest was detected, was able to correctly determine 98% of the region of interest. Failure of detection was due to a very low signal of the COC and to the presence of an oil droplet in the image, which interfered with the model. These 2 cases were excluded from further analyses, leading to a total of 98 COCs used to train the model as explained in section 2.5.
3.3.2 Alxpansion performs similar to human observers
The average ranking, i.e. the comparison of the different scores among the observers, is reported in Table 3. In 2 out of 3 cases, Alxpansion performs better compared to the other observers. The performance of AIxpansion in measuring the COC’s area was similar to that of the observers (p = 0.15).
Bias and variance among observers and between human observers and AIxpansion are reported in Supplementary Table S1 and S2, respectively. Measuring cumulus expansion using AIxpansion resulted in lower bias and less variance compared to the human observers in 1 out of 3 times. For the remaining 2 cases, AIxpansion scored similarly for both bias and variance to the human observers, proving that AIxpansion reaches a human-level performance.
Table 3. Average rank calculated in the comparison between the different human observers and human observers vs Alxpansion. This table shows the similarity between the average ranking of three observers (O1-O3) and the deep learning method (AIxpansion). Scores closer to zero indicate that the performance is closer to the column observer.
|
O1
|
O2
|
O3
|
O1
|
|
2.04
|
1.93
|
O2
|
2.04
|
|
2.15
|
O3
|
1.97
|
2.11
|
|
AIxpansion
|
1.99
|
1.85
|
1.93
|