In the following, we describe the challenge preparation, organization and evaluation following the guidelines for biomedical image analysis challenge reporting (BIAS guidelines)10. The public challenge training data set was drawn from the University Hospital Tübingen (UKT). The private challenge test set was partly drawn from the same source (UKT) and partly from the University Hospital of the LMU Munich (LMU).
Challenge Participation
A total number of 359 teams registered for the autoPET challenge including teams from all continents (Fig. 1) with clear geographic concentrations on Asia (61%, mainly China: 41%), Europe (20%) and North America (16%, mainly USA: 13%). As far as disclosed, most participants were affiliated to academic institutions (75%), followed by a smaller group of company employees (12%).
37 teams submitted at least one algorithm to the preliminary challenge phase amounting to 253 total submissions in this phase. In the final challenge phase 18 teams contributed a total of 67 algorithms. The best-performing submission by each team was considered for the challenge leaderboard. The seven best performing teams were identified as challenge winners – their contributions are described in greater detail as part of this work.
As expected, all final contributions were based on deep learning models. The majority of submitted algorithms relied on a 3D U-Net backbone in combination with a Dice loss. A minority of participants deployed transformer-based architectures or combinations of different architectures (2D and 3D) or used more uncommon loss functions11 (e.g., focal loss, TopK Dice loss, Lovasz loss, Tversky loss), mostly in combination with a conventional Dice loss. An overview of the technical details is depicted in Fig. 2.
Best performing algorithms
In the following, we provide brief descriptions of the seven best-performing contributions in the order of the final leaderboard followed by individual performance reports. The code for all contributions is publicly available – details are available in the technical papers published by the participating teams and cited below. Overall, the use of a U-Net backbone was a common feature of the best contributions. The additional implementation of rule-based post-processing of algorithm outputs (e. g. threshold-based removal of small connected components from the output segmentation mask) distinguished the top four contributions from the rest of the field. All top-performing teams used both, the PET and CT image volumes as algorithm inputs.
Team Blackbean
The best-performing team chose a deliberately simple approach by using a vanilla U-Net backbone and focusing on ablation studies to identify the best combination of input shape (crop size) and step size during sliding window inference. In addition, a post-processing step was used to minimize the false-positive volume by removing small connected components from the initial algorithm output12.
Team BDAV
Team BDAV used a combination of self-supervised pre-training (via contrastive learning) and a multi-stage U-Net architecture. The multi-stage U-Net architecture utilized a global segmentation module to conduct coarse tumor segmentation, which was then fed into a local refinement module to reduce the false positives. The multi-stage U-Net model was ensembled with a standard nnUNet model to generate the final prediction13.
Team FightTumor
This contribution was based on a slightly modified nnUNet model using DiceTopK loss and enhanced data augmentation. In addition, post-processing of the model output was performed by removing small connected components (< 10 voxels) and segmentations in areas with low CT Hounsfield Units (< -1,000 HU)14.
Team UIH-FL
Team UIH-FL trained a combined 2D and 3D nnUNet model. In addition, they performed post-processing of the model output by removing small connected components (< 4 voxels) and all connected components on the three bottom slices of the predicted pet mask15.
Team Heiligerl
This contribution was based on an ensemble of an nnUNet-based model and a Swin UNETR. In addition, a classification model was trained to identify negative PET/CT scans without metabolically active lesions, based on maximum intensity projections (MIP), inspired by reading procedures of physicians 16.
Team SM
Team SM proposed a cascaded architecture consisting of a stacked ensemble of low-resolution U-Net models and a subsequent refiner U-Net for high-resolution predictions17.
Team Flemings
Team Flemings proposed a cascaded architecture consisting of an initial inpainting model to detect and generate lesion-free images, followed by a U-Net-based segmentation, with the residual inpainting image as additional input18.
nn-Unet (baseline model, out of competition)
To provide a baseline model, the widely used and standardized nn-UNet framework19 was used with the default settings using PET and CT volumes as input. The trained baseline model is publicly available under https://github.com/lab-midas/autoPET.
Ensemble model (out of competition)
Based on the predictions of these above-described best performing algorithms, including the baseline model, an ensemble model output was computed by pixel-wise majority voting.
Overall, the performance metrics of the best performing teams were slightly different with respect to all three metrics (Fig. 3, A): Dice score (capturing consistency between foreground predictions and manual masks), false negative volume (capturing total volume of missed lesions), and false positive volume (capturing false-positive segmentations of physiologic tracer uptake). The mean Dice score ranged between 0.74 and 0.79, the mean false negative volume between 0.5 and 1.5 ml and the false positive volume between 2.1 and 9.5 ml. When assessing algorithm performance separately for the two data sources (UKT and LMU) of the multicentric training set, we observed that mean Dice scores were overall markedly higher for UKT test data (ranging between 0.8 and 0.88) compared to LMU test data (ranging between 0.6 and 0.7). Mean false negative volumes and false positive volumes were slightly higher for LMU data compared to UKT data (false negative volumes UKT: 0.3 to 1.7 ml, false negative volumes LMU: 0.9 to 2.3 ml; false positive volumes UKT: 1.5 to 5.4 ml, false positive volumes LMU: 3.2 to 20.3 ml).
The best performing team (Blackbean) ranked first regarding the mean dice score and the mean false negative volume and second regarding the mean false positive volume. Interestingly, the provided baseline nnUNet model showed a good overall performance ranking – out of competition – second with respect to the mean Dice score and seventh with respect to the mean false positive and false negative values (Fig. 3, A).
The ensemble prediction (out of competition) based on the top performing contributions and the baseline model showed a superior performance compared to all participating teams with the highest overall mean Dice score (0.81), the second lowest mean false negative volume (0.71 ml) and the lowest mean false positive volume (1.6 ml) (Fig. 3, A).
Typical qualitative examples of model performance and error cases are given in Fig. 4. In general, false positive segmentations mainly occurred in areas of atypical physiological tracer uptake (e. g. unusually large urinary bladder, brown adipose tissue) while tumor lesions adjacent to physiological tracer uptake were more often missed.
Impact of training data composition on algorithm performance
To better understand external factors influencing algorithm performance in general we performed additional ablation studies using the baseline model with different sizes and compositions of training data.
As could be expected, we observed an overall increase in segmentation performance with increasing numbers of training data reflected by increasing Dice scores and decreasing false positive and false negative volumes (Fig. 3, B). Interestingly, in contrast to this overall tendency, Dice scores on LMU test data showed no increase and even slightly decreased with higher numbers of training data, probably as a sign of overfitting to the UKT training data distribution.
In addition, we assessed the impact of the input data composition on algorithm performance. In addition to PET and CT volumes that were also used within the challenge, we added CT-based anatomical organ labels as a potential third input.
Regarding the composition of training data, we observed the highest segmentation performance in terms of Dice scores when using all three inputs (PET, CT and anatomical labels) on both, UKT and LMU data (Fig. 3, B). On UKT data, using all three inputs also gave lower false positive and false negative volumes. On LMU data, the results regarding composition of data and false positive/negative volumes were inconclusive; however, using only PET data resulted in markedly higher false positive volumes on LMU test data.
Figure 5 provides qualitative examples of test data sets and associated segmentation results for test data drawn from UKT and LMU. In agreement with the quantitative results, we qualitatively observed a larger mismatch between manual and automated tumor lesion segmentation on LMU data (Fig. 5). In general, tumor volumes were locally overestimated on LMU test data explaining the lower Dice scores on LMU test data compared to UKT test data. Also agreeing with the quantitative results, with respect to false positive and false negative volumes, we did not observe any obvious qualitative differences between LMU and UKT test data.