In the following section, we detail the different datasets used for the two segmentation tasks and the simulations, as well as the simulation parameters and software used for failure load assessment. We also explain the pre-processing pipeline developed to be applied prior to the training phase of the neural networks. We describe our approach for simulation comparisons with manual and automatic segmentations.
The study protocol was approved by the French Ethics Committee (CPP SUD-EST 1 France) under registration number ID-RCB: 2019- A01202-55. All procedures have been conducted in compliance with national and European regulations. All included patients received clear information and provided written consent, and informed consent was obtained from all subjects and/or their legal guardian(s).
A. Datasets
We use two datasets for bone segmentation: one publicly available for the vertebrae and one from the project for the femurs. For the femur segmentation task, MEKANOS cohort was used (Hospices civils Lyon, agreement number N. 21 5467, May 28th, 2021). This cohort consists of eleven in-vivo CT-scans of hips, where both femurs are present. Those scans were acquired in clinical routine following a specific procedure (constant table height, quality phantom QA Mindways, 120 kV, 270 mAs, 1 Pitch, Field of view 360 mm and 200 mm, reconstruction: standard filter B, 512 × 512 matrix, slice thickness 0.7 mm) and with three manufacturers acquisition systems (General Electric, Philips and Siemens). A few femurs have metastatic osteolytic lesions which complicates the segmentation task but allows trained models to segment metastatic bones more efficiently, which is important for our study. From this database, eighteen femurs were manually segmented (4 femurs were not available).
For the vertebrae segmentation task, we used two publicly available datasets: VerSe2019 and VerSe2020 [27]. Those datasets contain 374 CT-scans of various sizes all with manual segmentation. The number of vertebrae on each scan ranges from 3 to 25, with all types of vertebrae present in the dataset. In this study, we retained 363 patients, excluding those with additional transitional T13 scanned. Figure 1 shows examples of axial, coronal and sagittal slices taken from those datasets. The data used for simulation tests consists of 12 femurs from 6 patients (6 scans), among which 6 are healthy and 6 with metastasis, as well as 15 vertebrae from 2 patients (2 scans) among which 13 are healthy and 2 with metastasis (1 thoracic and 1 lumbar), all taken from the MEKANOS database.
B. Simulation Pipeline
For both vertebrae and femurs failure load simulations, a custom finite elements simulation pipeline, described in Fig. 2 was used with dedicated parameters in order to compute the failure load of the considered bones. When working on femurs, voxel-based hexahedral meshes were used [10]. These meshes were either obtained using a manual CT-segmentation using the software 3D Slicer, or using an automatic segmentation from a neural network described in part D. The grey density values were converted to Young’s modulus using the calibration phantom included in the acquisitions [10]:
$$E\left(MPa\right)=14900\rho QC{T}^{1.86}$$
1
Using Ansys software (version 2021 R1), an axial compression on the femurs was applied using a nonlinear constitutive law, computing the load applied on each element of the mesh. The failure occurs when the maximal load is reached [10].
For vertebrae simulations, a quadratic tetrahedral mesh (10 nodes) with a volume of element of 1 mm³ was used based on experimental data from [30], and for the numerical model, we used the elasticity-density relationship from [31] using the calibration phantom:
$$E\left(MPa\right)=3230\rho QCT-34.7$$
2
We used a linear elastic - perfectly plastic constitutive law, with a yield strain of 1.5% strain [32]. The failure criteria consisted in considering a strain of 1.9% of the total vertebral height reduction [33]. All the simulations were also run using Ansys software.
C. Pre-processing pipeline
For femur segmentation, we propose a fully automated segmentation method with a pre-processing pipeline as illustrated in Fig. 3 to facilitate the deep learning training. Our dataset
contains only few annotated data, and dedicated pre-processing is important to ensure the robustness of the proposed models.
The pre-processing pipeline is made of several steps: after selection and manual expert annotation of the femurs, the volumes are all resampled to the median voxel size (0.78 × 0.78 × 0.67 mm), then cropped when both femurs are present in order to separate them into two distinct volumes. The split left femurs are then flipped to obtain a comparable orientation for all volumes. To further increase the homogeneity of the dataset, especially the spatial orientation of the femurs, the flipped left femurs and right femurs are registered together, using affine transforms. The resulting volumes are then all normalized before being used as input of the convolutional neural network.
In addition to ensuring the proper training of the neural network, the pre-processing pipeline allows to increase the size of the dataset, thanks to the splitting of the initial volumes.
D. Neural Networks
We used several convolutional neural networks all based on the U-Net architecture [28]. We implemented a 2D multi-planar U-Net, as well as a 3D U-Net for femur segmentation. We compared the results with nnUNet, the state-of-the-art convolutional neural network for medical image segmentation [29].
Three 2D-UNet were trained on axial, coronal and sagittal slices for 500 epochs. The 3 resulting segmentations were then fused using majority voting. The 3D U-Net model was trained using random patches of size 64 × 64 × 64 for 300 epochs. Data augmentation, such as random rotations, translations, shearing and scaling was used to prevent overfitting. All custom UNets were trained using Adam optimizer (β1 = 0.9, β2 = 0.999) and a DICE loss, with a learning rate α = 2 × 10− 4 and a batch size of 16 for 2D U-Net, α = 3 × 10− 5 and a batch size of 4 for 3D U-Net. We also added morphological post-processing operations based on binary dilation and erosion to remove small unwanted islands and improve segmentation results.
The nnUNet architecture used is the ‘3d fullres’, with patch sizes automatically selected (238 × 196 × 208 for femur segmentation and 205 × 205 × 205 for vertebrae segmentation) and default parameters, and was trained for 1000 epochs. The optimizer used is stochastic gradient descent with an initial learning rate of 0.01. The batch size was set to 2 for both trainings. We only used this architecture for vertebrae segmentation as the results for femur segmentation were comparable to our custom 3D U-Net and the amount of training data was sufficient to avoid the need for dedicated pre-processing, and the only pre-treatments operations were automatically made with nnUNet.
The models were trained on a Nvidia P100 GPU with 16GB VRAM. The total training time was 12 hours for 2D U-Net per axis, 16 hours for 3D U-Net and 48 hours for nnUNet on the femur dataset. This substantial difference is also present during inference, where nnUNet takes up to 30 minutes for a prediction when standard models only take up to 3 minutes.
E. Segmentations and simulations comparison
To quantify the segmentation results, we used the Sørensen-Dice score (noted DSC) to evaluate the similarity between the ground-truth and the automatic segmentations ([0;1] where 1 is the best), as well as the Hausdorff distance (noted HD) to evaluate the maximum errors of the automatic segmentations (in mm, smaller the better). All metrics are computed on 3D volumes. We used a 5-fold cross-validation to quantify more accurately the results. Among the 18 available femurs, 12 were used for training, 4 for validation and 2 for testing. For vertebrae segmentation, 242 scans were used for training, 61 for validation and 60 for testing.
To compare the influence of the segmentation on the failure load simulations, we computed the failure load on 12 femurs, using automatic segmentations and using expert manual annotation for comparison. We also compared results using automatic and manual segmentation on 15 thoracic and lumbar vertebrae. In both cases, we also applied simple morphological operations (dilation / erosion), with either one or two iterations to the automatic segmentation, in order to investigate the effect of slight segmentation variations on the resulting failure load.
F. Statistical Analysis
Statistical tests were performed using SPSS software (SAS Institute, Cary, NC). Differences among groups were evaluated using non-parametric test (Friedman test). When a significant overall F value (P < 0.05) was present, differences between individual group means were tested using Dunn’s multiple comparison post-hoc tests. Only comparisons with the manual segmentation are presented. For all tests, P < 0.05 was considered statistically significant. Data are presented as mean ± standard error.