We enrolled patients who visited our institution between April 2016 and November 2018 and underwent US examination that resulted in the detection of liver tumors. Patients for whom information on age, sex, AST, ALT, platelet count, and albumin were available, were selected. Simple cysts were not included in the study. In addition, we excluded the tumor images that included measurement lines only, and those without information on the benign or malignant nature of the tumors. Finally, we extracted the data for a total of 1080 patients with US-detected liver tumors in whom a clinical diagnosis was made (548 malignant nodules and 532 benign nodules).
Contrast-enhanced CT, MRI, and US were used to obtain a definitive diagnosis of liver nodules using validated imaging criteria.24-29 We also used tumor markers (e.g., alpha-fetoprotein [AFP] or des-gamma-carboxy prothrombin [DCP] for HCC, and carcinoembryonic antigen, or carbohydrate antigen 19-9 for metastatic liver tumors) as diagnostic aids. When a definite diagnosis could not be made using these modalities, a US-guided tumor biopsy was performed. Patients clinically diagnosed with benign tumors with no evidence of clinical progression and without any treatment were also included.30,31 To evaluate the predictive ability of ML models, we randomly split a total of 1080 lesions into three groups, as follows: (ⅰ) the training set (80%), which was used to build the model (864 lesions), (ⅱ) the development set, which was used for tuning the model parameters (108 lesions), and (ⅲ) the test set, which was used to evaluate the performance of each classifier (108 lesions); we then assessed the predictive accuracy of the developed model.
The current study was performed in accordance with the ethical guidelines of the Declaration of Helsinki. This research project was approved by the ethics committee of our university hospital (approval number, 11941). Informed consent was obtained in the opt-out format, on the institution’s website. Patients who opted out of participating in our study were excluded. The study design was also included in a comprehensive protocol for retrospective studies, and was approved by the ethics committee of our institution (approval number, 2058).
The current study consisted of two stages: a training stage and a validation stage. In the training stage, we developed a DL model with 972 samples (864 samples for the training set and 108 samples for the development set). MobileNet version 2 (MobileNet v2, Salt Lake City, UT, USA) was used for model development.32 In the validation stage, we assessed the diagnostic accuracy of the developed model using the test set that consisted of 108 samples that were completely independent from the training samples.
All ultrasound examinations were performed using a Toshiba Aplio 300 or Aplio 500 instrument (Canon Medical Systems Co., Tokyo, Japan) fitted with 3.5-5 MHz transducers. We used still images of tumors captured and stored in routine clinical practice. These images were manually annotated for this study by an expert hepatologist (M.S) and an expert sonographer (Y.S); the original B-mode US images were annotated with rectangular bounding boxes to cover whole tumor nodules and make the areas other than the tumor lesions as small as possible (Figure 1).
Development of the algorithm (training stage)
First, we applied supervised learning with a CNN to a total of 972 B-mode liver nodules in the training and development sets (479 benign and 493 malignant nodules) to develop the image-only model (model 1) using the MobileNet v2 software. In addition to the CNN image processing network, we applied a multimodal representation DL to integrate the information on patient background or blood biomarkers to B-mode images. In a stepwise manner, we integrated the information on patient background such as age and sex (model 2), liver inflammation (AST and ALT) (model 3), liver fibrosis (platelet count) (model 4), and albumin (model 5) (Supplementary Table 1).
Validation of the algorithm (validation stage)
In the validation stage, we examined the accuracy of the trained model using 108 (55 malignant nodules and 53 benign nodules) segmented images from the original B-mode US images, patient background, or blood biomarkers in the test set. Each nodule was evaluated based on the ML models developed in the training stage, and the trained model outputted the probability of malignancy.
Continuous variables were expressed as medians and interquartile ranges, while categorical variables were expressed as frequencies (%). Categorical data and continuous data were analyzed using the chi-square test and the Mann-Whitney U test.
In the validation stage, we investigated the performance of each model in an independent test set by calculating its accuracy, sensitivity, and specificity using a confusion matrix generated by the CNN and the multimodal representation models. We also used receiver operating characteristic (ROC) curve analysis to assess the predictive accuracy. The area under the curve (AUC) was evaluated as the ability to discriminate malignant nodules; comparison of the AUC values was carried out using the Delong test.33 We used the following grading scales for the interpretation of the AUC results: AUC 0.5-0.6, fail; AUC 0.6-0.7, poor performance; AUC 0.7-0.8, fair performance; AUC 0.8-0.9, good performance; AUC 0.9-1, excellent performance.34,35 The intraclass correlation coefficient (ICC) was used to calculate intra- or inter-observer variance of continuous variables. Statistical analyses were performed using the R 3.4.3 software (https://cran.r-project.org/).