Development of novel deep multimodal representation learning-based model for the differentiation of liver tumors on B-mode ultrasound images

Background and Aim: Recently, multimodal representation learning for images and other information such as numbers or language has gained much attention. The aim of the current study was to analyze the diagnostic performance of deep multimodal representation model-based integration of tumor image, patient background, and blood biomarkers for the differentiation of liver tumors observed using B-mode ultrasonography (US). Method: First, we applied supervised learning with a convolutional neural network (CNN) to 972 liver nodules in the training and development sets to develop a predictive model using segmented B-mode tumor images. Additionally, we also applied a deep multimodal representation model to integrate information about patient background or blood biomarkers to B-mode images. We then investigated the performance of the models in an independent test set of 108 liver nodules. Results: Using only the segmented B-mode images, the diagnostic accuracy and area under the curve (AUC) values were 68.52% and 0.721, respectively. As the information about patient background and blood biomarkers was integrated, the diagnostic performance increased in a stepwise manner. The diagnostic accuracy and AUC value of the multimodal DL model (which integrated B-mode tumor image, patient age, sex, aspartate aminotransferase, alanine aminotransferase, platelet count, and albumin data) reached 96.30% and 0.994, respectively. Conclusion: Integration of patient background and blood biomarkers in addition to US image using multimodal representation learning outperformed the CNN model using US images. We expect that the deep multimodal representation model could be a feasible and acceptable tool for the de ﬁ nitive diagnosis of liver tumors using B-mode US.


Introduction
Ultrasonography (US) is widely used for hepatocellular carcinoma (HCC) surveillance to screen high-risk populations, because of its cost-effectiveness and non-invasiveness. However, a definitive diagnosis of liver tumors observed using B-mode sonography can be difficult because of the low specificity of this modality. 1 Currently, B-mode sonography is usually used in combination with other contrast imaging modalities such as computed tomography (CT) or magnetic resonance imaging (MRI), to obtain a definitive diagnosis. However, because B-mode US provides structural information that may reflect the histological characteristics of the tumor, 2 a precise and objective recognition of B-mode images has the potential to become a powerful tool for the qualitative diagnosis of liver tumors.
Machine learning (ML) is a multidisciplinary field combining computer science and mathematics, which focuses on implementing computer algorithms capable of maximizing the predictive accuracy from static or dynamic data sources using analytic or probabilistic models. 3 Deep learning (DL) architectures have become a hot topic in the ML field and have been successfully used for image classification. 4 The ImageNet Large Scale Visual Recognition Challenge competition is an annual competition for computer vision; in the competition held in 2017, DL technology with deep convolutional neural network (CNN) achieved a misclassification rate of less than 5%, indicating that CNN can classify images more precisely than humans. 5 Recently, multimodal representation learning for images and other information such as numbers or language has gained much attention due to the possibility of combining latent features using a single distribution. 6 In addition to information on B-mode images of liver tumors, patient background or data on biomarkers of liver inflammation (aspartate aminotransferase [AST] and alanine aminotransferase [ALT]) or fibrosis (platelet count) 7 are commonly collected in daily clinical practice. In addition, serum albumin levels, which were shown to be decreased in cancer patients, 8 are widely available. These biomarkers alter the pretest probability for the diagnosis of liver tumors using B-mode US and thus, are useful for the definitive diagnosis of liver tumors detected by US.
Although the application of multimodal representation learning-based integration of B-mode images, patient background, or blood biomarkers is likely to become a promising means of making a definitive diagnosis, the clinical utility of multimodal representation learning by DL model for the classification of liver tumors has not yet been elucidated. Our current study was designed to assess the diagnostic significance of adding patient background or blood data to B-mode US images and to analyze the diagnostic performance of a deep multimodal representation model for the differentiation of liver tumors observed using B-mode US.

Materials and methods
Clinical feature selection for the development of a deep multimodal representation model. For the development of a multimodal DL model, we applied a multimodal representation DL to integrate the information on patient background or blood biomarkers to B-mode images. To simplify the model for easier interpretation, we selected the minimum number of features based on the following criteria: (i) blood markers or information commonly collected in daily clinical practice, (ii) factors widely used to assess liver damage (inflammation or fibrosis) and function, or general clinical condition, which would alter the pretest probability for the diagnosis of liver tumors in B-mode US. Finally, we selected patient age and sex as the patient background, serum AST and ALT levels as markers of liver inflammation, platelet count as a marker of liver fibrosis, and the serum albumin level as a marker of liver function or general clinical condition. Because the AST/ALT ratio was shown to be a significant predictor for the development of HCC, 9 we included both the AST and ALT values in the current DL model to reflect the association between AST and ALT in the prediction model. DL is less susceptible to multicollinearity and effectively generalizes multicollinear and high-dimensional data. 10,11 Patients. We enrolled patients who visited our institution between April 2016 and November 2018 and underwent US examination that resulted in the detection of liver tumors. Simple cysts were not included in the study. In addition, we excluded the tumor images that included measurement lines only and those without information on the benign or malignant nature of the tumors. In addition, because the conventional DL model could not handle missing data, 12 patients lacking information on age, sex, AST, ALT, platelet count, and albumin were excluded.
Finally, we extracted the data for a total of 1080 patients with US-detected liver tumors in whom a clinical diagnosis was made (548 malignant nodules and 532 benign nodules).
Contrast-enhanced CT, MRI, and US were used to obtain a definitive diagnosis of liver nodules using validated imaging criteria. [13][14][15][16][17][18] We also used tumor markers (e.g. alpha-fetoprotein or des-gamma-carboxy prothrombin for HCC, and carcinoembryonic antigen, or carbohydrate antigen 19-9 for metastatic liver tumors) as diagnostic aids. When a definite diagnosis could not be made using these modalities, a US-guided tumor biopsy was performed. Patients clinically diagnosed with benign tumors with no evidence of clinical progression and without any treatment were also included. 19,20 To evaluate the predictive ability of ML models, we randomly split a total of 1080 lesions into three groups, as follows: (i) the training set (80%), which was used to build the model (864 lesions); (ii) the development set, which was used for tuning the model parameters (108 lesions); and (iii) the test set, which was used to evaluate the performance of each classifier (108 lesions); we then assessed the predictive accuracy of the developed model.
The current study was performed in accordance with the ethical guidelines of the Declaration of Helsinki. This research project was approved by the ethics committee of our university hospital (approval number, 11941). Informed consent was obtained in the opt-out format, on the institution's website. Patients who opted out of participating in our study were excluded. The study design was also included in a comprehensive protocol for retrospective studies and was approved by the ethics committee of our institution (approval number, 2058).
Study outline. The current study consisted of two stages: a training stage and a validation stage. In the training stage, we developed a DL model with 972 samples (864 samples for the training set and 108 samples for the development set). MOBILENET Version 2 (MobileNet v2, Salt Lake City, UT, USA) was used for model development. 21 In the validation stage, we assessed the diagnostic accuracy of the developed model using the test set that consisted of 108 samples that were completely independent from the training samples.
Image processing. All ultrasound examinations were performed using a Toshiba Aplio 300 or Aplio 500 instrument (Canon Medical Systems Co., Tokyo, Japan) fitted with 3.5-to 5-MHz transducers. We used still images of tumors captured and stored in routine clinical practice. These images were manually annotated for this study by an expert hepatologist (M.S) and an expert sonographer (Y.S); the original B-mode US images were annotated with rectangular bounding boxes to cover whole tumor nodules and make the areas other than the tumor lesions as small as possible (Fig. 1).

Development of the algorithm (training stage).
First, we applied supervised learning with a CNN to a total of 972 B-mode liver nodules in the training and development sets (479 benign and 493 malignant nodules) to develop the image-only model (Model 1) using the MOBILENET v2 software. We also applied a multimodal representation DL to integrate the information on patient background or blood biomarkers to B-mode images. In a stepwise manner, we integrated the information on patient background such as age and sex (Model 2), liver inflammation (AST and ALT) (Model 3), liver fibrosis (platelet count) (Model 4), and albumin (Model 5) ( Table S1).

Validation of the algorithm (validation stage).
In the validation stage, we examined the accuracy of the trained model using 108 (55 malignant nodules and 53 benign nodules) segmented images from the original B-mode US images, patient background, or blood biomarkers in the test set. Each nodule was evaluated based on the ML models developed in the training stage, and the trained model outputted the probability of malignancy.
Statistical analysis. Continuous variables were expressed as medians and interquartile ranges, while categorical variables were expressed as frequencies (%). Categorical data and continuous data were analyzed using the χ 2 test and the Mann-Whitney U-test.
In the validation stage, we investigated the performance of each model in an independent test set by calculating its accuracy, sensitivity, and specificity using a confusion matrix generated by the CNN and the multimodal representation models. We also used receiver-operating characteristic curve analysis to assess the predictive accuracy. The area under the curve (AUC) was evaluated as the ability to discriminate malignant nodules; comparison of the AUC values was carried out using the Delong test. 22 We used the following grading scales for the interpretation of the AUC results: AUC 0.5-0.6, fail; AUC 0.6-0.7, poor performance; AUC 0.7-0.8, fair performance; AUC 0.8-0.9, good performance; and AUC 0.9-1, excellent performance. 23,24 The intraclass correlation coefficient (ICC) was used to calculate intraobserver or interobserver variance of continuous variables. Statistical analyses were performed using the R 3.4.3 software (https://cran.r-project.org/).

Results
Patient and tumor characteristics. The details of the liver tumors included in the current study are shown in Table 1. The majority of benign and malignant tumors were hemangiomas and HCCs.
Patient characteristics are shown in Table 2. The proportion of male patients was significantly higher among patients with malignant nodules than among those with benign nodules. Compared with those in patients with benign nodules, the serum levels of   AST, ALT, gamma-glutamyl transpeptidase, or alkaline phosphatase, and patient age were also significantly higher, whereas the white blood cell count, hemoglobin level, platelet count, and serum albumin level were lower in patients with malignant liver tumors.
Predictive accuracy of convolutional neural network models for discriminating malignant and benign nodules. Table 3 shows the diagnostic accuracy, sensitivity, and specificity of each DL model in the test set. Using only the segmented B-mode images (DL model 1), the diagnostic accuracy, sensitivity, and specificity were 68.52%, 67.27%, and 69.81%, respectively. The diagnostic performance increased in a stepwise manner with the integration of patient background information such as age or sex and blood biomarkers. The diagnostic accuracy of Model 5 (the model integrating the data on B-mode tumor image, patient age, sex, AST, ALT, platelet count, and albumin) reached 96.30%. The sensitivity and specificity of ML model 5 were 100.0% and 92.45%, respectively.
Receiver-operating characteristic curve analysis of deep learning models. The receiver-operating characteristic curves for the prediction of malignant tumors were plotted for each DL model (Fig. 2). The AUCs for the prediction of malignant tumors for DL models 1, 2, 3, 4, and 5 were 0.721, 0.803, 0.955, 0.982, and 0.994, respectively. The predictive AUC values of DL models 3 to 5 were significantly higher than those of DL model 1 (Table S2).
Variability in lesion segmentation. To assess the intraobserver and interobserver reliability of the manual segmentation, we selected 20 cases and reperformed the segmentation. We also assessed interobserver reliability between the two observers. We analyzed the correlation of the magnitude of the volume based on the number of pixels in the segmented images. Figure S1a shows the intra-observer correlation of the size of images (number of pixels) between the original segmentation (A) and the reperformed segmentation (B) performed by M.S., and Figure S1b shows the inter-observer correlation between two observers (M.S. and Y.S.). Adequate agreement on volume size was found in both intra-observer (ICC 0.990; 95% confidence interval, 0.976 to 0.996) and inter-observer (ICC 0.959; 95% confidence interval, 0.895 to 0.984) assessments.

Discussion
Deep learning has gained increasing attention as an artificial intelligence (AI) strategy. 4 Image recognition technology has also improved dramatically, and its use in the medical field is increasing rapidly. [25][26][27][28][29] As B-mode US itself provides structural information, an objective recognition of B-mode images using the ML approach has the potential to become a powerful tool for the qualitative diagnosis of liver tumors. 30 In some fields, computer technology performs better than humans because of its ability to manage large amounts of information and to repeat the same routines exactly time after time. 31 A previous study by Brehar et al. investigated the performance in differentiating HCC from cirrhotic parenchyma using B-mode US and reported a higher performance   Recently, multimodal representation learning for images and other information has gained much attention because of the possibility of combining latent features using a single distribution. 6 Additionally, a number of previous studies have reported that multimodal representation had superior performance compared with unimodal representation-based approaches in various applications and achieved remarkable results. 6,33,34 In multimedia applications, multimodal learning is becoming increasingly necessary and important because different modalities typically carry different information. 34 In the present study, we also applied deep multimodal representation learning for the definitive diagnosis of B-mode liver tumor images. Stepwise integration of information on patient background and blood biomarkers improved the predictive performance of the original image processing model (DL model 1). The AUC value of the proposed multimodal network using information on patient age, sex, AST, ALT, platelet count, and albumin in addition to B-mode image (DL model 5) reached 0.994 and significantly outperformed the original model. To the best of our knowledge, this study is the first to investigate the clinical utility of deep multimodal representation model-based integration for the differentiation of liver tumors observed using B-mode US. Lately, DL with multimodal representation has been applied to various clinical fields. [35][36][37] In the HCC field, a study from China built a multimodal and multitask ML model to predict the prognosis of patients with HCC after transarterial chemoembolization. 38 Using evidence-based clinical scores such as the "American Joint Committee on Cancer stage" and "Response Evaluation Criteria in Solid Tumors" in addition to HCC images from dual-phase contrast-enhanced CT, the AUCs for predicting the 3-, 5-, and 10-year survival rates were reported to be 0.85, 0.910, and 0.89, respectively.
Although DL is a powerful technique 39,40 and large number of image samples will be stored for the training of DL model in the future, diagnostic imaging is not always conclusive. Physicians use clinical data information, such as patient background or laboratory data, in addition to image modality to make a clinical diagnosis. Therefore, if the goal of medical AI is to provide an objective reproduction of the clinical decision process of physicians, the integration of multiple modality information through multimodal AI will make this possible.
Our study had several limitations. First, we applied B-mode images obtained using limited types of US devices (Aplio 300 or Aplio 500 instrument) at a single institution for the construction of the CNN model. Further studies with multicenter clinical trials are needed to fully understand the clinical utility of the CNN model for B-mode image recognition. Second, histological proof of a liver tumor was lacking in a majority of cases. However, currently, the histological diagnosis of HCC is rarely required, as non-invasive methods are preferred. HCC can be diagnosed with the use of triphasic CT, contrast (Gadolinium, Premovist) MRI, or contrast (Sonazoid) US. 41 These non-invasive modalities are widely available and have largely replaced biopsy for HCC diagnosis. 42 Third, because information about tumor markers such as alpha-fetoprotein or des-gamma-carboxy prothrombin and hepatitis viruses, such as hepatitis C virus or hepatitis B virus, were lacking for a considerable number of patients with benign tumors, it was not possible to investigate the performance of a multimodal representation model using tumor markers. A huge volume of data will be stored in the cloud storage platform in the future. We expect that the performance of the DL model will be further improved using a larger volume of training data, including information regarding tumor markers or hepatitis virus infection status. In addition, DL-based frameworks could be used to develop more complicated models or systems to aid clinical decision-making in the future.
In conclusion, with the integration of patient background information and blood biomarkers in addition to US images, multimodal representation learning outperformed the CNN model that used US images alone. We expect that the deep multimodal representation model could be a feasible and acceptable tool that can effectively support the definitive diagnosis of liver tumors using B-mode US in daily clinical practice.