An Articial Intelligence Pipeline for Diagnosing Hepatocellular Carcinoma Patients with Bile Duct Tumor Thrombus

Background and purpose : Preoperative diagnosis of bile duct tumor thrombus (BDTT) is clinically important as the surgical prognosis of hepatocellular carcinoma (HCC) patients with BDTT is signiﬁcantly diﬀerent from that of patients without BDTT. The preoperative diagnosis of BDTT is usually based on identiﬁcation of dilated bile ducts (DBDs) on medical images (eg., CT and MRI images). However, it is easy for doctors to ignore DBDs when reporting the imaging scan result, leading to a high misdiagnosis rate in practice. The aim of the present study was to develop an artiﬁcial intelligence (AI) pipeline for diagnosing HCC patients with BDTT using medical images. Methods : The proposed AI pipeline included two stages. First, the object detection neural network Faster R-CNN was adopted to identify DBDs; then, an HCC patient was diagnosed to have BDTT if the proportion of images with at least one identiﬁed DBD exceeds some threshold value. Four-fold cross validation was used to evaluate the performance of the proposed AI pipeline. Results : The proposed AI pipeline was applied on a real dataset consisting of CT images collected from 34 HCC patients (16 with BDTT and 18 without BDTT). The average true positive rate for identifying DBDs per patient was 0.92, while the patient-level true positive rate for diagnosing (95% CI: 0.52, 0.89) by random forest. Conclusions : This study ﬁrst proposes an AI pipeline to identify DBDs and diagnose BDTT, and the high accuracies demonstrate that it is successful in the diagnosis of BDTT. image-level positive rate; ROC: receiver operating characteristic; AUC: area under ROC curve; CI: conﬁdence interval; DAM: data augmentation module; RPN: region proposal network; RCNN: FPN: feature pyramid network.


Introduction
In 1947, Mallory et al. reported the first hepatocellular carcinoma (HCC) patient with bile duct tumor thrombus (BDTT) who had a typical symptom of obstructive jaundice [1]. The clinical incidence of BDTT was shown to be about 0.5%-12.9% among HCC patients [2,4] (0.5%-9.0% as reported in [3]; 1.2%-12.9% as reported in [1]). HCC patients with BDTT were shown to have worse prognosis than HCC patients without BDTT after liver resection or liver transplantation [1,2,3,4,5,6,7,8]. Specifically, HCC patients with BDTT were shown to have a higher recurrence rate [4,5,6] (eg., one year recurrence rates was estimated to be 70.3% and 34.8% for HCC patients with and without BDTT, respectively, as reported in [6]) and a lower overall survival rate (Table 1) [1,2,3,4,6,7,8]. Early diagnosis is important in improving the prognosis of such HCC patients as it can help doctors make better preoperative decisions to remove the residual tumors in the bile duct as much as possible [9]. Currently, the diagnosis and evaluation of HCC with BDTT before surgery mainly depend on clinical symptom judgment and image examination [2,5]. However, most HCC patients with BDTT have no specific clinical symptoms at early stages [10]. Furthermore, obstructive jaundice caused by BDTT's invasion into common hepatic duct or common bile duct can be easily misdiagnosed to be a symptom of biliary stones or cholangiocarcinoma [1,10,11,12].
With the development of medical imaging technology, CT scan and MRI scan have been widely accepted as safe and valuable methods for diagnosing BDTT. Although the imaging features of BDTT and their correlations with the corresponding histopathologic manifestations have been reported in the literature [10,12], it is still hard for doctors to identify BDTT on medical images by naked eyes. Many clinical studies showed that the general clinical feature of HCC with BDTT is the appearance of intrahepatic dilated bile ducts (DBDs) [10,13]. Based on this fact, doctors can diagnose BDTT through identifying DBDs near intrahepatic tumors. DBDs caused by BDTT usually present dark linear structures on CT images (Figure 1(a)). DBDs only occupy a small part of the CT image, and these inconspicuous structures tend to be easily ignored if doctors are lack of sufficient awareness of their appearance, resulting in a high misdiagnosis rate in practice. To our best knowledge, no automatic method has been developed to identify DBDs, despite its importance in accurate preoperative diagnosis of BDTT. The advent of the era of big data and the improvement of computer hardware have promoted the development of artificial intelligence (AI), which enables mining disease information in medical images such as CT image and MRI image. Many AI algorithms have been developed to effectively utilize the disease information contained in the medical images. Among these AI methods, neural network methods have shown their competitiveness in many fields. As one of the most representative deep learning methods, convolutional neural networks (CNNs) have been widely applied in image recognition, providing state-of-the-art results in the fields of image classification, semantic segmentation, object detection, and so on [14]. Based on CNNs, more advanced neural networks have been proposed such as fully convolution network (FCN) [15], generative adversarial network (GAN) [16], and variational auto-encoder (VAE) [17], and so on. Many researchers applied these neural networks in medical fields and achieved promising results in identifying bone fracture [18] and tumor region [19], screening high-risk subjects of a specific disease [20], locating biomarkers [21], segmenting tissues and organs [22,23,24,25], and so on.
Many neural network methods have been developed specifically for object detection. Compared with traditional classification methods, object detection methods can not only identify objects but also locate them. Some well-developed object detection neural networks include Faster R-CNN [26], YOLO [27], SSD [28], and so on. These object detection methods have been shown to have a broad application prospect in clinical diagnosis. For example, Thian et al. [29] applied Faster R-CNN to locate the fracture area on wrist radiographs, and achieved sensitivities as high Fig. 1 (a) CT image of an HCC patient with BDTT (DBDs caused by BDTT were marked in capsules, tumor area was marked in an ellipse); (b) CT image with four labeled bounding boxes for DBDs; (c1)-(c3) Three consecutive CT images of an HCC patient with BDTT. In all images, tumors were marked in ellipses and DBDs caused by BDTT were marked in capsules as 91.2% and 96.3% based on forward images and lateral images, respectively. Their object detection method was shown to outperform the traditional CNN methods such as VGG network used by Olczak et al. [30] and Inception V3 network used by Kim and MacKinnon [31]. Boot and Irshad [32] combined Faster R-CNN with Unet [22] to locate and segment breast tumors, and achieved a micro average F1 score of 0.805 and a macro average F1 score of 0.843.
The current study aimed to develop an AI pipeline for preoperative diagnosis of BDTT through identifying DBDs on CT images. To our best knowledge, this is the first computer-aided method developed for diagnosing BDTT. Based on a total number of 34 HCC patients, the developed pipeline was shown to have an area under ROC curve as high as 0.92, making it a powerful tool in diagnosing BDTT. The proposed AI pipeline can also locate the DBDs caused by BDTT.

Materials
CT images of 16 HCC patients with BDTT were retrospectively collected from four hospitals that are well known in diagnosis and treatment of hepatobiliary diseases in China (Fujian Provincial Hospital, n = 3; West China Hospital, Sichuan University, n = 8; Mengchao Hepatobiliary Hospital, Fujian Medical University, n = 1; The First Afilliated Hospital of Fujian Medical University, n = 4). As controls, 18 HCC patients without BDTT were randomly collected from Fujian Provincial Hospital. The present study was approved by the institutional review board of all relevant institutions, and informed consents were obtained from the patients or their guardians for their data to be used for research purposes. Throughout this paper, the 16 HCC patients with BDTT are denoted as case group and the 18 HCC patients without BDTT are denoted as control group.
All patients underwent preoperative serological examinations, preoperative imaging examination, collection of demographic information (eg., sex and age), test for hepatitis B virus infection and assessment of liver (reserve) function, surgical resection, postoperative histopathological, and immunohistochemical examinations. The diagnosis of HCC with BDTT was based on postoperative histopathological examination. * P < 0.05; ** P < 0.01. Table 2 summarizes the demographic, clinical, and pathological variables of the patients included in this study. The demographic variables included age and sex. It seems that sex was balanced between the two groups (P = 0.61). On the other hand, the control group was suggestively more aged than the case group (mean age: 52.75 vs. 59.50, P = 0.10). This suggests that potential confounding effects due to both sex and age were well controlled (the age effect is controlled if the age of the control group equals or is larger than the case group). As far as the clinical variables are concerned, the difference between the case group and the control group was significant for the total bilirubin level (P = 0.01), albumin level (P = 0.02), ALT level (P = 0.02), and AST level (P < 0.01). Specifically, compared with the control group, the patients in the case group had significantly higher serum levels of total bilirubin, ALT, and AST, while their serum albumin levels were significantly lower. The differences in the above clinical variables demonstrate a worse liver function of the HCC patients with BDTT. The pathological variables included microvascular invasion, Edmondson-Steiner grade, and the AJCC 8th Staging System. The difference between the case group and the control group was significant for the Edmondson-Steiner grade (P < 0.01), which is a sign of poor differentiation of tumor cells. The patients in the case group had a suggestively more advanced AJCC cancer stages (stage III/IV) compared with the control group (68.8% versus 38.9%, P = 0.08).

Results
After filtering and pre-processing, 1160 and 1451 CT images were retained for the case group and the control group, respectively. After labeling the CT images for the case group, there were totally 756 bounding boxes labeled on the 553 CT images, of which 366 (66.2%) containing one bounding box, 172 (31.1%) containing two bounding boxes, 14 (2.5%) containing three bounding boxes, and only 1 (0.2%) containing four bounding boxes (refer to Figure 1 (b)). The largest annotated bounding box only accounted for less than 6.0% of the whole image, suggesting that DBDs were inconspicuous and could be easily ignored by naked eyes. Table 3 shows the DBD-level true positive rates (D-TPRs) for the 16 patients in the case group. The average D-TPR value was 0.92 and the range of D-TPR was 0.80∼1.00, indicating a high success rate of detecting DBDs. Figure 2 shows several examples of successfully detected DBDs. In these examples, for most DBDs, Faster R-CNN output multiple bounding boxes that overlapped the DBDs with a high IoU value (refer to Appendix A of the Supplementary Materials for the definition of IoU).   Table 4 shows the image-level positive rates (I-PPs) for both case group and control group. The average I-PP values were 0.78 and 0.50 for the case group and the control group, respectively. Most I-PPs for the case group were equal to or above 0.63 (93.8%), while the majority of I-PPs for the control group were below 0.63 (77.8%). The I-PPs of the patients are graphically displayed in Figure 3 (a) via box plots, which intuitively reflect the difference between the case group and the control group. Among those HCC patients with BDTT, only one patient had a relatively low I-PP of 0.55. The poor identification result for this patient could be due to his/her significantly worse image quality compared with other patients.  ROC curves for the proposed pipeline and the random forest algorithm are displayed in Figure 3 (b). For the proposed pipeline, the AUC value was 0.92 (95% CI: 0.83, 1.00) and the optimal I-PP threshold value was 0.63, with the corresponding sensitivity and specificity being 0.94 and 0.78, respectively. For random forest, the AUC value was 0.71 (95% CI: 0.52, 0.89) and the optimal I-PP threshold value was 0.59, with the corresponding sensitivity and specificity being 0.63 and 0.83, respectively. The AUC of the proposed pipeline was significantly larger than the random forest algorithm (P = 0.02).

Discussion
In this study, an AI pipeline was developed for accurate preoperative diagnosis of HCC patients with BDTT through automatically identifying DBDs on CT images by the object detection neural network Faster R-CNN. Only CT images of 16 HCC patients with BDTT were used in this study due to the rarity of BDTT and the difficulty in obtaining complete imaging data of HCC patients with BDTT. The current study could be the first attempt to applying object detection method to identify DBDs.
Analysis results of the multicenter data demonstrates that the trained Faster R-CNN had a promising performance in identifying DBDs. As shown in Table 3, for each patient in the case group, most of their DBDs were successfully detected and nearly half (43.8%) of the patients' DBDs were fully identified. In addition, the bounding boxes output by Faster R-CNN were quite precise (refer to Figure 2). From  the DBD identification result, false negative results (undetected DBDs) arose when their sizes were too small (Figure 4 (a1)) or they were too inconspicuous (Figure 4 (a2)), while the false positive results were mainly due to some confusing structures such as DBD-like tumor region (Figure 4 (b1)), gap at the junction between liver and other tissues (Figure 4 (b2)), and irrelevant structures outside the liver region (Figure 4 (b3)). Nevertheless, most of the false positive findings such as those shown in Figure 4 (b2)-(b3) were easy to distinguish by doctors.
A significant difference of identified DBD proportions (I-PPs) was observed between the case group and the control group (Table 4 and Figure 3 (a)), in accordance with the conclusion drawn in the literature [13]. This motivated a pipeline for diagnosing HCC patients with BDTT ( Figure 5). The proposed pipeline was shown to significantly outperform the traditional machine learning method random forest that did not utilize image data (AUCs: 0.92 vs. 0.71), demonstrating the superiority of the proposed pipeline.
The attractive performance of the proposed pipeline could be attributed to two aspects. On one hand, the two-stage anchor-based object detection neural network Faster R-CNN was applied to identify DBDs, which first output object proposals then extracted regional features based on these proposals for detecting DBDs. Such design philosophy of looking and thinking twice is consistent with human vision system and it can result in a higher detection accuracy than one-stage networks [36]. On the other hand, the CT images were first manually filtered and processed then labeled by an experienced doctor. The heterogeneity of CT images from different centers was eliminated as much as possible, so that the CT image difference between the case group and the control group was mainly due to the presence or absence of DBDs. Furthermore, the DBD positions were accurately located, providing a prerequisite for training Faster R-CNN.
Although postoperative histopathological review of surgical specimen remains the golden standard for diagnosing and evaluating BDTT, it is of great significance to determine the existence of BDTT (and their locations if any) before surgical resection. This motivated us to use object detection neural network instead of the conventional classification neural networks. The advantage of using the object detection method is obvious, that is, it is difficult for the classification neural networks to focus on such inconspicuous structure of DBDs, while the object detection method not only identify which images have DBDs but also frame the DBD positions. Since the DBD positions are informative for typing and positioning BDTT, the proposed pipeline also provide useful information for doctors to evaluate the condition of BDTT.
The current AI pipeline developed in this study can potentially be improved from several aspects. First, since the scope of object detection is the whole image, irrele-vant structures outside the liver region can become a source of false positive findings. Accordingly, the false positive rate can be effectively reduced by segmenting liver region from CT images in advance. However, in order to train the organ segmentation neural network such as Unet [22], pixel-level labels are required but it is much more difficult to obtain than bounding box labels. Second, individual images in portal venous phase were treated as the research objects in this study, which means that the correlation between the CT images was not utilized. Therefore, the DBD identification accuracy could be improved by mining the correlation among the sequential images in the portal venous phase. Third, the proposed pipeline only utilizes CT images of the HCC patients, it can be combined with other clinical evidences to potentially improve the diagnosis of BDTT. Finally, low-quality images may lower the detection accuracy, as demonstrated in the previous section. Consequently, the proposed pipeline can potentially be improved by using higher quality images.

Conclusion
In summary, the present study demonstrates the feasibility of AI-assisted preoperative diagnosis of BDTT through the identification of DBDs on CT images. The proposed AI pipeline achieves an average true positive rate for identifying DBDs per patient of 0.92, while achieving a patient-level true positive rate for diagnosing BDTT of 0.94. It is noteworthy that the AUC value of patient-level diagnosis of BDTT achieved by the proposed method was 0.92 (95% CI: 0.83, 1.00), compared with 0.71 (95% CI: 0.52, 0.89) achieved by random forest. Furthermore, the accurate locations of DBDs located by trained Faster R-CNN can help doctors evaluate the condition of BDTT.
The high accuracies make our AI pipeline a powerful tool in the diagnosis of BDTT patients and locating DBDs caused by BDTT. The pipeline proposed in this study can be potentially extended in other medical applications such as diagnosis of lymph node enlargement and so on.

Descriptive statistical analysis
Continuous variables were expressed as mean ± standard deviation and the categorical variables were expressed as their numbers. The Wilcoxon-Mann-Whitney test and the Fisher exact test were adopted to compare continuous variables and categorical variables, respectively, between the case group and the control group. The DeLong test implemented in the R package pROC [33] was adopted to compare the AUC values between two ROC curves. All the statistical analysis was performed using R version 4.0.3 [34]. P < 0.05 was considered statistically significant.

The proposed BDTT diagnosis pipeline
The proposed AI pipeline for diagnosing BDTT is displayed in Figure 5, which consists of three steps: 1) image filtering, pre-processing, and label making (Section 6.3), 2) DBD identification (Section 6.4), and 3) patient-level diagnosis of BDTT (Section 6.6). Fig. 5 The proposed AI pipeline for diagnosing BDTT. First, CT images in the portal venous phase with tumors were selected, then the selected images were center-cropped and resized to a unified size. After filtering and pre-processing, the trained Faster R-CNN was applied to identify DBDs on the resulting images. Finally, a patient was diagnosed as HCC with BDTT if the image-level positive proportion (I-PP) was greater than the optimal threshold 6.3. Image filtering, pre-processing, and label making Some raw CT images have grayscale JPEG (Joint Photographic Experts Group) format while some have DICOM (Digital Imaging and Communications in Medicine) format. To remove potential confounding effect due to the formats, it is necessary to convert the raw CT images so as they have the same format. First, all the DICOM files were converted to the lossless grayscale JPEG images. Then, the resulting grayscale images were converted to pseudo-RGB images by tripling their channels with each channel having the same pixel value, so that they could serve as inputs of the object detection neural network.
With CT scan, each patient's images can be divided into four phases: equilibrium phase, arterial phase, portal venous phase, and delayed phase. The brightness contrast between DBDs and normal tissues are most obvious in the portal venous phase, as shown in Figure 6. Note that images in the portal venous phase comprise of a series of consecutive abdominal scan slices, and liver region generally occupies the middle part of those slices. Furthermore, intrahepatic DBDs usually present near tumor regions. Therefore, after format conversion, only those CT images in the portal venous phase with tumors present were retained. Such filtering could balance data between the case group and the control group. The image filtering procedure was reviewed by an experienced doctor. Since CT images of the case group were provided by four hospitals, their sizes vary to some extent due to different scanning equipments. Therefore, all images were resized to have a unified size and noninformative parts were discarded.
After filtering and pre-processing, the heterogeneity of CT images from different centers was largely eliminated and the main difference between images of the case group and the control group is whether DBDs are present or not. Several examples of preprocessed CT images with tumors and DBDs are presented in Figure 1, which shows that most DBDs can appear on several consecutive images in the portal venous phase since CT images are usually generated by continuous cross-section scanning of human body (Figure 1 (c1)-(c3)). Finally, DBDs were labeled by an experienced doctor using the software LabelImg [35], which extracted the location information of DBDs, i.e. the coordinates of the upper left and lower right corner of the bounding boxes as XML files. The resulting CT images combined with the label information were used to train the object detection neural network.

Object detection
Faster R-CNN [26], a popular two-stage neural network designed for object detection, was adopted to identify DBDs. Faster R-CNN consists of four components, namely Data Augmentation Module (DAM), CNN backbone, Region Proposal Network (RPN), and Region CNN (R-CNN). Compared with one-stage object detection methods such as YOLO [27] and SSD [28], Faster R-CNN tends to be more accurate, albeit at a higher computational cost [36]. The flowchart of Faster R-CNN for identifying DBDs is shown in Figure 7. To enrich data, multiple data augmentation techniques (eg., random shift, random rotation, etc.) were implemented in the DAM component during the training process. Meanwhile, transfer learning was adopted in this study, with ResNet-50 [37] being the CNN backbone for extracting feature maps, which was pretrained on the Ima-geNet dataset [38] with feature pyramid network (FPN) [39]. By combining FPN within the backbone, multi-scale feature maps were introduced and strong semantic features were propagated through a top-down pathway, thus the localization ability could be greatly enhanced. FPN was shown to be very effective in detecting small objects [40,41]. After feature extraction, RPN and R-CNN were adopted to identify DBDs sequentially. First, RPN was used to identify regions that tend to contain DBDs as proposals and correct their coordinates. Then, R-CNN was used to classify the proposals and further refine the coordinates of these proposals. In R-CNN, RoI Align was used to extract uniform features from the proposals and fully connected layers (FCL) was used for classification and further refinement. Model configurations and training details of Faster R-CNN are included in the Supplementary Materials. We made the code for training Faster R-CNN available at https://github.com/Daniel-1997/AI-pipeline-for-diagnosing-BDTT.

Evaluation metrics for DBD identification
Due to limited data, four-fold cross validation was used to evaluate model's performance. Specifically, the 16 case patients were evenly and randomly divided into four subsets. Each of the four subset was used for validation and the remaining three subsets were used for training. (Thus, such training process was repeated for four times).
DBD-level true positive rate (D-TPR) was used as an evaluation metric to assessing model's performance in identifying DBDs. Specifically, each DBD was considered to be successfully detected if at least one bounding box output by Faster R-CNN overlapped with the DBD. Then, for each patient, D-TPR was defined as the ratio of the number of successfully detected DBDs to the number of true DBDs.
In each of the four training processes, image-level positive rate (I-PP) was defined as the proportion of images with at least one detected DBD for each patient in both case validation set and control group. I-PPs for each patient in the control group were averaged over the four training processes.

Patient-level diagnosis of BDTT
After identifying DBDs, an HCC patient was diagnosed to have BDTT if his/her I-PP was larger than a given threshold. Sensitivity, specificity, receiver operating characteristic (ROC) curve, and area under ROC curve (AUC) were used as diagnosis method evaluation metrics. For each given I-PP thresthod, sensitivity was defined as the proportion of positive cases in HCC patients with BDTT, and specificity was defined as the proportion of negative cases in HCC patients without BDTT. The ROC curve was then drawn based on sensitivity vs. specificity pairs across various I-PP thresholds. The AUC value was defined as the area under the ROC curve. All evaluation metrics were calculated using the R package pROC [33].

Random forest as an alternative method
Random forest [42] was used as an alternative diagnosis method, which has become one of the most commonly used machine learning algorithms in classification and regression due to its high accuracy and fast training speed. In the task of classification, random forest ensembles multiple decision trees, and its output category is determined by the mode of categories output by individual trees. Those preoperative clinical variables significantly different between the case group and the control group were selected as inputs of random forest.