Multimodal Deep Learning Model on Interim 18F-FDG PET/CT for Predicting Primary Treatment Failure of Diffuse Large B-cell Lymphoma


 Purpose: Prediction of primary treatment failure (PTF) is necessary for patients suffering from diffuse large B-cell lymphoma (DLBCL), since it serves as a prominent means for enhancing front-line outcomes. Utilizing interim 18F-Fluorodeoxyglucose (FDG) positron emission tomography and computed tomography (PET/CT) image data, we aimed to construct multimodal deep learning (MDL) models to predict possible PTF of low-risk DLBCL, which could enable individualized treatment decision-making in clinical practice.Methods: From June 2016 to November 2020, 205 DLBCL patients undergoing interim 18F-FDG PET-CT scans and the front-line standard-of-care were enrolled. We also collected other 44 patients for the external validation. We built a powerful backbone by redesigning the famous visual recognition network named Conv-LSTM in aspects of network architecture and learning strategy. On top of our improved backbone, multiple MDL models using different feature fusion strategies were developed and compared, including pixel intermixing model, separate channel model, separate branch model, quantitative weighting model, and hybrid learning model. Moreover, we proposed to use a contrastive training objective in the above best model to enhance the modal correlation of semantic embeddings for further improving prediction performance. For visualization, the region of interest was instructed using an activation map.Results: The MDL model using the hybrid learning strategy provided the best performance in predicting possible PTF with the accuracy of 89.76% (95% confidence interval [CI]: 84.85%–93.20%) in the test cohort. After further optimized by contrastive objective training, the accuracy was improved to 91.22% (95% CI: 86.55%–94.37%). The AUCs of contrastive hybrid learning achieved 0.926 and 0.925 in the test cohort and external validation cohort, respectively. Conclusion: Our model showed outstanding performance for predicting PTF of low-risk DLBCL and hold promise of improving clinical individualized treatment strategies.


Introduction
Diffuse large B-cell lymphoma (DLBCL) is the most frequently observed histologic subtype of lymphoma that is particularly prevalent in Asia [1]. Patients typically present with progressive mass involving lymph nodes and extranodal sites. Currently, over 60% of patients with DLBCL are cured after front-line standard-of-care R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone) chemotherapy [2]. Nevertheless, as high as 15% patients undergoing this chemotherapy experience primary treatment failure (PTF) limiting their median survival time to one year at most [3]. As a remedy to this problem, novel emerging therapies, such as chimeric antigen receptor T-cell, were proposed. These therapies show high response rates in relapsed/refractory DLBCL [4] and bene t patients at high risk for PTF towards R-CHOP. Ideally, it is necessary to identify these high-risk patients before ordering them to receive individualized therapies.
Although the revised international prognostic index (R-IPI) and the presence of TP53 mutation are effective in predicting longterm survival among DLBCL patients, they cannot identify those patients likely to experience PTF [5,6]. With the development of medical imaging, 18 F-Fluorodeoxyglucose (FDG) positron emission tomography and computed tomography (PET/CT) emerged as an effective tool that assists in the diagnosis, staging, prognosis, and predicting treatment response in oncology [7][8][9]. As a prognostic marker in DLBCL, quantitative analysis of initial and interim diagnostic imaging has been recently proposed. Kahle et al. [10] conducted a semi-quantitative analysis of 18 FDG-PET scans to investigate differences in the presence of necrosis between DLBCL cases with or without a MYC gene rearrangement. Similarly, Schöder et al. [11] utilized 18 FDG-PET scans at baseline, interim, and end of treatment (EoT) to identify biomarkers of response that are predictive of remission and survival. Recently, Santiago et al. [12] built a CT-based radiomics approach that utilizes random forest (RF) machine learning for predicting refractory DLBCL. Senjo et al. [13] measured metabolic heterogeneity using 18 FDG-PET/CT to predict a worse prognosis. These studies indicated the potential value of texture analysis of both PET and CT scan, which is correlated well with patient survival. Combining the metabolic information of PET scan to the anatomic features of CT scan in lymphoma investigations provides sparse results, albeit these results are possibly of some added value in predicting outcomes [14,15].
However, diagnostic imaging is routinely used for staging purposes, and few conventional radiological ndings have been correlated with PTF in DLBCL [16].
Radiomics has been widely used to correlate feature information from medical images to disease outcome including overall survival and tumor metastasis [17]. Traditional radiomics studies generally involve three steps: (1) the manual delineation of regions of interest (ROIs), (2) the quantitative extraction of hand-crafted radiomic features (i.e., shape, intensity, and texture) from ROIs, and (3) multivariate statistical analysis based on support vector machine (SVM) or RF to determine the correlation [18]. Many studies have presented effective prediction approaches and obtained important conclusions regarding outcomes in oncology. Aerts et al. [19] found a multitude of radiomic features with prognostic power in independent datasets concerning patients of lung and head-and-neck cancers. Seidler et al. [14] utilized machine learning assisted-texture analysis of dual-energy CT to distinguish metastatic head and neck squamous cell carcinoma lymph nodes from lymphoma, in ammatory, or normal lymph nodes. However, these radiomic-based methods were usually time consuming and labor intensive with a semiautomatic work ow depending on hand-crafted features. Therefore, an automatic approach for learning adequate information from medical image data in a way that exceeds human capabilities is needed.
Despite the prior progress made, there are still two challenges in PET/CT-based multimodal data analysis for predicting PTF in lymphoma. (1) DLBCL lesions are characterized by obvious heterogeneity, and thus designing an appropriate model for complementary feature learning may further enhance prediction accuracy at EoT. To describe high-level semantic features of PET/CT, a multimodal deep learning (MDL) model is needed; such an approach could improve both the investigation of PET and CT complementary characteristics and the interpretation of treatment outcomes. Unfortunately, prior deep learning-based studies on multimodal medical image analysis mostly adopted simple input-level concatenating [20,21] or output-level averaging [22] as the learning approach of complementary features, and some subtle structural information may be lost in these approaches. (2) On the other hand, image volume has a special 3D structure regarded as a sequence of 2D consecutive slices in many medical analysis [23][24][25]. Inter-slice context differentiated from intra-slice semantic information is memorized and propagated along the z axis. Opposite to the model with isotropic fully connection after convolution, Donahue et al. [26] combined long-range temporal recursion to convolutional layers and enabled a novel end-to-end model named Conv-LSTM for optimized feature mapping and better visual description. This study brought a new inspire in medical image volume processing, but the above limitation is often ignored in constructing a deep learning model specialized in diagnosis, staging, and outcome prediction.
A reliable model to predict PTF in low-risk DLBCL would guide treatment optimization, thereby improving e cacy and long-term survival. Therefore, we built a powerful backbone by redesigning the Conv-LSTM and further developed and validated multiple MDL models using different feature fusion strategies based on 18 F-FDG PET/CT data, including the pixel intermixing model, separate channel model, separate branch model, quantitative weighting model, and hybrid learning model. The model with the best performance was ultimately trained by a contrastive training objective so as to attain the best accuracy. According to what we know, our current investigation of PTF prediction in DLBCL patients is the rst such investigation that uses the PET/CT-based MDL approaches. Our results indicate that prediction of the best accuracy to date has been achieved by the present hybrid learning model that employs contrastive objective training. Apparently, our work is capable of securing a noninvasive and accurate method that indicates possible PTF before EoT and promotes DLBCL individualized treatment strategies.

Patients and dataset
All patients were collected from Ruijin Hospital (Shanghai, China) and were part of a consecutively observational DLBCL cohort from June 2016 to November 2020, in accordance with the declaration of Helsinki. For this retrospective study, we rst analyzed 18 F-FDG PET/CT data of 205 patients with de novo histologically con rmed DLBCL according to the World Health Organization 2016 classi cation, of no more than one risk factor according to IPI. A complete ow of data collection is shown in Fig. 1. We excluded patients who (a) underwent surgical resection of all tumor lesions before immunochemotherapy and included all patients in this frame who (b) had available interim 18 F-FDG PET/CT examination images after (c) receiving R-CHOP regimen, (d) with de nite treatment outcome at EoT. Prior to analysis, the patients were divided into two groups for comparison: PTF and non-PTF DLBCL. A total of 20 refractory patients, assigned to the PTF group, were de ned by progression of disease during R-CHOP, or failure to achieve a complete response (CR) after at least 4 cycles. In the non-PTF group, 185 patients achieved complete metabolic response at EoT without relapse within 6 months of therapy. Treatment response was evaluated according to standardized criteria for non-Hodgkin lymphoma [27]. For model development, the patients were randomly divided into three subsets for training, validation, and test at a ratio of 3:1:1. Following the inclusion criteria of this study, we also supplemented 44 patients from January 2021 to July 2021 for the external validation. In addition, detailed clinical characteristics of all patients were collected, including age (median with interquartile rang [IQR]), its range (≤ 60 years versus > 60 years), gender, IPI (0 versus 1), stage (I-II versus III-IV), eastern cooperative oncology group (ECOG) performance status (0 versus 1), serum lactate dehydrogenase (LDH) level (normal versus elevated), extra-lymphatic involvement, and B-symptoms.

Image acquisition and preprocessing
Image data were acquired from a PET/CT scanner (GE Healthcare, Waukesha, Wisconsin, USA) with the reconstruction method of ordered subset expectation maximization. Each sample contained one CT volume with the resolution of 512 × 512 pixels at 0.98 mm × 0.98 mm and one PET volume with the resolution of 128 × 128 pixels at 5.47 mm × 5.47 mm. Both volumes were reconstructed with the same number of slices, and the inter-distance was 3.27 mm. A standard routine in the rst step of dualmodal image preprocessing was a rigid-body registration to eliminate the misalignment in coordinate spaces between PET and CT volumes [28,29]. Next, the aligned image data were rescaled to the same resolution of 64 × 64 × 32 pixels using bicubic interpolation to reduce computational burden and facilitate model training. Furthermore, PET data were normalized by a transformation to the standard uptake value (SUV); this process was based on the radionuclide total dose of FDG and the weight of each patient [30].

MDL model developing
An overview of our framework for PTF-DLBCL prediction is shown in Fig. 2. The starting point of our model is Conv-LSTM [26], a classic deep learning architecture for natural image recognition and description, that has been recently applied to medical image analysis such as emphysema pattern classi cation in CT scans and achieved superior performance over traditional radiomics approaches [25]. The powerful network backbone for our model (Fig. 2a) was built from a redesigned Conv-LSTM in aspects of network architecture by constructing two identical encoders for PET and CT data, respectively. To extract hidden image features of input data, four blocks of convolution and pooling operations were conducted. Then, with the introduction of recursive learning framework, it had a structure called "long-short term memory" (LSTM) [31,32], which performed simple learned gating functions to allow learning parameters to be updated or reset. Above extracted features were concatenated into a sequence, which was then transformed by the LSTM into a composite feature vector for the sample. Thanks to it, complex and heterogeneous information of input data were derived to high-level semantic features re ecting intra-slice spatial structures and inter-slice contextual correlations. The output of the model was a set of two continuous variables representing the prediction probability (on the scale of 0.0 to 1.0) for each category and was treated as a discrete probability distribution. The nal prediction was calculated as the probability-weighted average of the categories rounded to the nearest integer.
On top of above improved backbone, multiple MDL models using different feature fusion strategies were developed and compared (Fig. 2b), including the pixel intermixing model (I), separate channel model (II), separate branch model (III), quantitative weighting model (IV), and hybrid learning model (V). The rst model (I) is the only input-level kind distinguished from other feature-level fusion approaches. Here, a PET slice and its corresponding CT slice were integrated as one input image via pixel intermixing for single-branch encoding [33]. Second (II), PET and CT data were read into one encoder by separate channels and were simply concatenated after the rst group of convolution and pooling operations [34]. As for the third model (III), the output feature maps from PET and CT separate encoding branches were concatenated before fed into the following LSTM predictor [35]. Forth (IV), the model learned the spatial contribution of feature maps from PET and CT encoders by a quantitative weighting strategy which calculated the convolutional result as a weighted matrix [36]. In the last model (V), PET and CT features extracted from two identical encoders were combined by the hybrid learning approach, a modal fusion method we published before [37], which generated spatial fusion maps and quanti ed the contribution of complementary information.
These fusion maps were then concatenated with speci c-modality (i.e. PET and CT) feature maps to obtain a representation of the nal-fused feature maps in different scales.
To achieve a better performance, we further aimed to promote the intra-class cohesion and inter-class separation of the semantic embeddings of PTF and non-PTF cases. Thus, we adopted the contrastive learning [38] in the hybrid learning model to achieve that goal. Speci cally, a cross-entropy of the prediction and ground truth (Fig. 2c) was integrated with a contrastive training objective (Fig. 2d) derived from the similarity between a pair of samples to generate the overall loss function. Trained in this way, the contrastive hybrid learning model (VI) will be enhanced because the same class lay close to each other regardless of the modal heterogeneity of data source domain, and away from those in different classes. Details of constructing the overall training objective were described in Sec. 1 of the supplementary materials.

Model implementation and visualization
We implemented MDL models using TensorFlow 1.14 [39]

Baseline clinical characteristics
From June 2016 to November 2020, 205 low-risk DLBCL patients (median age: 55.00 years, 95 females, 110 males) were collected for model development and assigned to the primary dataset. Besides them, the data of 44 patients (median age: 54.50 years, 23 females, 21 males) were included for prospective validation and assigned to the external dataset. Table 1 displayed the baseline characteristics of patients in the primary and external dataset. The PTF rates were 9.76% (20/205) and 9.10% (4/44), respectively, showing a similarity between the two datasets. The percentages of PTF patients and non-PTF patients who possessed one IPI risk factor each were 95.00% and 54.05% (100/185), respectively, which demonstrates the existence of a signi cant difference between the two cohorts in the primary dataset (p < 0.001).

Performance comparison of MDL models
The prediction performances of MDL models in the test cohort and external validation cohort were listed in Table 2. All four MDL models using feature-level fusion strategies (separate channel model, separate branch model, quantitative weighting model, and hybrid learning model) provided better accuracy than the pixel intermixing model in the test and external validation cohorts, the only one using input-level fusion strategy. Due to the class imbalance between PTF and non-PTF patients, PPVs of all MDL models were relatively low, which we still considered as a meaningful result. Overall speaking, the hybrid learning model achieved the best performance among all evaluation metrics (sensitivity = 65.00% [95% CI: 43. 29 Table 2.
Compared with the quantitative weighting method, which has shown good performances in computer aided diagnosis [36], its predictive accuracy, sensitivity, speci city, and predictive values were all improved. In addition, the normalized confusion matrices of all MDL models in distinguishing PTF from non-PTF in the test cohort were shown in Fig. 4. Notably, the contrastive hybrid learning model achieved continuous improvements in the overall sensitivity for PTF groups, as indicated from sub gure 4a to 4f.

Interpretability of MDL models
For each sample, model attention can be visualized for clinical comprehension and validation. Here, we aimed to understand which areas of input images and what kinds of features contributed to the prediction. Fig. 5 shows input PET/CT images and the heat maps of corresponding locations for three patients randomly chosen from the test cohort, which demonstrates the existence of a common pattern that is consistently shared among all samples. The contrastive hybrid learning model paid great attention (highlighted in red) to the structure of lesion from physiological uptake interference of the heart and bones. This suggested that this model actively sought tumor lesion distribution areas to classify PTF and non-PTF. Notably, anatomic features derived from CT-modality data were meaningful, even though radiologists mostly referred PET information in clinical diagnosis. In addition, the compared models viewed different lesion adjacent areas for the same patient, a discrepancy which explains why these models differed in the prediction performance attained by each of them. Speci cally, the heat map of the contrastive hybrid learning model contained more gradient attention on speci c regions related to tumor-self, except necrosis and peripheral in ammation [42].

Discussion
The prediction of PTF for DLBCL patients has been a prominent challenge facing clinicians for a long time. In this work, we developed and validated a deep learning model that learned complementary high-level semantic features from interim 18 F-FDG PET/CT images and achieved individualized and noninvasive prediction of PTF in patients with low-risk DLBCL at EoT. The major ndings of our experiments covered the following issues: (1) As far as we know, our work is a seminal pioneering one, apparently the rst that applies the MDL approach on interim PET/CT images acquired from DLBCL patients for predicting PTF.
(2) The prediction performance of the present MDL model, based on both the incorporation of hybrid learning feature fusion strategy and the enhancement of contrastive training objective, signi cantly outperformed (almost all AUCs, p < 0.05) other models using different feature fusion strategies in ablation comparisons. (3) Our work provides solid evidence that the contrastive hybrid learning approach on PET/CT images secured an effective method for the prediction of PTF and the strati cation of risk for patients suffering from DLBCL.
From the clinical point of view, the tumor involvement changes calculated from pathological FDG uptake in PET imaging can indicate possible PTF, but its use remains limited because of its dependence on data both prior to and after treatment [43].
Moreover, the identi cation of CT parameters in the tumor region has been demonstrated to be an effective discrimination technique for predicting the PTF of different carcinomas [14,15]. However, present models on extracting and combining invisible imaging features of PET and CT are not still adequate for predicting PTF of DLBCL. In the current work, we proposed several MDL models based on 18 F-FDG PET/CT to predict PTF, and we investigated ways of fully utilizing the PET/CT data so as to achieve the best performance. Compared with using the input-level fusion strategy for identifying PTF of DLBCL, other MDL models with various feature-level fusion strategies actively indicated more accurate tumor involvement areas in PET/CT images. The hybrid learning model demonstrated particularly outstanding predictive performance for PTF, as it effectively integrated PET metabolic features with corresponding CT anatomic features.
By optimizing the hybrid learning model with the contrastive training objective enhancement, the proposed contrastive hybrid learning model signi cantly outperformed all the other MDL prediction models in AUC (almost all, p < 0.05). By contrast, the quantitative weighting model implemented the convolutional result from a fusion unit as a weighted matrix that was then multiplied by the PET and CT feature maps [36]. Element multiplication was used to encompass the level of importance given to information from each modality, although it considerably weakened the natural characteristics of each modality. The separate branch strategy was widely used in medical data analysis especially with different-class information [35]. As shown in Table 2, the separate branch model was second to the quantitative weighting model in terms of sensitivity, speci city, and AUC. It included a layer-level fusion strategy based on simple concatenation and then applied prediction layers. Thus, some useful information associated with complementary features may be lost. In addition, the separate channel model combined the PET and CT images resulting after the rst convolutional layer to derive fused feature maps [34], which led to lopsided attention to a modality with dominant pixel strength. The pixel intermixing strategy was used to construct a type of early fusion model [33].
Here, the PET and CT images were initially fused via pixel intermixing and the intermixed images were used as model inputs.
This approach shared a similar weakness with the separate channel model, and such a limitation reduced the prediction accuracy.
To interpret the MDL model, we visualized the ROIs of network by generating rough location heat maps. The activated areas in the heat map were found to be primarily located in the tumor lesion and its surrounding areas according to several MDL models.
All these areas were consistent with the predictive region observed by experienced radiologists. Notably, the contrastive hybrid learning model paid more precise attention meanwhile avoiding interference of physiological uptake. These common patterns served as a clue for the working principle of MDL models for analyzing PET/CT data.
Although our results are de nitely promising, our present work has a few limitations, thereby leaving room for several future improvements. First, this work is a retrospective study that is based on a relatively small sample size, especially on positive samples. Although large numbers of PET-CT scans are di cult to obtain, the addition of in-house data would de nitely be of paramount importance to the current work. We are actively working on this task. Notably, after calibration and targeted optimization, deep learning models would have the natural advantages of stability, repeatability, and ease of migration. As more PET/CT images from other institutions are supplemented for model developing, our MDL model is quite likely to be applied to data from different institutions and would achieve better generalizing capabilities. Second, to indicate accurate PTF for planning personalized treatments, the sensitivity and accuracy should be very high. Thus, the MDL model provided a reference result but not a direct decision for clinical practice, given the need for a su ciently high PPV and prospective validation. Finally, we did not consider other possible prognostic factors. We suggest that integrating biologic markers including blood biomarkers and pathological and genetic features, may improve the accuracy and robustness of our model.

Conclusions
Our work developed several MDL models, and the best model integrated hybrid learning strategy with contrastive training objective enhancement that exhibited satisfactory performance in predicting PTF for patients suffering from DLBCL. It enabled complementary information generation and feature adaptation in multimodal learning. Adequate performance in ablation experiments proved the effectiveness and superiority of the contrastive hybrid learning model. In addition, it is end-to-end trainable and avoids the need for radiomics experience and time-consuming manual delineations. Therefore, it provides a proofof-concept for multimodal data analysis and further helps clinicians for individualized decision-making in DLBCL clinical practice. Note that data are number of patients; data in parentheses are percentage. The best metrics are shown in the bold numbers. Numbers in parentheses are 95% con dence intervals. Figure 1 The complete ow of data collection.

Figure 2
The work ow of MDL models based on PET/CT for predicting primary treatment failure (PTF) of patients suffering from diffuse large B-cell lymphoma (DLBCL). Legend is shown in the bottom right corner.

Figure 3
Receiver operating characteristic (ROC) curves of compared MDL models for predicting PTF of low-risk DLBCL. a Test cohort. b External validation cohort.

Figure 4
Confusion matrices of compared MDL models for predicting PTF of low-risk DLBCL in the test cohort.