Multimodal deep learning model on interim [18F]FDG PET/CT for predicting primary treatment failure in diffuse large B-cell lymphoma

The prediction of primary treatment failure (PTF) is necessary for patients with diffuse large B-cell lymphoma (DLBCL) since it serves as a prominent means for improving front-line outcomes. Using interim 18F-fluoro-2-deoxyglucose ([18F]FDG) positron emission tomography/computed tomography (PET/CT) imaging data, we aimed to construct multimodal deep learning (MDL) models to predict possible PTF in low-risk DLBCL. Initially, 205 DLBCL patients undergoing interim [18F]FDG PET/CT scans and the front-line standard of care were included in the primary dataset for model development. Then, 44 other patients were included in the external dataset for generalization evaluation. Based on the powerful backbone of the Conv-LSTM network, we incorporated five different multimodal fusion strategies (pixel intermixing, separate channel, separate branch, quantitative weighting, and hybrid learning) to make full use of PET/CT features and built five corresponding MDL models. Moreover, we found the best model, that is, the hybrid learning model, and optimized it by integrating the contrastive training objective to further improve its prediction performance. The final model with contrastive objective optimization, named the contrastive hybrid learning model, performed best, with an accuracy of 91.22% and an area under the receiver operating characteristic curve (AUC) of 0.926, in the primary dataset. In the external dataset, its accuracy and AUC remained at 88.64% and 0.925, respectively, indicating its good generalization ability. The proposed model achieved good performance, validated the predictive value of interim PET/CT, and holds promise for directing individualized clinical treatment. • The proposed multimodal models achieved accurate prediction of primary treatment failure in DLBCL patients. • Using an appropriate feature-level fusion strategy can make the same class close to each other regardless of the modal heterogeneity of the data source domain and positively impact the prediction performance. • Deep learning validated the predictive value of interim PET/CT in a way that exceeded human capabilities.


Introduction
Diffuse large B-cell lymphoma (DLBCL) is the most frequently observed histologic subtype of lymphoma and is particularly prevalent in Asia [1]. In clinical DLBCL treatment, R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone) chemotherapy is used as the front-line standard of care [2]. However, as many as 15% of patients undergoing this chemotherapy experience primary treatment failure (PTF), which limits their median survival time to 1 year at most [3]. In addition to high-dose chemotherapy and autologous stem-cell transplantation, novel therapies, including chimeric antigen receptor T cell therapy, have shown promising efficacy and have provided increasing alternatives for patients at high risk for PTF to R-CHOP [4][5][6]. Thus, it is crucial to identify these patients at high risk of PTF, so they can promptly receive a more potent therapy to improve their survival time and quality of life.
Recently, some studies have identified refractory DLBCL patients and predicted their survival after treatment with R-CHOP chemotherapy [7,8]. The revised international prognostic index (R-IPI) redistributed IPI factors and predicted the outcome of DLBCL patients based on the number of negative prognostic factors present at the time of diagnosis [9]. Many genetic changes, such as the presence of TP53 and KMT2D mutations, have been found to be independent markers of poor prognosis in DLBCL patients [10]. Nevertheless, R-IPI has limited utility in predicting PTF, especially for low-risk DLBCL, as most patients by definition have a favourable risk profile [7]. The confirmation of TP53 and KMT2D mutations relies on large-scale highthroughput gene sequencing, which requires extra time and high labour intensity [8]. With the development of medical imaging, 18 F-fluoro-2-deoxyglucose ([ 18 F]FDG) positron emission tomography/computed tomography (PET/CT) serves as an effective tool in the diagnosis and staging of DLBCL [11][12][13]. Some initial studies based on Kaplan-Meier survival analyses suggested that PET/CT scans performed early during treatment (that is, interim PET/CT) after 4 cycles of R-CHOP chemotherapy could identify patients likely to relapse [14,15]. However, no research to determine the predictive value of interim PET/ CT in identifying PTF in DLBCL patients is currently available, and it remains an unmet need.
With the rapid development of artificial intelligence, deep learning approaches have developed as popular mathematical models to analyse PET/CT data [16], and in the field of lymphoma, they have been applied for tasks related to lymph node detection [17,18], normal physiological FDG uptake identification [19], lesion segmentation [20,21], and, only recently, survival prediction [22]. Capobianco et al [23] estimated total metabolic tumour volume (TMTV) from [ 18 F]FDG PET and suggested that TMTV using a deep learning-based method displayed significant prognostic value for overall survival in DLBCL patients. In addition, the identification of CT features has been proven to be an effective discrimination technique for the PTF prediction of different carcinomas [24][25][26]. Therefore, it is notable that taking full advantage of the complementary information of multimodal imaging, such as PET and CT, enables us to obtain rich image features for PTF prediction, and this requires appropriate multimodality fusion strategies [27]. Based on the main stage of deep learning models, multimodality fusion strategies mainly include three types: input-level concatenating [28,29], feature-level fusing [30], and output-level averaging. Therefore, we utilized a powerful backbone from the famous visual recognition network named Conv-LSTM [31] for high-level feature extraction and further developed and validated multiple multimodal deep learning (MDL) models using different feature fusion strategies, including the pixel intermixing model, separate channel model, separate branch model, quantitative weighting model, and hybrid learning model. Moreover, the model with the best performance was selected and finally trained with a contrastive learning objective [32] in addition to the crossentropy loss to establish the most accurate model and evaluate the predictive value of interim PET/CT in identifying PTF in DLBCL patients.
To the best of our knowledge, our current work on PTF prediction in DLBCL patients is the first of such investigation that uses PET/CT-based MDL approaches. Our results indicate that the best prediction accuracy to date has been achieved by the proposed contrastive hybrid learning model. Our work validates the predictive value of interim PET/CT in identifying PTF in DLBCL. Moreover, it provides a noninvasive and accurate method that indicates possible PTF early during treatment so more potent therapies may be selected.

Patients and dataset
All patients were collected from Ruijin Hospital and comprised two subsets of a consecutively collected observational DLBCL cohort, which was in accordance with the Declaration of Helsinki. A complete flowchart of the data collection process is shown in Fig. 1. Prior to analysis, the patients were divided into two groups for comparison: PTF and non-PTF DLBCL. Specific rules of data inclusion, exclusion, and grouping are described in Sec. 1 of the supplementary materials. In addition, detailed clinical characteristics of all patients were collected, including age (median with interquartile range [IQR]), age range (≤ 60 years versus > 60 years), sex, IPI (0 versus 1), stage (I-II versus III-IV), Eastern Cooperative Oncology Group (ECOG) performance status (0 versus 1), serum lactate dehydrogenase (LDH) level (normal versus elevated), extralymphatic involvement, and B symptoms.

Image acquisition and preprocessing
Image data were acquired from a PET/CT scanner (GE Healthcare) with the reconstruction method of ordered subset expectation maximization. Each sample contained one CT volume with a resolution of 512 × 512 pixels at 0.98 mm × 0.98 mm and one PET volume with a resolution of 128 × 128 pixels at 5.47 mm × 5.47 mm. Both volumes were reconstructed with the same number of slices, and the interdistance was 3.27 mm. The standard data preprocessing routine is shown in Sec. 2 of the supplementary materials [33,34]. Of note, the number of positive samples was obviously smaller than the Fig. 1 Complete flowchart of data collection number of negative samples, and the discrepancy was due to the favourable prognosis of low-risk DLBCL in the clinic. To reduce the overfitting impact of data imbalance, we employed data augmentation strategies only in the network training of the each-fold cross-validation experiment, wherein we included random horizontal and vertical flipping of the input images. Instead, we kept the initial positive ratio of the test cohort in the primary dataset.

MDL model development
An overview of our prediction framework is shown in Fig. 2. The starting point of our model is Conv-LSTM [31], a classic deep learning architecture for natural image recognition that has been recently applied to medical image analysis and achieved satisfactory performance [35]. Our network backbone ( Fig. 2(a)) was built based on Conv-LSTM with a simple adjustment of constructing two identical encoders of PET and CT data. To extract the hidden image features of the input data, four blocks of convolution and pooling operations were conducted. Then, a recursive learning framework was introduced, with a structure called "long short-term memory" (LSTM) [36,37], which performed simple learned gating functions to allow the learning parameters to be updated or reset. The above extracted features were concatenated into a sequence, which was then transformed by the LSTM into a composite feature vector for the sample. Thus, the heterogeneous information of the input data was derived to high-level semantic features reflecting intraslice spatial structures and interslice contextual correlations. The output of the model was a set of two continuous variables representing the prediction probability (on the scale of 0.0 to 1.0) for each category and was treated as a discrete probability distribution. The final prediction was calculated as the probability-weighted average of the categories rounded to the nearest integer.
In addition to the above backbone, multiple MDL models using different feature fusion strategies were developed and compared ( Fig. 2(b)), including the pixel intermixing model (I), separate channel model (II), separate branch model (III), quantitative weighting model (IV), and hybrid learning model (V). Their main differences are exhibited in terms of feature encoding. The first model (I) is based on input-level fusion distinguished from other feature-level fusion approaches. Here, a PET slice and its corresponding CT slice were integrated as one input image via pixel intermixing for singlebranch encoding. Second (II), PET and CT data were read into one encoder by separate channels and were simply concatenated after the first group of convolution and pooling operations. For the third model (III), the output feature maps from separate PET and CT encoding branches were concatenated before being fed into the following LSTM predictor. Fourth (IV), the model learned the spatial contribution of feature maps from PET and CT encoders by a quantitative weighting strategy that calculated the convolutional result as a weighted matrix. In the last model (V), PET and CT features extracted from two identical encoders were combined by the hybrid learning approach, a modal fusion method we published before, which generated spatial fusion maps and quantified the contribution of the complementary information. Then, these fusion maps were concatenated with specificmodality (i.e. PET and CT) feature maps to obtain a representation of the final fused feature maps at different scales.
To achieve better performance, we further aimed to promote the intraclass cohesion and interclass separation of the semantic embeddings of PTF and non-PTF cases. Thus, we adopted contrastive learning [32] in the hybrid learning model to achieve that goal. Traditionally, we trained the models (I-V) with a cross-entropy of the prediction and ground truth (Fig. 2(c)). Here, the cross-entropy loss was integrated with a contrastive training objective (Fig. 2(d)) derived from the similarity between a pair of samples to generate the overall loss function. Trained in this way, we built the contrastive hybrid learning model (VI) with an enhanced prediction ability because the same class lies close to each other regardless of the modal heterogeneity of the data source domain and away from those in different classes. Details of constructing the overall training objective are described in Sec. 3 of the supplementary materials.

Model implementation and visualization
We implemented the MDL models using TensorFlow 1.14 [38] on a machine running Windows 10 with CUDA 10.0 and cuDNN 7.6 [39]. Model training was performed on an 11 GB NVIDIA GeForce RTX2080 Ti. To establish appropriate training parameters, we employed values of 0.1, 0.01, and 4 for the parameters of the regularization factor, the learning rate, and the batch size, respectively. We used Sec. 4 of the supplementary materials to present the detailed architecture parameters used in the MDL models.
For each sample, model attention can be visualized for physician comprehension and validation. Here, focused regions of hidden-layer feature maps were instructed in the form of a rough location heatmap [40], which highlighted the entry area of prediction targets and interpreted the explanatory nature of such models about what kinds of features contributed to outputs. In this way, complex features that passed deep convolutional and pooling layers were projected onto the original input image.

Statistical analysis
To reveal the difference between the clinical characteristics of the PTF group and the non-PTF group, we utilized the statistical package SPSS version 22.0 for various aspects of univariate analysis, including the Mann-Whitney U test for numerical variables and the Pearson's chi-square test and, if necessary, Fisher's exact tests for categorical variables. Here, p values of < 0.05 were considered statistically significant. The prediction results were drawn from a fivefold cross-validation, where samples of the primary dataset were divided into three cohorts for model training, validation, and testing at a ratio of 3:1:1. Samples of the external dataset were only tested for generalization validation. The main metrics used to evaluate the model performance were accuracy as well as sensitivity, specificity, and F1 score. Sec. 5 of the supplementary materials lists the specific formulas for these evaluation metrics.  Table 1 displays the baseline characteristics of the patients in the primary dataset (median age: 55.00 years; 95 females, 110 males) and external dataset (median age: 54.50 years; 23 females, 21 males). The PTF rates were 9.76% (20/205) and 9.10% (4/44), respectively, showing a similarity between the two datasets. The percentages of PTF patients and non-PTF patients who possessed one IPI risk factor were 95.00% and 54.05% (100/185), respectively, which demonstrates the existence of a significant difference between the two groups in the primary dataset (p < 0.001).

Performance comparison of MDL models
The prediction performance metrics of the MDL models in the primary (that is, the test cohort) and external datasets are listed in Table 2. Due to the number imbalance between the PTF and non-PTF groups, the F1 scores of all MDL models were relatively low, which we still considered a meaningful result. Overall, the hybrid learning model achieved the best performance among all evaluation metrics in the test cohort of the primary dataset for predicting PTF. This indicated that the hybrid learning feature fusion strategy increased both the ratio of true positives to false positives and that of true negatives to false negatives. Figure 3(a) shows that the areas under the receiver operating characteristic curves (AUCs) were 0.837 in the test cohort of the primary dataset and 0.869 in the external dataset. Although the quantitative weighting model achieved a better AUC of 0.844 in the primary dataset, we took the AUC of the external dataset as a more important indicator to ensure its better generalization ability. Therefore, we further optimized the hybrid learning model by integrating it with a contrastive training objective and finally established the contrastive hybrid learning model.
The contrastive hybrid learning model achieved AUCs of 0.926 and 0.925 in the primary (that is, the test cohort) and external datasets, respectively. In addition, the DeLong test demonstrated that the AUC of this model was significantly better than  Table 2. In addition, the normalized confusion matrices of all MDL models for distinguishing PTF from non-PTF in the test cohort of the primary dataset are shown in Fig.  3(b). Notably, the contrastive hybrid learning model achieved continuous improvements in the overall sensitivity of PTF patients, as indicated through models I to VI.

Interpretability of MDL models
For each sample, model attention can be visualized for clinical comprehension and validation. Here, we aimed to understand which areas of input images and what kinds of features contributed to the prediction. Figure 4 shows input PET/CT images and the heatmaps of corresponding locations for three patients randomly chosen from the test cohort, which demonstrates the existence of a common pattern that is consistently shared among all samples. As shown in PET images, physiological tracer uptake and lesions are delineated in green and red contours, respectively. Despite the interference of physiological uptake, the contrastive hybrid learning model paid great attention to the structure of lesions. We thought the learning mechanism inside it could help rectify the spillover by iteratively recognizing the differences between the lesions and the regions with physiological tracer uptake and then reduce the spill-in effect. This suggested that this model actively sought tumour lesion distribution areas to classify PTF and non-PTF. In addition, the compared models viewed different lesion-adjacent areas for the same patient, a discrepancy that explains why these models differed in their prediction performance. Specifically, the heatmap of the contrastive hybrid learning model contained more gradient attention on specific regions related to the tumour itself, except necrosis and peripheral inflammation [41].

Discussion
The prediction of PTF in DLBCL patients has been a prominent challenge that clinicians have faced for a long time. In this work, we developed and validated a group of deep learning-based multimodal models that learned complementary high-level semantic features from interim [ 18 F]FDG PET/ CT images and achieved an individualized and noninvasive prediction of PTF in patients with low-risk DLBCL at the end of treatment. The major findings of our experiments covered the following issues: (1) As far as we know, our work is a seminal pioneering one, apparently the first that applies the MDL approach to interim PET/CT images acquired from DLBCL patients to validate its predictive value for PTF. (2) The prediction performance of the present MDL model, based on both the incorporation of a hybrid learning feature fusion strategy and the enhancement of the contrastive training objective, significantly outperformed (almost all AUCs, p < 0.05) other models using different feature fusion strategies in ablation comparisons. (3) Our work provides solid evidence that the contrastive hybrid learning approach on PET/CT images provides an effective method for the prediction of PTF and the risk stratification of patients with DLBCL. From the clinical point of view, the tumour involvement changes calculated from pathological FDG uptake in PET imaging can indicate possible PTF, but its use remains limited because of its dependence on data both prior to and after treatment. Moreover, the identification of CT parameters in the tumour region has been demonstrated to be an effective discrimination technique for predicting PTF in different carcinomas [24,25]. However, the present models for extracting and combining invisible imaging features of PET and CT are still not adequate for predicting PTF in DLBCL. In the current work, we proposed several MDL models based on [ 18 F]FDG PET/CT to predict PTF, and we investigated ways of fully utilizing PET/CT data to achieve the best performance. The proposed contrastive hybrid learning model demonstrated particularly outstanding predictive performance for PTF, as it effectively integrated PET metabolic features with corresponding CT anatomic features [42]. In contrast, the quantitative weighting model implemented element multiplication to encompass the level of importance given to information from each modality, although it considerably weakened the natural Fig. 4 Visualization of three PTF-DLBCL examples. To achieve a clearer comparison, corresponding PET and CT images are shown on the top of each column. Physiological tracer uptake and lesions are delineated in green and red contours, respectively. The activated regions are presented in red with a larger weight, which can be decoded by the colour legend on the right. Yellow arrows indicate the obvious differences among the different models characteristics of each modality [30]. As shown in Table 2, the separate branch model included a layer-level fusion strategy based on simple concatenation and then applied prediction layers [43]. Thus, some useful information associated with complementary features may be lost. In addition, the separate channel model combined the PET and CT images resulting after the first convolutional layer to derive fused feature maps [44], which led to lopsided attention to a modality with dominant pixel strength. The pixel intermixing strategy was used to construct a type of early fusion model [45], which shared a similar weakness with the separate channel model, reducing the prediction accuracy.
To interpret the MDL model, we visualized regions of interest of the network by generating rough location heatmaps. The activated areas in the heatmap were found to be primarily located in the tumour lesion and its surrounding areas according to several MDL models. All these areas were consistent with the predictive region observed by experienced radiologists. Notably, the contrastive hybrid learning model paid more precise attention while avoiding interference from physiological uptake. These common patterns served as a clue for the working principle of MDL models for analysing PET/CT data.
Although our results are definitely promising, our present work has a few limitations, thereby leaving room for several future improvements. First, this work was a retrospective study based on a relatively small sample size, especially in terms of the number of positive samples. Although large numbers of PET/CT scans from different institutions are difficult to obtain, the addition of in-house data would definitely be of paramount importance to the current work. We are actively working on this task. Second, to indicate accurate PTF for planning personalized treatments, the sensitivity and accuracy should be very high. Thus, the MDL model provided a reference result but not a direct decision for clinical practice, given the need for prospective validation. In the future, we aim to make it convenient for clinical decision-making by encapsulating the well-pretrained model as a ready-to-use tool. Then, physicians could feed the original data sample into the model and directly obtain its PTF probability by simple operations. Finally, we did not consider other possible prognostic factors. We suggest that integrating biologic markers, including blood biomarkers and pathological and genetic features, may improve the accuracy and robustness of our model.
In conclusion, our work proposed MDL models and developed the best model named the contrastive hybrid learning model. It utilized the hybrid learning strategy and was optimized by contrastive training objective to generate complementary features for multimodal learning. Adequate ablation experiments proved the superiority of the proposed model. In addition, it is end-to-end trainable and avoids the need for time-consuming manual delineations. Therefore, it not only quantitatively validates the predictive value of interim PET/ CT but also provides proof of concept for multimodal data analysis in clinical decision-making.
Code availability The code of our study is publicly accessible at https:// github.com/cyuan-sjtu/MDL-model.