Joint Imbalance Adaptation for Radiology Report Generation

Radiology report generation, predicting text 001 descriptions for radiological images, may face 002 critical challenges due to data imbalance – med-003 ical tokens appear less frequently than regu-004 lar tokens, and normal labels of images may 005 not equal to abnormal ones. However, exist-006 ing studies mainly consider label imbalance 007 without mitigating other factors, such as token 008 imbalance. In this study, we jointly consider 009 two imbalance factors, label and token, deter-010 mining distributions of radiology images and 011 language, two fundamental modalities of the 012 generation task. We propose a J oint Im balance 013 A daptation ( JIMA ) model to promote task ro-014 bustness by leveraging token and label imbal-015 ance. Experiments on two standard evaluation 016 data (IU X-ray (Demner-Fushman et al., 2015) 017 and MIMIC-CXR (Johnson et al., 2019)) by 018 automatic and human evaluations demonstrate 019 our significant improvements over current state-020 of-the-art models. We conduct extensive abla-021 tion and case analyses to examine and present 022 dual imbalance effects on the radiology report 023 generation robustness. While data imbalance 024 remains challenging, our approach opens new 025 task directions and shows promising results. 026


Introduction
Radiology report generation is a multimodal and medical image-to-text task that generates text descriptions for radiographs (e.g., X-ray or CT scan), which may reduce the workloads of radiologists (Jing et al., 2018(Jing et al., , 2019)).The domain-specific task has own unique characteristics than general image-to-text tasks (e.g., image captioning), such as lengthy documents, medical annotations, and clinical terminologies.As demonstrated in Figure 1, data imbalance can significantly impact model robustness that prevents model deployment in practice -models can easily overfit on frequent patterns.
However, encountering data imbalance to augment the robustness of the radiology report generation task is still in its infancy.for screening purposes.MIMIC-CXR has a longer average length than IU X-ray.The lengthier documents may pose a unique multimodal generation challenge in the medical field.To conduct our analysis, we define the low and high frequency using the top 12.5% frequent tokens.Our findings in the Appendix A suggest a joint relation between label and token imbalance and higher ratios of lowfrequency tokens in abnormal reports.This observation motivates us to investigate how the imbalance impacts model robustness and reliability.

Imbalance Effects
We examine the potential impact of label and token imbalance on model performance.To ensure consistency, we keep the top 12.5% to split lowand high-frequent tokens for evaluation purposes.
The analysis includes three state-of-the-art models, R2Gen (Chen et al., 2020), WCL (Yan et al., 2021), and CMN (Chen et al., 2021).We either use released source codes and leave implementation details in the Appendix D.2.We use BLEU-4 (Papineni et al., 2002) and F1 scores to measure performance across both token (low vs high frequency) and label (normal vs. abnormal) imbalance.We visualize performance variations in Figure 1.
The results suggest that the models exhibit significant difficulties in coping under label and token imbalance.Models consistently perform worse on abnormal reports, which are lengthier and have more infrequent tokens than normal reports.For example, the top 12.5% frequent tokens count > 80% tokens in two datasets, and low-frequent tokens have much worse performance than frequent tokens, as infrequent tokens are harder to optimize (Yu et al., 2022).However, infrequent tokens contain higher ratios of medical terms (e.g., silhouettes and pulmonary) describing health states.The significantly varying performance highlights the unique challenges to adapt token and label imbalance.While existing work (Nishino et al., 2020) has considered label imbalance, however, the study did not examine the performance effects of label or token imbalance.The findings inspire us to propose our model Joint Imbalance Adaptation (JIMA) to model token and label imbalance.

Joint Imbalance Adaptation
In this section, we present our approach Joint Imbalance Adaptation (JIMA) using curriculum Difficulty measurer is to measure sample difficulties.To diversify learning aspects and jointly incorporate imbalance factors, we deploy three measurement tasks: 1) Task 1 -Label F1 promotes generating clinically correct reports, 2) Task 2 -Token F1 adjusts the balance between token infrequency and frequency, and 3) Task 3 -BLEU-4 is to generate coherently long reports.We start with a pre-train model (e.g., Transformer (Vaswani et al., 2017)), which can perform well on easy samples (e.g., normal samples and frequent tokens).The difficulty measurer will evaluate samples' difficulties by the three metrics, label F1, token F1, and decreases and vice versa.Given decreasing perfor-212 mance as an example, (pt−p t−1 ) p t−1 will be negative.
Figure 2: JIMA has three tasks, P (e.q. 5) as token distribution prediction, Q (e.q. 3) as label prediction by generated reports, and K (e.q 8) as regular report generation.We assign one color per task and solid arrows as workflows.The dotted arrow yields new models ( f ).Frames with double solid lines freeze model parameters.f R , f H , f T , f M refer to the visual extractor in e.q. 2, token distribution predictor in e.q. 5, transformer in e.q. 8 and memory-driven model in e.q. 7, respectively.
where V is the vocabulary.We use a feed-forward network as our token distribution predictor since our experimental findings suggest that employing a complex network architecture does not lead to improvements in performance.Samples containing infrequent tokens are prone to obtaining lower F1 scores, and as such, the samples will be prioritized in training data repeatedly.This approach allows the model to devote more attention to learning from samples containing infrequent tokens, particularly when the model struggles to capture the underlying patterns in such tokens.Since infrequent tokens have much higher ratios of medical terms, leveraging token imbalance will be beneficial.
Task 2 is to predict the occurrence probability of a word in a report, which is a multi-classification task.Therefore, we optimize the model by multiclassification loss as follows, where σ(•) is a sigmoid function.y ∈ R |V | is the ground truth and y i ∈ y is represented by, We set the threshold as 0.5 to predict whether a token occurs in a report and choose F1 score as our difficulty evaluator.

CL-Task 3
Task 3 implements an image-to-text generation pipeline with the objective of enhancing the fluency of generated reports.In text generation training, the model typically predicts i-th tokens based on 1-th to (i-1)-th tokens from the ground truth.To enable the adjustment, our Memory-Driven takes two contextual inputs, the token occurrence probability prediction P from Task 1 and a sequence token probability distribution Q from Task 2. We utilize Gated Recurrent Unit (GRU) (Cho et al., 2014) as our memory-driven encoder to learn a conditional token occurrence probability prediction h ∈ R l×V , where l is the sequence length of a report. 2The memory-driven model can capture the implicit relationship between a conditional token occurrence probability h and a sequence token prediction probability Q i as follows, Where h i ∈ R 1×V .We initialize h 0 = P and obtain h by stacking all h i .Then, we obtain our final probability prediction K l×V as follows, This task optimize the model by e.q 4. Finally, we can obtain our generation (G) from K by beam search, To maximize report fluency with the foundation of correct clinical description, we choose BLEU-4 as our difficulty evaluator on G and ground truth to augment generation ability on lengthier documents.
2 We have experimented more complex models other than GRU such as Transformer, but found GRU is the best option.

349
We propose a joint optimization approach to inte- Our optimization approach integrates with cur-366 riculum learning to tailor joint imbalance learning   We include more implementation details and hyperparameter settings in Appendix D.2.

Baselines
To examine the validity of our method, we include five state-of-the-art baselines under the same ex-   samples with lower BLEU-4, resulting in a better performance compared to the baseline models.For example, JIMA shows an improvement of 6.84% on average for IU X-ray and 7.10% for MIMIC-CXR.We infer this is as our task 3 improves generated sentence' fluency leading to the improvement of BLEU-(1-4) and ROUGE-L metrics.
Second, Our model achieves the best performance in F1 of the clinical metric.The results clearly indicates the effectiveness of Task 1 (Section 3.1) can enable the model to put more attention on difficult samples with lower F1 scores.Additionally, our method promotes clinical token prediction as performance on infrequent tokens and medical terms have been improved.For example, our generation significantly outperforms the baselines on F1 score by 21.69% on IU X-ray and 17.73% on the MIMIC-CXR average.CMN + RL performs better than other baselines on IU X-ray but not on MIMIC-CXR.In contrast, JIMA maintains a stable performance on both IU X-ray and MIMIC-CXR.
We infer this as our joint imbalance adaptation has more improvements than label imbalance adaptation, which has consistent observations with our ablation analysis (Section 5.4).We personalize the following setting in baselines.
In WCL, we use the basic contrastive learning loss without assigning a hardness weight to different samples in IU X-ray dataset.Because the file measuring the similarity among different samples is inaccessible.We set the contrastive embedding size as 256 and the weight of contrastive loss is 0.2.In CMM + RL, the reinforcement learning reward is based on evaluation metrics and we select BLEU-4 in this case.
BLEU score measures the precision of prediction with a penalty for the reference-to-prediction length ratio.METEOR computes the harmonic mean of unigram precision and recall.Unlike BLEU, which considers only single words, METEOR incorporates a penalty to account for the importance of word order.ROUGE-L takes into account sentencelevel structure similarity naturally and identifies the longest co-occurring in sequence n-grams automatically.Clinical metrics is a domain-specific evaluation method to measure the factual completeness and consistency of generated reports.We use CheXbert (Smit et al., 2020) learning.JIMA aims to augment model robustness under label and token imbalance.As optimizing data imbalance has been demonstrated difficulty, deploying such a learning strategy will strengthen model robustness and reliability.Our proposed approach deploys curriculum learning (CL) (Wang et al., 2022) that automatically adjusts the optimization process by gradually selecting training data entries from learning difficulty -learning from hard to easy samples as our optimization strategy (Zhou et al., 2020).To achieve the goal, we propose two major CL modules, difficulty measurer and training scheduler in Figure 2.
However, these tokens and ground-truth context are not accessible during the test stage -models generate the current position token by previous predictions, which causes the accumulation error for long documents and decreases the generation fluency.To narrow the generation discrepancy between the training and test period, we calculate the BLEU-4 score generation from the beam search to measure the model's performance in the test mode.BLEU-4 score matches four consecutive tokens between prediction generation and reference reports, which can efficiently evaluate the fluency of reports.Thus, we can improve the model's generation fluency by feeding the samples with lower BLEU-4 scores into the model's learning.Also, we propose a Memory-Driven module aiming to self-adjust the current token probability distribution based on the previous predictions instead of the ground truth.

350
grate three tasks.Algorithm 3.4 summarizes the 351 overall optimization process of our approach.We 352 set the learning rate of task 2 as α and β refers to 353 the learning rate of tasks 1 and 3.In each training 354 step, we sample different data for different tasks 355 and each task focuses on optimizing its own mod-356 ule of the models.For example, we update the 357 visual extractor (f R ) and token distribution predic-358 tor parameters f H in task 2. Then we fix the visual 359 extractor parameters (f R ) and update transformer 360 parameters (f T ) in task 1.Finally, we combine 361 the global token distribution P from task 2 and the 362 generation Q from task 1 to optimize the memory-363 driven model (f M ) in task 3. 364 Optimization Process of JIMA.Require: learning rate α, β for each epoch do 1.Rank entries by the three difficulty measurers (token F1, label F1 and BLEU-4); 2. Calculate three c(p t ) training schedulers by e.q.1; 3. Select top c(p t ) samples from the ranked datasets obtained by step 1 as training sets; 4. Sample a batch from D 1 and update Task

368
learning empowers the model to concentrate on 369 optimizing hard samples while mitigating the risk 370 of overfitting to easier samples.The joint opti-371 mization scheme facilitates each task to manage 372 different module parameters optimization and learn 373 a transferable knowledge from the simpler to more 374 complex task.As a result, all modules collaborate 375 to enhance error reduction from previous tasks.376 tomatic and human evaluations.The automatic evaluation includes NLG-oriented and clinicalcorrectness metrics.NLG-oriented metrics measure the similarity between generated and reference reports.Clinical correctness and human evaluation belong to factually-oriented metrics, and domain-specific evaluation methods.To be consistent with our baselines(Chen et al., 2020;Delbrouck et al., 2022;Wu et al., 2023), we utilize the F1 CheXbert(Smit et al., 2020) for the clinicalcorrectness metrics.The experiments compare our proposed approach (JIMA) and the state-of-the-art baselines.Two of our five baselines (CMM + RL & RRG) are designed to solve label imbalance by improving the abnormal findings generation.We conduct ablation and case analyses to fully understand the capabilities of our proposed approach.
perimental settings: R2Gen(Chen et al., 2020), CMN(Chen et al., 2021), WCL(Yan et al., 2021), CMN + RL(Qin and Song, 2022), RRG(Delbrouck et al., 2022), TIMER(Wu et al., 2023) and obtain from their open-sourced code repositories.Detailed baseline implementations are in the Appendix D.2. 4.2 Imbalance Setting 409 We evaluate model performance under token and 410 label imbalance settings.For token imbalance, we 411 compare F1-scores of frequent and infrequent to-412 kens separately.We introduce three different scales 413 to define frequency token sets, 1/4, 1/6, and 1/8 414 respectively.The splits define the top 1/4, 1/6, 415 and 1/8 vocabulary as frequent tokens and the rest 416 vocabulary as infrequent tokens.The setting is to 417 demonstrate the effectiveness of our approach in 418 adapting token imbalance.For label imbalance, 419 we divide our samples into a binary category, nor-420 mal and abnormal.We reuse labels from the data 421 section and NLG metrics for evaluation.
tween disease and normal by reinforcement learning(Nishino et al., 2020;Yu and Zhang, 2022).However, those methods ignore a fundamental challenge of generation task, token imbalance -a longtail distribution.The token imbalance can be even more critical for the clinical domain, as medical tokens appear less frequently than regular tokens in radiology reports.Our study makes a unique contribution to the radiology report generation that jointly incorporates token and label imbalance via curriculum learning.et al., 2020) is a transformer-based model with ResNet101 (He et al., 2016) as the visual extractor.To capture some patterns in medical reports, R2Gen proposes a relational memory to enhance the transformer so that the model can learn from the patterns' characteristics.Furthermore, R2Gen deploys a memory-driven conditional layer normalization to the transformer decoder facilitating incorporating the previous step generation into the current step.CMN (Chen et al., 2021) is a novel extension to the transformer architecture that facilitates the alignment of textual and visual modalities.The cross-modal memory network record the shared 958 information of visual and textual features.The 959 alignment process is carried out via memory query-960 ing and responding.The model maps the visual and 961 textual features into the same representation space 962 in memory querying and learns a weighted repre-(Qin and Song, 2022) is a cross-973 modal memory-based model with reinforcement 974 learning for optimization.CMM + RL designs a 975 cross-modal memory model to align the visual and 976 textual features and deploy reinforcement learning 977 to capture the label imbalance between abnormality 978 and normality.The author uses BLEU-4 as a re-979 ward to guide the model to generate the next word 980 from the image and previous words.981 RRG (Delbrouck et al., 2022, 2023) aims 982 to generate clinically correct reports by weakly-983 supervised learning of the entities and relations 984 from reports.RRG is a BERT-based model with 985 Densenet-121 (Huang et al., 2017) as a visual ex-986 tractor.RRG leverages RadGraph (Jain et al., 2021) 987 to extract the entities and relation labels in a report.988 RRG utilizes reinforcement learning to optimize 989 the model.The reward assesses the consistency 990 and completeness of entities and the relation set 991 between generated reports and reference radiology 992 reports.RRG addresses label imbalance issues by 993 maximizing the reward of predicting more compli-994 cated entities and relations in abnormal samples.995 TIMER (Wu et al., 2023) aims to decrease the 996 over-fitting of frequent tokens by introducing un-997 likelihood loss to punish the error on these tokens.architecture, we set the transformer 1003 structure with 3 layers and 8 attention heads, 512 1004 dimensions for hidden states.The memory-driven 1005 model is a single-layer GRU network with a hidden 1006 size equal to vocabulary size.We set the α learning 1007 rate as 4e − 4 and β learning rate as 1e − 5 and 1008 decay them by a 0.8 rate per epoch for all datasets.The pre-training epoch is 30 in IU X-ray and 10 in MIMIC-CXR.Then we adopt curriculum learning to optimize our pre-trained model.The maximum training epoch is 70 for the IU X-ray and 50 for the MIMIC-CXR datasets.We keep the learning rate the same as in the pre-trained stage.For all baselines, we set the maximum training epoch as 100 and 60 for IU X-ray and the MIMIC-CXR datasets, respectively.Also, we use the same preprocessing, optimizer, batch size, maximum length of training data, sampling method, and machine learning framework in all experiments.Specifically, we optimize models by ADAM (Kingma and Ba, 2015) with 16 batch sizes.The maximum length of training data is 60.In the test stage, we generate tokens by beam search (Sutskever et al., 2014) with 3 beam sizes for all experiments.All implementations are on PyTorch (Paszke et al., 2019).In implementing baselines, we keep all the model architecture and optimization parameters the same as in their papers.In R2Gen, CMN, and RRG, we generate reports by using the code and the pre-trained models published by the authors.For the other baselines (WCL & CMM+RL & TIMER), we use the released code to train and generate reports.

Table 1 :
Data statistics summary.Variations exist in label (Normal and Abnormal %) and average report length (L).
ports for 65,379 patients.Each report is a text document and associates with one or more front and side X-ray images.Table 1 summarizes statistics of data imbalance.We include preprocessing details and imbalance visualizations in Appendix A. Table 1 presents imbalance patterns in tokens and labels.Abnormal entries are predominant in both datasets, and MIMIC-CXR displays a more skewed label distribution, as more abnormal samples were collected during diagnosis phases not

Table 2 :
Overall performance.∆ are averaged percentage improvements over baselines.
We design our experiments to evaluate performance on both regular and imbalanced settings via au-

Table 2
434 forms baseline models (both imbalance and regular 435 methods) on BLEU scores by a large margin, con-436 firming the validity of selecting training samples 437 by our curriculum learning method.The approach 438 enables the model to learn multiple times from the 439

Table 3 :
Label imbalance evaluation with binary types, normal and abnormal.

Table 4 :
Results on high-and low-frequent tokens with three different ratio splits.
lary sizes and extreme sparsity.In terms of radiology report generation, reports may have diseaserelated labels.Recent studies have augmented model robustness by balancing performance be-

Table 6 :
Human evaluation."Same"meanstwo generated reports have the same quality by the clinician.Qualitative comparison between JIMA and CMM+RL.We highlight correct predictions of pathological and anatomical entities in blue color.comparisonwith the study of from an outside institution there is little change .cardiacsilhouette is within normal limits and there is no evidence of acute pneumonia vascular congestion or pleural effusion . in comparison with the study of there is little change and no evidence of acute cardiopulmonary disease .nopneumoniavascular congestion or pleural effusion .thecardiomediastinalsilhouette and pulmonary vascularity are within normal limits in size .thelungsare clear of focal airspace disease pneumothorax or pleural effusion .thereareno acute bony findings .theheartsize and pulmonary vascularity appear within normal limits .thelungsare free of focal airspace disease .nopleuraleffusion or pneumothorax is seen .thelungsand pleural spaces show no acute abnormality .heartsizeand pulmonary vascularity within normal limits .compared to the previous radiograph there is no relevant change .lowlungvolumes with areas of atelectasis at both lung bases .nonewparenchymal opacities .nolargerpleural effusions .nopneumothorax.ascompared to the previous radiograph there is no relevant change .lowlung volumes with minimal atelectasis at the lung bases .noevidence of pneumonia .nopulmonary edema .nopleural effusions .normalsize of the cardiac silhouette .duallead left-sided pacemaker is stable in position with leads extending to the expected positions of the right atrium and right ventricle .the patient is status post median sternotomy .there is minimal left base atelectasis .no focal consolidation pleural effusion or evidence of pneumothorax is seen .the cardiac and mediastinal silhouettes are stable .no displaced fracture is seen .frontal and lateral views of the chest were obtained .dual-lead left-sided pacemaker is again seen with leads extending to the expected positions of the right atrium and right ventricle .the lungs are clear without focal consolidation .no pleural effusion or pneumothorax is seen .the cardiac and mediastinal silhouettes are stable .frontal and lateral views of the chest were obtained .dual-lead left-sided pacemaker is again seen with leads extending to the expected positions of the right atrium and right ventricle .no focal consolidation pleural effusion or evidence of pneumothorax is seen .the cardiac and mediastinal silhouettes are unremarkable .
1113 Compared to normal samples, abnormal samples 1114 have longer descriptions and contain more com-1115 plex entities.These entities usually are rare in 1116 corpus and suffer under-fitting from models.There-1117 fore, models underperform in abnormal samples.1118 However, JIMA can capture most of the entities 1119 in all kinds of samples and achieve similar per-1120 formance in both normal and abnormal samples, 1121 which proves our model's effectiveness in improv-1122 ing the factual completeness and correctness of 1123 generated radiology reports.1124 Figure 4: in as