Improving Chest X-ray Report Generation by Leveraging Text of Similar Images

Automatic medical report generation is the production of reports from radiology images that are grammatically correct and coherent. Encoder-decoder is the most common architecture for report generation, which has not achieved to a satisfactory performance because of the complexity of this task. This paper presents an approach to improve the performance of report generation that can be easily added to any encoder-decoder architecture. In this approach, in addition to the features extracted from the image, the text related to the most similar image in the training data set is also provided as the input to the decoder. So, the decoder acquires additional knowledge for text production which helps to improve the performance and produce better reports. To demonstrate the e�ciency of the proposed method, this technique was added to several different models for producing text from chest images. The results of evaluation demonstrated that the performance of all models improved. Also, different approaches for word embedding, including BioBert, and GloVe, were evaluated. Our result showed that BioBert, which is a language model based on the transformer, is a better approach for this task.


Introduction
Medical imaging refers to various technologies used to observe the human body to diagnose, monitor, or treat medical conditions.Each type of technology provides different information about the area of the body being studied or treated for possible disease, injury, or the effectiveness of medical treatment [1].
There are different types of medical imaging technologies [2]: Radiography: An image is recorded for subsequent evaluation.
Mammography: A type of radiography to image the internal structure of breasts.
Fluoroscopy: A continuous X-ray image is displayed on a monitor that allows real-time monitoring.
CT: Many X-ray images are captured by the detector moving around the patient's body.A computer reconstructs all single images into slices of internal organs and tissues.
By exploring these images, doctors and specialists write a text (Figure 1) describing the abnormalities and essential points of a person's illness.This is error-prone for inexperienced specialists, and timeconsuming for experienced doctors.Automated support for this task can ease clinical work ows and improve the quality and standardization of care [3].

A typical radiology report includes below information [4]:
Type of exam History or reason for exam Comparison: if any previous exams are available, the radiologist compares them in this section.
Techniques: This section explains how the exam is done and whether contrast is injected into the patient's vein.
Findings: This section lists what radiologists observed in each part of the body.The radiologist will determine if the area is normal, abnormal, or potentially abnormal.Impression: In this section, the radiologist summarizes and reports the most critical ndings and possible causes for those ndings.It provides the most essential information for decision-making.
Sometimes the report does not answer the clinical question and more exams may be needed.
Automatic medical report generation is the production of reports from radiology images that are grammatically correct, and coherent.The report must include accurate information to diagnose, treat, and, track patient progress in the report.Figure 2 shows the parts of a standard automated report generation system (encoder-decoder architecture [5]).In this system, rst, the crucial areas of the image and the relationships between them must be identi ed using image processing techniques.Next, based on these ndings, a text is produced as a report, which should be syntactically, semantically, and medically correct.Each of these parts is described in more detail below.
Image Processing: In this section, a convolutional neural network is usually used to diagnose medical abnormalities.In this type of network, images are used as input, and the output is the features that are obtained from the image.These features are taken out of pixel mode and represented in a more meaningful format.Pretrained models like ResNeXt-101[6], and Resnet50 [7] are good choices for this part.
Text Generation: Once the visual features of the image have been obtained, a decoder is needed to produce sentences.These sentences can describe the content of the image.For this part RNN, LSTM, and, transformers like GPT2 are appropriate.Also, for some decoders, a word embedding approach like Glove should be used for converting words to vectors.
Although much research has been done in the eld of report generation from medical images, a signi cant result has not been achieved.It's because of the complexity of this task.This paper presents a comprehensive way to improve the performance of these models that can be easily added to any encoder-decoder model used to generate a report.In this method, in addition to the features extracted from the image, the text related to the most similar image in the training data set is also given as input to the decoder.To nd the most similar image, each image is encoded to a vector of features using a pretrained network (ResNet50 or ResNeXt), next the distance of these vectors is measured by Euclidean distance.In this way, the decoder acquires additional knowledge for text production, which helps to improve the performance and produce better report.To demonstrate the e ciency of the proposed method, this technique was added to several different models for making text from chest images.The results of training and evaluation on the Indiana University Chest Ray Collection [8] showed that the performance of all models improved.Another factor affecting the performance of report generation systems is the method used to convert words to vectors in the decoder.We used BioBert [9] for converting words to equivalent vectors.BioBert is a pre-trained language model for biomedical text mining.Evaluation results showed that it has higher performance for reporting than Glove [10].
The contributions of our work are summarized as follows: We adopted a transformer-based model called Bio-Bert for converting each word into a vector to feed into the model as the input.
We proposed an approach that uses the report of the most similar image with the input image as another input to provide more information to the decoder part.
We have used a pretrained network (ResNet50 or ResNeXt) that extracts relevant features from each image and compared these features to nd the most similar image.
We used BioBert for word embedding.

Related Works
Many researchers consider automatic report generation as an image captioning task [11][12][13][14].Nowadays, several studies have tried to use novel deep learning techniques to generate high-quality medical reports.Convolution-recurrent architectures (CNN-RNN) is the typical approach for automatic report generation tasks [15][16].This architecture is composed of two main parts.An encoder based on convolutional neural networks (CNN) and a decoder based on recurrent neural networks (RNN) that are widely used in natural language processing, which capture the temporal information of textual sequences [17].
[18] proposed a model called MDNet.This model used attention on the medical image to improve the performance.Also, it can provide justi cation of the network diagnosis process.[19] applied conditional GPT2 for report generation.This model consists of an encoder that gives visual and semantic features from the image and a decoder to generate words.The encoder in the Chexnet model was ne-tuned to predict multiple tags from the image.The predicted scores of each tag are then multiplied by the corresponding pre-trained word2vec embeddings.[20] used a co-attention mechanism to detect regions containing abnormalities and a hierarchical LSTM model to generate long paragraphs.[21] proposed conditional visual sematic embeddings to identify abnormal ndings from the reports.[22] used an encoder-decoder model for report generation.They enriched the encoder with multi-view visual features and the decoder by descriptive semantics.[23] presented a domain-aware report generation system using reinforcement learning which rst predicts the topics, next conditionally generated sentences corresponding to the topics.[24] used memory-driven transformer for generating radiology reports.In this model a memory is designed to save critical information.This memory is used in the decoding process.
Their model outperforms previous models and is able to generate long reports.[25] proposed a contrastive attention model.This model compares the current input image with normal image to detect the contrastive information.It can help report generation model can better attends to abnormal regions.
[26] developed a cross-modal memory network.In this model a shared memory is used to save the alignment between images and text.
The proposed method in this article can be added to all these systems to improve the performance of report generation.

Methods
As shown in Fig. 3, the model architecture consists of three major parts, the visual model (encoder) for extracting features from medical images, the function to nd the most similar image to the input image, and the decoder for generating a report.Below each part is explained.

3-1) Feature extraction
Each medical image is fed into a CNN to extract features.In this paper, ResNeXt-101 and ResNet50, which both are based on the ResNet model, are used.The model's output is a vector of size 2048.ResNeXt repeats a block that aggregates a group of transformations with the same structure (Fig. 4).It uses the same approach of VGG/ResNet to repeat layers.It is also similar to the Inception [27] module, but Inception has a different lter and size for every single block, while ResNeXt shares hyper-parameters among blocks.The size of the set of transformation is called cardinality.Increasing cardinality is a more e cient method to gain accuracy than going deeper or wider.
Another model is Resnet which was proposed in 2015 by researchers at Microsoft Research.Before the introduction of this model, the use of neural networks with many layers was problematic.With the increase in the number of layers, the network would suffer from the vanishing gradient problem; the Resenet network was able to almost solve this problem by providing a solution.For this reason, this network can even have up to 152 layers.The technique was used in Resnet called skip connections (Fig. 5), which connect activations of one layer to other layers by ignoring some layers in between.So, it causes to form residual block.The result of this approach is that if a layer decreases the performance of architecture, then the network skips this layer by regularization technique.An example of adding skip connections to a network architecture is depicted in Fig. 6.

3-2) Finding the Most Similar Image
For every input image, a mechanism is used to nd the most similar image in the train set (Fig. 7).This mechanism uses Euclidean distance to determine the distances between image feature vectors from the pre-trained model to vectors related to other images in the train set.The pretrained model encodes the essential features of images and two images are similar when they have similar features.The purpose of this approach is that the feature vector which has the least value in Euclidean distances (equation 1) with the input image's vector is chosen.After that, we fetched its image and impression text and considered it the second input. (1) Where p and q are the vectors and i = 0, …, n indicates the number of elements in each vector.

3-3) Word Embedding
Also, to convert each sentence in impressions to integer sequences.First, a tokenizer is adopted.Then, integer sequences are converted into a xed length.A weighted matrix of the embedding layer maps each word to a high dimensional space by using a pre-trained language model or a word embedding algorithm.
Word embedding is used to convert the impression of the most similar image and the impression of the input image.Two different word embedding approaches are used in this paper: BioBert: Bidirectional Encoder Representations from Transformers for Biomedical text mining, is the rst domain-speci c language model trained on large-scale biomedical corpora (English Wikipedia, BooksCorpus, PubMed Abstracts, and PMC full-text articles).It has the same architecture as Bert and outperforms BERT and previous state-of-the-art models in some biomedical text mining tasks.BioBert, like Bert, converts each token into a 768-dimension vector.To the best of our knowledge, BioBert wasn't used for the task of medical report generation.
Glove: is a model for word representation that captures the global corpus statistics.It is an unsupervised learning algorithm for converting words to vectors.It combines global matrix factorization and local context window methods.The main insight of the model is the observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.In this project, the Glove model embedded each token in a 300-length vector.

4-1) Dataset
The Indiana University Chest X-ray Collection (IU X-Ray) is a set of chest x-ray images and their diagnostic reports.The dataset includes 7,470 pairs of images and reports.Each report consists of the following sections: impression, ndings, comparison, and indication.In this paper, we treat the contents in impressions as the target captions to be generated (Figure 1 provides an example).
We preprocessed the data by converting all tokens to lowercases, removing all the non-alpha tokens (like '%', '$', '#', etc.) and erroneous tokens ("XXXX", "X-XXX)).On average, each image is associated with 2.2 tags, 5.7 sentences, and each sentence contains 6.5 words.All images are resized to 299*299 dimensions.Also, data augmentation is used to increase the number of images.Therefore, 22410 images were used.Finally, we randomly selected 21754 images for training, 250 images for validation, and 250 images for testing.

4-2) Implementation Details
We used the Adam (Kingma and Ba, 2014) [28] optimizer for parameter learning.Early stopping was used to prevent over-tting.The model was implemented with python 3.For training the model, an NVIDIA Tesla K80 GPU provided by Google Colab, 13 GB of RAM, and 68 GB of hard disk was used.The models were trained for 30 epochs and 256 as batch size.4-3) Visual Model: We used the pre-trained models, which are ResNext and Resnet, to extract features of each image.These models expect an image with dimensions and output vectors.This vector is of size 2048.Deciding on which layer to extract from is a bit of a science.Early layers in the model usually learn low-level features, while higher layer learns more abstract features speci c to the training data.Because the last layer of these models is a dense layer for detection of the input image class, this layer is ignored and the features of conv5_block3 located before the last layer were used.4-4) Decoder: method provides three inputs to the decoder part.The rst input is used to extract features from images using the ResNeXt pre-trained network.The next two inputs, which are the report of the corresponding image and the report of the most similar image to the current image.First, the word |startseq| is given to the model, then the successive word is predicted.In the next step, the two generated words are given along with the previous inputs to generate the third word.This process continues until the model generates the word |endseq|.

4-5) Results
The accurate reports are the ones that include most of the essential information and contain no false information.We use the BLEU metric to compare the word embedding models, and also, to evaluate the effect of our proposed model.The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for assessing a generated sentence to a reference sentence [29].The score is used for evaluating the predictions made by the automatic machine.This measure is calculated by counting matching n-grams in the generated text to n-grams in the reference text, where a unigram compares each token, and a bigram compares each word pair.The comparison is made regardless of word order.We evaluated models with BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores.A best match results can be scored 1, while the results of a complete mismatch are scored 0.  1, for automatic medical report generation, the results of the models that used the report of the most similar image outperformed the other approaches.We compared three methods and found that in all of them using the report of the most similar image as an input, had better outcomes.These results showed that the proposed method is effective in improving report generation and because it introduces more information into the decoder section, it can help improve the performance.Also, our results demonstrated that BioBert is a better embedding approach than GloVe.BioBert is based on transformers that use attention mechanism, and has shown its superiority over other methods in text processing.As can be seen from the table, the models using the BioBERT language model performed better at encoding contextual information.This result is because BioBERT learns the embedding of the given word from the words that come before and after it in a text which Glove just considers words located before the target word as the contex.

Conclusions
In this paper, an attempt has been made to introduce a general method for improving the performance of CNN-RNN architecture for automatic medical report generation.In this method, in addition to the features extracted from the image, the text related to the most similar image in the training data set is also given as input to the decoder.In this way, the decoder acquires additional knowledge for text production that Figures A sample of a medical image and its report [19] Figure 2 The architecture of an encoder-decoder model report generation from medical images [5] Figure 3 The architecture of the proposed model applied for report generation  An example of adding skips connections to network architecture

Figure 4 A
Figure 4 Figure 5

Table 1 .
Comparison of different methods for the medical report generation task