OCR-MRD: Performance Analysis of Different Optical Character Recognition Engines for Medical Report Digitization

In the modern era, the necessity of digitization is increasing in a rapid manner day-to-day. The healthcare industries are working towards operating in a paperless environment. Digitizing the medical lab records help the patients in hassle-free management of their medical data. It may also prove bene�cial for insurance companies for designing various medical insurance policies which can be patient-centric rather than being generalized. Optical Character Recognition (OCR) technology is demonstrated its usefulness for such cases and thus, to know the best possible solution for digitizing the medical lab records, there is a need to perform an extensive comparative study on the different OCR techniques available for this purpose. It is observed that the current research is focused mainly on the pre-processing image techniques for OCR development, however, their effects on OCR performance specially for medical report digitization yet not been studied. Herein this work, three OCR Engines viz Tesseract, EasyOCR and DocTR, and 6 pre-processing techniques: image binarization, brightness transformations, gamma correction, sigmoid stretching, bilateral �ltering and image sharpening are surveyed in detail. In addition, an extensive comparative study of the performance of the OCR Engines while applying the different combinations of the image pre-processing techniques, and their effect on the OCR accuracy is presented.


Introduction
The age of digitization has made it mandatory to have digital records to make the data easily available.
Philip et al. [1] observed that hospitals look after digitization of medical records in order to reduce costs and also improve accessibility to records.This medical record is extremely important from patient as well as from doctor's perspective, as it serves a basis for early diagnosis and tracking the history of the ailment.The hospitals which are equipped with the usage of digitized records showing better performance in terms of less time for searching the records and more time for patient care as compare to hospitals with paper records [1].In such scenario, it is observed that there is a need for hospitals to move towards process of paperless environment, wherein a tool is required to convert the existing paper records into their digital form in a simple and e cient manner.
Usually, printed documents are rst scanned and then saved in the form of image in order to store the information present in them.However, utilizing this available information by only reading the text within images is quite a di cult task.Optical Character Recognition (OCR) is one of the active research areas that involves number of attempts to develop an automated system that extracts and processes text from the image les.The objective of OCR is to provide an editable digital format by modifying or converting any form of text from document images such as handwritten and printed scanned texts for deep and further processing [2].The OCR mainly focuses on the recognition of two types of characters i.e. machine printed and handwritten.The varied characteristics makes them a challenging area of research [3].Number of challenges such as variations in font size, font colour, background colour, image resolution, etc. which the OCR faces while performing the digitization, and thus, makes it vulnerable to inaccuracies [2].This creates the pre-processing of images a crucial step wherein the characteristics of images are enhanced.Therefore, there is a need to carry out by taking in consideration the characteristics of the OCR, as every OCR Engine responds differently to the characteristics of the image e.g.resolution, colour correction, contrast etc.
Even though number of pre-processing techniques developed in the past, these techniques are dependent upon the category of images and which features should be extracted from these images.Most of the available comparative studies are mainly focused on a speci c characteristic of the image, i.e., from where data needs to be extracted, however, a study which compares various set of tools and techniques based on the utilization is needed.Research in the digitization of medical lab reports is still one of the unexplored areas.Study relating to the aftermath or advantages of medical data digitization is found quite abundantly but investigation regarding optimal approaches to digitize the data relating to speci c medical domains is quite insu cient.In the propose work, we focus on digitizing medical lab reports which are a major resource in understanding patients' health charts and his/her medical condition changes over a period of time in response to the treatment they are receiving.
The paper is organized as follows; Section 2 discuss on OCR engines and various image preprocessing techniques.Literature review based on the medical report generation is presented in Section 3. Section 4 highlights the proposed methodology for medical report digitization.Results and discuss is found in Section 5, nally the conclusion and future scope is presented in Section 6.

OCR Engines
Here, three open-source OCR Engines are described and there technical details are presented as follows: Tesseract OCR An open-source text recognition (OCR) Engine, available freely under the Apache 2.0 license.It can be run both from the command line and through GUI Interface (using 3rd Party compatibility) [18].It consists of a fully-featured API and can be compiled for a variety of targets including Android and the iPhone.It is also available for Linux, Windows and Mac OS X platforms.The development of tesseract is focused on line nding, features/classi cation methods, and the adaptive classi er for achieving the best accuracy [9].

EasyOCR
It is ready-to-use, open-source OCR with support for 80 + languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic, etc. EasyOCR is adaptable to any state-of-the-art model plug-in [5].In Fig. 1 the ow of the development of EasyOCR is shown [19].

Pre-processing Image Binarization
Image Binarization is the process of converting a document image from 3D pixels array format into 2D format, i.e. into a bi-level document image.Image pixels are separated into a dual collection of pixels, i.e. black and white.The main goal of image binarization is the segmentation of documents into foreground text and background.

Brightness transformation
Here, the brightness information is given by the following equation The parameters = gain and β = bias parameters are said to control contrast and brightness, respectively.The image brightness and contrast vary as we change and β.

Gamma correction or Power Law Transform
Gamma correction carries out a non-linear operation on the source image pixels and can cause saturation of the image to be altered.Gamma correction simply can be de ned by the following power-law expression where, the non-negative real input value V in is raised to the power γ and multiplied by the constant A to get the output value V out .In the common case of A = 1, inputs and outputs are typically ranging from 0-1.

Sigmoid Stretching
Sigmoid function is a continuous nonlinear activation function.
Statisticians call this function as the logistic function.
Here, g(x, y): enhanced pixel value, c: Contrast factor, th: Threshold value and fs(x, y): original image.
The amount of lightening and darkening can be varied to control the overall contrast enhancement by adjusting the contrast factor 'c' and threshold value [20].

Image Smoothing (Bilateral Filtering)
Image blurring is achieved by convolving the image with a low-pass lter kernel, which helps in removal of noises.Here, a bilateral lter is a Gaussian lter in space and is highly effective in noise removal while keeping edges sharp.

Image Sharpening (Kernel Method)
High-pass lter is used to emphasize the ne details in the image to enhance the sharpness of the image.It is opposite to that of low-pass lter and uses a different convolution kernel than a low-pass lter.

Literature Review
It is noted from the literature review that research in the digitization of medical lab reports is still one of the unexplored areas.Most of the available comparative studies are focused on a speci c area (such as image-smoothening) in the process of digitizing printed or handwritten data, however, a collective study for a complete process model is needed.It is found out that the study relating to the advantages of medical data digitization is quite abundant, but studies regarding optimal approaches to digitize data relating to speci c medical domains are quite insu cient.Scott et al. [1] discussed how hospitals with the usage of digitized records show a decrease in time to search for records and an increase in time for direct patient care, compared to hospitals with paper records.Suter et al. [2] projected the importance of big data analytics in the digitization of medical data to enable value-based healthcare.To further knowing the nuances of the OCR techniques, [4] and [7] are helpful in understanding all phases of OCR and understanding the problems that are faced during text recognition [3].Mello et al.
[8] detailed the best value of parameters for digitization (resolution, brightness, contrast, number of colors, etc.) that offers an overview of the impact on changing parameters and methods consists of OCR and their accuracy.
The novel pre-processing method is presented in [11] for scanned document wherein it detects and corrects the skew.The proposed method, minimizes the area of interest, independent of content and no limitation of skew angle.Subsequently, attentive generative network is employed to solve the problem of image denoising by embedding visual attention [12].Due to visual attention, the noise region is attracted and forms the balance between removal of noise and preservation of texture.In [15], the detailed study is proposed on improving the output quality for the OCR system.Moreover, it very vital to digitized the medical information [16] to offer suitable techniques to process the medical data under different disease conditions.Table 1 present the detailed ndings from the literature review performed.

Method And Materials
It is seen that every OCR technique responds differently to the pre-processing techniques, which are applied to the input image.Herein, we have devised an approach to perform a combinatorial and permutation analysis using the set of pre-processing techniques that outputs the most valid sets with the best accuracy.The approach is divided into the following different sub-parts.
Lab Reports (PNG) The input contains the lab reports which are in PNG image format.A collection of lab reports is considered for performing this comparative study.
Image Pre-processing Techniques The standalone pre-processing techniques on which the combinatorial and permutation study will be performed.These techniques will then be fed to the optimizer for further calculations.

Optimizer
The optimizer receives the pre-processing techniques as an input and uses the three OCR Engines.It then calculates the accuracy of text extraction after applying each pre-processing technique separately on all OCR engines.The technique whose accuracy is greater than the threshold accuracy (average of all accuracies) is supplied to the 'Combination Generator' for performing further analysis.The technique which performs below the threshold accuracy is discarded in this process.

Combination Generator
The Combination Generator receives the input from Optimizer.Here, 2 n + 1 − 1 combinations from the input set are generated, where 'n' is the number of pre-processing techniques.For each combination, k! permutations are generated where 'k' is the length of a combination.For each permutation, pre-processing techniques are applied to the image in the order speci ed from the permutation.The image is then fed to each OCR and accuracy is calculated for the text extraction.A dictionary is maintained with key as permutation and value as accuracy which is then utilized to produce the nal results.Figure 2 shows an example of how the combination generator gets its input and after that how it generates various combinations.

Accuracy Calculator
To calculate accuracy, comparison the ground truth values of text with the extracted values of text from the lab reports is performed.The scoring is on the basis of proximity of extracted word to the original word, which is calculated using the Levenshtein distance.The scoring is explained in Table 2. Optimized Results: Using the dictionaries obtained from the 'Combination Generator', results are produced.The results is computed based on various parameters, presented as follows: 1. Accuracy without pre-processing vs Accuracy of best combination of pre-processing.
2. Most suitable pre-processing techniques for each OCR.

3.
Comparison of execution time of the OCR Engines.
4. Top ve combinations of pre-processing for the best performing OCR.
Here, Fig. 3 shows how the aforementioned sub-parts form the ow of the proposed methodology.

Results And Discussion
For the proposed methodology, we conducted a detailed comparative analysis of OCR engines and combinations of pre-processing techniques on a dataset of 39 images/lab-reports.The system con guration for performing the experimentation are -Intel i5 8th Generation CPU with 16 GB DDR4 RAM and 4GB Nvidia 1050 GPU.Here, Fig. 4 presents a comparison between the text extraction accuracy before any pre-processing techniques are applied and after the pre-processing techniques are adopted with the best-identi ed combination.Based on the OCR text extraction accuracy comparison graph, it is observed that the average accuracy of text extraction is improved by 2.88%.The average text extraction accuracy of 91.9% is reported, after applying the pre-processing techniques.Table 3 shows the most favorable pre-processing techniques for the OCRs.From this work, it is concluded that 'Tesseract-OCR' is 87% faster when compare to 'Easy-OCR' and 89% in comparison with 'docTR-OCR'.After 'Tesseract-OCR', 'Easy-OCR' engine stands in second position in terms searching, which is found to be 12% faster than 'docTR-OCR'.In Fig. 5, top 5 combinations of pre-processing technique for Tesseract OCR are presented, whereas Table 4 shows the corresponding pre-processing techniques.Table 5 depicted the overall comparison between the 3 OCRs based on average accuracy without preprocessing, with pre-processing and execution time (per-image).When the whole dataset is considered, it is found out that 'Tesseract-OCR' is 87% faster than 'Easy-OCR' and 89% faster in comparison with 'docTR-OCR'.The 'Easy-OCR' is the 2nd fastest engine, after 'Tesseract-OCR'.Moreover, 'Easy-OCR' is 12% faster, compared with 'docTR-OCR'.In Table 6, compatibility of a given pre-processing method with a given OCR is presented.

Conclusion And Future Work
In this paper, we present the detailed study, indicating the noticeable increase in the text extraction accuracy for all the OCR Engines, when the right set of pre-processing techniques are applied before performing the text extraction.We have also identi ed the favorable pre-processing techniques for each OCR engine and the most optimal combinations amongst these pre-processing techniques to achieve the best text extraction accuracy.This work provides useful insights of a document with a structure similar to that of a medical lab report, where one of the pre-processing techniques can be chosen for optimal results, and that can help for further research in the eld of text extraction.In addition, it supports in devising techniques for improved pre-processing, thus, improves the accuracy of text extraction.This work leverages, in optimizing the OCR engines for optimal lab report text recognition.
Due to the limited availability of processing power, experimentation is not performed on a larger dataset.
However, experimenting on better computing infrastructure with larger dataset, we can observe better results which will be diverse and may offer the comparison of accuracies over the combinations between OCR engines, etc.Currently, our dataset consists of lab reports in PNG format, which can be extended to other formats.Moreover, apart from targeting on pre-processing combinations, future research could focus on the factors among the combinations that affects the accuracy for a certain OCR engine, if and only if detailed insights are available.

Declarations Data availability
The data that support the ndings of this study are available from the corresponding author, [Jitendra Tembhurne], upon reasonable request.Proposed System for Identi cation of Medical data.
Accuracy Before Pre-processing vs Accuracy After Pre-processing

docTR
An open source OCR engine available under Apache 2.0 license.docTR is powered by TensorFlow 2 and PyTorch.The text detection and text recognition is performed using the following techniques [6].Text Detection: DBNet: Real-time Scene Text Detection with Differentiable Binarization.LinkNet: Exploiting Encoder Representations for E cient Semantic Segmentation Text Recognition: CRNN: The combination of two of the most prominent neural networks, CNN (Convolutional Neural Network) followed by the RNN (Recurrent Neural Networks).Optimal neural networks model for image-based sequence recognition and its application to scene text recognition.SAR: Show, Attend and Read: A simple and strong baseline for irregular text recognition.MASTER: Multi-Aspect Non-local Network for Scene Text Recognition.

Figures Figure 1
Figures

Figure 2 Pre
Figure 2