2.1 OCR Engines
Here, three open-source OCR Engines are described and there technical details are presented as follows:
Tesseract OCR
An open-source text recognition (OCR) Engine, available freely under the Apache 2.0 license. It can be run both from the command line and through GUI Interface (using 3rd Party compatibility) [18]. It consists of a fully-featured API and can be compiled for a variety of targets including Android and the iPhone. It is also available for Linux, Windows and Mac OS X platforms. The development of tesseract is focused on line finding, features/classification methods, and the adaptive classifier for achieving the best accuracy [9].
EasyOCR
It is ready-to-use, open-source OCR with support for 80 + languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic, etc. EasyOCR is adaptable to any state-of-the-art model plug-in [5]. In Fig. 1 the flow of the development of EasyOCR is shown [19].
docTR
An open source OCR engine available under Apache 2.0 license. docTR is powered by TensorFlow 2 and PyTorch. The text detection and text recognition is performed using the following techniques [6].
Text Detection:
Text Recognition:
-
CRNN: The combination of two of the most prominent neural networks, CNN (Convolutional Neural Network) followed by the RNN (Recurrent Neural Networks). Optimal neural networks model for image-based sequence recognition and its application to scene text recognition.
-
SAR: Show, Attend and Read: A simple and strong baseline for irregular text recognition.
-
MASTER: Multi-Aspect Non-local Network for Scene Text Recognition.
2.2 Pre-processing
Image Binarization
Image Binarization is the process of converting a document image from 3D pixels array format into 2D format, i.e. into a bi-level document image. Image pixels are separated into a dual collection of pixels, i.e. black and white. The main goal of image binarization is the segmentation of documents into foreground text and background.
Brightness transformation
Here, the brightness information is given by the following equation
$$g\left(x\right) = \alpha f\left(x\right) + \beta , where \alpha >0$$
The parameters \(\alpha\) = gain and β = bias parameters are said to control contrast and brightness, respectively. The image brightness and contrast vary as we change \(\alpha\) and β.
Gamma correction or Power Law Transform
Gamma correction carries out a non-linear operation on the source image pixels and can cause saturation of the image to be altered. Gamma correction simply can be defined by the following power-law expression
$${V}_{out}=A{V}_{in}^{\gamma }$$
where, the non-negative real input value Vin is raised to the power γ and multiplied by the constant A to get the output value Vout. In the common case of A = 1, inputs and outputs are typically ranging from 0–1.
Sigmoid Stretching
Sigmoid function is a continuous nonlinear activation function.
Statisticians call this function as the logistic function.
$$g\left(x,y\right)=\frac{1}{1+{e}^{\left(c\right(th-fs(x,y)\left)\right)}}$$
Here, g(x, y): enhanced pixel value, c: Contrast factor, th: Threshold value and fs(x, y): original image.
The amount of lightening and darkening can be varied to control the overall contrast enhancement by adjusting the contrast factor ‘c’ and threshold value [20].
Image Smoothing (Bilateral Filtering)
Image blurring is achieved by convolving the image with a low-pass filter kernel, which helps in removal of noises. Here, a bilateral filter is a Gaussian filter in space and is highly effective in noise removal while keeping edges sharp.
Image Sharpening (Kernel Method)
High-pass filter is used to emphasize the fine details in the image to enhance the sharpness of the image. It is opposite to that of low-pass filter and uses a different convolution kernel than a low-pass filter.
$$Kernel=\left[\begin{array}{ccc}0& -1& 0\\ -1& 5& -1\\ 0& -1& 0\end{array}\right]$$