In image processing, Tesseract OCR is a widely used open-source Optical Character Recognition engine that has gained popularity due to its high accuracy rate and support for over 120 languages, including right-to-left written languages such as Hebrew and Arabic [5]. Despite some limitations, such as its lower accuracy compared to more advanced AI-based OCR solutions, Tesseract remains a popular choice for text recognition tasks due to its cost-effectiveness and open-source nature. However, it is important to note that Tesseract may produce errors if the foreground and background of the image are not well separated. Additionally, developing a custom solution using Tesseract OCR may require significant resources and time. Tesseract also has limitations regarding file format support and cannot recognise handwriting. Despite these limitations, Tesseract OCR remains a widely used OCR engine due to its high accuracy rate, support for multiple languages, and open-source nature. Tesseract offers two key functions for this purpose: deskew and orientation prediction. Both functions rely on the Hough Line Transformation algorithm to identify a line within the document. The pseudocode for this algorithm is as follows.
- Convert the input image to grayscale.
- Create an accumulator array with dimensions based on the image size and the range of angles to consider.
- Perform the Hough Line Transform to find the lines in the image.
- Accumulate the points for each line in the accumulator array.
- Find the angle with the most intersecting points.
- Calculate the angle.
- Rotate the input image by the calculated angle to obtain the rotated image.
The Hough Line Detection algorithm [6,7,8,9,10], utilised in the aforementioned algorithm, involves the following steps:
1. Initialise the Hough accumulator array with a grid of cells of appropriate size, where each cell corresponds to a particular line in the image space.
2. For each point (x,y) in the input image that is part of an edge, do the following:
3. For each angle theta in the range of angles that you are looking for lines (e.g., 0 to 180 degrees), calculate the corresponding distance r between the origin and the line perpendicular to theta that passes through point (x,y).
4. Increase the accumulator cell at the (r, theta) position by 1.
5. After all edge points have been processed, the accumulator array contains peaks at cells corresponding to the lines that were detected in the image. These peaks are identified as cells with values above a certain threshold.
6. From the peaks, extract the (r, theta) values and convert them into the x, and y coordinates of the lines in the image space.
Numerous research papers, including [11, 12, 13,14], have previously proposed and discussed the method. However, this widely accepted and commonly practised approach has a significant limitation. Specifically, the images must be skewed within a particular angle range; otherwise, the output image will be deskewed in the opposite direction, resulting in an upside-down output image. This issue has also been identified by Riaz et al in their recent publication, "Efficient skew detection and correction in scanned document images through clustering of probabilistic hough transforms" [10,13].
A recent study by Yang et al [15] has proposed a novel deep learning-based approach to solve the deskewing problem. The proposed method involves converting the task into a classification problem by creating two classes: Horizontal and Vertical. Specifically, the Horizontal class corresponds to images that fall within the following angles: -180° and -135°, -45° and 45°, and 135° and 180°, while the remaining images are classified as Vertical. The model predicts whether the input image is horizontal or vertical. However, this solution is both computationally expensive and challenging to implement. Furthermore, it does not address the issue of images being deskewed into an upside-down orientation. The proposed method requires many images for training and involves masking the area of interest in all images before training the machine learning model. The authors of this study used 80,000 images to train their model. However, the machine learning model fails if the input image is upside-down, even if it outputs a horizontal orientation.
Another common practice among engineers is to use Optical Character Recognition (OCR) to identify the text in a document and rotate the image based on the detected text. However, this method is not effective for several reasons. Firstly, there is a limited amount of computerised text in many documents, which makes text detection a challenging task for traditional solutions. Additionally, real-life documents are often scanned or are copies of documents that are difficult for classic OCR techniques to process accurately. Our work involved processing photocopies of cheques and ID cards that occupied a random portion of an A4-sized paper. In this scenario, applying OCR-based techniques would require either manually extracting the regions of interest or training a model specifically for that task, which is both expensive and cumbersome.
The image below serves as an example of a document for which most existing methods would fail to correct the skew (Figure 2)