Integrated document segmentation and region identification: textual, equation and graphical

With the advancement in the world of digitization, storing information in the form of scanned copies, images, etc. becomes a new normal. This new normal leads to the need for a system that can extract accurate information from the scanned documents or images with respect to every component they may have, such as textual, graphical, etc. The first step in extracting document information is to segment the document layout: divide the document into textual and non-textual regions of interest. There have been various studies over document layout segmentation, and this study observed that the majority of the existing studies face one common challenge, i.e., accurate segmentation of graphical components with sparsely clustered pixels such as flowcharts, block diagrams, etc. The study addresses it with a two-tier feedback-based framework. The first tier segments and classifies the textual and mathematical equation components, while the second tier segments and classifies the graphical regions using the feedback information from the first tier. The information provided by the first tier is the regional information of the equation and textual components to get a different copy of the original input document image in such a way that most of the foreground pixels are part of graphical regions. The proposed framework outperforms various existing studies (when evaluated against multiple data sets).


Introduction
With the increase in the availability of digital image documents publicly on digital platforms, digital image document analysis is becoming an important research problem for applications, such as information retrieval, document classification, OCR, etc. Document segmentation and region classification [1,2] are some of the core issues for a document image analysis task.While digital image document analysis needs segmentation of both textual and non-textual regions, the majority of the studies focus on identifying individual regions of interest.For instance, the studies [3][4][5][6] focus on identifying only textual regions.Whereas, the studies [7,8] focus only on identifying non-textual regions.On the other hand, a few studies [6,9,10] have attempted to identify both the textual and non-textual regions jointly with a single model.From the brief review and empirical studies, the following two points are noted: (i) most of the integrated models adopt a single model to segment different class regions (textual or non-textual) of interest and (ii) most of the integrated models fail to identify non-textual regions effectively when the majority of the pixels in the regions are sparsely clustered.A pixel can be said to be sparse if it is distributed in such a way that its neighboring pixel can be found through only 4 neighbors' connectivity. 1Figure 1 shows the comparison of documents that have non-textual regions with sparsely clustered pixels and tightly clustered pixels (a pixel can be said to be tightly clustered if it is distributed in such a way that its neighboring pixel can be found through 4 neighbors' connectivity or diagonal connectivity 2 or 8 neighbors' connectivity 3 ).The performances of three existing methods over the document samples that have non-textual regions with sparsely clustered pixels are shown in Fig. 2. The figure shows that all of the methods correctly segment and identify textual regions but fail to correctly segment non-textual regions.It may also be noted that many of the non-textual regions in scientific documents are made up of sparsely clustered pixels.[9], Umer et al. [6] and Wu et al. [10] Integrated document segmentation and region identification: textual, equation and graphical 1 3 Motivated by the above observations, this study proposes a two-tier feedback-based end-to-end integrated framework which can jointly segments and identifies regions of interest accurately.The regions of interest are classified into three major classes: textual, equation, and graphical.The regions that are considered "graphics" may vary among studies, but this study considers all the regions other than equations, and textual as "graphics" (such as natural scene images, data visualizations, flowcharts, etc.).The first tier focuses on segmenting and identifying textual and equational regions, while the second tier segments and identifies the graphical regions. 4The use of two tiers primarily helps the model in providing better segmentation performance as it employs different segmentation models for different regions of interest.However, to address the issues of inaccurate segmentation of non-textual (graphical in our study) regions with sparsely clustered pixels, this study makes the two tiers asynchronous.The asynchronous is developed in such a way that the second tier depends upon the first tier: First tier provides regional information (textual and equation) to the second tier so that the input document images are masked to have the majority of the foreground pixels as the graphical regions.This approach of having two asynchronous and feedback-based tiers helps the segmentation models learn various characteristics of different regions of interest independently.
From the various experiments over five data sets (four publicly available and one locally generated), it is observed that the proposed framework outperforms various existing methods.This study is further extended to four different use cases: meta-file generation, document recreation, structural preserving document transliteration, and structural preserving document translation.The major contributions of this study may be noted as follows: • Integrated Document Segmentation Model: a two-tier feedback-based end-to-end integrated model is proposed herein, which can effectively segments and identifies non-textual components with sparsely clustered pixels.

Related studies
In the early stage of the document analysis system, text/ non-text separation plays a vital role.Since the early 1980s a significant number of published articles have appeared dealing with the challenges of developing effective methods that are robust to varying document styles [6,[11][12][13].Based on the classification methods used by the existing studies, the journey of document analysis system can be divided into two phases viz.pre-deep neural network phase (1980s-2013) and deep neural network phase (since 2014).
During the pre-deep neural network phase, most of the studies used two methods for text/non-text separation: region classification-based and connected component (CC) classification-based.A region classification-based method decides whether a segmented region is text or not, while a CC classification-based method does the same thing with the CCs.Region segmentation is generally performed using a top-down [12][13][14], bottom-up [15,16], or hybrid [17,18] approach.For top-down approaches, segmentation proceeds from a coarse level to a finer level, where large homogeneous regions are first extracted, and subsequent refinement is performed.Conversely, in bottom-up approaches, processing starts with merging pixels into characters, characters into words, words into text lines, and finally text lines into regions, using local information.In the context of CC classification-based methods, CCs are extracted from the input document image, and each CC is classified as text or non-text.In the classification process of both methods, various features are exploited, including entropy [19][20][21], homogeneity [19], energy [19,21], mathematical morphology [19], white tile-based texture [22,23].Study [24] is a CC-based method that introduces the concept of probabilistic local text homogeneity, to effectively classify CCs as text or non-text.On the other hand, Study [25] and Study [26] propose region-based methods.In Study [25], a novel hybrid approach combines the minimum homogeneity algorithm with connected component analysis and multilevel homogeneity structure.This method classifies text and non-text elements, refines regions, and detects noise.Study [26] introduces a robust system that utilizes a multilevel-homogeneity structure within a hybrid methodology.The proposed system includes an efficient algorithm for table region detection, making it suitable for a variety of document languages.Similarly, Study [27] introduces an efficient region-based method for classifying text and non-text components using a combination of whitespace analysis and multi-layer homogeneous regions.Study [28] proposes a region-based method utilizing a Rotation Invariant Local Binary Pattern (RILBP)based texture feature and an Artificial Neural Network classifier (Multi-layer Perceptron) to separate text objects from non-text objects in handwritten document images.
During the deep learning era, most of the existing studies exploited various CNN-based models.Using CNN-based models, various studies focus not only on the segmentation of the document but also the classification of the document as a whole [29], word clustering [30].Study [29] learns hierarchical features directly from normalized image pixels.It classifies document images into ten classes such as ads, news, reports, email, application forms, etc.With a theory that the low-dimensional features contain more detailed information and the high-dimensional features contain more semantic information, the study [10] proposed an end-to-end united network named Dynamic Residual Fusion Network.Study [6] proposed a deep CNN model which divides the input document into multiple sizes of patches to handle various sizes of fonts.It basically follows three phases, preprocessing, classification of text/non-text, and post-processing.Some studies proposed deep learning approaches that focus only on the detection of some specific graphical components, such as tables [31,32], formulas [33], etc.In most of the studies, document segmentation is carried out by a single encoder-decoder segmentation model.The study [9] proposed a model for text segmentation and non-text classification in document images through a deep learning UNet segmentation model.It employs one segmentation model each for textual and non-textual components which performs segmentation in a synchronous manner in a parallel fashion.
The proposed two-tier feedback-based framework can be considered as the hybrid of two methods, namely, deep learning and rule-based.Both tiers consist of document segmentation and classification using CNN-based models and followed by rule-based post-processing methods.The proposed framework is different from most of the existing methods in two aspects: (i) employment of two asynchronous tiers to segment and classify different regions of interest (textual and equation in the first tier and graphical region in the second tier) and (ii) masking of the input document images (in the second tier) using the feedback information provided from the first tier.With these two aspects, the proposed framework addresses the issue of inaccurate segmentation of graphical regions reported in various existing studies.

Proposed framework
Considering the need of segmenting the regions of two contrasting natures, namely, textual (or equation) and nontextual (graphical), which may compose of regions with varying degrees of pixel distributions, our proposed method uses a two-tier approach to handle them separately.The first tier focuses on segmenting textual and equation regions, whereas the second tier focuses on segmenting graphical regions.These two tiers are connected by the regional information (co-ordinates of the textual and equation regions).The information flows from the first tier to the second tier to properly identify graphical regions of interest that may have sparsely clustered pixel distributions.The proposed model is explained in detail in this section, divided into two parts: (i) textual region segmentation and identification and (ii) graphical region segmentation and identification.

Textual region segmentation and identification
Given an image, the goal of this module (the first tier) is to segment and identify both textual and mathematical equation regions.Figure 3 shows the schematic diagram of this module, consisting of three sub steps: (i) region segmentation, (ii) region classification and (iii) pruning of false textual regions.

Region segmentation
Given an image, the first task is to develop a textual region segmentation model (we called it as Equation and Textual Segmentation Model (ETSM)).We consider UNet [34], a well-known image segmentation model (with VGG-16 [35] as a backbone in the encoder), as the model to segment textual and equation regions.The ground truth masked images are defined by considering only textual and equation regions as the regions of interest.It can be illustrated as below: where Y ′ is the output masked images and X is the set of training document images, and Y is the set of corresponding masked images.(More detail about the data set is provided in Sect.4.1).As the learning parameters, cross-entropy loss function and Adam optimization method are used.The trained ETSM produces a masked image, In mask , for each of the given document image.To get the position of the predicted regions of interest in In mask , this study finds the co- ordinates of the bounding boxes of the predicted regions.It is done in the function GetCoordinate (as shown in Fig. 3).It consists of three operations: • Contour detection in the mask Contours can be explained simply as a curve joining all the continuous points (along the boundary), having same color or intensity.As a result, the number of contours in In mask equals the num- ber of segmented regions.Contours can be determined as follows: where getContour() is the algorithm described in Study [36] 5 for obtaining contours, and C is the list of contours C = [c 1 , c 2 , c 3 ..c k ] presented in In mask .The contour c i consists all the boundary points of the ith segmented regions, and denoted as • Contour to coordinate conversion The co-ordinates of a bounding box are denoted by a tuple (x 1 , y 1 , x 2 , y 2 ) , where ( x 1 , y 1 ) and ( x 2 , y 2 ) are the top left and right bottom co- ordinates of the box.In this study, the co-ordinates of the bounding box (b i ) for the region r i are obtained form its contour c i as shown below: (2) C = getContour(In mask ) In this manner, the co-ordinates of all the bounding boxes c i in C are obtained and stored in a list CD = [c1, c2, ...ck] , where ci = (x i1 , y i1 , x i2 , y i2 ).
• Scaling up the co-ordinates The bounding box co-ordinates in the list CD are calculated with respect to the mask image In mask .However, we want the bounding boxes in the input document images, (which can be of arbitrary size m × n ).Let a × b be the size of generated masks image In mask , then the scalling up of the coordinate for any ci ∈ CD is done as follows: where (x i1 ∶ x i2 ) indicates the start and end points with respect to the row of the images, while the second parameter (y i1 ∶ y i2 ) indicates the start and end points with respect to the column of the images. (

Region classification
Once images of the segments are extracted from the input image, in the previous step, the next task is to classify the type of each segmented image.A CNN-based classifier, as defined in Table 1 is used for identifying the type of image segments, which we name as Equation Textual and Noise Classifier (ETNC) in Fig. 3.The CNN model is trained with cross entropy loss function and Adam optimizer.Given an image segment wb i , the class assignment is illustrated as follows: where Equation denotes a segment with equation components, Textual denotes a segment with textual components, and Noise denotes segment other than textual and equation.( 6)

Pruning of false textual regions
Once the image segments are classified into three classes in the previous step, the next step is to filter out the segments that are classified as Noise, and prune the false regions of interest from the segments that are classified as Text, or Equation.Most of the false regions of interest are embedded textual regions, such as text inside figures (representing labels), inside tables (representing labels, and entries), line numbers, page number, etc.The pruning is done considering two parameters of the regions, which are (i) Aspect ratio: most of the true regions (textual and equation) of interest are elongated horizontally as compared to the false regions of interest.Therefore, the aspect-ratio (width/height of the region) is larger for true regions of interest.(ii) Area: area of the region for false regions of interest are always small as compared to the true regions (textual and equation)of interest.As shown in Fig. 3, the process of this step is carried out in Pruning, which is described in Algorithm 1.It takes two inputs; (i) Class vector -the class labels of each image segment produced by ETNC i.e., class i denotes the class label of wb i image segment, and (ii) co-ordinates vector-co- ordinates of each image segment, i.e., co-ordinates i denotes the coordinate of the wb i image segment.As shown in Algorithm 1, the working of the function Pruning can be broken down into three steps-(i) remove co-ordinates of Noise segment type, (ii) find the normalized area, and aspect-ratios of the bounding boxes, (iii) the area and aspectratio of the bounding boxes are compared against the two thresholds: threshold-of-Area ( ), and threshold-of-Aspect-Ratio ( ) ).They are set to 0.02, and 0.009, respectively. 6As shwon in the algorithm, it returns two lists of co-ordinates, Textual co-ordinates and Equation co-ordinates which correspond to the co-ordinates of textual and equation regions in the input image In.These coordinate lists are passed on to the next tier to remove textual and equation regions from the input image In and generate corresponding masking.

Graphical region segmentation and identification
After extracting the co-ordinates of textual and equation components in the previous module, this tier extracts the co-ordinates of the graphical regions presented in the given document image In.Unlike other approaches reported in various existing methods, this study modifies the given input image to convert all the foreground pixels of non-graphical regions into background pixels.The modification of the input image is motivated by the fact that the segmentation model works better when trained on document images where all of the pixels in the foreground belong to the regions of interest of one class.Figure 4 shows the schematic diagram of this tier, consisting of four sub-steps: (i) input image masking, (ii) region segmentation, (ii) region classification, and (iii) pruning of false graphical regions.

Input image masking
The goal of this step is to mask the given input image and produce a new input image.The masking is done by converting all the pixels of non-graphical regions to the background pixels.To perform masking, we need two information, namely, (i) co-ordinates of non-graphical regions: This is fulfilled with the information provided by the previous tier, which are two sets of co-ordinates-Textual co-ordinates and Equation co-ordinates, and (ii) the background pixels value for the given input image-as most of the documents have a common structure that their contents (text, equation, or graphical) starts appearing on the document leaving some blank-regions (background pixels) on all four boundaries, this study calculates the intensity of the background pixels from these four boundaries.The masking process is given by where In is the original input image, and Ig is the modified input image.The working of Image_Masking() is demon- strated in Algorithm 2. It can be expressed in two operations: 1. Determine background pixel intensity The key objective was to identify the most frequently occurring color, which represents the dominant background color.By iteratively analyzing each pixel in the RGB image

Region segmentation
Given the modified input image Ig, same as in the previous tier, the first task is to develop a graphical region segmentation model (we called it as Graphical-region Segmentation Model (GSM)).It is developed considering the same UNet architecture as that of ETSM and trained on the document images which have only graphical regions as foreground regions (modified input images).GSM is able to capture the characteristics of the graphical regions accurately.Even when the graphical region is made up of sparsely clustered pixels, such as in flowcharts, line charts, etc., it segments the continuous foreground pixel, and is Due to its inclusion during training GSM, it segments some blank regions as the regions of interest.• It marks some regions, such as page number, equation number, which are filtered out as false regions of interest in the first tier and hence could not be modified during input image making to become background regions, as the regions of interest of graphical components.

Region classification
The goal of this step is to address the false segmentation caused by the blank regions.A CNN-based model, (which we name it as Graphical-region Classifier (GC)-same architecture as ETNC discussed in Sect.3.1.2),is trained to identify the type of image segments ( ∈ G ) into two classes as demonstrated below: where Graphical denotes segment with graphical components (any regions with foreground pixel), and Noise denotes segment without any foreground pixel, which is a blank region.

Pruning of false graphical regions
The goal of this step is to address the false segmentation because of the regions that have foreground pixels but are ( 8) not part of graphical regions.In our case, this error is created by page number, line number, equation number, etc., which are filtered out from the first tier because of their relatively small area.As shown in Fig. 4, the pruning process is carried out in Pruning-G, which working is same as the Pruning (discussed in 3.1.3).The only difference is that Pruning-G considers only one parameter, area of the regions with a user-defined threshold, threshold-of-Area-graphical ( ), to prune falsely segmented regions.This study sets to 0.03. 7fter pruning, we finally get a new set of co-ordinates, Graphical co-ordinates, which contains the coordinates of the graphical regions for the given input image In.
4 Experimental setup

Data set
This study uses multiple data sets for executing and evaluating the proposed framework.

In-House (IH)
This study curates a data set consists of 4840 document images written in Latin (3840) and Bengali (1000) scripts. 8All of them consists graphical components/regions.4040 samples are considered for training, and the remaining 800 samples are considered for developing a testing data set.Based on the type of parallel mask images, 9 this data set has two variants: (a) In-House-Textual (IH-T) Here, the regions of interest are the textual and equation, which means the mask images are annotated in such a way that textual and equation regions are considered as only foreground regions, and the rest as the background regions.This variants is used as the training data set for ETSM (segmentation model in the first tier).Therefore, it has 4040 samples.(b) In-House-All (IH-A) In this variant, the masks are annotated to contain the positional and label information of all the regions of interest (equational, textual, and graphical).This variants is used as the testing data set of the proposed framework.Therefore, it has 800 samples.

In-House-Graphical (IH-Graphical)
This data set is same as IH but the images are modified manually to have only graphical regions as the foreground regions.Therefore, the document images in this data set have no textual or equation regions.The corresponding mask images are generated in such a way that all the foreground pixels(which are the graphical regions) belong to the regions of interest.This data set is used as the training data set for GSM (segmentation model in the second tier).Therefore, it has 4040 samples.

Text-Equation-Noise data set (TEND)
This data set consists of 6000 samples which can be grouped into three classes of 2000 samples each: Text (images that have only two texts), Equation (images that have only mathematical equations), and Noise (images that does not have texts or equations such as diagrams, parts of pictures, etc.).

Segmentation and classification models
The proposed framework consists of two segmentation models (ETSM and GSM), and two classification models (ETNC and GC).As stated earlier, the segmentation models are UNet models with VGG-16 as a backbone of the encoder.
Along with vanilla UNet, this study considers three variants of UNet: UNet-CBAM [41], UNet-SE [41], and UNet-(CBAM+SE) [41] obtained with three attention mechanism CBAM [42], SE [43], and CBAM+SE [41], respectively.The experimental setup of the segmentation and classification model for the two tiers are provided below: • In the first tier of the proposed framework, the segmentation model ETSM is trained on using a training data set of IH-T, while the classification model, ETCM is trained using a training data set of TEND.• In the second tier of the proposed framework, the segmentation model GSM is trained on the training data set of IH-Graphical, while the classification model GC is trained on the training data set of GOBD.

Evaluation metric
To evaluate the performance of the proposed framework, this study considers three well-known evaluation measures used in various studies [44][45][46][47]: Intersection over Union (IoU) is used for measuring the segmentation performance, mean average precision (mAP), and average F1-score (maF1) measures are used for object/region classification measures.

Experimental results
The proposed framework is evaluated from three different perspective: (i) evaluation with an optimal IoU threshold value, (ii) Regionwise evaluation: the proposed framework is evaluated tierwise and stepwise, and (iii) evaluation with varying IoU threshold.
1. Evaluation with an optimal IoU threshold In line with various existing studies [6,9,10,[44][45][46][47], the evaluation of segmentation performance in this study considers an IoU threshold of 0.5 for region detection and identification.Table 2 presents the performance comparison of the proposed framework against three existing methods using an IoU threshold of 0.5 over five data sets: IH-A, DocBank, DSSE, CS, and Pub .The results demonstrate that the proposed framework, employing the (SE+CBAM)-based UNet as the segmentation model, achieves superior detection and identification accuracy across most data sets.For instance, in the IH-A data set, the proposed framework achieves an mAP of 97.85%, surpassing all the baseline methods.Similarly, in the DocBank and DSSE data sets, the proposed framework consistently achieves higher mAP and mAF1 values compared to the existing methods.However, it is important to note that the existing study [6] marginally outperforms our proposed framework in terms of mAP for the CS data set.Despite this, the proposed framework still maintains competitive performance, with an mAP of 98.01%, indicating its effectiveness in document analysis tasks.These findings further highlight the robustness and reliability of the proposed framework across different data sets and its potential for real-world document analysis applications.In summary, the proposed framework demonstrates superior performance compared to the existing methods in terms of mAP and mAF1 across multiple data sets.Although there is a marginal case where one of the existing methods achieves a slightly higher mAP for a specific data set, the overall performance of the proposed framework remains consistently high.The achieved results affirm the effectiveness of the proposed framework in accurately detecting and segmenting components in document analysis tasks.2. Regionwise and stepwise evaluation Based on the evaluation conducted in this study, the proposed framework selects the (SE+CBAM)-based UNet as the ETSM and GSM models, which consistently deliver high-performance.Henceforth, we refer to this selected configu- ration as our proposed model.The evaluation aims to assess the performance of the proposed model in comparison to three existing methods, focusing on the segmentation and identification of each class/region (textual, equation, and graphical) using an IoU threshold of 0.5.The evaluation process follows a specific procedure comprising two steps: segmentation using the ETSM/ GSM model and the pruning process that combines region classification and filtering false regions of interest based on area and aspect ratio.Initially, the framework is evaluated based solely on the segmentation results achieved through the ETSM/GSM models.This evaluation provides insights into the framework's performance based on segmentation alone.Subsequently, a comprehensive evaluation is conducted by incorporating the pruning process.In this step, regions classified as noise are ignored, only textual and equation regions are retained, and falsely segmented regions are eliminated based on the area and aspect ratio of the regions.The results of this evaluation are reported in Table 3.
The values inside the bracket accompanied by an up arrow ( ↑ ) in the rows of the proposed method indicate the increase or improvement in the mean average precision (mAP) achieved by incorporating the pruning step compared to the first step (segmentation) alone.
For the Text region, the mAP of the proposed method in the first step is 86.01%.However, when the pruning step is incorporated, the mAP significantly improves to 98.17% for IH-A, 98.09% for DocBank, 97.33% for DSSE, 97.56% for CS, and 98.13% for Pub.These results demonstrate the effectiveness of the pruning process in enhancing the segmentation performance, as indicated by the noticeable increase values of 12.16, 10.64, 8.3, 8.44, and 8.04% for the respective data sets.
Similarly, for the Eqn (Equation) region, the mAP of the proposed method in the first step is 81.83%.Upon incorporating the pruning step, the mAP significantly improves to 98.72% for IH-A, 99.51% for DocBank, 99.12% for DSSE, 99.18% for CS, and 99.57% for Pub.
The corresponding increase values of 16.89, 9.38, 9.5,  8.81, and 9.07% highlight the substantial improvement achieved through the pruning process.Regarding the Graph (Graphical) region, the mAP of the proposed method in the first step is 89.21%.After incorporating the pruning step, the mAP improves to 97.07% for IH-A, 97.07%for DocBank, 98.10% for DSSE, 98.63% for CS, and 98.43% for Pub.The increased values of 7.86, 9.51, 15.79, 10.18, and 15.87% demonstrate the positive impact of the pruning process on the segmentation performance for graphical components.Overall, these results emphasize the significance of the pruning step in improving the segmentation accuracy for all classes of components.The substantial increase in values indicates the effectiveness of the proposed method in achieving more accurate positional information for textual, equation, and graphical elements in various document analysis and understanding tasks.Furthermore, from the table, it can be seen that the proposed model achieves high mAP for the textual and equation classes, while also delivering competitive performance for graphical components.Considering the consistent and robust performance of the proposed model across multiple data sets using an IoU threshold of 0.5, it can be concluded that our proposed framework exhibits reliability and effectiveness in segmenting and identifying different classes of components in various document analysis and understanding tasks.The robustness of the proposed method provides confidence in its applicability and potential for real-world applications.

Evaluation with varying IoU threshold In this study,
we conducted an in-depth analysis of the detection performance of the proposed model by varying the IoU threshold.As done in earlier evaluations, the evaluation was extended to include three existing methods for a comprehensive comparison.4, it is evident that the best mAP and Ave F1 scores for all methods, including the proposed model, are achieved when the IoU threshold is set to 0.5.Of particular interest is the performance of the existing method presented in Study [6], which outperforms the proposed framework when the IoU threshold is set to 0.5 (as shown in Table 2).However, its performance significantly deteriorates as the IoU threshold increases.This observation highlights the sensitivity of existing methods to variations in the IoU threshold.To further illustrate the performance differences, Fig. 6 displays output samples from the four methods when the IoU threshold is set to 0.6.It is evident that, apart from the proposed model, the other methods struggle to accurately segment most graphical regions.Based on this experimental observation, it can be concluded that the proposed model outperforms the existing methods and exhibits robustness with respect to varying IoU threshold values.The robustness of the proposed method is a significant advantage, as it indicates the model's consistent performance across different scenarios and data sets.In addition, the cross-examination results highlight the superiority of the proposed framework over existing methods, particularly in accurately segmenting graphical regions.These findings reinforce the reliability and effectiveness of the proposed model for various document analysis and understanding tasks.

Applications
This study is extended to four use cases, which can be considered as the applications of the proposed framework, namely, (i) Document recreation-the given input document images are recreated in the PDF form while preserving the structural information, (ii) Structural preserving document transliteration-the given document images are transliterated into another target script/language and recreated into a PDF form while preserving the structural information of the documents, (iii) Structural preserving document translation-the given document images are translated into another target script/language and recreated into a PDF form, while preserving the structural information, and (iv) Meta-file generation-a JSON file, called a meta-file is generated for any given document image.It consists the descriptive information (with respect to structure, contents position, classes, etc.) of the given documents.
All four applications discussed in this section, make use of multiple tools, namely, OCR, machine transliteration, machine translation, and graphical component classification model.In addition, this study uses the open source tool TeXStudio11 for recreating input document images into PDFs.The process of recreation can be seen as three steps procedure: (i) Blank page creation: the creation of a blank page with size m × n , which is the size of the input document image.This is done with the package \geometry, (ii) insertion of textual components: Textual components are inserted in the blank page at the specific positions.This is done using two packages \tikz, and \minipackage, (iii) insertion of graphical and equation components: The graphical and equation components are inserted as images.It is done using the package \picture.In recreating the document, the fonts of all the textual components are kept the same, and some textual information is ignored such as boldness, italic, underlined, etc.In addition to this, the graphical or equation components might shrink because of inaccurate segmentation, or loss of textual contents such as words, or paragraphs because of the error provided by the translation or transliteration model.
• Document recreation: The given input document image is recreated in PDF preserving the structural informa- tion of the original image.The graphical and equation components are inserted as an image but the textual components are inserted from their editable form.So, this case considers OCR to convert textual components into their editable form.This study considers OCR for two languages-English, and Manipuri language(written in Bengali script).We adopt the OCR system provided by Tesseract. 12OCR system for English is readily available, but for Manipuri language, we developed the OCR system following the fine-tuning method (one of the OCR development methods from the existing OCR for similar scripts provided by Tesseract-development 13 ).The output sample or recreated sample of this case is shown in Fig. 7a.• Structural preserving document transliteration: This is the same as the previous one but the difference is we perform transliteration of the textual components.So, this case considers OCR, and the machine transliteration model.It can be done for any scripts/language if the above-mentioned two components are available.However, this study considers performing transliteration of Manipuri documents written in Bengali script to Meetei script. 14We consider the Manipuri OCR system used in the previous case, and developed a simple rule-based Manipuri Bengali-to-Meetei script transliteration system influenced by the study [35].The output sample of this case or transliterated sample is shown in Fig. 7b.• Structural preserving document translation: This is also the same as the previous one but the difference is we perform translation instead of transliteration.So, this case considers OCR and the machine translation model.It can be done for any scripts/language if the above-mentioned two components are available.However, this study considers performing translation of documents written in the English language to Hindi language. 15This study ).The graphical components can be classified into any classes and it depends on the classification model employed for the task.This study considers the graphical components to be classified as chart types ( data visualization) such as area, bar, column, etc., and non-chart graphical components are classified as "Other".For this study, we consider a chart-type classifier model developed in study [50] which classifies the given chart images into 25 chart types.Therefore, this case considers two components OCR (to get content description with respect to textual components) and a chart classifier.The generated meta-file is saved as a JSON file, and one of its sample is provided in Fig. 8.

Discussion
In spite of providing well behave characteristics compared to various existing methods, our proposed system has some limitations which may be noted in the following points: • Textual components The proposed framework ignores some details such as page number, and line number to be considered as the textual components.In addition to this, it works only on documents in which the texts are written horizontally.• Equation components The proposed framework is able to segment those equations which are written in a single line accurately.However, it leaves some part of some equation components if they are written in multiple lines (if-else) with a text-lines as conditions.This situation is shown in Fig. 9a.• Graphical components The proposed framework fails to segment the complete region of the graphical components if they have long text-line as their parts such as long text-line entries inside the table, labels of the plots, etc.This situation is shown in Fig. 9b.• Background This study assumes that the intensities of all the background pixels are the same.This assumption does not affect the performance of the proposed framework though.• Document layout The proposed framework provides better results for single and double-column documents.If the documents are multiple columns (greater than two), the post-processing method in the first phase of the proposed framework filters out multiple regular textual regions considering them as noisy regions.

Conclusion and future work
In this study, we have proposed a two-tier feedback-based end-to-end integrated framework for document segmentation and region classification.The first tier considers textual and equation regions as the only regions of interest, while the second tier is instructed to consider only the graphical regions as the regions of interest.From the various experimental evaluation against multiple existing methods and data sets, the advantage of using two tiers, and feedback information is observed.The proposed framework outperforms various existing methods.This study further extends the proposed framework to four use cases.In the future, we are planning to address all the shortcomings discussed in Sect.7.

Fig. 1
Fig. 1 Samples of document images: the document sample in a has non-textual components in which the pixels are tightly clustered, while the samples in b and c has non-textual components in which the pixels are sparsely clustered

Fig. 2
Fig. 2 Output samples of three document segmentation methods: The samples in a, b, and c are the output of the method in study Tran et al. [9], Umer et al. [6] and Wu et al. [10]

Fig. 3
Fig. 3 Equation and textual region segmentation and identification

Fig. 4
Fig. 4 Graphical region segmentation and identification

( 7 )
Ig = Image_Masking(In, −,− )and constructing a histogram of color frequencies, we determined the color value with the highest frequency, signifying the background pixel color.It is important to note that this method relies on the assumption that the background color is the most prominent color in the image.In instances where the document contains complex backgrounds with multiple dominant colors, more advanced techniques such as color clustering or background modeling may be required for accurate background pixel estimation.However, in practice, most of the document images that are available online, and also those in publicly available data sets, have white color (255, 255, 255) as the background pixel color/value.The output of this operation is provided in Fig. 5. 2. Pixel value modification The pixel value of all the pixels inside the bounding box obtained with the co-ordinates h i ∈ (Textual co-ordinates∪ Equation co- ordinates) are changed with the background pixels value to obtain a new input image Ig.

Fig. 5
Fig. 5 Background pixel intensity determination examples.The first row indicates the input document images and the plots in the second row indicate their respective selected background intensity values for all three channels

Fig. 6
Fig. 6 Output samples of the proposed model and three existing methods with IoU threshold 0.6

Fig. 7
Fig. 7 Output samples of three use cases (presented in the second row for the corresponding input images shown in the first row): a document recreation, b structural preserving document transliteration, and c structural preserving document translation

Fig. 8
Fig. 8 Sample of generated meta-file: a input document image, b a generated meta-file (JSON file) for the document image in a

Fig. 9
Fig. 9 Some challenging samples in region segmentation: a equation with multiple options as long text-lines, b graphical component, table with various text-lines as the entries Case study: extension of the proposed framework to four use cases, namely, meta-file generation, document recreation, structural preserving document transliteration, and structural preserving document translation.
• Comparative evaluation of the proposed framework with three recently proposed existing integrated methods.•

Table 1
Architecture detail of two classifiers: ETNC and GC (GC to be discussed in Sect.3.2.3)

Table 2
Performance

Table 4
Performance