Information Extraction System for Cargo Invoices

doi:10.21203/rs.3.rs-2413475/v1

Download PDF

Research Article

Information Extraction System for Cargo Invoices

https://doi.org/10.21203/rs.3.rs-2413475/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Rapid growth in the digitization of documents, such as cargo invoices or receipts, has alleviated the demand for methods to process information accurately and efficiently. However, it has become impractical for humans to extract the data due to the size of the corpus. Furthermore, extracting the information manually can be labor-intensive and time-consuming as employees need to go through multiple digitized invoices and extract the critical information. The digital documents, such as digital cargo invoices contain various components such as tables, key-value pairs and figures. Existing optical character recognition (OCR) method are able to recognize texts. But it is challenging for them to extract the key-value pairs in unformatted digital cargo invoices. Hence, creating a smart information extraction system that can extract key-value pairs would be beneficial, as it would help organizations achieve workflow efficiency, resource utilization and eliminate costly errors. In this paper, a pipeline of the information extraction system is proposed with deep learning approaches for classifying key-value pairs first, followed by linking the key-value pairs. Two merging rules and two pairing rules are developed in the proposed pipeline. Various experiments have been conducted to illustrate the performance of the system.

Information Extraction

Key-Value Pairs

Optical Character Recognition

Natural Language Processing

On a daily basis, many organizations, such as logistics companies, deal with many paper documents such as receipts or cargo invoices [1]. Currently, the majority of these documents must be processed manually, which is time-consuming and expensive [2].

One of the most difficult aspects of invoice processing for logistics organizations is the time-consuming, intensive in-house procedure, requiring numerous manual works when extracting and keying data into various internal software systems. In addition, data captured from logistics invoices poses a particular difficulty because they are received in multiple formats. This is a significant issue, due to the intricacy of the invoices. Manually inspecting and correcting scanning errors greatly lengthens the correction and processing time [3].

Digital invoices contain various components such as tables, key-value pairs and figures. A key-value pair is made up of two connected data elements: a key which is a constant defining the data set; and a value which is a variable that is part of the set. A fully formed key-value pair will be like Gross Weight: 123 kg, where “Gross Weight” is the key and “123 kg” is the value.

A corporation can benefit from digitising and extracting key information on a number of levels. Business owners can better track their processes, provide better customer services, increase employee productivity, and cut costs. Optical character recognition (OCR) also known as text recognition, is the process of extracting text information from scanned documents and images. Current OCR systems, such as Tesseract [4] and EasyOCR can recognize raw text in unformatted digital documents or images. But they are not capable of extracting information like key-value pairs from unformatted data. Key-value pairs are the most significant components in digital cargo invoices. Key-value pairs make raw texts more understandable.

The existing solutions, such as Amazon Web Services (AWS) Textract, provide such a service [5]. However, it is a paid service in which many of the small and medium enterprises (SMEs) are reluctant to pay extra costs for going digital since the SMEs usually have a tight budget. Furthermore, the off-the-shelf solutions may not fit the specific domains, as most of the solutions lack of flexibility to be customized or tailored according to different requirements from different customers. Moreover, the AWS Textract is a service hosting on the cloud platform. Users need to upload their digital invoices and receipts to the cloud platform, in order to get them processed. It raises possible privacy concerns since some companies might treat the invoices and data as private and business sensitive.

Hence, automation of extracting key-value pairs within unformatted digital invoices is proposed as it can significantly reduce manpower and costs while simultaneously ensuring the reliability of the data retrieved. In this paper, an information extraction system with the deep learning approaches that is capable of extracting key-value pairs will be presented to improve the overall performance. The proposed information extraction system would be beneficial as it would help companies achieve workflow efficiency, resource utilization and eliminate costly errors.

The remaining parts of this paper are organized next. Section II presents the prior works in the literature. Section III explains the pipeline of the proposed information extraction system. Section IV analyses and discusses experiment results. Lastly, Section V concludes this paper.

There are three different techniques of key-value pair extraction: regular expression, natural language processing (NLP) and layout detection. This section studies the prior arts related to OCR key-value pairs extraction approaches.

OCR is an important portion as it will affect the results for the key-value pairs. As mistaken words might be wrongly classified when performing key-value pair extraction. Hence, OCR plays a crucial role for key-value pair extraction. OCR is used to extract text information from scanned documents and images. Vedant Kumar et al. [6] use OCR on bill receipts images taken from a mobile phone to extract out the text information from the receipts. Before extracting the information, they have done some image pre-processing such as binarization, removing of shadows and etc. After that the processed image is inputted into the Tesseract OCR engine to extract the text. OCR has been used for scanned documents to extract the text information in [7]. Similarly, they have pre-processed the scanned invoices by sharpening the images, threshing and binarization.

2.1 Key-Value Pair Extraction using Regular Expression

A key-value searching system is developed by Kaló and Sipos [8] that uses the open-source Tesseract OCR engine and post-processing techniques with regular expressions. Regular expressions are patterns used to match character combinations in strings. To find key-value pairs with regular expression, we can find by text with colon. If the regular expression manages to find the word, this means that it is a key.

This method can be used to create a low-cost office automation system for invoice processing. Regular expression is very effective at identifying patterns and replacing strings. It increases efficiency and versatility by allowing constant patterns and variable values to be searched in the text. Before developing the key-value searching system with regular expression, it is necessary to learn what patterns are there in the dataset, collate all the different types of patterns and put them into a pattern dictionary. The regular extraction pattern dictionary can be expanded over time to learn a wider variety of patterns, with more datasets being included and more patterns being added. New patterns can be added based on the errors. If these errors occur during a later process, they would be stored and corrected.

Using regular expressions to extract key-value pairs is efficient in finding specific patterns or text. However, when a new text or pattern is introduced, that would be an issue as the system does not understand and cannot extract them. Additionally, some keys and values may have a variety of different forms of patterns. Thus, there is a need to manually look through the dataset and update the patterns to extract the key-value pairs correctly and effectively. Furthermore, with many patterns declared, it degrades the readability and performance of the codes. Hence, with these problems being identified, extracting key-value pairs using regular expression is not efficient. Though, regular expression may be suitable for post-processing in assistant of validation of value formats.

2.2 Key-Value Pair Extraction using NLP

Bidirectional Encoder Representations from Transformers (BERT) [9] utilizes Transformer, an attention mechanism that discovers contextual relationships between words (or subwords) in a text. Transformer has two independent working parts: an encoder that reads text inputs, and a decoder that generates a task prediction. BERT has two training tasks: Masked Language Model (MLM) and Next Sentence Prediction.

Robustly Optimized BERT Pretraining Approach (Roberta) [10] is an optimized approach for pretraining natural language processing. Roberta has almost similar architecture to BERT. Some changes are removing the Next Sentence Prediction and training with bigger batch sizes and longer sequences.

The BERT is extended to a new model, StructBERT, by incorporating language structures into pre-training [11]. StructBERT expands BERT by leveraging structural information, such as word-level and sentence-level ordering, in addition to the present masking method. Two new structural objectives are added to model pre-training, focusing on inner-sentence and inter-sentence structures. In this way, the language characteristics of pre-training phase are explicitly represented.

Furthermore, StructBERT uses structural pre-training to embed dependency between words and sentences in the contextualized representation, allowing the model to be more generalizable and adaptable. This allows StructBERT to represent language structures explicitly by requiring it to reconstruct the correct order of words and sentences in order to make accurate predictions. StructBERT is compared with a few methods such as BERT, Roberta, etc. Through the experimentation results, it shows that the StructBERT model can obtain better results in a variety of downstream tasks with the improvement of the F1 scores to 93%.

StructBERT framework is better than the regular expression approach, as it is trained on the textual data. It understands the key-value pairs and reconstructs the correct order of words, in order to make accurate predictions. However, the NLP approach only takes in the text from the document and does not incorporate the position of the text. In the tasks of extracting key value pairs, the position of the text is useful. However, StructBERT is still useful to understand the key-value pair structure and can be used together with layout awareness algorithms to improve the accuracy.

2.3 Key-Value Pair Extraction using Layout Detection

LayoutLM [12] is the few of the models successfully reported that does labeling using positional information, text-based information, and image information. As an upgrade from LayoutLM, the LayoutLMv2 is reported in [13], which uses model architectures and pre-training tasks to pre-train text, layout, and image in a multi-modal framework. Unlike other Visually-rich Document Understanding (VrDU) approaches which aims to analyze scanned documents [13], LayoutLMv2 not only considers the text and layout information, but also integrates the new text-image alignment and text image matching tasks, which help to learn cross-modality interaction. In addition, LayoutLMv2 incorporates a spatially-aware self-attention mechanism into the Transformer design, allowing the model to completely comprehend the relative positioning relationships between different text blocks.

Experiments are conducted to compare the efficiency of different types of models [13] with multiple datasets, including open sourced FUNSD dataset [14] which consists of scanned forms that differ significantly in terms of their structure and presentation. The results show that LayoutLMv2 outperforms strong baselines and achieves better performance on a variety of downstream visually rich document comprehension tasks.

Another approach, LAMBERT, is reported in [15] that tackles the challenges of comprehending documents where non-trivial layout affects the local semantics. It is a Layout-Aware Language Model, which combines NLP methods with layout understanding mechanisms. The LAMBERT uses the layout information of the document and train it with the pretrained Roberta. Hence, users do not need to input any images, as the model uses the bounding box coordinates instead to understand the layout of the document.

A pre-training approach, StructuralLM, utilizes cells and document layouts from scanned documents [16]. This approach is influenced by LayoutLM, but with some differences. Firstly, rather than modeling word-level 2D-position embeddings, the StructuralLM uses cell-level 2D-position embeddings to represent the layout information of cells. The StructuralLM also introduces a cell position classification which attempts to predict a cell location based on the location of cells and their semantic relationships. Lastly, the StructuralLM removes the image embeddings from LayoutLM that are only needed in the downstream tasks while keeping the 1D-position embeddings to describe the positional link between tokens from the same cell.

The LayoutLMv2 framework with layout detection approach is better than models with the regular expression and NLP approaches for extracting key-value pairs. LayoutLMv2 integrates NLP processing techniques in the framework and text-image alignment where it is able to comprehend the relative positioning relationships between different text blocks. This is efficient for extracting key-value pairs as most of them are either next to each other or positioned on top of the other. However, while experimenting with LayoutLMv2 on our dataset in our research, as our dataset has several different layouts, it has been observed that the LayoutLMv2 has some limitations. While LayoutLMv2 trains well on the layout of the invoices and when a different layout of the invoice is input as test data, it will not do as well. It may not be able to accurately detect the key-value pairs. However, this can be solved by training more datasets from different varieties.

In this section, the overall design of the proposed information extraction pipeline for cargo invoices is described.

3.1 System Arcitecture Pipeline

The flow chart of the proposed information extraction system for cargo invoices is shown in Fig. 1. The proposed pipeline of information extraction system consists of two portions: the key-value pair classification and linking the key-value pairs. The key-value pair classification portion will be to explore models to classify the text from OCR to “Question”, “Answer” and “Others”. For the second portion of linking the key-value pairs, it takes the output from the key-value pair classification to link the key-value pairs with the use of layout spatial awareness like the bounding box position.

To begin the pipeline of the proposed system, data collection is performed by taking images of cargo invoices from warehouses. After the data collection, there would be a quality check to ensure that the images are not blurry, as blurry images would hinder the performance. After that, the cargo invoices will be annotated into FUNSD format [14] using an open source tool, Banksy [17]. This open source tool is an image annotation tool where the Named Entity Recognition (NER), Named Entity Linking (NEL), and a box region on the image will be outputted. The NER will be the labels such as “Question”, “Answer” and “Others”. The NEL is the tasks to link the “Question” text to the “Answer” text. Lastly, the box region is the bounding box, where the bounding box of texts is drawn.

The next step is to perform data cleansing on the data. Data cleaning is a necessary process because it improves our data quality and increases overall productivity. After cleaning, the data is formatted into a model-ready form for training and split the dataset into the train set (80%), and the test set (20%).

Subsequently, different algorithms for the key-value pair classification and detection are explored. The text classification algorithms are only trained on the text and labels. Whereas for linking of key-value pairs algorithms, it uses layout awareness and text embeddings to train and get the keys and values.

After comparing the results of multiple key-value pair classification algorithms such as LayoutLMv2, BERT, etc., the best model will be chosen for key-value pair classification. For the algorithms to link the key-value pairs, two methods explored are the regular expression and linking key-value pairs by the bounding box location. After comparing the results of these two algorithms, the best model will be used and output the final key-value pair. The output will be evaluated with the ground truth. If the performance of the output from the proposed information extraction pipeline is low, it will go through another iteration to improve the accuracy by experimenting with different model hyperparameters. This will be an iterative process until it achieves good accuracy. Finally, if the performance is satisfied, it will go through inferencing using the optimal model chosen by the proposed pipeline, and merge the output of word-level key and value to the entire key and value pairs.

3.2 Selection of Algorithms

To explore key-value pair classification, two types of different techniques will be used to compare: NLP methods, and NLP methods combined with layout spatial awareness.

NLP methods are considered as they can understand context-sensitive human language. By using NLP, it can effectively extract data or information from text-based documents [16]. For the NLP methods, both BERT model and Universal Language Model (ULMFit) model are explored. The BERT model analyzes a word's left and right sides to infer its context. Additionally, The BERT model uses MLM, which covers or masks a word in a sentence. MLM enables or enforces bidirectional learning from a text by requiring BERT to predict the word on either side of the covered word [18]. With this in mind, applying it to the cargo invoice dataset, the model could understand the key-value pairs. For example, the term "Product" could be next to the word "Number".

The ULMFit model is trained using a general-domain corpus to capture overall language properties at several layers and then learns task-specific features [19]. The model uses a 3-layer ASGD Weight-Dropped Long Short-Term Memory Networks (LSTM) architecture. There are three steps to the training: Pre-training a general language model on a text, fine-tuning the language model on a target task, and fine-tuning the classifier on the target task are the first two steps.

The other technique for key-value pair classification is NLP with layout spatial awareness. The first algorithm is YOLOv4 combined with BERT. This algorithm is able to understand the position of the text as well as using BERT to understand the text embeddings. The second algorithm is LayoutLMv2; LayoutLMv2 not only considers the text and layout information, but also integrates the new text-image alignment and text image matching tasks, which help to learn cross-modality interaction [13].

For the two algorithms for the linking of key-value pairs, i.e., the regular expression and the pairing by the nearest bounding box, the regular expression is efficient in finding specific patterns or text which is common in cargo invoices, for example, "Product No.: 123". The patterns commonly seen for keys and values are with colons. For the regular expression algorithm, an analysis will be done to see the common key-value pairs and their patterns to extract the key and values pairing. For the pairing by the nearest bounding box algorithm, it will pair keys and values by finding the nearby bounding box. The pairing rules are to merge the text only if they have the exact vertical coordinates (i.e., y coordinates), overlapped bounding box and are close.

3.3 Datasets

The datasets used for experiments and performance comparisons are the FUNSD dataset [14], and Cargo Invoices dataset. The Cargo Invoices dataset is constructed in this research which are all the cargo invoices images from the warehouses. TABLE I describes the number of training and testing data in each dataset.

Table I SAMPLE SIZES OF UTILIZED TWO DATASETS

Dataset Used
Dataset	Training	Testing	Total Data
FUNSD [14]	149	50	199
Cargo Invoices Dataset	484	121	605

4.1 Evaluation Metrics

The evaluation metrics used in the ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction [20] are employed to evaluate the experiment performances in this paper: the Precision, Recall and F1 Score. The extracted text will be compared to the ground truth for each test image. If the submitted content and category of the extracted text matches the ground truth, it is marked as correct; otherwise, it is labelled as inaccurate. Furthermore, the algorithms will also be further evaluated by the classification report and confusion matrix.

The values of the precision, recall, F1-score, and support are the four most critical headers for classification results to pay attention in classification reports. The value of precision refers to the ability of a classifier to avoid labelling a negative instance as positive [21]. The recall is the ability of a classifier to find all positive instances. The F1 score is a weighted harmonic mean of precision and recall. The support is the number of actual class occurrences in the specified dataset.

The precision value can be calculated in Eq. (1). Precision is useful when False Positives (FP) pose a greater threat than False Negatives (FN). In our research, having lots of wrong results would affect the key-value pair linking and the final key-value pair results.

$$Precision= \frac{TP}{TP+FP}$$

where TP denotes True Positive.

The recall value is computed in Eq. (2). When FN is more important than FP, it is a valuable metric.

$$Recall= \frac{TP}{TP+FN}$$

The F1 score ranges between 0 and 1, that can be calculated in Eq. (3). If the F1 score scores closer to 1 it means that the model performs better.

$$F1 Score= 2 \times \frac{Precision \times Recall}{Precision+Recall}$$

4.2 Label Classification Comparisons

Prior to classification, the dataset is first split into 80% training and 20% testing. The four classification models experimented with are BERT, ULMFit, YOLOv4 combined with BERT and LayoutLMv2, each of which will classify the text in to “Question”, “Answer” and “Others”. In the experiments in this sub-section, the dataset used in the constructed Cargo Invoice dataset in this research.

1) Results of the ULMFit Model

With the Cargo Invoice dataset, the ULMFit model is utilized for key-value pairs classification. The results of precision, recall, F1 score and support for the ULMFit model are shown in TABLE II. The ULMFit model scores 93% accuracy. Overall, the performance of this model is good. However, the model does not perform well for the "Other" class, compared to the other two classes. It might because there are fewer "Other" labels as compared to those of "Question" and "Answer".

Table II ULMFit Results with Cargo Invoice Dataset

Class	Precision	Recall	F1 Score	Support
Question	99%	98%	98%	882
Answer	88%	96%	92%	882
Other	93%	80%	86%	568
Accuracy	93%			2332

2) Results of the BERT Model

The experiment results of the BERT classification model with the Cargo Invoice dataset are shown in TALBE III. The BERT model achieves 88% accuracy. But the “Answer” label does not do well for the values of precision, recall and F1 score. This model can perform better on the “Other” label as the score for precision, recall, and F1 score is higher than the others.

Table III BERT Model Classification Results

Class	Precision	Recall	F1 Score	Support
Question	85%	91%	88%	2128
Answer	83%	76%	79%	1387
Other	96%	95%	95%	1493
Accuracy	88%			5008

3) Results of YOLOv4 Combined with BERT Model

The classification results of the YOLOv4 combined with BERT model is shown in TABLE IV. It shows that the overall performance is worse than other experimented classification models. For “Other” label, it performs the worst as compared to other two labels. It might be due to the imbalanced dataset whereby “Other” has the least amount of data comparatively. The results in this experiment show that the combination of YOLOv4 and BERT algorithms does not enhance the performance, as the BERT model itself manages to achieve 88% for the cargo invoices as seen in TABLE III.

Table IV Results of YOLOv4 Combined with BERT Model

	Precision	Recall	F1 Score	Support
Answer	52%	42%	46%	882
Other	46%	39%	40%	568
Question	49%	40%	44%	882
Accuracy	40.27%			2332

4) Results of the LayoutLMv2 Model

For the LayoutLMv2 model, the labeling methods are slightly different from other three models. The labels of the LayoutLMv2 model follow the BIOES tagging, where B means beginning; I mean in the middle; E means the ending; O means others; and S means a single word representing a full sequence [22]. For example, “Product No.”, will be split into “Product” which will be B-QUESTION and “No.” is “E-QUESTION”.

The experiment results of the LayoutLMv2 classification model are shown in TABLE V. It is observed that the LayoutLMv2 model achieves an accuracy of 96%, which is the highest out of other three classification models. All the labels achieve 90% and above for the precision, recall and F1 score. It shows that the LayoutLMv2 model can predict well and distinguish each labels properly. Even though there is some unbalanced data, the model is still able to predict well for all the labels. The analysis on the confusion matrix will be further conducted whether this model is the best.

Table V LayoutLMv2 Model Classification Results

	Precision	Recall	F1 Score	Support
B-ANSWER	93%	94%	94%	331
B-QUESTION	98%	98%	98%	507
E-ANSWER	90%	96%	93%	331
E-QUESTION	98%	98%	98%	507
I-ANSWER	96%	98%	97%	915
I-QUESTION	96%	97%	97%	104
O	97%	93%	94%	1387
S-ANSWER	95%	97%	96%	502
S-QUESTION	96%	98%	97%	375
Accuracy	96%			4959

5) Experiment Results with the FUNSD Dataset

Besides the classification experiments on the constructed Cargo Invoice dataset, the experiments are also performed on the FUNSD dataset to compare the performances of these four classification models. The overall performance with the FUNSD dataset is shown in TABLE VI.

Table VI Classification Results with the FUNSD dataset

Model	Precision	Recall	F1 Score
ULMFit	77%	76%	76%
BERT	55%	67%	60%
YOLO combined with BERT	49%	40%	44%
LayoutLMV2	80%	85%	83%

It has shown that LayoutLMv2 model performes the best out of other three models. This has been demonstrated that integrating the spatial-aware self-attention Mechanism into the Transformer architecture in the LayoutLMv2 model has fully allowed the model to understand the relative positional relationship among different text blocks. On the other hand, YOLOv4 combined with the BERT model does not perform well. Even though it is trained with the text along with the bounding box of the texts, it is not able to find the relative positional relationship accurately compared to the LayoutLMv2 model. Thus, the LayoutLMv2 model will be the model chosen by the proposed pipeline for the key-value pair classification tasks.

4.3 Linking of Key-value Pair Comparisons

The two algorithms for the linking of key-value pairs to be evaluated are: the regular expression algorithm, and finding the pairs by the nearest bounding box algorithm.

1. Regular Expression Algorithm

Before beginning regular expression, some analysis is done on the dataset to understand the keys and values patterns, as well how they are paired to enhance the process of extracting key-value pairs. There are various formats of the keys, such as the short forms. Hence we need to create a list of the different forms of the keys. Some examples of the types of formats can be seen in TABLE VII.

Table VII Some examples of different formats of keys

Word

Different Formats of Keys

Gross Weight

G/W

G/Weight

Gross wt

Gross wght

Gross wght(kg)

Net Weight

Net WT

Net wght

Net weight(kg)

Net wt(kg)

Net wght(kg)

N/W

Dimension

Dim

Dims

Dim(mm)

DIM (CM)

Dimensions(cm)

After finding out the different formats of the keys from the Cargo Invoice dataset, a word cloud is created to better understand the keys in the dataset, as shown in Fig. 2. Through the wordcloud, words such as "gross weight", "net weight", "PO", "country of origin", and "dimension" stand out the most. Hence, from the common keys, the next step is understanding their values pattern. Some examples of the patterns of values are shown in TABLE VII.

Table VIII Pattern of Values

Key	Values	Patterns
Gross Weight	G/W 2134 23Kg/g 88,500 KG	For gross weight the pattern is always an integer/float followed by the metrics (KG/lb/g). Sometimes G/W would be infront of the integer/float.
Net Weight	270 lb 88,500 KG 3.840 KG	For Net weight the pattern is similar to Gross weight. The pattern is always an integer/float followed by the metrics (KG/lb/g).
PO	PO-IMA-21007 PO2000004328	For PO the pattern is PO followed by letters then integers or just integers.
Dimension	228.00 X 148.00 X 165.00 CM 89.76 X 58.26 X 64.96 INCH 381521502550mm	For Dimension, the pattern is integer/float X integer/float X integer/float

After doing analysis on these patterns, key value pairs will be extracted. For the work flow of the extraction of key-value pairs are as follow:

Find all key-value pairs.
Calculate Levenshtein distance with given identifiers to see which one is the most likely identifier.
Return key-value pair with the lowest normalized Levenshtein distance.

2) Pairing Via Nearest Bounding Box

The input for this part is the output from the LayoutLMv2 classification model. The words are in the word level form and the labels are in BIOES tagging [22].

Hence, we will need to combine the word level text into sentence level to make sense of the full question and answer. To combine them into sentence level, the bounding box position and the labels will be critical.

There are two merging rules to get the full questions and answers as currently in the word level form. These two merging rules will be described next.

The flowchart of the merging rule 1 is shown in Fig. 3. First, it gets the coordinates, labels and text from the LayoutLMv2 Predicted results and images. Then it traverses through the whole results and compare whether if both labels are same. If not the same, then it is assign as None to the variables called “neighbours”. If it is the same, then it checks whether the vertical y coordinates are the same. This is to check whether these nodes are on the same line. If they are not on the same line, then it is assign as None. If they are on the same line, then it proceeds to check whether the right side of a bounding box is the same as the left side of another bounding box. If it is the same, then the bounding box ID is assigned to each of the nodes in the “Neighbours” variable. Otherwise, it will be assign None.

After that, the second part of the merging rule 1 shown in the blue box in Fig. 3 checks if there are multiple bounding boxes connected to one another. If yes, then group them into one group. If a bounding box is not connected to another bounding box, then they will be put into individual groups.

Lastly, the purple box in Fig. 3, the merged results are derived and updated into the dictionary into the form of:

{'text': pred[link_list[0]]['text'],

'labels': pred[link_list[0]]['labels'],

'left_neighbour': None,

'right_neighbour': None,

'width': pred[link_list[0]]['width'],

'height': pred[link_list[0]]['height']}}

The second merging rule is shown in Fig. 4. For the second merging rule of the full questions and answer, it uses the data derived from the mergin rule 1 to further enhance the method of getting the full questions and answers.

First, the merging rule 2 gets the center of the bounding box from the mean of the x and y coordinates. Next it finds the nearby bounding boxes and checks if both labels are same. If not the same, the nodes are assigned as None for the variable “neighbours”. If both labels are the same, it then checks if the center of the bounding box A is within the range of the bounding box B in the y direction. If not, the variable “neighbours” is assigned to None. Otherwise, it checks if the center of the bounding box B is within the range of the bounding box A in the y direction. If yes, it checks if the two bounding boxes overlap in the x direction, or if the 2 bounding boxes gap in the x direction is < 10% of the width of the image. If the condition is true, the bbox_id of these 2 bounding boxes (bboxes) are assigned as neighbours to each other's nodes. If the condition is fasle, the bbox_id of these 2 bboxes are assigned as None to the variables “neighbours”.

The second part of the merging rule 2 in the blue box is to check if there are multiple bounding boxes connected to one another. If true, then group these bounding boxes into one group. If a bounding box is not connected to others, then it is put into an individual group. These results will be merged and passed to the last process in the merging rule 2 shown in the purple box. Lastly, the results are merged and updated into the dictionary in the same format as that of the merging rule 1.

After merging the words into the full key and values, we need to start pairing the keys and values together. There are two pairing rules for this step.

The flowchart in Fig. 5 shows how the key-value pairing rule 1 works. It takes in the results from the merging of full questions and answers. It then checks if the “Question” and ”Answer” have the right neighbour, followed by checking whether the y coordinates between both “Question” and ”Answer” are identical. Lastly, the key-value pairing rule 1 checks whether the coordinates of the right side of the “Question” bounding box is identical to the left side of the “Answer” bounding box.

After performing the key-value pairing rule 1, the output is splitted into successfully paired and unsuccessfully paired. The unsuccessfully paired results are inputted into the key-value pairing rule 2 to pair the questions ans answers. The key-value pairing rule 2 is presented in Fig. 6. Firstly, it checks whether both centers of the bounding boxes are within the range of the other bounding box in the y axis. Next it checks whether they overlap in the x axis or the gap in the x axis is less than 50% of the width of the bounding box with the smaller width value.

The final output consists of the label, bounding boxes, text, left_neightbour, right_neighbour, the width, the height, the variable of pair_with, and lastly the center attribute.

3) Results Comparison

After performing the key-value pairing, the results are evaluated by comparing the derived final key-value pairs with the ground-truth. The experiment result comparisons are shown in TABLE IX. The metrics used to do the comparison is the precision, recall and F1 score.

Table IX Results Comparison for Key Value Pairing

Algorithm	Precision	Recall	F1 Score
Regular Expression	63%	60%	66%
Pairing by Finding Nearby Bounding Box	73%	72%	70%

It is observed from the results that the second algorithm, i.e., the pairing by finding the nearby bounding box, has done a better job with a precision of 73%, recall of 72% and F1 score of 70%. In comparison, the first algorithm, i.e., the regular expression, achieves a precision of 63%, recall of 60% and F1 score of 66%. The reason for the regular expression not perform well might because there are limited patterns introduced into the system. Even though the algorithm of the pairing by finding a nearby bounding box performs better, it can still be further improved, as on some occasions the questions and answers are very far apart in a horizontal direction. Furthermore, there are a few mistakes that are made by the OCR results which have caused the LayoutLMv2 to misclassify some labels and hence messes up the key-value pairs.

In this paper, an end-to-end pipeline of information extraction system for extracting key-value pairs is presented. The proposed system uses deep learning approaches for classifying key-value pairs, and linking the key-value pairs. Its performance and effectiveness are evaluated.

First, a few deep learning approaches are employed to explore key-value label classification and linking key-value pairs. After selection, experiments are conducted to evaluate and compare the performances of each model for key-value pairs. It is observed that the results from the LayoutLMv2 model performs the best. It shows that the LayoutLMv2 architecture of layout spatial awareness and words embedding improve the results of standard text classification.

Afterwards, experiments for linking key-value pairs are conducted. There are two methods for linking the key-value pairs: the regular expression and a unique method by linking the key-value pairs by finding the nearby bounding boxes. After evaluating the experiment performance, the algorithm of linking by finding the nearby bounding boxes performs better with a precision of 73%, recall of 72% and F1-score of 70%.

Several recommendations can be made for future improvements to the overall pipeline. For the key-value label classification, the LayoutLMv2 [13] model has recently introduced a new version known as LayoutLMv3 [23]. The new model may be implemented to investigate its efficacy in enhancing the key-value label classification results compared to the current model. Moreover, other key-value label classification approaches may be explored with sufficient time to improve the model performance further.

For linking of key-value pairs, most key-value pairs are typically located beside each other on the same line. However, there are some cases where the values are below the keys. Currently, this proposed information extraction system has not explored that and can only find nearby bounding boxes horizontally. Hence, to further improve the system, it needs to be able to get key-value pairs vertically nearby.

Acknowledgment

The first author would like to thank her intern supervisor Mr. Eric Tan of Infocomm Media Development Authority (IMDA), Singapore for his guidance and dedicated supports in the project.

R. Turner, “The myth of the paperless office,” New Library World, vol. 104, no. 3, pp. 120–121, 2003. doi: 10.1108/03074800310467043.
B. Klein, S. Agne, and A. Dengel, “Results of a study on invoice-reading systems in Germany,” International Workshop on Document Analysis Systems, pp. 451–462, 2004. doi: 10.1007/978-3-540-28640-0_43.
R. Hatfield, “How to Capture Freight Logistics Data,” 2021. Web link: .
A. Kay, “Tesseract: an open-source optical character recognition engine,” Linux Journal, 2007.
Amazon Web Services, “Form Data (Key-Value Pairs),” Web link: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html.
V. Kumar, P. Kaware, P. Singh, R. Sonkusare, and S. Kumar, "Extraction of information from bill receipts using optical character recognition," International Conference on Smart Electronics and Communication, 2020, pp. 72–77. doi: 10.1109/ICOSEC49089.2020.9215246.
V. N. Sai Rakesh Kamisetty, B. Sohan Chidvilas, S. Revathy, et al., "Digitization of Data from Invoice using OCR," 6th International Conference on Computing Methodologies and Communication, 2022. doi: 10.1109/ICCMC53470.2022.9754117.
Á. Z. Kaló and M. L. Sipos, "Key-Value Pair Searhing System via Tesseract OCR and Post Processing," IEEE 19th World Symposium on Applied Machine Intelligence and Informatics, 2021. doi: 10.1109/SAMI50585.2021.9378680.
J. Devlin, M. Chang, K. Lee, K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” North American Chapter of the Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423.
Y. Liu, M. Ott, N. Goyal, et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692, 2019.
W. Wang, B. Bi, M. Yan, et al., “StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding,” International Conference on Learning Representations, 2020.
Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding,” 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. doi: 10.1145/3394486.3403172.
Y. Xu, Y. Xu, T. Lv, et al., “LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding,” 59th Annu.. Meeting of Assoc. for Computatuional Linguistics and 11th Intl. Joint Conf. on Natural Language Processing, 2021, doi: 10.18653/v1/2021.acl-long.201.
G. Jaume, H. Kemal Ekenel and J. Thiran, "FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents," International Conference on Document Analysis and Recognition Workshops, 2019, pp. 1–6, doi: 10.1109/ICDARW.2019.10029
L, Garncarek, R. Powalski, T. Stanisławek, et al., “LAMBERT: Layout-Aware language Modeling for information extraction,” International Conference on Document Analysis and Recognition, 2021. doi: 10.1007/978-3-030-86549-8_34.
C, Li, B. Bi, M. Yan, et al., “StructuralLM: Structural Pre-training for Form Understanding,” 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing, 2021. doi:10.18653/v1/2021.acl-long.493.
Banksy Annotation Tool. 2020. Weblink: https://github.com/AboutGoods/Banksy-annotation-tool
B. Muller, “BERT 101 State of The Art NLP Model Explained,” 2022. Web link: https://huggingface.co/blog/bert-101.
J. Howard, and S. Ruder, “Universal Language Model Fine-tuning for Text Classification,” 56th Annual Meeting of the Association for Computational Linguistics, 2018. doi: 10.18653/v1/P18-1031.
Robust Reading Competition, “ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction,” Web link: https://rrc.cvc.uab.es/?ch=13&com=tasks.
S. Agrawal, “Metrics to Evaluate your Classification Model to take the right decisions,” 2021. Web link: https://www.analyticsvidhya.com/blog/2021/07/metrics-to-evaluate-your-classification-model-to-take-the-right-decisions/
B. Johansen, “Named-Entity Recognition for Norwegian,” 22nd Nordic Conference on Computational Linguistics, 2019.
Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking,” arXiv:2204.08387, 2022.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Information Extraction System for Cargo Invoices

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

2.1 Key-Value Pair Extraction using Regular Expression

2.2 Key-Value Pair Extraction using NLP

2.3 Key-Value Pair Extraction using Layout Detection

3. Proposed Methodology

3.1 System Arcitecture Pipeline

3.2 Selection of Algorithms

3.3 Datasets

4. Experiment Results And Analysis

4.1 Evaluation Metrics

4.2 Label Classification Comparisons

4.3 Linking of Key-value Pair Comparisons

5. Conclusion

Declarations

Acknowledgment

References

Additional Declarations

Status:

Version 1