4.1 Evaluation Metrics
The evaluation metrics used in the ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction [20] are employed to evaluate the experiment performances in this paper: the Precision, Recall and F1 Score. The extracted text will be compared to the ground truth for each test image. If the submitted content and category of the extracted text matches the ground truth, it is marked as correct; otherwise, it is labelled as inaccurate. Furthermore, the algorithms will also be further evaluated by the classification report and confusion matrix.
The values of the precision, recall, F1-score, and support are the four most critical headers for classification results to pay attention in classification reports. The value of precision refers to the ability of a classifier to avoid labelling a negative instance as positive [21]. The recall is the ability of a classifier to find all positive instances. The F1 score is a weighted harmonic mean of precision and recall. The support is the number of actual class occurrences in the specified dataset.
The precision value can be calculated in Eq. (1). Precision is useful when False Positives (FP) pose a greater threat than False Negatives (FN). In our research, having lots of wrong results would affect the key-value pair linking and the final key-value pair results.
$$Precision= \frac{TP}{TP+FP}$$
1
where TP denotes True Positive.
The recall value is computed in Eq. (2). When FN is more important than FP, it is a valuable metric.
$$Recall= \frac{TP}{TP+FN}$$
2
The F1 score ranges between 0 and 1, that can be calculated in Eq. (3). If the F1 score scores closer to 1 it means that the model performs better.
$$F1 Score= 2 \times \frac{Precision \times Recall}{Precision+Recall}$$
3
4.2 Label Classification Comparisons
Prior to classification, the dataset is first split into 80% training and 20% testing. The four classification models experimented with are BERT, ULMFit, YOLOv4 combined with BERT and LayoutLMv2, each of which will classify the text in to “Question”, “Answer” and “Others”. In the experiments in this sub-section, the dataset used in the constructed Cargo Invoice dataset in this research.
1) Results of the ULMFit Model
With the Cargo Invoice dataset, the ULMFit model is utilized for key-value pairs classification. The results of precision, recall, F1 score and support for the ULMFit model are shown in TABLE II. The ULMFit model scores 93% accuracy. Overall, the performance of this model is good. However, the model does not perform well for the "Other" class, compared to the other two classes. It might because there are fewer "Other" labels as compared to those of "Question" and "Answer".
Table II ULMFit Results with Cargo Invoice Dataset
Class
|
Precision
|
Recall
|
F1 Score
|
Support
|
Question
|
99%
|
98%
|
98%
|
882
|
Answer
|
88%
|
96%
|
92%
|
882
|
Other
|
93%
|
80%
|
86%
|
568
|
Accuracy
|
93%
|
2332
|
2) Results of the BERT Model
The experiment results of the BERT classification model with the Cargo Invoice dataset are shown in TALBE III. The BERT model achieves 88% accuracy. But the “Answer” label does not do well for the values of precision, recall and F1 score. This model can perform better on the “Other” label as the score for precision, recall, and F1 score is higher than the others.
Table III BERT Model Classification Results
Class
|
Precision
|
Recall
|
F1 Score
|
Support
|
Question
|
85%
|
91%
|
88%
|
2128
|
Answer
|
83%
|
76%
|
79%
|
1387
|
Other
|
96%
|
95%
|
95%
|
1493
|
Accuracy
|
88%
|
5008
|
3) Results of YOLOv4 Combined with BERT Model
The classification results of the YOLOv4 combined with BERT model is shown in TABLE IV. It shows that the overall performance is worse than other experimented classification models. For “Other” label, it performs the worst as compared to other two labels. It might be due to the imbalanced dataset whereby “Other” has the least amount of data comparatively. The results in this experiment show that the combination of YOLOv4 and BERT algorithms does not enhance the performance, as the BERT model itself manages to achieve 88% for the cargo invoices as seen in TABLE III.
Table IV Results of YOLOv4 Combined with BERT Model
|
Precision
|
Recall
|
F1 Score
|
Support
|
Answer
|
52%
|
42%
|
46%
|
882
|
Other
|
46%
|
39%
|
40%
|
568
|
Question
|
49%
|
40%
|
44%
|
882
|
Accuracy
|
40.27%
|
2332
|
4) Results of the LayoutLMv2 Model
For the LayoutLMv2 model, the labeling methods are slightly different from other three models. The labels of the LayoutLMv2 model follow the BIOES tagging, where B means beginning; I mean in the middle; E means the ending; O means others; and S means a single word representing a full sequence [22]. For example, “Product No.”, will be split into “Product” which will be B-QUESTION and “No.” is “E-QUESTION”.
The experiment results of the LayoutLMv2 classification model are shown in TABLE V. It is observed that the LayoutLMv2 model achieves an accuracy of 96%, which is the highest out of other three classification models. All the labels achieve 90% and above for the precision, recall and F1 score. It shows that the LayoutLMv2 model can predict well and distinguish each labels properly. Even though there is some unbalanced data, the model is still able to predict well for all the labels. The analysis on the confusion matrix will be further conducted whether this model is the best.
Table V LayoutLMv2 Model Classification Results
|
Precision
|
Recall
|
F1 Score
|
Support
|
B-ANSWER
|
93%
|
94%
|
94%
|
331
|
B-QUESTION
|
98%
|
98%
|
98%
|
507
|
E-ANSWER
|
90%
|
96%
|
93%
|
331
|
E-QUESTION
|
98%
|
98%
|
98%
|
507
|
I-ANSWER
|
96%
|
98%
|
97%
|
915
|
I-QUESTION
|
96%
|
97%
|
97%
|
104
|
O
|
97%
|
93%
|
94%
|
1387
|
S-ANSWER
|
95%
|
97%
|
96%
|
502
|
S-QUESTION
|
96%
|
98%
|
97%
|
375
|
Accuracy
|
96%
|
4959
|
5) Experiment Results with the FUNSD Dataset
Besides the classification experiments on the constructed Cargo Invoice dataset, the experiments are also performed on the FUNSD dataset to compare the performances of these four classification models. The overall performance with the FUNSD dataset is shown in TABLE VI.
Table VI Classification Results with the FUNSD dataset
Model
|
Precision
|
Recall
|
F1 Score
|
ULMFit
|
77%
|
76%
|
76%
|
BERT
|
55%
|
67%
|
60%
|
YOLO combined with BERT
|
49%
|
40%
|
44%
|
LayoutLMV2
|
80%
|
85%
|
83%
|
It has shown that LayoutLMv2 model performes the best out of other three models. This has been demonstrated that integrating the spatial-aware self-attention Mechanism into the Transformer architecture in the LayoutLMv2 model has fully allowed the model to understand the relative positional relationship among different text blocks. On the other hand, YOLOv4 combined with the BERT model does not perform well. Even though it is trained with the text along with the bounding box of the texts, it is not able to find the relative positional relationship accurately compared to the LayoutLMv2 model. Thus, the LayoutLMv2 model will be the model chosen by the proposed pipeline for the key-value pair classification tasks.
4.3 Linking of Key-value Pair Comparisons
The two algorithms for the linking of key-value pairs to be evaluated are: the regular expression algorithm, and finding the pairs by the nearest bounding box algorithm.
1. Regular Expression Algorithm
Before beginning regular expression, some analysis is done on the dataset to understand the keys and values patterns, as well how they are paired to enhance the process of extracting key-value pairs. There are various formats of the keys, such as the short forms. Hence we need to create a list of the different forms of the keys. Some examples of the types of formats can be seen in TABLE VII.
Table VII Some examples of different formats of keys
Word
|
Different Formats of Keys
|
Gross Weight
|
G/W
G/Weight
Gross wt
Gross wght
Gross wght(kg)
|
Net Weight
|
Net WT
Net wght
Net weight(kg)
Net wt(kg)
Net wght(kg)
N/W
|
Dimension
|
Dim
Dims
Dim(mm)
DIM (CM)
Dimensions(cm)
|
After finding out the different formats of the keys from the Cargo Invoice dataset, a word cloud is created to better understand the keys in the dataset, as shown in Fig. 2. Through the wordcloud, words such as "gross weight", "net weight", "PO", "country of origin", and "dimension" stand out the most. Hence, from the common keys, the next step is understanding their values pattern. Some examples of the patterns of values are shown in TABLE VII.
Table VIII Pattern of Values
Key
|
Values
|
Patterns
|
Gross Weight
|
G/W 2134
23Kg/g
88,500 KG
|
For gross weight the pattern is always an integer/float followed by the metrics (KG/lb/g). Sometimes G/W would be infront of the integer/float.
|
Net Weight
|
270 lb
88,500 KG
3.840 KG
|
For Net weight the pattern is similar to Gross weight. The pattern is always an integer/float followed by the metrics (KG/lb/g).
|
PO
|
PO-IMA-21007
PO2000004328
|
For PO the pattern is PO followed by letters then integers or just integers.
|
Dimension
|
228.00 X 148.00 X 165.00 CM
89.76 X 58.26 X 64.96 INCH
3815*2150*2550mm
|
For Dimension, the pattern is integer/float X integer/float X integer/float
|
After doing analysis on these patterns, key value pairs will be extracted. For the work flow of the extraction of key-value pairs are as follow:
-
Find all key-value pairs.
-
Calculate Levenshtein distance with given identifiers to see which one is the most likely identifier.
-
Return key-value pair with the lowest normalized Levenshtein distance.
2) Pairing Via Nearest Bounding Box
The input for this part is the output from the LayoutLMv2 classification model. The words are in the word level form and the labels are in BIOES tagging [22].
Hence, we will need to combine the word level text into sentence level to make sense of the full question and answer. To combine them into sentence level, the bounding box position and the labels will be critical.
There are two merging rules to get the full questions and answers as currently in the word level form. These two merging rules will be described next.
The flowchart of the merging rule 1 is shown in Fig. 3. First, it gets the coordinates, labels and text from the LayoutLMv2 Predicted results and images. Then it traverses through the whole results and compare whether if both labels are same. If not the same, then it is assign as None to the variables called “neighbours”. If it is the same, then it checks whether the vertical y coordinates are the same. This is to check whether these nodes are on the same line. If they are not on the same line, then it is assign as None. If they are on the same line, then it proceeds to check whether the right side of a bounding box is the same as the left side of another bounding box. If it is the same, then the bounding box ID is assigned to each of the nodes in the “Neighbours” variable. Otherwise, it will be assign None.
After that, the second part of the merging rule 1 shown in the blue box in Fig. 3 checks if there are multiple bounding boxes connected to one another. If yes, then group them into one group. If a bounding box is not connected to another bounding box, then they will be put into individual groups.
Lastly, the purple box in Fig. 3, the merged results are derived and updated into the dictionary into the form of:
{'text': pred[link_list[0]]['text'],
'labels': pred[link_list[0]]['labels'],
'left_neighbour': None,
'right_neighbour': None,
'width': pred[link_list[0]]['width'],
'height': pred[link_list[0]]['height']}}
The second merging rule is shown in Fig. 4. For the second merging rule of the full questions and answer, it uses the data derived from the mergin rule 1 to further enhance the method of getting the full questions and answers.
First, the merging rule 2 gets the center of the bounding box from the mean of the x and y coordinates. Next it finds the nearby bounding boxes and checks if both labels are same. If not the same, the nodes are assigned as None for the variable “neighbours”. If both labels are the same, it then checks if the center of the bounding box A is within the range of the bounding box B in the y direction. If not, the variable “neighbours” is assigned to None. Otherwise, it checks if the center of the bounding box B is within the range of the bounding box A in the y direction. If yes, it checks if the two bounding boxes overlap in the x direction, or if the 2 bounding boxes gap in the x direction is < 10% of the width of the image. If the condition is true, the bbox_id of these 2 bounding boxes (bboxes) are assigned as neighbours to each other's nodes. If the condition is fasle, the bbox_id of these 2 bboxes are assigned as None to the variables “neighbours”.
The second part of the merging rule 2 in the blue box is to check if there are multiple bounding boxes connected to one another. If true, then group these bounding boxes into one group. If a bounding box is not connected to others, then it is put into an individual group. These results will be merged and passed to the last process in the merging rule 2 shown in the purple box. Lastly, the results are merged and updated into the dictionary in the same format as that of the merging rule 1.
After merging the words into the full key and values, we need to start pairing the keys and values together. There are two pairing rules for this step.
The flowchart in Fig. 5 shows how the key-value pairing rule 1 works. It takes in the results from the merging of full questions and answers. It then checks if the “Question” and ”Answer” have the right neighbour, followed by checking whether the y coordinates between both “Question” and ”Answer” are identical. Lastly, the key-value pairing rule 1 checks whether the coordinates of the right side of the “Question” bounding box is identical to the left side of the “Answer” bounding box.
After performing the key-value pairing rule 1, the output is splitted into successfully paired and unsuccessfully paired. The unsuccessfully paired results are inputted into the key-value pairing rule 2 to pair the questions ans answers. The key-value pairing rule 2 is presented in Fig. 6. Firstly, it checks whether both centers of the bounding boxes are within the range of the other bounding box in the y axis. Next it checks whether they overlap in the x axis or the gap in the x axis is less than 50% of the width of the bounding box with the smaller width value.
The final output consists of the label, bounding boxes, text, left_neightbour, right_neighbour, the width, the height, the variable of pair_with, and lastly the center attribute.
3) Results Comparison
After performing the key-value pairing, the results are evaluated by comparing the derived final key-value pairs with the ground-truth. The experiment result comparisons are shown in TABLE IX. The metrics used to do the comparison is the precision, recall and F1 score.
Table IX Results Comparison for Key Value Pairing
Algorithm
|
Precision
|
Recall
|
F1 Score
|
Regular Expression
|
63%
|
60%
|
66%
|
Pairing by Finding Nearby Bounding Box
|
73%
|
72%
|
70%
|
It is observed from the results that the second algorithm, i.e., the pairing by finding the nearby bounding box, has done a better job with a precision of 73%, recall of 72% and F1 score of 70%. In comparison, the first algorithm, i.e., the regular expression, achieves a precision of 63%, recall of 60% and F1 score of 66%. The reason for the regular expression not perform well might because there are limited patterns introduced into the system. Even though the algorithm of the pairing by finding a nearby bounding box performs better, it can still be further improved, as on some occasions the questions and answers are very far apart in a horizontal direction. Furthermore, there are a few mistakes that are made by the OCR results which have caused the LayoutLMv2 to misclassify some labels and hence messes up the key-value pairs.