Effect of attention and triplet loss on chart classification: a study on noisy charts and confusing chart pairs

Charts are powerful tools for visualizing and comparing data. With the increase in the presence of various chart types in scientific documents in electronic media, the development of an automatic chart classification system is becoming an important task. Existing studies on chart classification fail to address the presence of noise in charts and confusing chart class pairs. Motivated by the above observations, in this paper, we propose an attention and triplet loss based deep CNN framework to address the above issues. From various experimental results over four datasets, it is evident that the proposed framework can effectively handle noise in the charts and confusing chart samples and outperforms its counterparts.


Introduction
With the increase in the presence of various chart types in scientific documents in electronic media, the development of an automatic chart classification system is becoming an important task. Though the attention of the researchers on chart classification increased post 2001 (Zhou & Tan, 2001), its importance has been realized way back in 1981 (Futrelle et al., 1992). Initial studies (pre-deep learning era) on chart classification generally use traditional machine learning methods such as SVM, KNN, Decision tree, etc., with handcrafted features (Shao & Futrelle, 2006;Prasad et al., 2007;Jung et al., 2017). However, the majority of the recent studies focus on using state-of-the-art deep learning models such as VGGs, ResNets, etc. As reported in Thiyam et al. (2021aThiyam et al. ( , 2021b, majority of the existing chart classification models face problems while handling (i)Chart noise: most of the publicly available datasets for chart classification contain samples with various types of noise such as background noise, pattern noise, composite noise, etc., and (ii) Confusing chart class pairs: charts of similar characteristics is also one of the major reason for chart misclassification.
To the best of our knowledge, none of the earlier studies on chart classification focused on developing methods that could handle the above two issues. Motivated by this, this paper proposes an attention and triplet loss based model to address the problem of chart noise and confusing chart class pairs. Though attention-based approaches have been extensively used to handle noise in other image classification tasks , none of the earlier studies have investigated the effect of the attention mechanisms on handling chart noise in the chart classification tasks. Therefore, in this paper, we investigate the effect of attention mechanisms, namely Convolutional Block Attention Module (CBAM) (Woo et al., 2018) and Squeeze and Excitation network (SE) (Hu et al., 2018) on handling the chart's noise. We apply these two attention mechanisms to various CNN models (VGGs, ResNets, Inceptions, MobileNets, DenseNets, Xception). One of the most effective chart classification models, the Xception, has not been thoroughly examined with attention mechanisms (except for Zhang et al., 2021). In this paper, we propose an attentionbased Xception model by incorporating CBAM and SE attention with both the residual and non-residual layers (study Zhang et al., 2021 considers attention only with the last seven residual layers of Xception). Furthermore, this study explores the triplet loss function for the first time in the domain of chart classification. As training a model using the triplet loss function is one of the common approaches for the fine-grained classification Wang et al., 2019;Cui et al., 2016a), this paper investigates the effect of the triplet loss function on handling confusing chart class pairs classification. The Triplet loss learning method has become a popular approach after the proposal of Facenet (Schroff et al., 2015), created by Google. The goal of the triplet loss is to build a triplet (anchor, positive, negative) consisting of an anchor image, a positive image (which is similar to the anchor image), and a negative image (which is dissimilar to the anchor image). Focusing on elongating the distances between confusing samples, we develop a strategy to form the triplets considering only confusing samples.
The rest of the paper is organized as follows. Section 2 presents the background and related studies, where we discuss some selected existing studies of chart classification and two issues: chart noise and confusing chart class pairs, which are reported in our earlier study (Thiyam et al., 2021b). In Section 3, we present our proposed attention-based Xception models and a framework where we talked about training the attention-based models using the triplet loss function. Section 4 presents the detailed experimental setups. In Section 5, we discuss the experimental results, where we have reported the performance of multiple chart classification models. In section 6, we present the detailed analysis of the proposed framework with respect to the chart noise and confusing chart class pairs. Section 7 concludes our study and highlights the future directions.

Related and Background studies
Although study on chart analysis can be traced back to the year 1991 (Futrelle et al., 1992), the first study on chart classification was reported in the year 2001 (Zhou & Tan, 2001). A good survey on chart classification can be found in (Davila et al., 2020). Based on the classification methods used by the existing studies, the journey of chart classification can be divided into two phases, viz. pre-deep neural network phase (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013) and the deep neural network phase (since 2014).
In the pre-deep neural network phase, studies exploited model-based approaches  and traditional machine learning (ML) approaches such as SVM (Futrelle et al., 2003;Gao et al., 2012), HNN (Zhou & Tan, 2001), KNN (Gao et al., 2012;Karthikeyani & Nagarajan, 2012), etc. In a model-based approach, each chart type has its own unique model, which is based on the chart's inherent characteristics. Graphical components of the charts, such as axes and colors , the layout of the chart, e.g., rectangular (for bar charts), circular (for pie charts) (Yokokura, 1998;Mishchenko & Vassilieva, 2011), are among these qualities. SVM, KNN, and Decision Tree are some well-known state-of-the-art traditional ML models. It is observed that studies in this phase consider manually extracted small sample sizes and the small number of chart types. One main shortcoming of this phase is that most approaches do not generalize well. They are not effective when dealing with a large amount of data that could contain significant varieties, as in the chart image dataset (Amara et al., 2017).
LeNet, AlexNeT, VGG-16, VGG-19, Inception-V3, and Inception-V4 are some CNNbased models that have been exploited in the current deep neural network (DNN) phase along with Inception-ResNet-V2, MobileNet-V1, and MobileNet-V2. Xception is another CNN-based model that has been exploited in the current DNN phase. Without explicitly extracting features, models take raw images at this phase. Although the authors have investigated models with and without feature extraction in work such as (Tang et al., 2015), they have also commented on the relevance of feature selection approaches. Study (Mishra et al., 2021) developed a model for learning the characteristics of both different and comparable regions simultaneously. An enhanced loss function is utilized to train the model, fused with a structural variation-aware dissimilarity index and regularisation parameters to make it more prone to dissimilar areas. In this era, the size of the dataset is one of the obstacles. Most datasets with real chart images are quite small in size. For this reason, large-scale synthetically created datasets have been examined in various research (Chagas et al., 2018;Davila et al., 2019). However, a model trained on a synthetic dataset fails to perform well in real chart images because the real chart images often contain noise as compared to the synthetic images. To address the lack of real chart samples, study (Bajić & Job, 2021) proposed a chart classification model using Siamese CNN (Koch et al., 2015). The Siamese CNN is a network architecture built using two or more identical (twin) networks, in which they used MobiletNet.

Effect of Noisy chart samples, and confusing chart class pairs
Apart from the image quality and image noise, the performance of a chart classification model depends on other factors such as noisy chart samples and confusing chart pairs. Our earlier paper (Thiyam et al., 2021b) discovered ten types of chart noise and 13 confusing chart class pairs. As stated earlier, the main objective of this study is to develop a model that can handle noisy and confusing chart samples. We briefly discuss the chart's noise and confusing chart class pairs reported in the study (Thiyam et al., 2021b).
Chart noise Noisy chart samples are defined as samples that are often misclassified because of some other extraneous components in the charts. Figure 1 shows the samples of ten chart noise. The brief definitions of the ten chart noise are given below.  Fig. 1h. 9. 3D images (3DI): As our study does not consider 3D chart images except for surface plots, the 3D images become one noise type. Even the image with a slight degree of the third dimension, as shown in Fig. 1i, becomes noise. 10. Patterned Background (PB): A chart image that has a background with patterns, such as shown in Fig. 1j. It is an area chart, but because of the vertical blocks in the background, it is misclassified.

Confusing chart class pair
In study (Thiyam et al., 2021b), we have observed false classification throughout the experiments because of the similarity between two or more chart types. The chart class pair (X,Y ) is considered a confusing chart class pair if t% of the sample population belonging to class X are classified as class Y. Study (Thiyam et al., 2021b) considered t as 4% 1 , and reported 13 confusing chart class pairs. They are briefly discussed below.

Proposed framework
Since the attention mechanism is one of the popular approaches for classifying fine-grained categories, before developing our proposed framework, this study exploits various classification models with the attention mechanism. Although several studies have introduced attention mechanisms into computer vision, to the best of our knowledge, this is the first of its kind to study the effect of attention in the chart classification domain. As stated in Section 1, there are various studies on developing attention-based models of several DL models. However, there is limited exploration of developing attention-based Xception models.
Since it is one of the well-performed chart classification models in our earlier work (Thiyam et al., 2021b), we proposed multiple attention-based Xception models and investigated their performance on the chart classification task. This section discusses our proposed attention-based Xception and introduces the proposed framework, which exploits the attention mechanism and triplet loss function.

Attention on Xception
Xception consists of 14 modules with linear residual connections, except for the first and last modules, as shown in Fig. 2. In other words, it has three main flows: entry (4 modules including the initial CNN layers), middle (8 modules), and exit (2 modules). Study  proposed attention-based Xception for the classification of flower types. Their study incorporated the Convolutional Block Attention Module (CBAM) (Woo et al., 2018) in the last six residual layers. However, we proposed inserting an attention mechanism in the residual layers and the non-residual layers as well. Because of the places where we can insert attention mechanisms, this study proposed five variants of attention-based Xception : 1. Xception-Entry (XEN) -The attention mechanism is inserted only in the entry field. So, in this variant, three attention modules are inserted. 2. Xception-Middle (XM) -The attention mechanism is inserted only in the middle field.
So, for every eight modules (all of them have a residual connection), one attention module is integrated. Hence, eight attention modules are used in this variant. This study considers two well-known attention mechanisms, CBAM and Squeeze and Excitation network (SE) (Hu et al., 2018). With five variants of Xception and two attention mechanisms, we have proposed ten attention-based Xception: CBAM-based XE, XM, XEX, XMEX, XA, and SE-based XE, XM, XEX, XMEX, XA.

Attention & triplet loss based Framework
The schematic architecture of the proposed framework is shown in Fig. 3. The framework has multi-stage training, which can be broadly divided into two blocks: Triplet loss training, and Chart type classification.
(1) Triplet loss training As mentioned earlier triplet loss is first introduced in the study (Schroff et al., 2015) for face verification and recognition in 2015. Since then, it has been one of the popular loss functions for fine-grained classification, such as bioacoustics (Zhao et al., 2018), species of birds (Zhao et al., 2018), flowers (Zhang et al.,  (Guo et al., 2021) and identifying the models of vehicles (Kumar et al., 2019). The goal of triplet loss is to learn parameters by minimizing the intra-class distance and maximizing the inter-class distance as opposed to other loss functions like cross-entropy loss or mean square error loss, where the goal is to learn parameters by minimizing distance between observed and ground truth values. The misclassification because of confusing chart class pairs (R,S) (mentioned in Section 2.1) exists because of the hard-to-distinguish features among the samples of classes R and S. It is observed from the baseline models that the cross entropy loss is not capable of handling confusing chart pairs. However, as triplet loss attempts to capture discriminating characteristics of inter-class and intra-class samples by taking two samples (anchor and positive) from class R and one sample (negative) from class S, it helps in separating the confusing chart pairs of these two classes. So, we adopt the triplet loss function to address the issue of confusing chart class pairs. Its learning can be visualized as shown in Fig. 4.
In fine-grained classification tasks, apart from the triplet loss, another commonly used method that has the same goal is the contrastive loss (Hadsell et al., 2006). Our study considers triplet loss because of the advantages as opposed to the contrastive loss, as reported in the study (Cui et al., 2016a;Guo et al., 2021;Kumar et al., 2019;Kang et al., 2020). In contrastive loss, even after obtaining the separate clusters of two classes (R and S), there is still a change in the distance between the anchor and positive samples (samples from the same class R) as it tries to put them in the same position. As a result, contrastive loss becomes greedy and cannot tolerate intra-class variance. On the other hand, triplet loss enables clusters to be stretched in order to incorporate outliers while still maintaining a buffer between samples from various clusters. However, triplet loss is computationally more expensive than other loss functions like mean squared error, cross entropy loss, etc. because of the enormous number of triplets it produces from the dataset. This is one of the major concerns of using triplet loss for a large dataset. The number of triplets in our analysis, however, is much lower than the potential number of triplets because we already anticipate to be aware of confusing chart pairs. This block aims to generate triplet samples from the confusing chart class pairs and train the model using the triplet loss function. For a given confusing chart class pair (R,S), the process of this block is described below.
Triplet generation: As stated above, a triplet consists of an anchor sample (a reference sample, which is a confusing sample in our case), a positive sample (which is similar to the anchor), and a negative sample (which is dissimilar to the anchor). Let A be the training set of class R then the set of anchor samples (A * ) , and positive samples (A + ) are defined as A * = {x ∈ A|p(x) ∈ S}, and A + = {y ∈ A|p(y) ∈ R}, respectively. Where p(.) is the operation that returns a class, it performs a manual checking of patterns in a sample (discussed in Fig. 4 The objective of triplet loss learning Section 2.1 ) and assigns a class (either R or S) to which the sample might belong. Finally, the set of negative samples is the training set of class S, which is denoted as B. The algorithm for the triplet generation from these three sets is provided in Algorithm 1.

Algorithm 1 Triplet formation
The algorithm consists of two main steps: Finding feature embedding, and Finding hard positive and negative samples. In the prior step, we used pretrained attention-based model (PABM) to obtain the anchor's feature vector (f x ), set of positive feature vectors (denoted by F A+ ) for all the samples in A + , and a set of negative feature vectors (denoted by F B ) for all the samples in B. The second step performs distance calculations and comparisons. The Euclidean distance (denoted by E(.)) between an anchor feature vector f x with every negative feature vectors (in F B ) and every positive feature vectors (in F A + ) are calculated. We have three options to select a positive feature vector f x p ∈ F A+ and a negative feature vector f x n ∈ F B for a given anchor feature vector f x : easy, hard, and semi-hard. In an easy selection, the distance between an anchor and a negative is very large than the distance of an anchor and a positive, which can be denoted as In the hard selection, the negative feature vector is closer to an anchor than the positive feature vector which can be denoted as . In case of semi-hard selection, the negative is not closer to an anchor than the positive but with some margin, which can be denoted as As stated in study (Hermans et al., 2017), hard selection yields the best performance. So, we adopt a hard selection process. In the hard selection, two types of masks are identified: a positive hard triplet mask and a negative hard triplet mask, to select a hard positive vector f x p , and hard negative vector f x n , respectively. f x p is the one in F A + which has the highest distance from f x , and the hard negative sample f x n is the one in F B which has a minimum distance from f x . Finally, for a given confusing chart class pair (R,S), we generate the triplet samples (x ∈ A − , y � ∈ A + , z � ∈ B) corresponding to the above obtained triplet embeddings Fig. 3, once we have the triplet samples, the next step is to initialize the pre-trained weights of PABM and train with triplet loss to obtain TABM. In the triplet loss function, the idea is to use three identical networks (one each for anchor, positive, and negative) having the same neural net architecture, and they should share underlying weight vectors to train using triplet loss. We implemented this idea using only one network and a triplet, where the network expects three input samples. These three samples do not go with each other but separately. Given a triplet (x, y � , z � ) , in order to estimate the triplet loss, we give x, y ′ and z ′ one after another to obtain f x , f y ′ , and f z ′ respectively. Once the above embedded vectors are obtained, as done in Wang et al., 2019;Cui et al., 2016b), loss function is estimated using L 2 -normalization. The normalized vector f x of f x is estimated as Similarly, the normalized vectors f y ′ and f z ′ are also estimated. The distance between the anchor and the positive samples and the distance between the anchor and the negative samples are estimated using softmax as given below:

Training Triplet loss attention based model (TABM): As shown in
We optimize the loss using the Adam algorithm, which is the combination of the 'gradient descent with momentum' algorithm and the 'RMSP' algorithm.
(2) Chart type classification In the classification block, the pretrained triplet loss learned model (obtained in the previous block) is used as a feature generator for the final task of chart type classification, as shown in Fig. 3. It is followed by three fully connected layers and then a softmax layer. The parameters we used in this block are as follows: Stochastic Gradient Descent (SGD) as an optimizer, 0.9 as momentum, 0.0001 as learning rate, 40 as batch-size, and 2 as steps-per-epoch.

Dataset
We consider the dataset reported in our earlier paper (Thiyam et al., 2021b). It consists of 110,182 samples with 25 chart types. To perform triplet formation (described in Algorithm 1), we develop a sub-dataset with the samples from only confusing chart class pairs. The number of anchor, negative, and positive samples for 13 confusing chart class pairs are shown in Table 1. With 3308 anchors, this study obtained 3308 triplets. To study the responses of our proposed framework to other datasets, we consider three publicly available datasets (Savva et al., 2011;Chagas et al., 2017; (1)  For the rest of the paper, we will be referring the dataset provided by (Savva et al., 2011), (Chagas et al., 2017), (Davila et al., 2021), and (Thiyam et al., 2021b) as D1, D2, D3, and In-house, respectively. The comparison of these four datasets is presented in Fig. 5.

Attention-triplet loss based:
We have considered all the attention-based models used in this study to be trained on using triplet loss as shown in the proposed framework. Table 2 shows the mean accuracy under five fold cross validation of 14 CNN based models on In-house dataset, and further tested on D1, D2 and D3. The following observations may be noted.

Baseline Models
• All the 14 models provide the best result In-house, then with D3, D2, and D1, respectively. Xception outperforms all other models. • Apart from Xception, DenseNets, and VGGs also provide comparatively better performance than the rest. • Among all the models, ResNet and Inception, provide the least performance for all the datasets.

Fig. 5
Comparison of four datasets: In-house, D1, D2, and D3. D1 and D2 contribute to only ten chart types, D3 contributes to 11 chart types, and In-house contributes to 25 chart types. The number below the chart indicates the number of samples belonging to the class in the respective dataset ▸  Table 3 shows the performance of 19 attention-based models . The following observations may be noted.

Attention-based Models
• Most of the models experienced a rise in the mean accuracy from that of their baseline version. • There is a fall in the accuracy from that of the baseline version for all the variants of ResNet in In-house dataset, and Inception-ResNet in D3 dataset. • For all four datasets, among our proposed five variants of Xception, all the models except for XA improve their performances with an integrated attention mechanism compared to the baseline Xception. With the integration of the attention mechanism on all the modules of Xception, we are retraining all the modules on our dataset, which in turn has no effect of using pre-trained weights. For highly deep networks such as Xception, the size of In-house dataset might not be enough to learn efficiently, and hence XA fails to provide promising results with the attention mechanisms. • Among all 19 models, XMEX provides the highest mean accuracy for all four datasets. • Among the two attention mechanisms, all the models provide better results with CBAM for all four datasets. Table 4 shows the performance of 19 attention-triplet loss based models, obtained under our proposed framework. The following observations may be noted.

Attention & Triplet loss based Models
• With our proposed framework, all the models experienced a rise in the accuracy over their respective attention-based models. • With the proposed framework, XMEX provides better performance for all four datasets. • Among the two attention mechanisms, our proposed framework works well with CBAM for all four datasets.
From the above observations, it is clear that our proposed framework can increase the performance of all the state-of-the-art models. By integrating only the attention mechanism, the models address the issues of noisy charts (discussed in detail in Section 6). Yet, the challenges provided by confusing chart class pairs remain unsolved. However, with the combination of attention and triplet loss in our proposed framework, both these issues are addressed on a large scale (discussed in detail in Section 6)).

Discussion
From Table 3 and 4, it is observed that for all four datasets, triplet loss based CBAM-XMEX (TCBAM-XMEX) outperforms all other models in handling noisy samples and confusing chart class pairs, followed by triplet loss based CBAM-Xception* (TCBAM-X*). So, this study presents the quantitative analysis for these models and their earlier  versions before training with triplet loss: CBAM-XMEX, and CBAM-X*, and the baseline Xception. We used the Grad-CAM (Selvaraju et al., 2017) to do the analysis. Grad-CAM is developed for a visualization approach that calculates the relevance of spatial positions in convolutional layers using gradients. Grad-output CAM's clearly displays attended regions since gradients are calculated with regard to a unique class. We attempt to look at how this network makes excellent use of features by monitoring the areas the network considers crucial for predicting a class. The visualization results are shown in Fig. 6. For all these challenging input images, as shown in the figure, Xception fails to focus on the regions of interest. With the attention mechanism, it is observed that CBAM-X*, and CBAM-XMEX started to focus on the object's regions for some samples and classified them correctly. However, with the combination of triplet loss and attention mechanism, TCBAM-X*, and TCBAM-XMEX, the issues of most of the challenging samples are resolved with high classification confidence. We can see that the TCBAM-XMEX network's Grad-CAM masks cover the target object areas better than other approaches. It learns to exploit information in target object regions and aggregate features from them and can decrease the distance of intra-class and increase the distance of inter-class samples. Note that target class scores also increase accordingly. This section presents a detailed discussion of the well-performed attention-based Xception model, TCBAM-XMEX against CBAM-XMEX, CBAM-X*(provided by the study Zhang et al., 2021), TCBAM-X*, and the baseline Xception concerning ten noise types and 13 confusing chart class pairs. ) × 100 , where NS i and TS i are the total number of noisy samples, and total number of testing samples, respectively. In-house dataset contributes to only two types of noise viz. Hard Background Grid (HGB) and Patterned Background (PB) by providing NTS of only 6%. Except for the noise type Transparent Background (TB), the dataset D1 contributes to all noise types by occupying 27.50% of its samples. Leaving the noise type Composite Chart (CC), the dataset D2 contributes to all other remaining nine noise types. Noisy samples from these nine types occupy 18.08% of its dataset. Finally, the dataset provided by D3 occupied 18.32% of its dataset by nine noise types (leaving Improper Image Screenshot (IIS)). Table 6 presents the response of Xception, CBAM-X*, CBAM-XMEX, TCBAM-X*, and TCBAM-XMEX on four datasets with respect to the chart noise. The TNMC(Total noise misclassification) and TNMCO (Total noise misclassification overall ) column in the table shows the misclassification because of noise among the noisy samples and over the entire dataset, respectively. TNMC is estimated as the macro average percentage of sample misclassification among the noisy samples i.e., 1. Xception : Except for In-house, it provides a false result for more than 50% of noisy samples as given by TNMC. There are some noise types where it recognizes some of their instances, such as PB noise. However, in some cases, it provides inconsistent results by classifying some instances of CC noise correctly (in the case of D3), and sometimes fails to recognize even a single instance of the same noise type (in the case of D1). The same characteristics are observed for noise type NC, where it fails to recognize a single   Table 7 presents the summary of four testing datasets from the view of confusing chart class pairs. It is observed from the table that In-house and D3 contributes comparatively  Table 6 Performance of Xception, CBAM-X*, CBAM-XMEX, TCBAM-X*, and TCBAM -XMEX over four datasets with respect to ten types of chart noise.  CS i and TS i are the total number of confusing samples, and total number of testing samples, respectively. As observed in the table, In-house dataset contributes to all 13 confusing chart class pairs providing CCS of 6.57%, and the lowest CCS of 1.56% comes from D1 that contributes to only one confusing chart class pair. Table 8 presents the performances of five models: Xception, CBAM-X*, CBAM-XMEX, TCBAM-X*, and TCBAM-XMEX over four datasets from the perspective of identified confusing chart class pairs. The TCMC(Total Confusing pairs Misclassification) and TCMCO (Total Confusing pairs Mis-classification Overall ) in the table present the overall error contributions among the confusing samples and the entire dataset, respectively. TCMC is estimated as the macro average percentage of sample misclassification between the confusing chart class pairs i.e.,the percentage of misclassifications from the confusing pairs over the entier testing samples (TS), and estimated as

Conclusion and future work
This research offered a framework for dealing with two major chart classification issues:chart noise and confusing class chart pairs. This is the first study of its kind to tackle these complex and challenging issues in developing the chart classification models. For the first time in the domain of chart classification, the proposed framework used two attention mechanisms, CBAM and SE, as well as the triplet loss function. In addition, the developed framework employed the offline model for producing triplet samples from confusing chart pairs. This study conducted comprehensive trials with multiple state-of-the-art models to evaluate its efficacy, confirming that our proposed framework outperforms all baselines on four different datasets. In addition, we visualize how it infers an input image precisely. Interestingly, we discovered that our framework focuses appropriately on the target object. In a nutshell, the attention mechanism deals with the majority of chart noise, while the triplet loss function tackles the problem of confusing chart pairs. In the future, we intend to expand the number of chart kinds and include their 3D images.