DGNet: A handwritten mathematical formula recognition network based on deformable convolution and global context attention

： The Handwritten Mathematical Expression Recognition (HMER) task aims to generate corresponding LATEX sequences from images of handwritten mathematical expressions. Currently, the encoder-decoder architecture has made significant progress in this task. However, the architecture based on the DenseNet encoder fails to adequately consider the unique features of handwritten mathematical expressions (HME) and the similarity between different characters. Additionally, the decoder, with its small receptive field during the decoding process, fails to effectively capture the spatial positional information of the targets, resulting in a lack of global contextual information during decoding.To address these issues, this paper proposes a neural network called DGNet based on deformable convolution and global contextual attention. Our network takes into full consideration the sparse nature of handwritten mathematical formulas and utilizes the properties of deformable convolution, allowing the convolution kernel to deform based on the content of the neighborhood. This enables our model to better adapt to geometric changes and other deformations in handwritten mathematical expressions. Simultaneously, we introduce GCAttention in optimizing the feature part to fully aggregate global contextual features of both position and channel.In experiments, our model achieved accuracies of 58.51%, 56.32%, and 56.1% on the CROHME 2014, 2016, and 2019 datasets, respectively. This research introduces a more effective deep learning architecture to the field of handwritten mathematical expression recognition, providing a strong foundation for future research and applications.

of handwritten mathematical formula recognition technology reflects the urgent demand in the digital age for efficient and intelligent processing of mathematical information, offering robust support for the digitized future of technology, education, and research [1,2,3,4,49].
Despite the advancements in Optical Character Recognition (OCR) technology, the recognition accuracy of conventional text has reached high levels.However, challenges persist in the recognition of handwritten mathematical formulas in areas such as automatic grading, digital library construction, and office automation, where the accuracy of existing OCR algorithms is relatively low.Unlike conventional text, handwritten mathematical formulas exhibit complex spatial structures and diverse writing styles.This complexity in spatial structures primarily arises from the unique structures of mathematical formulas, such as fractions, superscripts, subscripts, square roots, and other elements.

Figure 1: Schematic diagram of deformable convolution
Currently, although algorithms based on encoder-decoder [5,6,7,8,[9][10][11][12] structures can effectively recognize horizontally aligned regular text and even demonstrate good recognition results for some multi-directional and curved text, they still face challenges in accurately recognizing mathematical formulas with complex spatial structures.During the feature map extraction process in the encoder, the pooling operation in Convolutional Neural Networks (CNNs) reduces the resolution of the feature map.Given the significant size differences in handwritten mathematical symbols, the fine details in the extracted feature map become crucial for Handwritten Mathematical Expression Recognition (HMER).Zhang et al. proposed DenseWAP [13] as an improvement in model architecture to preserve these details, utilizing a multiscale DenseNet as the encoder to enhance the ability to handle symbols of various scales.DenseNet concatenates feature maps from multiple previous layers in the channel dimension, including both high-resolution and low-resolution feature maps.The low-resolution feature map provides a larger receptive field, offering more global semantic information, while the high-resolution feature map retains finer details.Despite the outstanding performance of DenseNet as an encoder in mathematical formula recognition, recent models have not made improvements to DenseNet, as popular architectures like BTTR, ABM, and CAN [14,15,16] still use DenseNet as the encoder.However, the fixed geometric structure of the modules used in the DenseNet encoder limits its ability to model geometric deformations, overlooking the unique features of handwritten mathematical expressions and the similarity between different characters.Additionally, standard convolutions in DenseNet with fixed receptive fields cannot effectively capture more global semantic information.These issues hinder further applications of handwritten mathematical formula recognition.
Deformable Convolutional Modules (DefConvs) [17] have been used in object recognition tasks.Deformable convolutions can learn offsets during training to change the sampling positions in space, showing great adaptability to geometric changes and partial deformations, as well as the ability to model transformations in object scale, posture, and viewpoint, as illustrated in Figure 1.Moreover, mathematical formulas often involve complex syntax and semantics, requiring the combination and operations of different mathematical concepts.Global contextual information can assist the model in better understanding the syntax structure and semantic relationships of the entire expression, leading to more accurate parsing and inference.Therefore, we introduce GCAttention [18] in optimizing the feature part to fully aggregate global contextual features of both position and channel.This attention mechanism assigns different weights to different positions or features based on learned weights, aiding the model in better understanding semantic information in expressions.
Therefore, this paper proposes a neural network called DGNet based on deformable convolution and global contextual attention.Our network takes into full consideration the sparse nature of handwritten mathematical formulas and leverages the characteristics of deformable convolution, allowing the convolution kernel to deform based on the content of the neighborhood.This enables our model to better adapt to geometric changes and other deformations in handwritten mathematical expressions.Additionally, the context attention helps the model gain contextual information.Finally, to validate the superiority of the proposed model, we conducted extensive experiments on the CROHME 2014, 2016, and 2019 datasets [19,20].Experimental results demonstrate the feasibility of our model.
The contributions of this paper are as follows: 1) To address the challenges posed by the unique features of handwritten mathematical expressions and the similarity between different characters, we utilize learned offsets in deformable convolution kernels to dynamically adjust their size and position based on the content of the image to be recognized.This approach effectively enhances the representational capacity of handwritten character features.Deformable convolutions, through adaptive receptive fields, capture various forms and scale information of handwritten characters.
2) In handling complex tasks, decoders may focus only on local regions of the input, failing to capture overall contextual information.In mathematical formula recognition tasks, long-distance dependencies are crucial for correct predictions, and traditional models may struggle with capturing such long-distance relationships.Thus, this paper introduces a global contextual attention module into the feature maps generated by the model.The global contextual attention mechanism assigns different weights to different positions or features based on learned weights, reflecting these attentions in the context vectors, ultimately integrating contextual information into the original features.The remainder of this paper is structured as follows: the second section provides a detailed overview of relevant research and technologies in handwritten mathematical expression recognition.The third section describes the overall architecture and implementation principles of the model.The fourth section analyzes the experimental results.Finally, the fifth section concludes the research presented in this paper.

2.Related work 2.1HMER
Traditional grammar-based Handwritten Mathematical Expression Recognition (HMER) models consist of three main steps: symbol segmentation, symbol recognition, and structural analysis.However, this approach has not yielded satisfactory results in recognizing handwritten mathematical formulas because these formulas possess complex two-dimensional structures.With the rise of deep learning, encoder-decoder models have shown outstanding performance in various tasks, including scene text recognition and image captioning.In the HMER field, introducing such models has led to significant performance improvements.
Deng et al. [21] first applied attention-based encoder-decoder models to HMER, inspired by their success in image captioning tasks.Zhang et al. also proposed a similar model called WAP, where they used Fully Convolutional Networks (FCN) as the encoder and employed coverage attention to address the issue of insufficient coverage.Wu et al. [23.24.25] focused on pairwise adversarial learning strategies to enhance recognition accuracy.Subsequently, Zhang et al. [26] designed a tree-based decoder for formula parsing, generating parent-child node pairs at each step, where the relationship between parent and child nodes reflected the structural type.In terms of data augmentation, Li et al. [27] introduced scale augmentation by randomly scaling images while maintaining the aspect ratio to improve the generalization ability of multi-scale images.PAL-v2 [28] used printed mathematical expressions as additional data to assist model training.In terms of training strategies, Truong et al. [29] proposed WS-WAP by introducing weak supervisory information to the encoder, demonstrating the effectiveness of bidirectional learning in improving model recognition performance.Zhao et al. [14] designed a bidirectional training Transformer framework called BTTR.However, BTTR lacked explicit supervisory information during opposite-direction learning, and its decoders in both directions did not mutually learn, limiting its bidirectional learning capability.Bian et al. [15] introduced a bidirectional mutual learning network and further demonstrated significant improvements in HMER performance with bidirectional learning.Li et al.
[16] introduced a counting module, greatly enhancing model performance.Additionally, other strategies such as transformer-based decoders, as used in [30,31,32], have also achieved encouraging performance.Tree decoder methods treat Mathematical Expressions (ME) as a tree structure, a natural representation for the tree structure of ME in the HMER context [36].

Deformable Convolution
In the field of deep learning, deformable convolution has gained significant attention in recent years, becoming one of the focal points.Despite the remarkable success of Convolutional Neural Networks (CNNs) in image processing and computer vision tasks, they have limitations when dealing with deformations, distortions, or non-rigid transformations in images.Traditional convolutional operations slide fixed convolutional kernels over input images, making them inflexible in adapting to changes in the images.To address this limitation, deformable convolution introduces learnable spatial transformation parameters, allowing the network to dynamically adjust the shape of convolutional kernels to better capture local features of the input images.Deformable convolution, as a key technology in CNNs, aims to improve the modeling capability for deformed or irregularly structured targets.In the early stages of research in this field, Dai et al. [17] proposed Deformable Convolutional Networks (DCN) in 2017.DCN, by introducing deformable convolutional kernels, enables the network to adapt to the deformation of targets, leading to significant performance improvements in tasks such as object detection and image classification.Subsequent research has focused on enhancing the structure and performance of deformable convolution.In 2019, Zhu et al. introduced Deformable DETR [37], applying deformable convolution to object detection, achieving better detection performance, demonstrating the applicability and effectiveness of deformable convolution in object detection tasks.Simultaneously, deformable convolution has also achieved success in semantic segmentation tasks.In 2020, Zhuang et al. proposed Deformable Convolutional Networks V2 (DCNv2) [38], effectively enhancing the semantic segmentation capability for complex scenes by introducing deformable attention modules.To improve the flexibility and generalization ability of deformable convolution, some research has focused on fine-grained modeling of deformable convolution.For example, Wang et al. introduced Dynamic Deformable Convolution (DDC) [39], which allows the network to adjust the shape of convolutional kernels based on the content of input images, thereby improving adaptability to different scenes.
In recent years, deformable convolution techniques have gradually expanded into attention mechanism models, finding widespread applications in areas such as image semantic segmentation, human pose estimation, medical image processing, and more.This evolution in technology highlights a deep understanding of model adaptability and non-local structural modeling capabilities, providing robust support for research and applications in the field of computer vision.Iulian Cojocaru et al. [40] introduced deformable convolution into handwritten text recognition, recognizing that standard convolutional operators do not explicitly consider the significant variability in the shape, proportion, and orientation of handwritten characters.To overcome these limitations, the authors studied deformable convolution and ultimately demonstrated its effectiveness in handwritten recognition.

3.Method 3.1 Deformable Convolution Module
One of the primary challenges in handwritten mathematical formula recognition lies in the diverse shapes, variable orientations, and fine linear characteristics of mathematical symbols.Even with a large amount of data, traditional convolutional networks struggle to fully learn the diverse features of mathematical symbols.This is because the geometric structure of the modules used in constructing convolutional neural networks is fixed, limiting their ability to model geometric deformations.This limitation means that conventional convolution can only extract symbol features at fixed positions in the input feature map, severely compromising its representational capacity for symbol features.To address this issue, this paper introduces deformable convolution, which possesses spatial geometric deformation capabilities.Deformable convolution adaptively captures various shapes and scale information of mathematical characters through deformable receptive fields.The fundamental idea is to use offsetsampled locations instead of the original fixed positions for convolution, with the added offsets becoming part of the network structure.These offsets are computed through another parallel standard convolutional unit and learned end-to-end through gradient backpropagation.With the learned offsets, the size and position of deformable convolutional kernels can dynamically adjust based on the current image content to be recognized.The sampling point positions of convolutional kernels at different locations undergo adaptive changes according to the image content, thus better adapting to the geometric deformations of mathematical symbols.
In the formula:   enumerates points in G, and  represents the weight of the sampled point.Deformable convolution adapts the shape of the convolution kernel to the object's shape by adding a 2D offset to the position of each sampled point in the kernel.This offset is learned from the input feature map using a convolutional layer.The offset and the input feature map are jointly input into the subsequent convolutional layer, as illustrated in Figure 2. Deformable convolution extends the regular grid G with offsets { Δ  |n = 1，2，⋯，N }, where N = |  |.At this point, the sampled positions on the feature map become   + Δ  , transforming equation (1) to: As shown in Figure 2 and equation ( 2), deformable convolution is achieved by applying a convolutional layer on the same input feature map to obtain offsets.The deformable convolution kernel and the current convolutional layer have the same spatial resolution.The output offsets have the same spatial resolution as the input feature map.When the channel dimension is 2N, it requires encoding N 2D offset vectors.Since Δ  is usually in decimal form, to effectively learn the offsets, bilinear interpolation is employed to determine the values of the sampled points after applying the offset.Therefore, equation ( 2) becomes (3).
In equation (S), p represents any position ( 0 +   + Δ  ), and q enumerates all the integral spatial positions in the feature map x. (  , ) represents the bilinear interpolation kernel.

Global Contextual Concerns Module
Capturing long-range dependencies is aimed at achieving a global understanding of visual scenes, which proves effective for many computer vision tasks such as image classification, video classification, object detection, semantic segmentation, etc.For the handwritten mathematical formula task, obtaining global context information and capturing long-range dependencies is even more crucial.This is because mathematical formulas often exhibit hierarchical structures, including nested parentheses, fractions, exponents, and so on.Additionally, the syntax and semantics of mathematical formulas are often complex, involving combinations and operations of different mathematical concepts.Global context information can assist the model in better understanding the syntax structure and semantic relationships of the entire expression, leading to more accurate parsing and inference.To capture long-range dependencies, two types of methods have emerged.The first type employs selfattention mechanisms to model relationships between queries.For example, NLNet [41] uses a self-attention mechanism to model pixel-pair relationships.However, NLNet learns position-independent attention maps for each position, resulting in a significant computational waste.Zhao et al. [14] designed a bidirectional training Transformer framework to address the challenge of modeling relationships between symbols that are far apart, a problem traditionally difficult for RNN-based models.However, Transformer-based frameworks introduce substantial computational overhead.The second type focuses on query-independent global context modeling.For instance, SENet recalibrates the weights of different channels using global context to adjust channel dependencies.However, feature fusion based on weight recalibration may not fully leverage global context.We aim to achieve both comprehensive fusion of context information and lightweight modules without increasing computational overhead.Hence, we were inspired by the work of Gao et al. [18], who proposed a lightweight global context attention module that effectively utilizes global context information without adding significant computational burden, as illustrated in Figure 3.
Where  is the index of the position, and  enumerates all possible positions. .

𝑓(𝐱
For different positions, their attention maps are nearly identical.This has been demonstrated in the GCNet paper by analyzing the distances between global contexts at different positions.In other words, although the non-local block aims to compute a specific global context for each position, after training, the global context becomes independent of position.Based on the above observation, the non-local block can be simplified by calculating a global attention map and sharing this map across all positions.Ignoring   , the simplified version of the non-local block is defined as equation ( 5): and   are represented as linear transformation matrices.To further reduce the computational cost of the simplified non-local block,   is moved outside of the attention pooling and is expressed as equation ( 6): With this modification, the FLOPs of the 1x1 convolution   are reduced from (HWC 2 ) to ( 2 ) .Unlike the original non-local block, the second term of the simplified version of the non-local block is location-independent and all locations share this term.Therefore, in this paper, we directly model the global context as a weighted average of all location features, and then aggregate the global context features to the features at each location.
In the simplified version of the non-local block, the transform module has a large number of parameters.To gain the advantage of the lightness of the SE block, the 1x1 convolution is replaced with a bottleneck transform module, which is able to significantly reduce the number of parameters (where r is the reduction rate).Because the two-layer bottleneck transform increases the optimization difficulty, a layer normalization layer is added in front of ReLU (which reduces the optimization difficulty and improves the generalization as a regularity).So the global context module GC block is represented as (7): Convolution can also be split into three steps: spatial dependence, channel dependence, and feature fusion.Non-local and SENet are effective mainly because of context modeling, and convolution can only perform context modeling on local regions, resulting in a restricted sensory field, whereas Non-local and SENet actually perform context modeling on the entire input feature, the sensory field can cover the whole input feature, which is a useful semantic information supplement for the network.In addition, the network only extracts features by convolutional stacking, which can actually be considered as fitting the input with the same form of function, resulting in a lack of diversity in the features extracted by the network, whereas Nonlocal and SENet precisely increase the diversity of the extracted features to make up for the lack of diversity.GCNet fully combines the advantages of the strong ability of Non-local global context modeling and the computation-saving advantages of SENet, which can be used for the whole input features.the advantage of saving computation to get better results on various computer vision tasks.The global contextual attention mechanism gives different attention to different locations or features through the learned weights, and then reflects these attention to the context vectors, which effectively solves the problems of missing key characters and failing to understand some semantic information in the process of recognizing long and complex mathematical formulas in traditional models.

Overall framework of the model
The overall framework of the proposed model is illustrated in Figure 4 and consists of two main parts.In the encoder phase, we utilize DenseNet as the backbone network for feature extraction.Considering the limited geometric deformation modeling capability of DenseNet due to the fixed geometric structure and receptive field, we replace all convolutional modules in the network with deformable convolutions.DenseNet is primarily composed of Dense Blocks, Transition Layers, and a Global Average Pooling layer.Through experiments, we verified that replacing all convolutions with deformable convolutions yields the optimal results compared to changing only a subset of layers.After preprocessing the input handwritten mathematical formula images, they pass through our densely connected feature extraction network to obtain feature maps.These feature maps then go through a pooling layer to reduce dimensionality and retain crucial feature information, thereby alleviating the computational burden of the model.Subsequently, they proceed to a fully connected layer, which further combines and compresses the extracted features to form a comprehensive representation of the mathematical formula.In the decoding part, we introduce GCNet, a mechanism capable of globally capturing information.By integrating contextual information, GCNet enhances the model's ability to model long-range dependencies within mathematical formulas.

Figure 4: Schematic diagram of the overall framework of the proposed model
To enhance the model's spatial positional awareness, this paper also introduces Positional Encoding from the Transformer framework in the decoding part [42].Given the encoding dimension d, position P, and feature dimension index i, the character's positional encoding vector is represented as equations ( 8) and (9).
( , 2 + 1) = (/10000 2/  ) By introducing positional encoding, the model can encode different positions within the input sequence, allowing it to perceive the relative positions of elements in the sequence.For handwritten mathematical formulas, this is crucial in recognizing spatial relationships and structures between symbols, as the arrangement of handwritten mathematical formulas often contains information about subscripts, superscripts, fraction lines, and brackets.Additionally, positional encoding introduces unique encoding values for different positions, aiding in providing information about the relative order of elements in the input sequence.This helps the model better understand the sequential structure of handwritten mathematical formulas, ensuring that the model can maintain appropriate contextual relationships when processing inputs.
In this paper, we first utilize deformable convolutions to adaptively capture various shapes and scales of handwritten characters through deformable receptive fields.We propose a feature extraction module based on deformable convolutions, effectively addressing the unique features of handwritten mathematical expressions and the similarity between different characters.Furthermore, the paper introduces a context attention module, which, by learning weights, assigns different attention levels to different positions or features.These attention levels are then reflected in the context vector, ultimately integrating contextual information into the original features.

4.Experiments
To assess the performance of our model in action recognition, we conducted extensive experiments.This section introduces the CROHME 2014, 2016, and 2019 datasets, along with details of the implementation process.Subsequently, we quantitatively and qualitatively discuss the experimental results of our proposed model to demonstrate the effectiveness of our mechanism.

Datasets
We conducted our experiments using the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) dataset [19,20], which is currently the largest open dataset for the Handwritten Mathematical Expression Recognition (HMER) task, as illustrated in Figure 5.The training set comprises a total of 8,836 training samples, while the CROHME 2014/2016/2019 test sets contain 986/1147/1199 test samples, respectively.The CROHME 2014 test set [19] is used as a validation set to select the best-performing model during training.We utilized the evaluation tools provided by the organizers of CROHME 2019 to convert the predicted LATEX sequences to the symLG format.Subsequently, we reported the metrics using the LgEval library.We chose metrics such as "ExpRate," " 1 error," "  2 errors," and " 3 errors" to assess the performance of our proposed model.In the CROHME dataset, each handwritten mathematical expression is stored in the InkML format, which records the trajectory coordinates of handwritten strokes.We converted the handwriting stroke trajectory information from the InkML files into image format for training and testing.

Implementation Details
In this study, we implemented the proposed DGNet model using the PyTorch library.DGNet was implemented in PyTorch.We utilized an Nvidia 3090ti with 24GB RAM to train our model, setting the batch size to 4, and the hidden state size for the two GRUs was set to 256.We set the dimensions for word embedding and relation embedding to 256.Adadelta optimizer [43] was employed during the training process.The learning rate started from 0, increased monotonically to 1 at the end of the first epoch, and then decayed to 0 following a cosine schedule [44].For the CROHME dataset, the total number of training epochs was set to 240.Unlike most previous works, no data augmentation was applied during training for a fair comparison.

Comparison with State-of-the-art Approaches
To validate the effectiveness of our proposed method in pose recognition, we conducted comparative experiments with nine current advanced algorithms on the CROHME 2014, 2016, and 2019 datasets, demonstrating the feasibility of our approach.The experimental results are presented in Table 1.Our DGNet method exhibits promising performance on the CROHME dataset.Compared to the baseline CAN method, our approach significantly improves the ExpRate, achieving improvements of 1.51%, 0.23%, and 0.52% on CROHME 2014.
In this section, we evaluated our proposed method on the CROHME 2014, CROHME 2016, and CROHME 2019 datasets, comparing its performance with other state-of-the-art methods.As most previous methods did not employ data augmentation, we mainly focus on the results without data augmentation.Table 1 presents the quantitative results of different methods on ExpRate for the CROHME 2014, 2016, and 2019 datasets.We compared our model with ten existing methods [25,17,11,45,14,15,46,16,47,48].The metrics include "ExpRate," "≤ 1 error," "≤ 2 errors," and "≤ 3 errors." On the CROHME 2014 dataset, CAN achieves an ExpRate of 57.00, while our method demonstrates a superior ExpRate of 58.51.Compared to CAN, which has the highest ExpRate among other methods, our approach achieves a significant improvement of 1.51.On the CROHME 2016 dataset, CAN's ExpRate is 56.06, and our method's ExpRate is 56.32.Although the improvement is relatively small, we still achieve a noticeable increase of 0.26 compared to CAN, which has the highest ExpRate among other methods.On the CROHME 2019 dataset, CAN's ExpRate is 54.88, while our method's ExpRate is 56.71.Our method exhibits a significant improvement of 1.83 compared to CAN, which has the highest ExpRate among other methods on this dataset.The substantial improvement on the 2014 and 2019 datasets may be related to the abundance of similar characters in the 2014 dataset, as shown in the model comparison chart in Figure 6.Traditional models struggle to correctly recognize similar characters, often misidentifying them as other characters, making it challenging even for the human eye to distinguish.In contrast, our model successfully identifies the correct results by learning the strokes of characters and comparing similar strokes, for example, distinguishing between the square root of 15 and the square root of 45.In the 2019 dataset, which contains more complex and numerous mathematical expressions, traditional models often encounter issues such as missing characters and inaccurate attention when dealing with these complex expressions.However, with the introduction of the global context attention module, we observe a significant improvement in addressing these issues.For instance, in the fourth data plot in Figure 5, the character "v" may be easily mistaken for "g."Due to different handwriting styles, some models might represent "v" with a shape resembling "g."Through global context attention, our model compares the characters appearing later in the expression, discovers similar characters, and finds that the subsequent characters better match the shape of "v."By assigning different attention to different positions or features through learned weights, our model successfully conveys weight information, ultimately recognizing the preceding character correctly.
All the experimental results above demonstrate that our model achieves state-ofthe-art performance on the CROHME 2014, 2016, and 2019 datasets, highlighting the superiority and robustness of our approach.In this section, we conducted extensive ablation experiments on the CROHME 2014, 2016, and 2019 datasets to investigate the effectiveness of the deformable convolution module and the global context attention module.The results of these experiments are presented in Table 2. Additionally, we demonstrate that the introduction of module parameters in the DGNet model not only increases accuracy but also reduces the training time, enhancing the practicality and feasibility of mathematical expression recognition tasks.41.This suggests that the global context attention mechanism effectively enhances the accuracy of mathematical expression recognition by assigning varying attention to different positions or features through learned weights, reflecting these attentions in the context vector, and ultimately integrating context information into the original features.Next, when we add the DC (Deformable Convolution) module alone, the ExpRate increases significantly to 56.29, showing a substantial improvement of 3.96.We attribute this improvement to the presence of numerous similar characters in the CROHME 2014, 2016, and 2019 datasets.The deformable convolution kernel's size and position can dynamically adjust based on the current image content to focus on the most relevant parts of the image, leading to a significant increase in accuracy.Finally, with both GCA and DC modules added simultaneously, our model achieves the highest ExpRate of 58.51.This is because the GCA and DC modules comprehensively consider the unique features of handwritten mathematical expressions and the similarity between different characters.Simultaneously, they effectively capture the spatial position information of the targets, further boosting the overall accuracy of mathematical expression recognition.The training time for the original plain convolution is shown to be 5779 minutes, indicating a substantial computational cost.However, with the introduction of deformable convolution, the training time rapidly decreases to 5415 minutes.Even after adding the Global Context Attention (GCA) module, the total training time for the entire model is only 5591 minutes, which is still less than the original plain convolution model.The deformable convolution allows the convolution kernel to deform based on the content of the neighborhood, adapting better to geometric changes in the image without the need for frequent adjustments of fixed-shaped convolution kernels during training.This significantly accelerates the convergence speed, as illustrated in Figure 7.

Comparison of Visual Results
In this section, we selected a typical case and visualized the overall model to showcase the weights of the model's attention to different parts of the feature maps.As shown in Figure 8, when the model predicts the character "2," the CAN model incorrectly predicts it as "z" due to the differences in the handwriting styles of mathematical expressions, leading to the original model's prediction error.However, in comparison, our proposed method DGNet successfully predicted the formula correctly.By observing the heatmap, it is evident that almost all symbols are accurately localized.These observations indicate that by introducing deformable convolution, the model gains a deeper understanding of each symbol, particularly their positional information.Therefore, during the decoding process, our model exhibits more accurate attention results (observable through attention maps) and is less prone to missing or incorrectly predicting redundant symbols.

Conclusion
In this research, we propose an innovative neural network architecture named DGNet, based on deformable convolutions and a global context attention mechanism.Aimed at overcoming the limitations of traditional encoder-decoder architectures in handwritten mathematical expression recognition tasks, our model takes into full consideration the sparse nature of handwritten mathematical formulas.It cleverly utilizes the characteristics of deformable convolutions, enhancing adaptability to better handle geometric variations and various deformations in handwritten mathematical expressions.Notably, we introduce a global context attention module that effectively aggregates global context features of both position and channel.This innovative design significantly improves the model's ability to understand the overall context, enhancing its accuracy and generalization.Experimental results demonstrate that DGNet achieves remarkable accuracy on multiple datasets, including CROHME 2014, 2016, and 2019, with accuracy rates reaching 58.51%, 56.32%, and 56.1%, respectively.Compared to methods based on DenseNet encoders in traditional architectures, DGNet excels in handling handwritten mathematical expressions.By introducing deformable convolutions and global context attention, we successfully address issues such as the traditional architecture's lack of adaptation to unique features of handwritten mathematical expressions, neglect of similarities among different characters, and missing crucial information in global context.This progress positions our model as a significant advancement in the field of handwritten mathematical expression recognition, providing robust support for future research and applications.

Figure 3 :
Figure 3: Overall flow of the Global Context Attention module.The input features are passed through the module as a way to obtain global context information.
,   ) represents the relationship between positions  and  , and C(x) is the normalization factor.  and   represent linear transformation matrices (e.g., 1x1 convolution).For simplicity, define   = (  ,  ) () as the normalized relationship between positions and  .In this paper,   is expressed in the form of an Embedded Gaussian and is defined as   = exp(⟨    ,    ⟩) ∑  exp(⟨    ,    ⟩)

)
is the weight of the global attention pooling, (⋅) =  2 ReLU(LN( 1 (⋅))) is bottleneck transform.The three steps of GC block are (a) global attention pooling for context modeling; (b) bottleneck transform to capture inter-channel dependencies; (c) broadcast element-wise addition for feature fusion.

Figure 6 :
Figure 6: Comparison plot of modeling effects.The above figure represents the comparison plot of different recognition results on the mathematical formula dataset.The upper row represents the recognition results of CAN under 4 batches and the lower row represents the results of our method + b -2ab = ...... -2a6 1 -2a + -2a = ...... -2a

Figure 7 :
Figure 7: Comparison plot of model training curves.The left side shows the training process curve of the original model, and the right side shows the convergence process graph of our DGnet model.

Figure 8 :
Figure 8: Heat map of the DGNet recognition process

3 )
Compared to current mainstream algorithms, our algorithm outperforms on the open-source CROHME 2014, 2016, and 2019 datasets.Experimental results indicate that deformable convolution and global contextual attention both enhance the recognition capabilities of mathematical expressions.

Table 2 : Ablation experiments for mathematical expression recognition on the
Table 2 presents the ablation experiments for mathematical expression recognition on the CROHME 2014, 2016, and 2019 datasets.Taking the CROHME 2014 dataset as an example, the results indicate that the baseline CAN method achieves an ExpRate of 52.33.When we add the GCA (Global Context Attention) module, the ExpRate increases to 53.75, showing an improvement of 1.

Table 3
provides the training time for the entire experiment on the CROHME training dataset.