Refocus attention span networks for handwriting line recognition

Recurrent neural networks have achieved outstanding recognition performance for handwriting identification despite the enormous variety observed across diverse handwriting structures and poor-quality scanned documents. We initially proposed a BiLSTM baseline model with a sequential architecture well-suited for modeling text lines due to its ability to learn probability distributions over character or word sequences. However, employing such recurrent paradigms prevents parallelization and suffers from vanishing gradients for long sequences during training. To alleviate these limitations, we propose four significant contributions to this work. First, we devised an end-to-end model composed of a split-attention CNN-backbone that serves as a feature extraction method and a self-attention Transformer encoder–decoder that serves as a transcriber method to recognize handwriting manuscripts. The multi-head self-attention layers in an encoder–decoder transformer-based enhance the model’s ability to tackle handwriting recognition and learn the linguistic dependencies of character sequences. Second, we conduct various studies on transfer learning (TL) from large datasets to a small database, determining which model layers require fine-tuning. Third, we attained an efficient paradigm by combining different strategies of TL with data augmentation (DA). Finally, since the robustness of the proposed model is lexicon-free and can recognize sentences not presented in the training phase, the model is only trained on a few labeled examples with no extra cost of generating and training on synthetic datasets. We recorded comparable and outperformed Character and Word Error Rates CER/WER on four benchmark datasets to the most recent (SOTA) models.


Introduction
Handwriting Text Recognition (HTR) systems allow computers to read and understand human handwriting. HTR is useful for digitizing the textual contents of old document images in historical records and contemporary administrative material such as cheques, law letters, forms, and other documents. While HTR research has been ongoing since the early 1960s [34], it remains a challenging and unsolved research problem. The fundamental problem is the wide range of variations and ambiguity encountered by different writers when crafting words. Because the words to be deciphered usually adhere to well-defined grammar rules, it is possible to eliminate gibberish hypotheses and enhance recognition accuracy by modeling the linguistic practices. HTR is usually embarked with a blend of computer vision and natural language processing (NLP).
In nature, handwritten text is a signal that follows a particular sequence. Texts in Latin languages are written in the left to the right direction, whereas non-Latin scripts are written from right to left; an ordered sequence of letters creates the words in both languages. As a result, HTR approaches used temporal pattern recognition techniques to overcome and maintain the sequence order. The usage of Deep Learning techniques progressed from early techniques based on Hidden Markov Models (HMM), with Recurrent Neural Network (RNN) and its variations Bidirectional Long Short-Term Memory (BiLSTM) networks becoming the mainstream alternative. Sequence-to-Sequence (Seq-Seq) techniques, served by encoder-decoder networks directed by attention mechanisms, have recently begun to be involved in HTR, inspired by their success in applications such as speech-to-text or automatic translation. It is not only feasible to decode images sequentially using the methods mentioned above, but it is also conceivable to learn which characters are more likely to follow each other in the decoding process. Language modeling can only increase recognition performance if applied as a post-processing step.
One crucial flaw persists despite the success of attentionbased encoder-decoder architectures in HTR. These attention techniques involved in RNN's variations, either LSTMs or GRUs, are utilized on the standard Convolutional Neural Network (CNN) feature extraction method. Due to memory constraints, the sequential approach discourages parallelization during training and substantially reduces processing speed while processing more extensive sequences.
To alleviate those flaws, we were inspired by End-to-End Object Detection with Transformers (DETR) architecture [8]. Though the DETR was meant to detect objects, we exploited their architecture to solve the HTR task. We explore the impact of slant removal and illumination computation preprocessing techniques. Also, we were inspired by the findings as mentioned earlier, [54] innovative work on split attention on convolutional layers (ResNeSt) and Transformer encoder-decoder architectures, respectively. Transformers and ResNeSt are based on attention mechanics, with no recurring techniques. Motivated by this benefit, we propose addressing the HTR problem with an architecture involving attention mechanisms on feature extractions and encodingdecoding feature representations. We desire to address both the suiTable phase of character recognition from images and understand the corresponding language model dependencies of the character sequences. The encoder-decoder Transformer multi-head self-attention module decoded textual ground truth at both stages of the visual feature representations.
Transformers have demonstrated superior performance to recurrent networks in various language and visual applications while more parallelizable than GRUs and BiLSTMs, requiring less training time. In contrast to classic speech and translation recognition models, our proposed split attention transformer networks operate at the character level rather than the word level. Therefore, no specified fixed vocabulary is required with our architecture. Thus, the proposed model can recognize words that our training model has never seen, known as out-of-vocabulary (OOV) terms. We obtained competitive CER and WER performance compared to the current existing methods of the state-of-the-art (SOTA) outcomes on four publicly available English script (IAM, Bentham, and Washington) and French script (RIMES) datasets with no synthetic training data. Our contributions to this paper are: • We experimented with different backbone feature extraction methods; Table 4 shows the findings of the standard CNN-BiLSTM model and Self Attention module-based CNN-BiLSTM where the attention mechanism improved the performance. Whereas the impact of the deeper layer networks between ResNet50 and ResNet101 coupled with self-attention BiLSTM as shown in Table 6 the deeper the layers, the better the performance. From there, we examine the standard CNN feature extraction and the attention-based CNN feature extraction methods as shown in Table 7. As a result, the split attention Transformers have demonstrated superior performance to recurrent networks in various language and visual applications while being more parallelizable than GRUs and BiLSTMs, requiring less training time. In contrast to classic speech and translation recognition models, our proposed split attention transformer networks operate at the character level rather than the word level. Therefore, no specified fixed vocabulary is required with our architecture. Thus, the proposed model can recognize out-of-vocabulary words (OVV) that have never been seen during the training. Our experimental results demonstrate a comparative performance accuracy of CER and WER compared to the state-of-the-art (SOTA) outcomes on the publicly available IAM, RIMES, Bentham, and Washington datasets with no synthetic training data. Since the convolution network ResNeSt101 outperforms the standard ResNet101 convolution network. Thus, we choose the ResNeSt101 for feature extraction method on the proposed model. • We investigated the influence of different preprocessing approaches such as illumination adjustment, removing cursive handwriting, and combining both. Table 5 shows that only removing the cursive writing improves the performance while using the illumination compensation technique declines the performance of the raw dataset. Thus, we opt only for the recursive text technique to remove slanted and sloped text from our datasets before feeding them to the CNN for feature extractions. • To the best of our knowledge, the proposed approach is considered the first study to examine the implications of attention mechanisms on both ResNeSt split attention convolution feature extraction-Transformers multi-head attention encoder-decoder for the HTR problem without using any recurrent architecture. Specifically, we want to extract, encode, and decode robust representations from document images and model language using a unified architecture that can detect character sequences while providing context to differentiate between letters or words that may appear similar. The proposed architecture operates on a character-by-character basis, avoiding the need for predetermined lexicons, known as a lexicon-free model. • Using pre-trained weights from ImageNet as a starting point benefits our model's rapid convergence and learning. Pre-training with the Bentham dataset also allows for competitive outcomes with a minimum amount of annotated training data. • On the benchmark IAM, RIMES, Washington, and Bentham datasets, our proposed HTR model improves SOTA performance with few label data. In Sect. 5.2, we conduct thorough ablation and comparison investigations to demonstrate the efficiency of our model. Finally, we showcase the influence of using another popular pretrained model (the CLIP) on the IAM dataset.
The remaining of this paper is organized as follows. Section 2 presents the most recent methods related to the one presented here regarding the tackled problem and modeling choices. In Sect. 3, the proposed method gives the system's details. Experiments are reported in Sect. 4, followed in Sect. 5 presents a discussion of our findings and compare to the state of the art, in sect. 6 we conclude how the system could be improved and present the challenge of generalizing it to complete documents.

Related work
The sequential pattern of the recognition framework has traditionally been used to recognize handwritten text. Text-line images are handled by learning models using their internal states to analyze incoming signals in variable-length lines. The HTR applications follow the same process paradigm using either Hidden Markov Models (HMM) [5,21,41] or Deep Neural Networks architecture-based (DNN) such as Bidirectional Long Short Memory (BiLSTM), multidimensional LSTM (MDLSTM) and Gated Recurrent Units (GRU) accompanied with Connectionist Temporal Classification (CTC) which used to label unsegmented sequences of handwritten images with RNN-LSTM [23,24,43], encoderdecoder networks (seq2seq) [35], nonrecurrence transformer networks [30]. Recently, attention mechanisms [3,13,25,51] have emerged as an essential component of any model that must account for global dependencies. In respective, selfattention [15,38] calculates the response or the weight at a particular sequence location by paying attention to the entire sequence of the contributions at that position. The machine translation model has demonstrated that cutting-edge results can be achieved using only self-attention in the literature.

Recurrent network: CTC encoder-decoder
Recently, the HTR task was treating as a sequence-tosequence (Seq2Seq) model [1,35,47,55]. Seq2Seq model transforms a sequence of convolutional and recurrent text image segments into transcribed text. The training of networks that utilize this strategy can be accomplished in one of two ways: first to maximize the categorical cross-entropy loss, then to combine that loss with the recurrent CTC loss. [22] employed LSTM and CTC for text-line recognition with no prior word-level segmentation. LSTM and CTC techniques help improve recognition applications of handwriting documents segmentation-free. Many studies [24,29] proved that the LSTM model effectively predicts longer sequences than standard RNN. For instance, the cell gates in the LSTM architecture allow it to remember important information from inputs that have already passed through, which distinguishes it from the RNN. More improvement in recent years is well investigated in the BiLSTM model [16,46]. While the information from both forward and backward (backpropagation) is used as input to the final output layer, BiLSTM provides an additional training capability that improves prediction accuracy. Therefore, BiLSTM can detect and extract more time dependencies and resolve them more precisely. While crossing over the input image, fixed grids specify convolutional kernels and focus on all input pixels simultaneously, disregarding the challenges of handwritten text like inter/Intra class variations, scale, and orientation, and the importance of ink pixels. Authors [9] proposed a convolutional neural network with a deformable variation in its convolutions. This variation of CNN can deform based on its image input and better geometrically respond or adapt to the variations in the textual contents. A Separable Multidimensional Long Short-Term Memory (SepMDLSTM) was applied by Chen et al. [14] to encode the input text line pictures for script identification and multiscript text recognition via convolutional feature extraction. Pham et al. [39] used Multidirectional LSTM layers with CTC for computing the negative Log-likelihood for sequences. Pham investigated how dropout can work with recurrent and convolutional layers in deep network architecture, mainly on word handwriting recognition. Furthermore, they examine the proposed method of line recognition with language modeling and lexicon constraints. Wigington et al. [49] used random perturbations on a regular grid as an augmentation technique and a novel profile normalization technique on word and line handwriting text. Those techniques help the performance of their proposed CNN-LSTM model. Bluche et al. [6] proposed a generic model based on a convolutional encoder of the input images and a bidirectional LSTM decoder predicting character sequences. The authors also proposed a convolutional gate in the decoder in order to control the propagation of the next layer's feature representation. Aradillas et al. [2] have comprehensively investigated the various combination of (TL and DA) techniques on CNN for feature extraction accompanied with 2DLSTM layers for classification. Puigcerver et al. [43] reported an efficient and outperformance accuracy on text line recognition model based on convolutional and 1D-LSTM rather than 2D-LSTM which improve the speed of their proposed model. In our recurrent baseline model, we employed BiLSTM on top of different backbone feature extraction methods, including CNN with/without attention mechanism. Unlike the existing studies, we answered different hypothesis questions regarding the recurrence modelling performance and architecture.

Non-recurrent network: transformer encoder-decoder
Transformer [4] is one of the most successful models for processing long sequences. While transformers were mainly focused on Natural Language tasks like BERT [19] and GPT-3 [7], they are widely spread in the Computer Vision communities to tackle various tasks on the domain. Transformer Encoder-Decoder networks are composed of three main components: positional encoding, self-attention multihead attention modules, and the feed-forward layers. The main advantage of Transformer over RNN is that the former process long sequences in parallel. [40] exploits the Transformer for action recognition model where the problem is close to the handwriting recognition task as both datasets suffer from inter-class and intra-class variations. The SOTA frameworks of handwritten text recognition using BiLSTM have achieved acceptable recognition results, but the training process is too computationally intensive. Furthermore, even though they are supposed to describe language-specific dependencies [19,20], they often fall short and require additional post-processing processes. For the first time, we propose avoiding any recurrent architecture by using split attention on the CNN and a multi-head transformer network for the HTR task. It is possible to detect long character sequences from images and model language at a character level, avoiding established lexicons. Kang et al. [30] utilized a deep CNN as a feature extraction method and the transformers as a transcriber for handwriting recognition tasks. They tested their proposed method on the IAM dataset with different methods, including language models of the IAM and WikiText. Flor et al. [17] proposed a novel and efficient architecture for HTR constructed on Gated-CNN and integrated with two steps of the language model at the character and word levels. [9] authors used a deformable convolutional-recurrent network in order to adapt geometric transformations rather than a standard CNN as a backbone for their HTR model inspired by CRNN and 1D-LSTM. Further to what we found in the state-of-the-art methods, we built on top of the most efficient architectures and transformed them to solve HTR problems efficiently and effectively in a straightforward approach.

Data augmentation and transfer learning
Research studies in deep learning [31,50] show the impact of DA by transformations I input image and preserving the original class for new perturbing images such as flipping, resizing, cropping, rotation, and scaling. DA creates different copies of images for unique ground truth (label). Authors of [42] used affine transformation to generate more to the input image, another 36 images. They were utilizing both rotation and shearing where rotation was used around the centered input image with these specific angles (−5, −3, −1, +1, +3, +5) while the shear on (−0.5, −0.3, −0.1, +0.1, +0.3, +0.5) degrees. Wiginton et al. [49] proved two DA techniques that help to reduce CER and WER for handwriting text recognition. They applied the profile normation technique to make neural networks more tolerant, just like a human being when reading text that has various variations. [11] proposed two DA methods on the training set: (1) crafted multiscaled data, which proved to boost the training performance for HTR with fewer labeled data. (2) normalization scheme modelbased that addresses the problem of handwriting variability at the recognition phase. The DA techniques of these studies are applied to a relatively large known dataset. However, the regularization impact of the DA technique has little effect if only one writer of a small database is changed. Therefore, we must combine TL and DA to reduce the final error rate, especially for a small dataset.

Proposed method
This section presents the baseline convolution neural network model and the preprocessing methods used in our experiments. Then, we propose the split-attention convolutionaltransformer architectures that we employ in the experiments.

Baseline recurrent model
We embrace the baseline model inspired by Puigcerver [43] where the primary purpose of this model is to validate our hypothesis on attention mechanism and the impact of preprocessing techniques. We hypothesized that using an attention mechanism in the convolutional layer help to retrieve more robust image-line representations than non-attention CNN. Then, there are four main parts to the model: the feature extracted from a given input image using the ConvNet method, a self-attention module to enhance the extracted feature representations, the visual representation treated as sequences and outputted their corresponding character probability distributions using BiLSTM method, finally, the last textual transcription obtained from the decoding block using CTC loss/decoding method. We train the baseline model by ensuring that the output sequence of CTC probability distribution is as high as possible. So, in addition to the characters that make up the text, the RNN gets a unique blank character that means "no other character." We have adopted the self-attention module from [53]. After experimenting with the baseline model, we found that adding a self-attention module to the end of CNN obtained the most robust feature representations and improved model performance. The baseline Handwriting recognition model is trained using CTC loss and further detailed in experimental setup Sect. 4.3.1. The CTC tackles the issue of alignment, as alignment information is not provided for the input data or output transcription since handwriting differs from person to person. The influence of the attention mechanism on the CNN feature extraction method is displayed in Table 4. This method reduced 0.65% on the Bentham dataset while injecting the attention module into the proposed baseline model.

Preprocessing
The used datasets have a wide range of image sizes. In contrast, the input dimension of models must be fixed for efficient training and a pooling layer. Therefore, images in this paper are downsized to 1024 height and 128 widths with maintaining the aspect ratio. Since we have black and white text-line images, pixels are either 0 or 255. All of the characters and the background in the images are represented as 0 and 255, respectively. So we have added white color padding to the image in order to maintain the aspect ratio after resizing. More specifically, adding the padding where the pixels were short of the desired resolution; otherwise, we first resize along the shorter dimension such that the width is 128 pixels and then, crop the image along the height, such that the height is 1024 pixels. Maintaining a consistent aspect ratio allows our Convolutional Neural Network to learn more discriminative and compatible features.
Various noises often affect offline handwritten text images scanned by various scanned devices. Also, different writers' handwriting styles (fonts) vary extremely widely. Though our proposed model is an end-to-end HTR system, input images are cleaned using the deslanting method [48] to remove the writing cursive style. Illumination Compensation technique [12] to remove shadows and balance brightness and contrast normalization before feeding to our CNN for feature extraction. In the present system, the following collection of preprocessing operations has been applied (i) illumination compensation techniques, (ii) removal of slant and slope from cursive handwriting text, (iii) normalization of line height to a fixed value (128 units), keeping the aspect ratio unchanged and (iv) a combination of both (i, ii).
The Illumination Compensation Techniques are five processes to balance uneven light distribution. The primary goal of the entire process is to produce content with a high degree of recognition. To eliminate the slanted and skewed text, we extend the most popular technique used as a new normalization approach to solving the cursive text in the word-level handwritten text by removing the slope and the slanted text. The results of those preprocessing experiments are conducted on the Bentham dataset and presented in Sect. 5.1.2.

Transformer-based model
In this subsection, we explained in detail the proposed HTR convolutional transformer architecture, starting with the feature extraction backbone and then, the transformer component methods.

Feature Extraction using Convolutional Neural Network
We investigate different CNN architectures, specifically the most recent SOTA models, including ResNet [27] and ResNeSt [54] with an attention mechanism for better feature extraction. Tables 6 and 7 report the results on the Bentham dataset as we found out that when the network goes deeper and uses the attention, the better the feature extraction is and hence, the better accuracy. Despite recent advancements in image classification models [31], most downstream applications such as image recognition [27], object detection [45], and semantic segmentation [32] continue to utilize ResNet variations as the backbone network due to their simplicity and modularity. In contrast, we employ a basic and modular Split-Attention block that allows attention to be distributed across feature-map groups. Stacking these Split-Attention blocks on ResNet-style formed a new variant called ResNeSt. ResNeSt is chosen because it preserves the overall ResNet structure in subsequent tasks without charging extra processing costs.
ResNeSt beats other SOTA neural networks with comparable model complexity and improves downstream HTR and OCR tasks.

Encoder-Decoder Transformer Network
In this subsection, we describe the main components of the proposed transcriber model architecture: Encoding feature representation: High-level representations of features are extracted from the line handwritten image (x i ∈ X ) in CNN feature extraction sub-network 3.3.1, where x i represents a sample of input images as in Fig. 2. The visual representation and sequential order information are encoded. To process the handwritten line images, the input image x i is first processed by the CNN, which can handle images of any size. We can obtain an intermediate visual feature representation F v of size ( f = 2048) as a result of the process. The ResNeSt101 convolutional architecture serves as the backbone of our convolutional architecture. Visual feature representations using standard CNN cannot compact the global representation of the input image; therefore, the attention layer in ResNeSt101 helps extract this meaningful information.
Positional encoding (PE): In Latin scripts, text images are processed sequentially from left to right. While avoiding repetition, Positional Encoding stages before the transformer encoder phase are meant to leverage and encode such crucial information. It also helps the model define the exact location of the next word and character in the text line. For the model to utilize the sequence's order, we employ the positional encoding (PE) of tokens within the sequence. As a result, we incorporate the input embeddings with the PE at the bottom of the decoder, as shown in Fig. 2. The encoded feature vector from the ResNeSt is converted to the same hidden dimension (256) as the transformer encoder before adding the PE. The PE in Eq. (1,2) are useful in the proposed architecture since they teach our model wherein the sequence is actively focused.
where the q i v ∈ Q v as input query and i range from (0 and w − 1), K v and V v are the input key and value, respectively. Finally, we obtain the high level visual representation from Fig. 3, we visualize the role of self-attention on the target line image. Due to the short length of that line sequence, we also show the visualization of the padding. The padding shows at the end of the presented image until we reach the maximum length of the input image on the datasets, which is pre-determined as 2048.
Transcriber text decoding : As depicted in Fig. 2 where visual aspects and language-specific information gathered through textual representations are taken care of this component. It outputs the decoded characters and anticipates the next likelihood character following decoded characters sequences. To analyze the line string effectively, we need symbols that do not contain textual information and the numerous characters inspected in the vocabulary size V alphabet. The characters < B O L > and < E O L > are used to indicate the beginning and end of the line sequence, respectively. In contrast, the character < pad > is used to indicate padding, as depicted in Fig. 3. Except for the initial/first character, the transcriptions (y i ∈ Y ) are padded out to a maximum predicted many characters of N . Character level embedding uses a dense layer to convert every character from the input string into an f-dimensional ( f d ) vector. Eq. 4 uses the same positional encoding as in Eq (1, 2) where the PE should encode time-steps uniquely. In Fig. 3, we visualize the role of self-attention on the target line image.
where y i is the ground-truth text transcription, P E is the positional encoding. Fig. 3 The impact of self-attention in transformer-decoder network demonstrating the padding for this input line image

Datasets
This study examines the HTR system across four datasets: the IAM [33], RIMES [26], Washington [28], and Bentham [10] databases. Table 1 describes the standard partitions used by other researchers on the four datasets to ensure a fair comparison with the current work. Precisely, we can describe the used datasets as follows.
The IAM database has 13353 modern English text lines created by 657 distinct authors. The RIMES database is a French handwriting script collected by 1,300 writers. The George Washington dataset comprises 565 text lines from George Washington's letters authored by two 18th-century writers. Bentham's dataset is English scripts in black-andwhite images with distorted writing and dark backgrounds. The text in this collection is about 11,500 lines. A sample of Bentham images is shown in Sect. 5.1.2 along with the corresponding preprocessing methods. Each dataset contains 100 characters, including capital and lowercase letters, numerals, punctuation, special symbols, and white space.

Performance metrics
Character Error Rate (CER) and Word Error Rate (WER) used to evaluate the proposed HTR model. Both equations (CER 5 and WER 6 ), in all cases, are the total number of insertion, replacement, and deletion operations required to shift from one sequence to another are based on the Levenshtein edit distance [52]. Mainly, CER is defined as where |N | is the number of ground truth characters at N partition while the LD(yˆ,y) is Levenshtein distance between prediction yˆand target label y of ith character. In terms of W E R, it is defined similarly to CER. Whereas LD(yˆ,y) is computed on word-level that requires to transform one string into another as deletions D word's sum, insertions I and substitutions S by the ground-truth N for that partitions.

Implementation details
This section describes the implementation of the recurrent baseline model and the proposed transformer-based model.

Baseline recurrent model
This subsection briefly describes the specific implementation details applied to the baseline recurrent model as introduced in the proposed method Sect. 3.1. We stack for both convolution and recurrent layers five blocks each where the convolution layer convolved using 3 × 3 kernel size. We used zero-padding with stride=1 to maintain the layer's outputs that lead to the exact spatial dimensions as its inputs. A normalization layer (batch norm) is used to help the model converge faster and make it more stable, and a LeakyReLU activation function provides a nonlinearity. Further details, a brief description of the model has depicted in Table 2.

Non-recurrent transformer-based model
The proposed model has a feature size of 2048. The network architecture 3 contains a CNN block for feature extraction, 4 encoder layers, and 4 decoder layers with a hidden dimension of 256. We used a batch size of 16, and ADAM optimizer, and a learning rate of .0006. We adopted the label smoothing technique [37] with Kullback-Leibler divergence cross-entropy loss. Every line is padded to the maximum length of 128 characters using a unique character to the right, as illustrated in Fig. 3. Vocabulary size is 100, which include small/capital letter, punctuation marks, numbers, and other special characters.

Kullback-Leibler divergence loss
We calculated that error regularly during the training optimization process to penalize the error model predictions.
Choosing a cost function to estimate the model performance allows for updating the weights to minimize the loss on the next epoch. The loss function used in neural network models must fit the issue framing. The softmax activation function generates a probability distribution of over 100 classes

Softmax layer ×1
(multi-class classification problem) in our task. In contrast to the Softmax situation, where the categorical cross-entropy loss function is frequently used, the argmax of the predictions generated compares the predicted distribution with the ground truth distribution. Using Kullback-Leibler Divergence (KLD), [36] is an adaptation of the entropy metric standard in information theory. The predictions generated by the final feedforward layer effectively form a probability distribution and can thus be compared to the actual distribution for the sample x in the corresponding training dataset. KLD computes the gain and loss of the probability distribution between the predicted distribution of the model P(t) and the distribution of the ground-truth G(t). Backpropagation will continue until the model P(t) produces textual transcription equal to or very similar to the ground truth G(t) distribution probability. The model weights and biases will be adjusted in Eq 7 using the ADAM optimizer to achieve the ideal distribution of the prediction probabilities output.
where i represents both a sample of the decoded ground truth of X along with its corresponding encoded image feature representation.

Label smoothing
Over-fitting and overconfidence are common issues when training deep learning models. Some regularization techniques have addressed the over-fitting issue like early stop- where (K = 100) is the total number of multi-class categories and y hot represents the embedded ground truth labels.
In this way, the smoothed distribution of the Label is equivalent to adding noise to the actual distribution to avoid the model being too confident about the correct Label. Therefore, the difference between the output values of the predicted positive and negative samples is not so significant to avoid overfitting and improve the model's generalization ability. We experimented with a 0.4 as the value of α. Table 3 presents the attention Transformer model architecture.

Results and discussions
As mentioned in Sect. 3.3.1, the proposed architecture is based on ResNeSt backbone for feature extraction. All the models were initialized with Imagenet [18] pre-trained weights. We investigate different SOTA neural network models in both: (1) Feature extractions and representation with and without split attention CNN including ResNeSt Image-net pretrained model. (2) Recurrent and non-recurrent encoder-decoder task using both BiLSTM Sect. 3.1 as baseline model, and Transformer encoder-decoder network as proposed model Sect 3.3.2.

Preliminary results
In the following Sect.  Table 4 shows the impact of the attention mechanism on the CNN feature extraction method. The CER/WER per- Fig. 4 The impact of the Attention mechanism on the prediction of different input images from the Bentham Dataset. The examples are shown from top to bottom: the input image, corresponding ground truth, and the output prediction with the split attention mechanism is backed by the ResneSt101 CNN. Finally, the output prediction without the attention mechanism is backed by the Resnet101 CNN formance is slightly improved using the baseline-attention model on the Bentham dataset. That leads us to choose the ResNeSt split attention backbone in our proposed model, which increased the performance by 2% in favor of the standard ResNet backbone feature extraction method. Further validation supported the importance of the attention mechanism on CNN are the quantitative and qualitative results reported, as, respectively, demonstrated by Table 7 and Fig. 4 in Sect. 5.1.3. Based on that, we confirm that the CNN-attention-based model captures helpful information, hence becoming robust to the earlier HTR task challenges of handwriting variations. Qualitative results provided visualization evidence where the font red denotes the mispredicted characters over the ground truth and support the quantitative findings showing that the suggested technique is more resilient to inter/intra-class handwriting variances.

The impact of preprocessing methods
To elaborate on our choice of the preprocessing techniques discussed in Sect. 3.2, Table 5 demonstrates the CER on four training cases. We choose Bentham line text images to investigate our first hypothesis, the better performance of applying massive or light preprocessing techniques. As a result, we found that only using the removal of cursive handwriting (slant removal text method) leads to the lowest CER. Consequently, we employ only the slant removal method for further experiments and discard the illumination compensation techniques on the proposed datasets. In Fig. 5, we show the sample from the Bentham dataset after reducing the noise, enhancing the lightening, recovering the damaged strokes, and correcting the geometry distortions (correcting the text skew).
As presented in Table 5, we obtain the lowest error rate when only removing the cursive handwriting. As discussed in 3.2, the preprocessing technique employs the de-slanted algorithm to correct the slanted and skewed handwritten text. To elaborate on our choice of the preprocessing techniques discussed in Sect. 3.2, Table 5 demonstrates the CER on four training cases. We choose Bentham line text images to investigate our first hypothesis, the better performance of applying massive or light preprocessing techniques. As a result, we found that only using the removal of cursive handwriting (slant removal text method) leads to the lowest CER. Consequently, we employ only the slant removal method for further experiments and discard the illumination compensation techniques on the proposed datasets. For further experiments, we utilized the removal of cursive handwriting from the images.  Bold numbers show the best accuracy with attention backbone (ResNeSt) over standard convolutional (Resnet) backbone network, as discussed in Sect. 5.1.3

Results on Transformer-based model
After the findings that attention mechanism helps in better feature extraction, we also wanted to experiment with the deeper networks. Hence, we first train Transformer model with ResNet CNNs. Table 6 showcases the results based on ResNets experiment. Since we obtained better results while using deep layers, we experimented on Bentham dataset to compare the ResNet101 with ResNeSt101 for backbone feature extraction; as anticipated, the latter model provides the more robust feature representation of handwriting images shown in Table 7. Therefore, we considered the ResNeSt as our backbone in the architecture for further experiments, as shown in Fig. 2 and Table 3. The Resnet and ResneSt models perform competitively on the Bentham dataset in the considered standard-CNN and the split attention-CNN approaches. Compared to the Standard CNN-baseline model, the proposed model further decreases both the CER and the WER which can also be observed from the qualitative results reported in Fig. 4 where the reddish color indicates the mispredicted characters over the ground-truth. The qualitative result shows that the standard backend CNN method has twice CER than the backend CNN with attention ResNeSt. This suggests that, by capturing the more contextual feature representation, the split attention CNN model makes fewer character-level errors and word-level errors with more than 2%. The result of the Bentham historical dataset demonstrates the improvement of model performance when using the attention mechanism.
Further to the importance of the attention mechanism in the feature extraction block, Table 6 presents two experiments on the Bentham dataset. We discovered that the deeper the network with attention mechanism gets, the better the feature extraction, which ultimately leads to improved accuracy.

Ablation study
From our findings on previous section, we reported that the network with attention mechanism, the better the feature representation extractions obtained with respect to the best preprocessing choice made by our experiments. We built upon those findings by further improving the SOTA deep learning method for feature extraction and encoder-decoder text transcriber as described in Sect. 3.3.
In ablation research, characteristics from a model are often removed to enhance the model efficiency. The influence on performance is assessed to witness whether or not the model can withstand the removal of specific approaches. This section intensely detailed the impact of transfer learning (TL), data augmentation (DA), and their possible combinations.

An impact of transfer learning
The challenge of training a CNN from scratch is due to different factors, particularly CNN data-hungry and timeconsuming. Therefore, TL and DA can solve the HTR task with the shortage of labeled datasets. The Bentham database is excluded from comparing DA with TL since the Bentham dataset plays the source database role in the TL approach, IAM, RIMES, and Washington target datasets. The performance findings of both CER/WER were computed in two scenarios: • Ensemble approach where we have the average weights of two models trained on 100 and 200 epochs, respectively. • Best loss where the model records the best validation loss. Table 8 demonstrates the effect of transfer learning where the source dataset was Bentham. Initially, we trained our model to predict the handwritten text on the Bentham dataset; then, we took the model with the best validation loss for TL on the target datasets, namely IAM, RIMES, and Washington. We also investigated which blocks in ResNeSt should be frozen for best performance. We found out that it is best to train all the blocks. The lowest achieved CER/WER error rates are highlighted in the same Table. It is essential to ensure a reliable comparison with current methods; we used the exact training and testing partitions on a public dataset. The training, validation, and testing procedures used the corresponding set sizes for each dataset are given in partition Table 1.

An impact of both data augmentation and transfer learning
We further analyze the CER/WER performance of both TL and DA techniques when applied to the handwriting system. We followed the same approach as [2]. As they proposed, the alternate techniques (TL and DA) combination has several possible structures. In the first approach, as shown in Table 9, we use DA to learn from the source database (Bentham) and retrain this pre-trained model on the target datasets (IAM, Rimes, Washington).
• The model is trained from scratch using a source dataset augmented with data. • The model is retrained using target datasets after the data augmentation technique has been applied to the source dataset this is referred to as the DA-TL-DA paradigm.
On the other hand, in the second proposal, we used the DA technique but did not apply it to the target datasets: • The model is trained from scratch using a source Bentham dataset using the DA technique. • the model is retrained without using DA on the target datasets.
To be thorough, in Table 9, we report the findings on the proposed architecture with all block layers unfrozen. The baseline model is the CNN-Transformer model with no techniques applied to it. The DA-TL and DA-TL-DA techniques use the Bentham database to train the model from scratch. The model is then calibrated using data from the IAM, Rimes, and Washington databases, with/without augmentation on the target datasets. As imposed to [2], we can conclude that applying DA over the target datasets after TL is applied, i.e., the DA-TL-DA paradigm, does improve the CER and WER performance. In the case of the Rimes dataset, the DA-TL-DA approach did not perform well; one possible reason for this poor performance could be that the French language on the target dataset is slightly different from the source Bentham English dataset. We acknowledged that applying DA on target datasets following TL is advantageous. The DA-TL-DA paradigm is beneficial when the target source has only a few textually labeled lines, as with the Washington database. After considering the arguments above and Tables 8 and 9, it can be concluded that the DA-TL-DA technique is reliable. The starting point is relatively good when fine-tuning a ResNeSt trained on a similar task, such as the Bentham dataset as an extensive database of HTR samples. Without additional training on the target datasets, we demonstrate that the model can provide good generalization for the DA as in the Rimes dataset and DA+TL-DA as in the rest target datasets. The proposed model is then trained against the target databases containing only a few input samples, as in the Washington database, representing only a tiny portion of the training set.

Comparison with the state-of-the-art on benchmark datasets
In Table 10, we present a comprehensive performance comparison with the state-of-the-art. Different approaches have been compared to our work depending on whether they required a pre-segmentation task, known as segmentationbased, or not, known as segmentation-free, and whether they required a language model (lexicon-based) or not (lexicon-free). Some techniques are based on recurrent neural networks, typically LSTMs with a Connectionist Temporal Classification (CTC) loss function or Transformer encoderdecoder sequence-to-sequence architectures. Our results are compared against the nine related studies to demonstrate the improvement and evaluate the accuracy of the proposed model. These approaches used deep learning methods based on CNN and LSTM-CTC or Transformer encoder-decoder, where some variant of the LSTM is used. Some studies like using DA, synthetic datasets in the target datasets, and Lexicon Model (LM). We track the percentage of characters and words incorrectly identified as CER and WER, respectively. The proposed model is tested on four benchmark datasets using their standard partitions as described in our experiments in Sect. 4.1, to ensure a fair comparison with the state-of-the-art methods in Table 10. Additional experimental protocols used in the existing studies to further compare our work are: (1) recurrence or nonrecurrence approaches, (2) employed DA-TL or not, (3) lexicon-based or lexicon-free model, and (4) using synthetic dataset or not. Though Chen et al. [14] attempt to provide a multi-HTR systems-based recurrent network, their model poorly performed on IAM and Rimes compared to our finding. Although Pham et al. [39] proposed and empowered their model with Dropout and a constrained language and lexiconbased model, their model performance remains farther than ours. Likewise, in our proposed model, Wigington et al. [49] applied DA to their studies, and we got close CER/WER in IAM and Rimes datasets. However, in the Washington dataset, we outperform their approach with more than 12%. Similarly, Aradillas et al. [2] outperform our (DA+TL+DA) paradigm approach, with a slight value of less than 1% CER on IAM and around 2% CER on the RIMES dataset. However, we obtain better WER performance on those datasets as well as on the Washington dataset. Bluche et al. [6] and Kang et al. [30] outperform the proposed model on IAM and Rimes with some constraints on their models like generating synthetic datasets and applying language model. However, the improvement is significantly noticed using the transformer as the decoder part; unlike our work, Bluche and Kang boost their performance with the help of generating a synthetic dataset and using the lexicon-based model. Finally, our proposed model marginally outperforms the performance of the work of Flor et al. [17], Puigcerver et al. [43] and Cas- cianelli et al. [9]. However, Puigcerver used DA and synthetic datasets to empower their performance model. In contrast, the work of Cascianelli employed deformable convolutions that resulted in competitive CER/WER in IAM and Rimes datasets.
During the evaluation process, we also try to use Clip Pretraining on the IAM dataset. Since we aimed to discover the effect of CLIP [44] pre-training the backbone CNN model on the performance of HTR's CER since the CLIP is exceptionally well-known for its capabilities these days. We decided to use the Clip-pretraining on the ResNeSt model. The approach is similar to what is described in the CLIP model. The ground truth labels are fed to the text encoder sub-encoder model and output a list of their corresponding textual embedding, called mathematical space. On the other hand, the equivalent input images fed to the sub-encoder model output its relative feature representations that serve as input for the Transformer encoder.
After pre-training the CNN model on the IAM dataset, we used the same weights to train the CNN-Transformer model on the HTR task. Table 11 shows the impact of clip training on the CER for the IAM dataset. DA is done on both of the experiments. We can conclude that pre-training the CNN model helps improve the model's performance.

Conclusion
This paper presents a transformer-based and lexicon-free approach for HTR. To the best of our knowledge, this is the first approach to the HTR task that utilizes a joint attention mechanism on both feature extraction (CNN) and encoder-decoder text transcriber (transformer) networks. We examined different preprocessing methods before feeding data to the purposed model; we found that removing only the slanted text technique improves performance. We conducted a rigorous analysis and evaluation of several experiments with (TL) and (DA) paradigms, demonstrating DA+TL+DA paradigm opted to be the proposed and optimal design for the HTR task. Our findings on the reported results show that our method can achieve the best possible results (SOTA) with neither synthetic data nor a lexicon model. Also, our proposed model can deal with small-shot training datasets such as the Washington dataset, extending its relevance to real-world use cases. Transformers are excellent at integrating visual and language-specific knowledge since they are character-based rather than vocabulary-based.
More specifically, the HTR problem is a generic text recognition task. Therefore, we provided an end-to-end neural network model trained on variable-sized line images with their corresponding line-level transcriptions. To summarize our contributions on the following lines: • We studied the impact of the attention mechanism on CNN-backbone feature extraction • We examined the influences of the prior utilizing different preprocessing techniques. • We devised a unified deep learning framework from SOTA dual attention mechanism networks • We comprehensively investigated the effect of TL and DA. • Our proposed model with lexicon-free and no synthetic data outperforms the performance of the cited SOTA models with the same constraints.
We completed thorough evaluations on four public benchmark datasets for handwriting text recognition. We achieved SOTA performance utilizing the identical architecture and the minimum hyper-parameter changes. That makes our model more robust, universal, and straightforward for any new text recognition challenge. Extensive experiments are performed on four public datasets, and both the source code and all of the pre-trained models will be available on our GitHub. In the future, we aim to improve the architectural design by incorporating (TL) and (DA) with the CLIP model to decrease the CER/WER further. In addition, synthesizing good enough handwriting text lines will help the model train on a substantial amount of data that improve the model performance and addresses overfitting. The closed vocabulary (LM) should incorporate the DA-TL-DA and CLIP techniques to improve CER/WER.