Conv-transformer architecture for unconstrained off-line Urdu handwriting recognition

Unconstrained off-line handwriting text recognition in general and for Arabic-like scripts in particular is a challenging task and is still an active research area. Transformer-based models for English handwriting recognition have recently shown promising results. In this paper, we have explored the use of transformer architecture for Urdu handwriting recognition. The use of a convolution neural network before a Vanilla full transformer and using Urdu printed text-lines along with handwritten text lines during the training are the highlights of the proposed work. The convolution layers act to reduce the spatial resolutions and compensate for the n2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n^{2}$$\end{document} complexity of transformer multi-head attention layers. Moreover, the printed text images in the training phase help the model in learning a greater number of ligatures (a prominent feature of Arabic-like scripts) and a better language model. Our model achieved state-of-the-art accuracy (CER of 5.31%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5.31\%$$\end{document}) on publicly available NUST-UHWR dataset (Zia et al. in Neural Comput Appl 34:1–14, 2021).


Introduction
Communication through written words differentiates humans from other species. It has remained an effective way of communication till date. Despite all technological advancements in speech to text and word processors, handwriting is still the most convenient way of jotting down thoughts, filling forms, and writing addresses.
Automatic text recognition is the process of converting text in images to corresponding editable text. Document digitization has several important applications in the real world.  We can preserve our cultural heritage and knowledge of our ancestors for our future generations.
Apart from preservation of history and heritage, digitization plays an equally important role in automating several processes in our daily lives today. Postal automation can significantly reduce mail delivery times. Information extraction from documents like forms or medical records is helpful in developing digital databases and decision support systems. Therefore, text recognition, in general, has been an active area of research for past several decades.
Urdu is national language and among the two official languages of Pakistan. It is the 21st largest first language spoken in the world, with around 61.9 million native speakers. 1 It is usually written in Nastaleeq script and is a derivation from Arabic language. The majority of Pakistanis speak and understand it as their second language [1].
Printed text recognition is considered a solved problem and practical systems exists to reliably digitize scanned documents in several languages (Google Vision API supports over 100 languages 2 ).
Despite being natural to human beings and its wide spread used, it has been a challenging task for computers to digitize handwriting. Humans are highly creative when it comes to handwriting resulting in a vast diversity in writing styles, character formations, etc. Every person has their own writing style and training a model that can recognize an unseen handwriting style is a challenging task. Therefore, we must consider different aspects of writing such as writing styles, type of paper used, width of strokes, human error, and several other factors in addressing handwriting recognition.
Similar to Arabic language, the letters in Urdu scripts (Nastaleeq or Naskh) are joined together to form ligatures. 3 This makes Urdu text recognition highly context sensitive. Moreover, due to joining, some ligatures or characters overlap each others vertically. Some characters are very similar and thus they can be confused with other characters very easily (please refer to Fig. 1). Urdu has over 24,000 unique ligatures [2] and have different joining rules. These challenges rendering Urdu text recognize a highly complex task.
There are two major approaches to off-line handwriting text recognition. The first is a segmentation-based approach. This approach isolates each letter, ligature or word and recognizes it individually [3]. However, this technique does not work very well especially for Urdu as its text is highly context sensitive [4]. The second approach is non-segmentation based. In this technique, text recognition is modeled as a sequence to sequence modeling task. This has been inspired by [5] for the task of neural machine translation (NMT) [2,6].
There are existing text-recognition systems for symbolic and alphabet-based languages but no such systems exist for languages based on Arabic handwritten script including Urdu [7]. Attempts have been made to digitize information by manual transcription; however, manually transcribing such a large volume of data is difficult, time-consuming, and costly [2].
Most research in the field of Urdu text recognition focuses on printed text [8][9][10], whereas handwriting recognition is wide open for new ideas. The majority of research in Urdu handwriting recognition revolves around stroke-based online handwriting recognition [11] in which touch sensitive devices such as mobile devices take advantage of touch input to recognize text through handwriting. We focus on offline Urdu handwriting recognition which involves the use of handwritten Urdu text images. 3 https://www.w3.org/TR/alreq/.
The major contribution of this paper is that we propose a CNN + Transformer (Conv-Transformer) architecture for the task of Urdu handwriting recognition. A convolutional neural network (CNN) is used to extract the visual information from the image which is then fed to a full-transformer [5] having three encoder and decoder layers stacked on top of each other. The encoded sequence is passed to a transformer decoder that digitizes the handwritten text. The model works in an autoregressive fashion due to the presence of a transformer. For testing, we proposed a beam search inspired technique to select the most probable outcome among the top-k probable output sequences.
The paper is mainly divided in the following sections. Section 2 gives a summary of the related work in the field of off-line Urdu handwriting recognition. Section 3 discusses our proposed technique for the task at hand. Section 4 describes the experimental setup including preprocessing steps, data augmentations used, and implementation details. Section 5 provides the findings and their interpretation. Lastly, Sect. 6 concludes the study and provides future research directions.

Related works
Traditionally, Urdu recognition techniques are broadly categorized into holistic and analytical methodologies [6]. Holistic approaches refer to word-level recognition in Roman scripts, whereas in Arabic and Urdu, they refer to partial words or ligatures. Analytical procedures, on contrary, refer to the recognition at the character level. Both printed and handwritten writings are usually categorized in this manner. While recognition of printed Urdu text has improved over the years, the research on handwritten text recognition is still limited.
Sagheer et al. [1] used the Support Vector Machine (SVM) model for Urdu text recognition. As the pre-processing steps, the images were converted into binarized and greyscale images. To eliminate the salt-and-pepper noise, the median filter was applied to the images. For feature maps, the structural features and gradient features were extracted. For the classification task, the SVM model with the RBF kernel was used. The dataset 'CENPARMI Urdu word dataset' named contained 14,407 samples for training and 3770 samples for testing. Recognition performance of 97% was reported in the paper.
The techniques used in [2,3,6,7] demonstrate how convolutional-recursive architectures can be used for the effective recognition of recursive text. Hassan et al. [6] proposed an analytical approach in which the character segmentation is done implicitly using a convolutional neural network as feature extractors, and the classification is done using a Bi-LSTM network. The input image is binarized first. The series of strokes sequences are then mapped according to the transcription. The feature maps are extracted using convolutional layers, which are then transformed into feature sequences and fed into the LSTM layer. The network architecture consists of seven convolutional layers followed by pooling, batch normalization and dropout layers in between, and two BiLSTM layers. An average character identification rate of more than 83% was obtained in experiments on a sample of 6000 distinct text lines. The dataset used for this study is the UNHD dataset [12]. Furthermore, the authors proposed to extend the work to the recognition of main ligatures separately to decrease the number of character classes.
In another similar study, Zia et al. [2] proposed a handwriting recognition model based on CNN-RNN architecture with an n-gram language modeling. The input layer size is increased from 100 to 128 pixels height to address the issue of limited resolution. Alongside the features are concatenated before being fed into the LSTM layer instead of using the max-pooling layer to eliminate the excessive dimensions. Moreover, the random distortion layer is added just before the input layer in order to distort the images randomly. The paper uses the interpolated n-gram model that combines the strength of lower-order and higher-order grams. To cater for the issue of preventing zero frequency of unknown words, the Kneser-Ney smoothing is applied. The proposed model gave a minimum Character Error Rate (CER) of 5.28% on a newly created dataset called 'NUST-UHWR'. This proposed architecture by Zia et al., uses CTC loss at the end which makes it a sequence labeling problem. Treating it as a sequence labeling problem further requires the use of n-gram language modeling to capture the probability of the next character given the previous n-characters. One draw back of this paper is that it is evident from models like LSTM, BERT [13], gpt-3 [14], etc. that the task of language modeling is better captured using deep learning models than statistical approaches like n-gram. This motivates and inspires us to model handwriting recognition as a seq2seq task like neural machine translation, while in [3], Naz et al. used the same convolutional-recursive technique, which is state-of-the-art in printed text recognition. They used a five-layered CNN model for extraction of generic-and abstract-level features from MNIST dataset. These features are then fed into multi-dimensional long-short term memory (LSTM) for contextual features and classification. The proposed technique achieved 98.12% accuracy on the UPTI dataset. The authors of the paper proposed to extend this work to Persian and Arabic languages.
Similarly, Husnain et al. [7] also used CNNs to recognize handwritten Urdu characters. For feature extraction, each Urdu handwritten character was analyzed in order to extract the structural and geometrical information. These characteristics were then incorporated with the image's pixel-based data in order to produce reliable classification results. These features were then passed through four-layered convolutional networks. Finally, toward the end, the fully connected layer is used for the classification. The paper reported an accuracy of 96.05% for character-level recognition of Urdu handwritten text. The dataset contained 800 images of 80 Urdu characters and 10 numerals.
To improve the results of previously mentioned techniques, an attention mechanism was used in [15,16]. The attention mechanism helps in achieving global reference for each word/pixel-level prediction. In [15], Michael et al. redesigned an attention-based seq2seq model for the task at hand inspired by the model proposed in [17] for neural machine translation. The model combines with CNN as a feature extractor and an RNN (three Bi-LSTM layers) to encode the temporal context and visual information in the input image, and this is the encoder part of the architecture. For the decoder part, a separate LSTM (with 256 hidden units) is used to decode the actual character sequence. The attention is applied between the extracted features and the hidden state of the decoder. Positional embeddings are also injected in the input sequence as they provide relative positions of tokens in any sequence. The proposed architecture gave the minimum CER of 4.87 on the IAM dataset and 4.66 on the BOZEN dataset. In future, the authors aim to improve the encoder part of the architecture by using pre-trained models. While in [16], the authors proposed an end-to-end transformer-based OCR (TrOCR) model. The input image is divided into patches, concatenated and then flattened to get an embedding matrix that can be fed into the transformer-based encoder. This architecture uses pre-trained image transformer as an encoder and pre-trained text transformer as the decoder. TrOCR treats the handwriting task as a seq2seq problem, where encoder is initialized by weights pre-trained on image net and decoder is initialized by weights pre-trained on wiki-text. The TrOCR model gave the minimum CER of 2.89 on the Synthetic and IAM datasets. Furthermore, the authors proposed to test this model on multi-lingual text recognition problems. The architecture is computationally complex due to the presence of pre-trained transformer encoder like DIET [18] and transformer decoder like Roberta [19]. The model heavily relies on its pre-training and gives good results on IAM dataset after fine tuning. Since decoder is pre-trained on wiki-text which is in the English language. The TrOCR hence becomes infeasible and gives poor results on the task of Urdu handwriting recognition.

Methodology
In this study, the handwriting recognition is treated as Seq2Seq modeling task inspired by the model proposed in [5,17]. In [17], the authors proposed a full-transformer model to tackle the task of neural machine translation. The encoder and decoder architecture with all attention mechanism not only allows to capture the inter-language dependencies and alignments at embedding level (since attention mechanism does not have any weights), but also learns a language model for the translated language simultaneously. Taking inspiration from this, we also model the task of handwriting recognition as a Seq2Seq problem where the goal is to treat the image as a sequence and generate an output sequence of digitized text. It is pertinent to observe that transformer has an n 2 factor in its computational complexity due to the presence of multi-head attention layers, where 'n' is the sequence length [20]. This makes transformers computationally very slow or infeasible for very large sequences as is the case with the handwritten text images. In order to compensate for that we propose a Conv-Transformer architecture. The convolutional layers at the start act to reduce the spatial resolution of the image and extract important features. After that, the feature maps are fed into a Vanilla full transformer that then digitizes the input image. This architecture not only learns to digitize the input image, but also learns a language model for the task due to the inherent property of how transformers work. Unlike [2], it does not require a separate n-gram language model. For training the model, we use teacher forcing on the output sequence to converge the training faster (as shown in Fig. 2). Due to the presence of transformer architecture, the proposed model works in an auto-regressive fashion during the testing or inference phase. We do not use the right shifted labels at a stage of testing or inference. Given an image to be tested, we pass it through convolutional block and then the transformer encoder, while a BOS (Beginning Of Sentence) token is passed through the decoder initially. The next char-   acter or token is predicted given BOS token and the feature maps extracted from the image. The predicted character is appended with BOS token and the process repeats until an EOS (End Of Sentence) token is encountered. This procedure is the decoding of the output sequence. We use beam search decoding which decodes the output sequence based on the best probability of the sequence (as shown in Fig. 3) rather than the best probability of the next character as in greedy decoding. Beam decoding gives superior results than greedy decoding. The individual components of the proposed architecture are explained in the following sections. Section 3.1 discusses the details of convolutional block. Section 3.2 discusses the significance of the transformer for the task of Urdu handwriting recognition. Section 3.3 discusses the beam search.

Convolutional neural network (CNN)
We used stacked CNN layers to extract visual features from the image as CNNs have a strong ability to learn task-specific features [21]. Given a grayscale input image of handwritten text of dimension W × H , the CNN reduces it to (S × 1 × d) where S is the width of the feature map having a depth of d after convolution layers. This is then reshaped to (S × d) and fed to the transformer encoder where d is a hyperparameter and is treated as the input embedding dimensions to the encoder and S is the sequence length. The configuration of convolutional blocks of our custom CNN are given in Table  1 and Fig. 4.

Transformer
Transformers were first introduced in [5]. The architecture is shown in Fig. 2 which uses attention mechanism to capture long-and short-term dependencies. This architecture completely replaced RNNs and LSTMs, which struggled to capture long-term dependencies due to vanishing gradient problems [17]. The transformer is an encoder-decoder architecture that uses self-attention on the encoder side and causal-attention on the decoder side. In the self-attention mechanism, every position in the input embedding attends to every other position, whereas in the decoder, the causalattention restricts the tokens to attend to the previous tokens only. Attention also takes place between the encoder and the decoder known as multi-head encoder-decoder attention. All in all the attention mechanism helps in the handwriting task as it allows the model to know which pixels to attend to in the image while generating a particular character.
The attention mechanism used by the transformer takes 3 input matrices that are Query (Q), Key (K), and Value (V ). These are different representations of the input embedding after passing through dense or linear layers. Attention scores are evaluated by the dot product of the hidden states of encoder and decoder (In Encoder Decoder Attention), as shown in Eq. 1.
The dot-product is scaled by a factor of the square root of the depth of embedding. These are then converted into probabilities or attention weights using Softmax. The multiplication of the attention weights with V vector helps concentrate on positions that are to be focused while generating a particular character. V , K , and Q are split into multi-heads instead of a single attention head since it enables the model to collectively attend to information at various locations from different representational spaces.
In our proposed architecture, the encoder part of the transformer receives the embedding extracted from the CNN module (the feature maps). The embedding is then injected with positional information. We need to add some information about the locations to the input embedding because the transformer's encoder does not have repetition like recurrent neural networks. The positional encoding suggested in [5] has been employed in the proposed work. Then, we have three encoder layers, stacked on top of each other, followed by three decoder layers. The number of layers in encoder and decoder were chosen empirically. The right-shifted output tokens followed by embedding layer are fed to the decoder during the training phase. To predict the final output tokens, a linear layer followed by Softmax is used to project the decoder embedding of model dimension to the vocabulary size dimension.

Beam search
At inference, auto-regressive models must generate context to predict next tokens. This is unlike training when the rightshifted token is supplied as context from the output label. An intuitive initial solution to this problem is making use of greedy decoding. The most probable token is selected at a given index and used as context for further predictions. This approach however has a significant drawback, the output sentence produced may not necessarily be the most probable sequence as a whole given the input. 4 Producing the most likely sentence involves generating all sentences possible and filtering for the most probable one. This is an NP-complete algorithm and hence impractical to implement. Not only does the computational size of this algorithm grow exponentially with each index, but the vocabulary also itself maybe tens of thousands to several million or billion words. For a vocabulary size V , at an index location n, there are V n possible partially complete sequences each with their own probability. 5 The model must then be run on each of these partial sequences to generate partial sequences of size n + 1. This not only increases the computational complexity per iteration, the computational resources and time required to run each subsequent iteration increases exponentially as well. This however is the only algorithm that guarantees the model outputs the highest probable sequence at the end.
Beam search is a heuristic-based approach to make this algorithm more tractable. It captures the essence of the algorithm while remaining computationally feasible. The algorithm works by introducing a hyper-parameter 'k', inferred as the number of beams. At the start of inference, the model sorts through 'V ' possible choices for the leading character after the beginning of string token 'B O S' for the highest probable 'k' tokens. At each subsequent iteration after the first iteration, the model is run for each one of these 'k' possibilities and produces a set of k tokens for each one of the previous 'k' sequences for a total of k 2 possible sequences. These sequences are then ranked according to the scoring formula given in Eq. 2.
(P(y n )) = Sum ofŷ of lengthn (2) where 'n' is the length of the sequence, 'ŷ' is the ground truth label and 'α' is set to 0.7. Log probabilities are utilized to prevent numerical underflow and a soft normalization is added with another hyperparameter 'α' to prevent the model from strictly preferring shorter sequences. The best k-sequences are then selected, and the next iteration is performed. It is to be noted that sequence probabilities are only calculated up till the end of string token 'E O S', while the maximum length of the string is bounded, different inputs may correspond to outputs of different sequence lengths. The model keeps iterating past the E O S token as well and as beam search continues, a particular sequence that already had an E O S token at some previous index may be overwritten by a possible string from the current iteration given the string scores better than the complete sequence. For this reason, after each iteration, all k-sequences are scanned for an E O S token and if present, a separate cache stores the string along with the corresponding score and lets the algorithm continue. While an incomplete string may replace a string with an E O S token during the search, the final score that the incomplete string achieves 5 https://www.width.ai/post/what-is-beam-search. after predicting an E O S token may be lower than the score of the string that was initially replaced. This is the reason for the presence of the before-mentioned cache. The cache ensures that only the best 'k' finished sequences over the run of the entire search are stored and kept till the end. Once the search reaches the final index location, the highest scoring entry from the cache is returned as the final output. Despite being fallible, this algorithm manages to outperform greedy search by a significant margin and captures the essence of the N P-complete algorithm while remaining computationally realizable and consistent. The model is only run k times at each iteration and there is breadth of the search tree stays limited to 'k' for the entirety of the run. This paper utilized a custom implementation of beam search, mathematically given in Eq. 2. Traditional implementations generally compute k · V probabilities at each index and search for the best 'k' 6 , whereas the approach here only computes k · k probabilities. Furthermore, given the architecture of the model presented in the paper, the beam search is performed on a character level instead of a word level. The vocabulary size for characterlevel recognition is generally much smaller since it comprises particularly of unique characters and some ligatures in a language instead of an entire dictionary of the language. This leads to beam search performing even better and closer to the NP-complete algorithm since even moderate values of total beams are much closer to the actual vocabulary size and prove to be significantly more efficient than the number of beams being a very small fraction of the total vocabulary size.

Dataset
For our experimental setup, we used the NUST-UHWR dataset [2]. The dataset contains image having a single line of Urdu Handwritten text along with their corresponding text labels. The images are unique and contain different text of different styles. The UHWR dataset was divided into training, validation and testing split as described in Table 2. The dataset consists of approximately 10,000 text lines, which are insufficient to train a reasonably good handwriting text recognition engine. The standard method of increasing the training samples is to introduce data augmentation; however, we argue that data augmentation methods are not helpful in training a text recognition system. In [22], we showed that ligature coverage has a positive impact in improving the accuracy of a text recognition system. It is specifically true for Arabic like scripts where the number of ligatures  Two printed text datasets were employed to augment the proposed handwriting recognition algorithm during the training phase. One is the recently proposed Urdu Ticker Text dataset [23] and the other is UPTI 2.0 dataset [2]. The statistics of these two datasets are given in Table 3.
Before feeding the images into the model, few preprocessing steps were performed as discussed in the following sub-section. To add diversity in the dataset and increase the number of samples, data augmentation techniques were also used as described in Sect. 4.1.1.

Data preprocessing and augmentation
The first step is to formalize the dataset. In a nutshell, we translate data into a format that is simple to utilize. It helps reduce the model training CPU bottleneck. We convert all of the images to greyscale and resize them all to a set height of 64 px while maintaining the image aspect ratio. We pad the image width with zeros for batching up till max-length hyper-parameter (set to 1600).
During model training, we employ data augmentation to introduce diversity to the training datasets. Data augmentation can help improve the performance and results of any machine or deep learning model. A total of 10 augmentation functions were used. Four of them are color-based augmentations, while the other ones are shape-based augmentations. The image is transformed into multiple dimensions using shape-based augmentations. The augmentation functions include color inversion, padding the image, adding color correction, adding soft noise to the image, slightly blurring the image, squeezing, degradation effect, rotation of axis, compression of artifacts, and re-scaling the chunks of images. Few examples of data augmentations are shown in Fig. 5

Implementation details and hyper-parameters
The proposed architecture was implemented using PyTorch. The CNN was implemented with leaky-ReLU activation function as it is one of the standard practices 7 and batch normalization for faster convergence as shown in Table 1. The spatial resolution of feature maps was reduced using max pooling layers only, whereas the convolutional layers retained their spatial resolution using padding. For transformer, we used 3-encoder and 3-decoder layers as this setting gave us the best results in less computational time. Other settings were tested with different number of encoder and decoder layers. Using higher number other than 3 for encoder and decoder layer of transformer did not yield any improvement in performance. After the transformer, a linear layer was used to transform the output to the shape (B × Sq × V ), where 'B' is the batch size, 'Sq' is the output sequence length and 'V ' is the vocabulary size. In our case, the vocabulary size is the total number of characters encountered in the training data due to the size limitations of our dataset plus the special tokens like PAD (padding), BOS (Beginning Of Sentence), EOS (End Of Sentence) and UNK (Unknown character) as we are performing character-level handwriting recognition. Softmax followed by cross-entropy loss was used for training and validation. We carried out the training of our architecture on a single Nvidia RTX 3080 GPU. Batch size of 16 was used where the right shifted output sequence length was padded with pad token up till the sequence with maximum length in a batch. We used Adam optimizer for the training of our architecture. Learning rate of 0.0003 was used with betas (0.9, 0.98) and epsilon 1e−9, which are hyper-parameters of the Adam optimizer. Other settings of learning rate diminished the training by either diverging the loss for a higher learning rate or slow convergence for lower learning rates.

Experiments performed
The experimentation includes the mixing of different datasets with augmentation techniques in order to analyze the effect on CER (Character Error Rate) of UHWR validation and test split. Printed and handwritten Urdu text datasets were mixed to get more diversity of the Urdu language. This was done inorder for the proposed architecture to learn a better language model. Results in Sect. 5 verify that printed text aids the model to capture more diversity in the Urdu language. Firstly, our model was trained on UHWR train split alone. Then, we added complete Ticker printed text dataset and to analyze further we added more data including the complete UPTI-2 data for training. The three setups for training include: The addition of more data from different distribution of printed text lead to drop in CER for UHWR validation and test splits, which shows that the transformer indeed learns a language model besides digitizing the input image.

Results
Character Error Rate (CER) of 6.0% and 6.4% was achieved on UHWR validation and test splits, respectively, after training our architecture on UHWR train split data. These results were further improved when printed Urdu Handwriting datasets were added with the UHWR train split for training (see Table 4 for details). The combination of printed and handwritten text datasets add more diversity of Urdu Language to the data. This diversity is captured by our architecture and a better language model is learnt as a part of the transformer.

Comparison with conv-recursive architecture
Our proposed architecture is thoroughly compared with the convolutional recursive architecture proposed by Zia et al. [2], which is the state of the art in Urdu handwriting recognition. The results given in Table 4 show that we beat the state of the art with a margin.
The architecture proposed in Zia et al. [2] uses a separate n-gram word-level language model with a character-level convolutional recursive deep learning model. Proper reasoning of how these two models generate results is missing so it is a possibility that the n-gram word-level language model overrides the predictions of character-level deep learning model at the end resulting in reduction in CER from 7.42% without LM to 5.49% with LM on the test set. Overriding the results with an n-gram language model would definitely yield better results over the test dataset since the text uses proper Urdu language words.
Given a handwritten text with random letters, the model proposed by Zia et al. [2] with a separate n-gram would fail. Our proposed architecture uses convolutional transformer that models the problem as the probability of next character given previous character and feature map extracted from image, i.e., P(n c | p c , c) where 'n c ' is the next character, ' p c ' is previous character and 'c' is the feature map extracted from the input image thorough convolutional layers. This performs two learning task simultaneously, i.e., digitizing the input image and learning a character-level language model. The comparison of CER between the two models in Table 4 shows that our proposed architecture outperforms the current state of the art without having a need for a separate language model. Moreover, as more data is added to the UHWR dataset even though belonging to a different distribution of printed text, the model performs significantly better than the state of the art since more data enables the transformer to learn a better language model.

Comparison with Google's vision API
Google vision has recently provided an experimental API for Urdu handwriting recognition. 8 We tested this API on UHWR test and validation splits. The results were worse than the state of the art. Google Vision API gave CER of 26.5% and 27.8% on UHWR validation and test splits, respectively. The Google vision API model may not have been trained on UHWR dataset and thus we tested the vision API [2] on some random images and compared the results with Zia et al. [2] and our proposed architecture. The details are given in Sect. 5.4.

Smoke testing on random images
Testing on random Urdu handwriting images was performed to check the generalization capabilities of the Zia et al. [2], Google vision API and our proposed architecture. The images were collected randomly by making few individuals write an Urdu script on blank piece of white paper. The scanned images of these handwriting were used for smoke testing. It is evident from the results in Fig. 6 that our proposed architecture gave the best CER on these images.

Ablation studies
Ablation study was also performed on the architecture in order to test the contribution of the the transformer decoder in learning a language model that reduces CER on UHWR test set. Moreover, we also perform ablation study to test the contribution of Conv layers as well in the performance of our architecture.

Ablation study: only encoder CTC
We completely removed the decoder layers in our architecture and used only the Conv plus transformer encoder to train on UHWR dataset. We used CTC loss as in Zia et al. [2]. This model is similar to Zia et al. [2] with difference of transformer encoder in-place of Recurrent neural networks like GRU and is how in Fig. 5.5.1. With this setting, it was evident from validation and testing results that the model was performing similar to Zia et al. [2] without n-gram language modeling. The conv and transformer encoder gave us the CER of 7.28% on UHWR validation split and 7.4% on test split. We used same hyper-parameter settings as in the case of our full convtransformer architecture with the change in the loss function, i.e., CTC loss.

Ablation study: encoder-decoder
We also tested our architecture by removing the Conv layers from it in order to test the impact of convolution network as whole before a full transformer. Image was directly fed to the transformer encoder after positional embeddings. This setting gave us better results than Zia et al. [ harder for the model to converge during the training. We got a CER of 6.97% and 7.1% on UHWR validation and test splits, respectively. Given sufficient data, a full transformer without conv layers can be used for training and testing. With the limited data that we have, convolution layers play a major role in giving the state of the art results. Again, the same hyper-parameter settings were used to train this variant of our architecture.

Analysis of failure cases
Some failure cases of our model are shown in Fig. 7. The characters predicted corresponding to the input image show that the failure in prediction was encountered where either the image has some distortion (Fig. 7b) or the writing of a character closely resembled some other character (Fig. 7a). These errors can be reduced by pre-training of our transformer decoder on a Urdu language modeling task so that if a character in the input image is ambiguous, the architecture can still predict it based on the probability of next character given previous character.

Conclusions and future directions
We modeled the task of Urdu handwriting recognition as a Seq2Seq learning problem inspired from [5] and proposed a Conv-Transformer architecture that eliminated the need for a separate language model. Moreover, the convolution layers at the start of a full transformer acts to reduce the spatial Fig. 7 Examples of some the input with with its ground-truth and predicted text given below. a Contains example of label noise in the dataset. Given an image, the provided ground truth contains extra words/characters that are not present in the input image hence increasing the CER of this example. The model correctly predicts the characters that are present the input image. b Contains an example of a distorted input image. The model is unable to predict the true label as the input images contains few characters/literals that are difficult to recognize due to the writing style of the writer. c Contains an input image that contains a complex and calligraphic writing style which makes it difficult for the model to make a correct prediction. d Contains an example of label noise in the dataset. Given an image, the provided ground truth is incorrect, but the predicted output is correct as per the input image. This shows that our model is efficient enough to produce correct results. The high CER is due to the mismatched ground truth. e, f Contains the cases where the model perfectly recognized the handwritten text hence giving a very low CER resolution of the Urdu handwritten text images and extract important features. The feature maps with reduced spatial resolution than the input image compensate for the n 2 complexity of the multi-head attention layers of the transformers leading to reduced training and inference running times.
To the best of our knowledge, we are the first one to propose a deep learning architecture that trains simultaneously on Urdu printed and handwriting dataset to yield state-ofthe-art result on the task of Urdu handwriting recognition.
Future direction includes the pre-training of our architecture on a big dataset before generalizing to a specific task. The Conv-Transformer encoder can be pre-trained on a vision task like ImageNet classification and the transformer decoder can be trained on a language specific language modeling task. This pre-training would greatly enhance the accuracy on a task specific datasets after fine-tuning on them as the convolution and transformer architectures have good generalization capabilities.

Conflict of interest
The author(s) disclosed no possible conflict of interest.