The proposed system, Cultural Emotion Analyzer for Multimodal and Multilingual Sentiment Analysis (CEA-MMSA), leverages deep learning algorithms to interpret and analyze the sentiments encoded in Tamil and Sanskrit Siddha palm leaf manuscripts. The CEA-MMSA framework integrates Vision Transformers (ViTs) for visual sentiment analysis and Gated Recurrent Units (GRUs) with attention mechanisms for textual sentiment analysis. Initially, images of palm leaf manuscripts are collected and pre-processed to remove noise, enhance text clarity, and prepare them for sentiment extraction. The textual data undergoes sentiment analysis using GRUs, where the attention mechanism improves the model’s ability to focus on significant textual elements. Simultaneously, visual data is processed through ViTs, which capture the emotional tone embedded in the images. The multimodal fusion model then combines the textual and visual sentiment data to enhance the analysis's depth and accuracy. This approach allows for a nuanced understanding of the emotional and cultural narratives within the manuscripts, achieving high performance metrics and contributing to the preservation and interpretation of this invaluable cultural heritage.
3.1 Structure of CEA-MMSA
The objective characteristics are documented by collecting images of palm leaf manuscripts from Tamil and Sanskrit Siddha and storing them in a database. Multiple pictures of palm leaf manuscripts written in script are compiled. The collected palm leaf manuscripts of the Tamil and Sanskrit Siddha sentiment analysis are shown in Fig. 1. The following section presents the various ways to enrich and experiment with the data collection. Data preparation, datasets, and all the methodologies used to develop the final model are detailed in this section of Cultural Emotion Analyser for Multimodal and Multilingual Sentiment Analysis in Tamil and Sanskrit Siddha Palm Leaf Manuscripts.
The multimodal data is in the foreground against a dark, leaf-coloured backdrop in palm-leaf manuscripts. Blurring or other morphology techniques can lessen noise detection in digital camera scans or photos of palm-leaf manuscripts. In the proposed model, the multimodal data includes text, emotions, and visual sentiments of Tamil and Sanskrit Siddha medicine palm scripts. Text and Visual sentiment data merge raw data from two languages to move on to the next step, and text with emotions is a fantastic tool for supplying information and expressing ideas and feelings. Efficient noise removal is essential in extracting valuable data from textual images. So that the characters stand out and are straightforward to manipulate, the pre-processing involves turning the backdrop to black and the foreground text to white, narrowing the colour space from 0 to 255 to just 0 and 1. Images can be prepared for text line and visual extraction by performing morphological procedures and removing the background. The structural processes take the input images of the leaves and apply a structure component to them, such that the final images are the same size.
Consider a palm leaf manuscript with a black backdrop and white text, where 0 represents the background colour, and 1 represents the text in the forefront. In this case, T displays the 52-point global threshold. After eliminating the background for the input image, the processed instance output image is considered the pre-processed image. Here is the representation of Eq. 1 for the pre-processing image:
$$\:p\left(x,y\right)=\begin{array}{c}1\:\:\:\:\:\:if\:f\left(x,y\right)\ge\:t\\\:0\end{array}$$
1
In this case, x and y represent the two image inputs from the multiple modalities, while p represents the matching image output. After pre-processing, the data is given to the GRU for text sentimental analysis, which thoroughly comprehends emotional information and ViTs for vision sentimental analysis. Two popular models utilized for text sentiment analysis and classification tasks are Recurrent Neural Networks (RNNs) and GRU, which have many uses in palm leaf manuscripts. RNN is better at capturing context knowledge and is more suited for analyzing data in series. Still, CNN is better at capturing local features when processing text and extracting multiple levels of semantic data through multilayer convolutional processes. Architectures like GRU are commonly utilized to address issues like vanishing gradients and acceleration of gradients during RNN training.
Recently, text sentiment analysis in palm leaf manuscripts has extensively used the attention mechanism, among other models, to learn about traditional Siddha procedures and classification problems. An attention mechanism allows the model better to comprehend blurred texts from multimodal data in many languages. It makes it more adaptable to focus on information at different points in the input sequence. The work introduces a novel approach to text sentiment analysis and classification using a bilateral GRUs model with an attention mechanism, demonstrating the superior performance of deep learning algorithms on this problem. Simultaneously, the image data used in the ViT transformer often convey emotions better than words alone. Results from task-based image classification utilizing ViTs have been demonstrated to be more robust than those obtained using CNNs. Obtaining such outstanding performance stability is critically dependent on the attention process. Evidence suggests that ViT outperforms CNN in picture categorization learning thanks to its multi-head self-attention layer, which allows it to incorporate data about global features into the complete picture. This study used the transformer architecture to train three models. One model was used for each image type to obtain task-specific, high-quality data. After that, the multimodal visual sentiment classifier will be modelled utilizing these features.
Additionally, the information is given to the multimodal fusion model for data fusion. It can occur at various points in the model learning process using intermediary fusion since it combines information at varying depths. When data from many sources, such as Tamil and Sanskrit Siddha medicine manuscripts, are combined, a process known as multimodal data fusion is employed. This results in information that is easier to interpret and use. With deep learning, multimodal data fusion has significantly enhanced learning performance. The intermediary fusion takes the higher-level representations learned by deep learning layers as input. Machines can learn a picture from each separate model through multimodal deep learning's intermediate fusion, simultaneously merging many model presentations into a hidden layer. Next, the multilingual and multimodal data will be sorted using a deep learning-based classification layer according to the text's emotional tone and the visual image's clarity in identifying each letter in the Siddha palm manuscript.
3.2 GRU with attention mechanism for text Sentimental Analysis
GRU can handle time series, text, and speech, representing sequential data. By including an attention mechanism, the transformer transformation model can enhance its ability to express features and acquire more layers of information in phrases from various places. This study takes a deep learning approach, drawing on the emotional model of literature, and suggests an attention model that integrates GRU to address the issue of textual sentiment evaluation. This model, shown in Fig. 2, builds sentiment evaluation for each emotion category by combining the attention mechanism and using text sentiment information from vector neurons. The representation of features improves both the generalizability and robustness of models. This approach outperforms alternatives that need linguistic and emotional data regarding classification accuracy and brevity.
The core principle of GRU is that at each time step, the network's hidden state can be selectively updated via gating methods. Information can be controlled inside and outside of the network using gating techniques. One popular model in neural network research for processing sequence and time series data is the gated recurrent unit model, which is incorporated into the suggested model. Traditional machine learning methods rely on restricted prefix lexicon knowledge as the dependent elements of the semantic model; however, RNN can include the entire preamble vocabulary into its linguistic knowledge set, resulting in superior performance. However, vanishing or inflating gradients is a problem with traditional RNNs. Attention and GRU circumvent these problems by allowing input to selectively alter the model's state at any given point through the structure of specific controls. GRU uses update gates instead of an input gate and a forget gate, making it a version of LSTM. Equations (2) through (5) display the GRU structure description and the applicable calculation formulas.
$$\:{G}_{t}=\sigma\:({w}^{G}{x}_{t}+{u}^{G}{h}_{t-1})$$
2
$$\:{R}_{t}=\sigma\:({w}^{R}{x}_{t}+{u}^{R}{h}_{t-1})$$
3
$$\:{h}_{t}=\left(1-{G}_{t}\right)\times\:\widehat{{h}_{t}}+{G}_{t}\times\:{h}_{t-1}$$
4
$$\:\widehat{{h}_{t}}=\text{t}\text{a}\text{n}\text{h}({w}^{h}{x}_{t}+{u}^{h}{h}_{t-1}\times\:{h}_{t-1})$$
5
In this case, \(\:w\) and \(\:u\) denote the GRU weighted matrices, \(\:\sigma\:\) stands for the analytical logistic operation, and \(\:\times\:\) indicates the element amplification. Concealed by the present input condition and the prior layer is \(\:{G}_{t}\), the gate that controls the amount of updates for the activation value of the GRU unit. The hidden layer \(\:{h}_{t}\) and the candidate concealed a layer \(\:\widehat{{h}_{t}}\) are both involved in determining the layers' states; the reset gate \(\:{R}_{t}\) blends the updated information given alongside the prior understanding.
By simplifying the model and reducing the number of variables, the GRU helps to keep trial expenditures to a minimum. State propagation in a traditional recurrent neural network only goes in one direction, from the front to the back. There are issues, though, when the present output depends on the prior and subsequent states. The bilateral recurrent neural network emerged to address the challenge of missed word prediction, which involves both the previous assessment and the message of the current one. Two RNNs are fed contrary input directions simultaneously, and their output is jointly computed, resulting in a more precise outcome. Making a bilateral GRU is the same as making an RNN in a bilateral recurrent neural network. Y is the multi-head attention output matrix the model in this research learns from using a Bilateral GRU network. During training, the network produces the \(\:{h}_{t}\) with the use of two GRUs that stand in for feelings along the text series' forward and backward paths at the same time. GRU unit vector dimension Eq. 6 illustrates the precise method of calculation.
$$\:\widehat{{h}_{t}}=GRU({x}_{t},{h}_{t-1})$$
6
With the help of the attention mechanism, GRU can spotlight the text's most crucial details. The significant data of the text sequence is captured across several subspaces using multi-head attention in this article. Eq. 7 gives the dimensional vector d for the supplied text line L=(w1,w2…wn) of length n, where wi is the i-th word in sentence L.
$$\:L\in\:{R}^{n\times\:d}$$
7
The first step is to transfer various subspaces to the vector words matrix L, which has been linearly processed and divided into three identical matrices such as query(\(\:{q}_{i}\)), key(\(\:{k}^{t}\)) and value of matrices with the subspaces. Finally, in simultaneously determining the attention value of every subspace is given in Eq. 8
$$\:head=softmax\left(\frac{{q}_{i}{k}^{t}}{\sqrt{a}}\right){V}_{i}$$
8
The head represents the value of attention in the i-space, \(\:\sqrt{a}\) ensures that the colour gradient remains visible during the backpropagation process by transforming its attention matrix into a conventional distribution of values. Next, considering the line spacing, there are two ways to classify an obstruction in the white space among the lines of text: contacting and overlaying. An impediment has a significant role in shaping a character's personality in Tamil literature. "Touching text lines" refers to the extent of an obstruction that stretches and approaches the following text line. The characters in the first and second lines are striking. Tamil and Sanskrit text line splitting is intricate because while ignoring an obstacle causes the character "யு" to transform into "య," cutting an obstruction at a fixed length causes the second line symbol "ఞ/tha/" to become "ఞి/thi/." As with contacting textual lines, the same inaccurate forecast problem occurs when utilizing pre-existing methods for text line separation with overlapping text lines. The proposed GRU ensures accurate character projections by repositioning the hurdle's separating line, eliminating touching and overlaying lines of content. Segmenting the text lines is an essential first step before character segmentation. Line separation in text gets more complicated and ineffective as lines of writing touch and overlap. Text line separation becomes more complex when lines of text intersect. An impediment invades the text zone of following lines, getting in the way of the character's strokes, which could lead to incorrect character interpretation or an unexpected character.
3.3 Vision Transformers (ViTs) for visual sentiment analysis
The pre-processing phase is critical to the learning procedure. It entailed cleaning, denoising, and preparing the data before feeding it into the VIT for sentimental analysis to train on multilingual palm leaf manuscripts written in Tamil and Sanskrit. The work employs the 2020-introduced visual Transformer (ViT) as its deep learning framework for sentiment analysis in multimodal visual analysis. A visual categorization approach called the Vision Transformer applies a Transformer-like architecture to some areas of an image. This incorporates various architectural elements in the Transformer design commonly employed in natural language processing, such as Multi-Head Attention and Expanded Dot-Product Attention. An alternative to CNNs for image identification tasks is ViT. Both computationally and in terms of accuracy, it quadruples the performance of state-of-the-art CNNs, particularly on massive data regimes. Because ViT can learn its own biases, CNNs' inductive biases are superfluous in enormous data regimes. The ViT model can learn when and how to attend without prejudice due to the shorter layers' ability to process local and global information in images. ViTs are valuable for interpreting the intricate patterns found in palm leaf manuscripts because they can successfully record the links among faraway regions of an image. It is easier for ViTs to deal with images of varying sizes than with CNNs, which typically need fixed-size data. Scaling up ViTs with more significant records and greater computing power could improve their efficiency on challenging tasks.
Figure 3 provides a high-level picture of the ViT design. The palm leaves manuscripts for the Tamil and Sanskrit languages are divided into sections, subsequently turned over and smoothed out. Images with only two dimensions are fed into the Transformer model. The input image is divided into shorter multifaceted patches according to its length (h), width (w), and number of channels (N). It is done to format the input information in a way that is similar to how it is organized in the natural language processing domain, where each word is represented by its sequencing. The results based on the number of patches and pixels (p) are given in Eq. 9
$$\:n=\frac{hw}{{p}^{2}}$$
9
The compressed patches undergo linear projection and directional encoding before feeding the data into the multilayer transformer encoders. The building blocks of an encoder are a multiple-layer perceptron module and a multihead attention module. An MLP with one hidden layer during initial training and a single linear layer during fine-tuning implements the classification head. As its central component, a classification head is an MLP that encodes positions. A series of integrated pictures enhanced with positional data is fed into the encoder as input. A state-dependent categorization output is produced by a classification head connected to the encoder's output and takes the value of the accessible class embedding as input. During the refinement process, an individual feedforward layer is used within the framework of the MLP, where size is the number of classes or responses relevant to the job. Throughout the training, the input images are consistently chopped into a single patch size; however, higher-resolution images are utilized for modification. The outcome is a more extended input sequence for fine-tuning compared to pre-training. If the input sequence is extended, fine-tuning will necessitate more positional embedding than pre-training. A multiple-layer perceptron head, functioning as the categorizing module, produces class predictions by receiving the encoder's information.
3.3 Multimodal Fusion model
Architecture for intermediary fusion that combines output from GRU with attention and ViT models based on deep learning to construct a multimodal and multilingual Siddha palm leaf text for in-depth analysis. Creating a dynamic text and picture classifier that works across several modalities is depicted in Fig. 4. Because diverse perceptions complement each other, multimodal fusion improves emotion recognition skills. The benefits of deep neural networks are better utilized in model-level fusion than in determination-level and feature-level fusion. Here, the Transformer model combines modalities at the model level specific to different languages. After encoding the text and visual modalities of the Siddha palm leaf manuscript, the multi-head attention uses a common semantic feature space to generate multimodal emotional intermediate representations. Data fusion can occur at various points in the model learning process using intermediate fusion since it combines information at varying depths. With deep learning, multimodal data fusion has significantly enhanced learning performance. Middle fusion takes the features, or higher-level representations, learned by various deep learning layers as input. Intermediary fusion in multimodal deep learning is simultaneously merging multiple model presentations into a hidden layer. This allows the algorithm to learn a demonstration from each of the many models. It is the fusion layer's job actually to fuse the layers. This research recommends a ViT-based fusion for multimodal, aesthetic, and affective analysis of Siddha meds found in palm manuscripts. We train three models—one for single-modality sentiment, one for facial emotion, and one for textual sentiment—using Transformer backbones, specifically ViT, for visual content and GRU with an attention mechanism for textual information. After that, each model is placed into an MLP classification head using a single combined feature extracted by each model. The improved analysis of Siddha medication from palm scripts, which might be helpful for letter identification, is the outcome of the models.
Vision Transformers offers a practical and versatile method for visual sentiment analysis of multilingual palm leaf manuscripts. ViTs can decipher these significant materials' emotional and cultural significance using the capabilities of transformer designs despite their inherent complexity and peculiarities. Both sentiment analysis and the study and protection of historical artefacts benefit from this application's use.
3.4 CEA-MMSA Multimodal Fusion Algorithm
// Algorithm 1: CEA-MMSA
// Step 1: Data Collection and Storage
1.1 Collect palm leaf manuscript images in Tamil and Sanskrit.
1.2 Store collected images in a database.
// Step 2: Preprocessing
FOR each image I IN database:
2.1 Convert I to grayscale.
2.2 Apply morphological techniques to reduce noise.
2.3 Invert colors:
FOR each pixel (x, y) IN I:
IF pixel value f(x, y) > = threshold T:
SET pixel value p(x, y) = 1
ELSE:
SET pixel value p(x, y) = 0
2.4 Ensure final images are the same size.
// Step 3: Text Sentiment Analysis Using GRU with Attention Mechanism
INITIALIZE GRU model parameters.
FOR each textual input sequence S:
3.1 Encode S using embedding layers.
3.2 Pass encoded text through GRU layers.
3.3 Apply attention mechanism to focus on important parts of S.
// GRU Equations
COMPUTE update gate G_t:
G_t = sigmoid(W^G * x_t + U^G * h_(t-1))
COMPUTE reset gate R_t:
R_t = sigmoid(W^R * x_t + U^R * h_(t-1))
COMPUTE candidate hidden state h_hat_t:
h_hat_t = tanh(W^h * x_t + U^h * (R_t ⊙ h_(t-1)))
UPDATE hidden state h_t:
h_t = (1 - G_t) ⊙ h_(t-1) + G_t ⊙ h_hat_t
// Step 4: Visual Sentiment Analysis Using Vision Transformers (ViTs)
INITIALIZE ViT model parameters.
FOR each image I:
4.1 Divide I into patches.
4.2 Flatten and linearly project patches.
4.3 Add positional embeddings.
4.4 Pass patch sequence through transformer encoder layers.
4.5 Use multi-head attention to capture global features.
// ViT Patch Processing
COMPUTE number of patches n:
n = (h * w) / (p^2)
APPLY linear projection and positional encoding to each patch.
// Step 5: Multimodal Fusion and Classification
5.1 Extract features from both GRU and ViT models.
5.2 Fuse features at intermediate layers using a common semantic space.
5.3 Combine textual and visual features using multi-head attention.
5.4 Pass combined features through MLP classification head.
// Step 6: Classification
FOR each combined feature vector:
6.1 Apply MLP classification layers.
6.2 Output sentiment classification.
// Step 7: Evaluation
7.1 Calculate accuracy, precision, recall, and F1 score on the validation set.
7.2 Compare against baseline models.
END
CEA-MMSA involves collecting and storing images of Tamil and Sanskrit palm leaf manuscripts, preprocessing them to reduce noise and enhance readability, and then using GRU with an attention mechanism for text sentiment analysis and Vision Transformers (ViTs) for visual sentiment analysis. The model fuses textual and visual features through intermediate layers, combining them with multi-head attention, and classifies the sentiment using a multi-layer perceptron (MLP). The system is evaluated for accuracy, precision, recall, and F1 score, demonstrating its effectiveness in analyzing and classifying sentiments from multimodal and multilingual data.