Examining emotions in English and translated Chinese children’s literature: a bilingual emotion detection model based on LLMs

doi:10.21203/rs.3.rs-4350089/v1

Download PDF

Research Article

Examining emotions in English and translated Chinese children’s literature: a bilingual emotion detection model based on LLMs

https://doi.org/10.21203/rs.3.rs-4350089/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The study of emotions within the language sciences has been an area of scholarly interest since the 1880s. Emotion analysis in this field primarily examines the expression of emotions in various texts, encompassing a broad spectrum from online commentary to classical literature. Recent years have seen an increased emphasis on the detection and analysis of emotions within children's literature. This burgeoning interest is motivated by the recognition that a deeper understanding of the emotional layers embedded in children's stories can greatly enhance the insights of educators and caregivers into the emotional development and experiences conveyed through these narratives. While the majority of research in this field has concentrated on the analysis of emotion in monolingual datasets, efforts to explore emotion within bilingual contexts, such as in translated children’s literature, are relatively rare. To address this gap, this paper firstly compiles a bilingual Chinese-English dataset of emotions from a parallel Chinese-English classical children’s literature corpus. Then, the dataset is fine-tuned and evaluated on different Large Language Models (LLMs). The fine-tuning results indicate that the GPT-3.5-turbo model surpasses other language models, reaching its best performance with an F1 of 0.869. This performance denotes not only the feasibility of Chinese-English bilingual emotion detection, but also the applicability of this modelled dataset for future Chinese-English emotion detection tasks.

Emotion detection

Bilingual emotion analysis

Large Language Models

fine-tuning

Children’s literature

The study of emotion analysis, referred to as the exploration of emotions within language sciences (Majid, 2012), has begun since the 1880s (James, 1884; Johnson-Laird & Oatley, 1989; Love, 2007; Wilce, 2009). It aims to determine the emotional tone or feelings conveyed by written content, such as text from social media posts (Brynielsson et al., 2014; Kohout et al., 2023), reviews (Tang et al., 2009), news articles (Edwards, 1999; Lin et al., 2008), or any other forms of written communication (Rimé, 2009). Recent decades, natural language processing (NLP), which predominantly focusing on language-based emotion classification datasets (Demszky et al., 2020; Sosea & Caragea, 2020), has greatly advanced emotion analysis. These datasets, which encompass diverse sources, including tweets (Mohammad & Bravo-Marquez, 2017), news articles (Staiano & Guerini, 2014), and literary texts (Haider et al., 2020).

Among all these language-based emotion classification datasets, the domain of children’s literature is currently gaining prominence as a significant focal point for emotion analysis research (Adukia et al., 2022; Moruzi et al., 2017; Kaya et al., 2017; Oberländer et al., 2018). This is perhaps due to children’s literature’s unique capacity to encapsulate a wide spectrum of emotions (Nikolajeva, 2014), intricately woven into narratives designed to engage and resonate with young readers. Children’s literature often serves as a rich repository of emotional expression, offering “detailed information both about the characters’ physical appearance and about their emotions and thoughts” (Nodelman, 2008, p. 13). As a result, analyzing emotions within children’s literature offers a valuable opportunity to delve into the emotional landscapes that shape young minds and contribute to their cognitive and emotional development (Wang et al., 2015). Through emotion analysis, we can not only enhance our understanding of literary narratives but also enrich our insights into the emotional dynamics that influence children’s perceptions, experiences, and growth.

However, up to date, few attempts have been made to study emotion in translated children’s literature (Schwieter & Ferreira, 2017). Translated texts pose unique challenges as they not only inherit the intricate cultural and linguistic complexities of the original work but also represent the norms and expectations of the target culture (Toury, 1995). Investigating emotions in bilingual contexts assumes significance, as it allows for in-depth linguistic exploration of the intricate cultural and linguistic intricacies in emotions that arise from language differences. Through the provision of a bilingual dataset, researchers can delve into the cross-linguistic and cross-cultural variations in the expression of emotions, thereby facilitating a deeper comprehension of emotion recognition and sentiment analysis within bilingual contexts. Moreover, The establishment of a unified system for detecting emotion categories across both English and Chinese languages holds immense value, serving as a valuable tool not only for future studies detecting emotions in Chinese-English language pairs but also for examining the nuances of translations and conducting comprehensive cross-cultural analyses.

To address the research gap, this paper attempts to compile a bilingual Chinese-English dataset of emotions. The dataset is composed of 1,116 Chinese and English sentences annotated with emotions taken out of a parallel Chinese-English classical children’s literature dataset. The emotion taxonomy in this paper, adapted from Parrott’s (2001) five basic emotions (Joy, Sadness, Anger, Fear, and Love), is designed considering the psychological and practical implication in the dataset. Subsequent steps involve deploying supervised machine learning techniques (Support Vector Machines, Naïve Bayes, Logistic Regression), unsupervised deep learning with Neural Networks, and advanced Transformer models (XLM-RoBERTa, GPT family) for fine-tuning on labelled dataset. The fine-tuning results indicate that the GPT-3.5-turbo model from GPT family consistently surpasses other models, reaching its best performance with an F1 Score of 0.869 and Accuracy of 0.958—a substantial advancement compared to other models, such as GoEmotions (F1 score 0.46).

The contributions of this paper include: (1) it introduces BilingualChildEmo, a novel children-related bilingual dataset for emotion detection composed of two languages; (2) it evaluates the bilingual fine-grained emotion detection task and establishes strong baselines based on GPT and variants; (3) it exams different supervised and unsupervised pre-training techniques, shedding light on the significance of selecting the appropriate pre-training domain.

Research into emotion detection has gained substantial traction in the field of computational linguistics over the past two decades (Picard, 1995/2000; Chuang & Wu 2002, Mihalcea & Liu 2006, Ahmad 2008, Strapparava & Mihalcea 2008, Chen et al. 2009, Lee et al. 2009, Lee et al. 2010). Current investigations in text-based emotion detection encompass various domains. One prominent avenue involves the examination of emotions within the context of online social media platforms. This encompasses a wide range of data sources, from theme-based book reviews and movie comments on platforms like Goodreads (Dimitrov et al., 2015) to the unfiltered expression of thoughts and sentiments on platforms such as Twitter and Reddit (Demszky et al., 2020). Another noteworthy direction centers around the analysis of emotions in literary classical works, including fairy tales (Mohammad, 2012), among others. However, it is essential to acknowledge that a substantial portion of research in this domain predominantly relies on monolingual datasets, which poses limitations in understanding the complexities of emotions in bilingual contexts. While some scholars have recognized the importance of incorporating bilingual or multilingual datasets, their approach often involves annotating emotions in a single language and subsequently relying on rudimentary translation methods, such as Google Translate, to render the annotated dataset to other languages (Mohammad & Turney, 2010). This methodology, unavoidably, introduces potential errors and inaccuracies in the resulting translated dataset. Therefore, a more dedicated exploration of the bilingual perspective is crucial to enhance emotion detection in Chinese-English language pairs, analyze emotional nuances in translations and cross-cultural studies, and improve cross-linguistic and cross-cultural analyses of emotions.

Previous research has predominantly employed diverse emotional taxonomies, including Ekman’s (1992) six basic emotions (Joy, Anger, Fear, Sadness, Disgust, and Surprise), Plutchik’s (2003) eight basic emotions (Joy, Sadness, Anger, Fear, Trust, Disgust, Surprise, and Anticipation), Parrott’s (2001) five basic emotions (Joy, Sadness, Anger, Fear, and Love), and the extensive GoEmotions taxonomy with over 27 emotions (Demszky et al., 2020). The emotion taxonomy in this dataset is founded on established research, acknowledging four primary emotions are Happiness, Sadness, Anger, and Fear (Lee, 2015). This study adopts Parrott’s (2001) classification, which includes Love except the four primary ones, aligning with the focus on children’s literature. This choice is underpinned by two primary considerations: firstly, Love serves as a cornerstone in the emotional development of children, aiding in the cultivation of empathy and social cohesion (Shaver et al., 1996); secondly, Love consistently emerges as a prevalent theme within children’s literature, enriching the analysis of emotions and contributing to the fostering of positive emotional attitudes among young readers.

For the trends and application models in the identification and analyses of emotions in languages, there are mainly rule-based approach, machine learning-based approach and deep learning approach. In the rule-based emotion analysis, the use of emotion-bearing words and their combinations to assess phrasal units for emotions has been a primary focus of emotion analysis research for a long time (Aman & Szpakowicz, 2007, Chen et al, 2009, Lee et al., 2013). Popular emotion lexicons includes NRC Lexicon (Mohammad & Turney, 2010; Mohammad & Turney, 2013), ANEW (Bradley & Lang, 1999; Nielsen, 2011) and the Valence Arousal Dominance Lexicon (Mohammad, 2018). The machine learning-based approach to emotion analysis entails converting text emotion analysis into a classification task. This involves the application of established algorithms like Support Vector Machines (SVMs), Naive Bayes, Logistic Regression, and other machine learning methods (Aman & Szpakowicz, 2007; Danisman & Alpkocak, 2008; Deshpande & Rao, 2017). Although bag-of-words models have demonstrated promise in the domains of speech emotion recognition (Jain et al., 2020; Kwon et al., 2003) and facial emotion detection (Michel & El Kaliouby, 2003; Susskind et al., 2007), there exists considerable potential for refinement within the context of text-based emotion analysis in terms of sparse data features. In recent years, deep learning approach, such as Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs), and Recurrent Neural Networks (RNNs), have gained significant prominence in the field of text emotion analysis (Zhou & Long, 2018). These approaches have garnered attention due to their demonstrated ability to address the inherent limitations associated with traditional machine learning techniques. More recent advancements in transformer-based models, such as BERT and the GPT family, which incorporate language model pre-training, have demonstrated significantly improved performance (Guu et al., 2020).

Scholarly interest in the analysis of emotions in children’s literature emerged as early as the late twentieth century, as evidenced by Stevenson’s work in 1997. Previous research in this field has primarily focused on several key areas, including the detection and recognition of emotions within children’s literary texts, as explored by Alm et al. (2005) and more recently by Zad et al. (2021). Additionally, scholars have examined the socio-cultural implications of emotions depicted in children’s literature, as demonstrated by Adukia et al.’s research in 2022, and have conducted psychological examinations of hypotheses related to emotions, exemplified by Jacobs et al.’s work in 2020.

The methodologies employed in the analysis of emotions in children’s literature share commonalities with the broader field of emotion analysis. Researchers have employed both rule-based and machine learning-based approaches to investigate emotions in literary works. For instance, Saif Mohammad’s studies in 2010 and 2013 utilized these approaches to scrutinize emotions in novels and fairy tales. His research highlighted the potential of sentiment analysis when combined with effective visualization techniques, enabling the quantification and monitoring of emotions within individual books and extensive literary collections. Similarly, Alm et al. (2005) adopted a supervised machine learning approach, utilizing the SNoW learning architecture, to explore text-based sentiment prediction. Their investigation provided valuable empirical insights into this facet of sentiment analysis. Furthermore, Jacobs et al. (2020) directed their research toward the Pollyanna Effect, employing the model-based unsupervised vector space sentiment analysis tool known as SentiArt. This approach allowed for a nuanced exploration of sentiment dynamics within literary texts, making a significant contribution to the broader discourse on emotion analysis in literature.

However, it is worth noting that recent advancements in transformer-based models, such as BERT and the GPT family, have yet to be applied to emotion analysis in children’s literature. To address this gap in the literature, the current study aims to leverage BERT and GPT family models in conjunction with high-quality data annotated with emotions. This approach seeks to enhance our understanding of emotional content within children’s literature, drawing upon the latest developments in natural language processing techniques.

3.1 BilingualChildEmo

This paper constructed a dataset named BilingualChildEmo. This bilingual Chinese-English dataset of five basic emotions comprises 1,116 bilingual Chinese and English sentences, each annotated with emotional labels (Joy, Sadness, Anger, Fear, and Love). The sentences were selected from a parallel Chinese-English classical children’s literature corpus using a systematic random sampling method. To mitigate potential translation biases during annotation tasks, 1,116 sentences were independently chosen from both the Chinese and English datasets within the corpus, ensuring that the dataset isn’t merely a parallel translation. For reference, Table 1 provides a representative selection of sample texts from the BilingualChildEmo dataset.

Table 1

Examples in the BilingualChildEmo dataset
Sample Text	Label
So overjoyed were they at their deliverance that they laughed aloud, and the Earth seemed to them like a flower of silver, and the Moon like a flower of gold.	Joy
巨人欣喜若狂地跑下楼梯, 岀了房子冲进花园。	Joy
And in the morning he rose up, and plucked some bitter berries from the trees and ate them, and took his way through the great wood, weeping sorely.	Sadness
可怜的人儿, 失去了他们唯一的儿子!	Sadness
织工气愤地看着他, 说: 你看我干什么༟	Anger
"Upon my word," said the Miller with anger, "you are very lazy."	Anger
But his face was strangely pale, and as he fell upon the deck the blood gushed from his ears and nostrils.	Fear
据说那个墓穴里还躺着一个人, 死者是一个异常英俊美丽的青年, 他的双手用绳子反绑着, 胸部被捅了很多刀, 衣服都被血染红了。	Fear
I am his best friend, and I will always watch over him, and see that he is not led into any temptations.	Love
比如说, 新娘和新郎这么年轻就彼此相爱了。	Love

The design of the emotion taxonomy in this dataset is grounded in an established framework, recognizing Happiness, Sadness, Anger, and Fear as primary emotions widely acknowledged in prior research (Lee, 2015). Building upon this foundation, this study adopts Parrott’s (2001) classification of five basic emotions, which notably includes Love, aligning with the genre of current study, that is, children’s literature. This adoption stems from two key considerations. Firstly, Love holds a pivotal role in children’s emotional development. It serves as a cornerstone for nurturing empathy, fostering healthy attachment, and facilitating social growth (Haslip et al., 2019). Through experiences of love and affection, children cultivate a sense of security, belonging, and emotional well-being, all essential elements for their overall resilience and development. Secondly, Love emerges as a consistent and significant theme in children’s literature. Within these narratives, emotional themes are carefully crafted to resonate with young readers, aiming to evoke empathy and foster emotional engagement. By exploring the theme of Love within children’s literature, we not only enrich the emotional landscape of our analysis but also contribute to reinforcing positive attitudes among children, thereby promoting the development of their emotional competence.

Furthermore, it is essential to elucidate the rationale behind adopting a sentence-level approach for emotion detection within this study. The decision to focus on sentences as the fundamental unit of analysis is underpinned by the belief that sentences represent a more contextually appropriate and semantically meaningful unit for the expression of emotions within the literary genre (Yang & Cardie, 2014). While this approach acknowledges, to some extent, the potential for complex linguistic devices such as irony and metaphor, which can be challenging to discern when examining emotions solely at the word level, it is worth noting that detecting irony and metaphors remains challenging even at the sentence level. This suggests that future studies may benefit from providing additional context to accurately identify figurative expressions.

3.2 Annotation

The three annotators recruited for this dataset annotation were English-Chinese bilingual language professionals with proficiency in both Chinese and English. Annotators were instructed to identify the emotions they experienced while reading the provided sentences. Prior to commencing the annotation task, they were equipped with predefined emotion definitions and illustrative examples, as outlined in Appendix.

Annotators were instructed to select a singular emotion descriptor for each sentence, opting for the emotion that they felt most confident in attributing. They were encouraged to consider both explicit expressions and the implicit contextual inference to identify the emotion conveyed in the sentence. Emotion labels were to be assigned based on the predominant emotion conveyed by the sentence, with annotators prioritizing the most dominant emotion in cases of multiple emotions. Annotations were expected to be consistent with the definitions provided above, taking into account cultural nuances and linguistic expressions.

In instances where annotators encountered difficulties in annotating a particular sentence, such as when emotions were not readily discernible or when overly intricate emotions were present, they were guided to employ the label ‘Other’ for annotation purposes. For instance, in Example 1, emotions were not readily discernible, while in Example 2, emotions were deemed rather complicated and overwhelming. Sentences designated as ‘Other’ were subsequently excluded from the dataset during the compilation process.

Example 1

小燕子開始想事, 很快就入睡了。(Back translation: Little Swallow began to ponder and soon fell asleep.) (Emotion: Other)

Example 2

The child of the old King’s only daughter by a secret marriage with one much beneath her in station—a stranger, some said, who, by the wonderful magic of his lute-playing, had made the young Princess love him; while others spoke of an artist from Rimini, to whom the Princess had shown much, perhaps too much honour, and who had suddenly disappeared from the city, leaving his work in the Cathedral unfinished—he had been, when but a week old, stolen away from his mother’s side, as she slept, and given into the charge of a common peasant and his wife, who were without children of their own, and lived in a remote part of the forest, more than a day’s ride from the town. (Emotion: Other)

To gauge the level of agreement among the annotators, agreement scores were computed both for overall agreement and individual emotions, utilizing Fleiss’ Multirater Kappa statistic, as detailed in Table 2. The computed average Fleiss’ kappa value across all emotion categories was 0.886, with individual values ranging from 0.826 to 0.921. Following the interpretative framework elucidated by Landis and Koch (1977), kappa values exceeding 0 signify varying degrees of agreement that surpass mere chance among two or more raters, with a maximum attainable value of + 1 denoting perfect agreement, indicating complete consensus among the raters on all items. The final emotion assignment for each sentence was determined through the application of the majority vote rule.

Table 2 Results of Fleiss Multirater Kappa

3.3 Data analysis

3.3.1 Emotion distribution

Figure 1 visually represents a scatter plot, offering insights into the distribution of the five primary emotions within the BilingualChildEmo dataset. This dataset encompasses a total of 1,116 instances of emotions, with Joy emerging as the most prevalent emotion, occurring 332 times. Overall, the patterns are in line with findings in other or general domains. In close proximity, Sadness follows closely with 309 occurrences. Existing scholarship, exemplified by Nikolajeva (2013), has posited that children’s literature frequently incorporates themes centered around Joy and Sadness. These emotionally vivid and contrasting themes are believed to aid children in comprehending and managing their own “emotion literacy” (p. 249). The substantial prevalence of Joy and Sadness within the corpus may be indicative of a deliberate emphasis on these emotional themes within the children’s literature contained in this dataset.

On the other hand, Anger and Fear are observed with lower frequencies, appearing 187 and 158 times, respectively. This reduced occurrence of Anger and Fear could potentially be attributed to the inherent nature of children’s literature, as discussed by Logan (1998). Such literature often tends to steer clear of explicit depictions of violence or frightening content, a precaution taken to safeguard the emotional well-being of young readers. Lastly, Love emerges as the least frequently depicted emotion, appearing in only 130 instances. Given the complexity of the emotion of Love, it is plausible that it is less frequently portrayed in children’s literature, possibly owing to the challenges young readers may face in fully grasping its intricacies and nuances. Additionally, the lowest agreement score among annotators in the Love category (0.826) suggests potential challenges in identifying Love, particularly in distinguishing it from Joy within the annotated texts.

3.3.2 A comparison of emotions in Chinese and English sentences

The BilingualChildEmo dataset presents a distinctive bilingual character, necessitating a comprehensive comparative analysis between Chinese and English texts to foster a foundational comprehension of the emotion detection task. The examination of emotion distribution within this parallel corpus of Chinese and English children’s literature serves as an illuminating lens through which we can discern the prevalence of diverse emotional states across both linguistic domains. This comparative inquiry affords valuable insights into the cultural and linguistic distinctions that shape emotional expression in the two languages.

Upon examination of the data presented in Fig. 2, it becomes evident that Sadness ranks first in Chinese and second English, making it one of the most frequently encountered emotions in both Chinese and English sentences. Nevertheless, it is noteworthy that Sadness exhibits a more pronounced presence in Chinese sentences (194) when contrasted with English sentences (115). This discernible discrepancy may be attributed to the cultural disparities that permeate the two linguistic contexts. Chinese culture is notably characterized by its emphasis on the open expression of emotions, particularly Sadness, which finds resonance within the literature (Becker & Kleinman, 2013). The heightened occurrence of Sadness within Chinese literature aligns with these cultural proclivities. Similarly, Fear manifests with greater frequency in Chinese sentences (92) in comparison to English sentences (66). This observation further underscores the influence of culture on emotional themes within literature. For instance, in Chinese culture, sentences associated with supernatural elements or monsters are often categorized as inducing Fear (See Example 3). However, in English, such expressions are typically considered neutral and lack the connotations of horror (See Example 4) commonly found in translated Chinese literature. Interestingly, the occurrence of Love remains identical in both the Chinese and English corpora, hinting at the possibility of Love as a universally transcendent emotion that transcends cultural and linguistic boundaries.

Example 3

他跪在耶穌像前, 神龕旁的大蠟燭燃燒得很亮, 香燒起的縷縷青煙在穹蓋形成薄薄的霧環。(Back translation: He knelt in front of the statue of Jesus. The big candle next to the shrine burned brightly, and the wisps of green smoke from the burning incense formed a thin ring of mist on the dome.) (Emotion: Fear)

Example 4

He knelt before the image of Christ, and the great candles burned brightly by the jewelled shrine, and the smoke of the incense curled in thin blue wreaths through the dome. (Emotion: Joy)

In the realm of Joy and Anger, Chinese sentences exhibit a marginally higher frequency of Joy (177) than their English counterparts (155). Moreover, English sentences showcase a slightly decreased prevalence of Anger (90) in comparison to Chinese sentences (97). These nuanced distinctions can be ascribed to inherent linguistic disparities between the two languages. For instance, it is conceivable that the Chinese language presents a more expansive array of emotional expressions to convey the sentiment of Joy, whereas the English language may encompass a more comprehensive lexicon to articulate the intricate nuances associated with the expression of Anger, as suggested by Ekman (1973).

Overall speaking, there are more emotion expressions in the Chinese sentences than that of English sentences. This emphasizes the potential for translations to evoke changes in emotions. Additionally, it illuminates the unique characteristics of translated literature (exemplified by Chinese sentences) and the possibilities of emotions undergoing shifts in bilingual contexts during the translation process. For instance, in Example 5, the translated Chinese sentence “這絕唱 (This final song)” conveys the emotion of Sadness, while its source text “It” remains emotionally neutral. This exemplifies how translations may imbue target language sentences with additional emotional nuances not present in the source text, consequently resulting in a higher frequency of emotional expressions in Chinese sentences compared to English sentences.

Example 5

這絕唱隨著河流的浪頭飄去, 把它的餘音一直傳向大海。 (Back translation: This final song floats away with the waves of the river, continuously transmitting its lingering sound towards the sea.) (Emotion: Sadness)

Source text: It floated through the reeds of the river, and they carried its message to the sea. (Note: this sentence is just the source text and is not included in the BilingualChildEmo dataset so it’s not labelled with emotions)

In summation, the scrutiny of emotion distribution within a parallel corpus of Chinese and English children’s literature underscores the cultural and linguistic distinctions intrinsic to these languages. It is imperative to acknowledge and incorporate these distinctions when interpreting and comprehending emotional expressions across diverse linguistic and cultural contexts. Such recognition fosters a more profound understanding of the intricate interplay between language, culture, and emotion within the realm of children’s literature.

In this study, the modelling of Parrot’s fundamental set of five emotions within the BilingualChildEmo dataset is undertaken through an array of machine learning and deep learning methodologies. These encompass supervised learning techniques, including SVMs, Naïve Bayes, and Logistic Regression, unsupervised learning via Neural Networks, and self-supervised learning through the fine-tuning of transformer models such as XLM-RoBERTa and the GPT-3.5-turbo-1106 model (an integral component of the GPT family of language models).

Initially, during the data preparation phase, frequency calculations are executed for both the English source text, employing AntConc, and the translated Chinese target texts, utilizing Sketch Engine. Tokenized data, conducive to emotion analysis, is prepared through the application of Stanford NLP Group’s Stanza. Subsequent phases entail the deployment of diverse techniques, commencing with supervised machine learning approaches (namely SVMs, Naïve Bayes, and Logistic Regression), followed by unsupervised deep learning employing Neural Networks, and culminating in the utilization of advanced Transformer models (specifically XLM-RoBERTa, GPT-curie, GPT-davinci, and GPT-3.5-turbo-1106) for fine-tuning on annotated datasets.

In this study, the exploration revolves around the viability of predicting emotions within the framework of five distinct categories. This endeavor is facilitated through the employment of multiclass classification models, which encompass SVMs, Naïve Bayes, Logistic Regression, and Neural Networks. To facilitate this analysis, the dataset undergoes a partitioning process, where it is divided into training and testing sets, constituting 80% and 20% of the data, respectively. Both supervised and unsupervised learning models are scrutinized, in conjunction with Multilingual Sentence-BERT (SBERT) word embeddings (Reimers & Gurevych, 2019), as the chosen methodology. The evaluation of model performance is conducted via an assessment based on a suite of metrics, including the F1 score, precision, recall, and accuracy, which collectively reveal promising outcomes in the context of emotion prediction. Notably, this research is executed within the framework of the Orange platform, and the schematic representation of the workflow is depicted in Fig. 3, providing a visual synopsis of the entire process.

Additionally, the study investigates transformer-based language models, particularly XLM-RoBERTa and GPT, renowned for their excellence in natural language processing tasks (Lauriola et al., 2022; Worsham & Kalita, 2020). The primary objective is to further elevate the accuracy of emotion classification. To attain this goal, the study initiates with a comprehensive examination of the structural intricacies and algorithmic foundations characterizing transformer-based language models. Subsequently, the training phase unfolds, employing the comprehensive BilingualChildEmo dataset, which encapsulates a diverse array of emotional expressions within the context of children’s literature, spanning a rich tapestry of linguistic styles and contextual nuances. Following the rigorous training regimen, the model is subjected to a meticulous evaluation, wherein a comprehensive array of performance metrics, including the F1 score, precision, recall, and accuracy, is invoked to methodically assess its efficacy and robustness.

4.1 Supervised machine learning: SVMs, Naïve Bayes and Logistic Regression

This study initially focuses on three prominent supervised machine learning algorithms—SVMs, Naïve Bayes, and Logistic Regression—for the task of emotion classification. SVMs is recognized for its efficacy in constructing optimal hyperplanes to delineate data points into distinct classes, thereby maximizing inter-class separation and ensuring robust classification performance (Raschka & Mirjalili, 2019). Within the scope of this research, SVMs is employed to discern various emotional categories by learning discernible patterns within the provided training data. Naïve Bayes, a probabilistic classifier rooted in Bayes’ theorem and based on the assumption of feature independence, exhibits efficiency and effectiveness, particularly in handling high-dimensional data (Sen et al., 2020). In the realm of emotion classification, Naïve Bayes is adept at predicting the most likely emotion category based on observed features, such as word frequencies or linguistic patterns. Logistic Regression, a linear model well-suited for binary or multiclass classification tasks, estimates the probability of data points belonging to specific classes by fitting a logistic function to the input features (Sen et al., 2020). In the context of this study, Logistic Regression is harnessed to forecast the probability of an instance being associated with a particular emotion, grounded in its acquired knowledge of the relationships between features and emotional categories.

These supervised machine learning algorithms have a robust track record in diverse classification tasks, including text classification and sentiment analysis (Dawei et al., 2021). Within the scope of this study, our objective is to assess their performance in the domain of emotion classification, leveraging suitable feature representations and evaluation metrics such as the F1 score and accuracy. Through comparative analysis of their outcomes, we endeavor to identify the most effective approach for emotion classification while illuminating the inherent strengths and limitations of each method.

4.2 Unsupervised deep learning: Neural Network

Differing from supervised machine learning, unsupervised deep learning is directed at uncovering latent patterns and structures within unlabeled data (Raschka & Mirjalili, 2019). Neural networks, a category of deep learning models, exhibit remarkable versatility and can be applied to unsupervised learning tasks, including the domain of emotion classification. These computational models draw inspiration from the human brain and consist of interconnected layers of artificial neurons (Batool et al., 2013). They possess the capacity to acquire intricate patterns and representations from raw data, rendering them apt for a multitude of applications, spanning image and text classification. In unsupervised learning contexts, neural networks prove instrumental in the identification of inherent data structures, such as clusters or low-dimensional representations, which subsequently find utility in emotion classification endeavors.

This study undertakes an exploration of the potential of unsupervised neural networks for the task of emotion classification. Through a comparative analysis of their performance vis-à-vis supervised machine learning algorithms, the study aspires to elucidate the advantages and constraints associated with each approach, ultimately discerning the most efficacious method for the classification of emotions within textual data. The outcomes of this research endeavor hold the promise of yielding valuable insights pertinent to the refinement of emotion recognition systems, contributing to their enhanced accuracy and robustness.

4.3 Transformers with LLMs: XLM-RoBERTa and GPT

In recent years, self-supervised learning methodologies have garnered considerable attention due to their aptitude for harnessing substantial volumes of unannotated data to pretrain models, subsequently amenable to fine-tuning for task-specific objectives (Atito et al., 2021). This study directs its attention towards transformer models, specifically emphasizing (1) XLM-RoBERTa and (2) members of the GPT family, encompassing curie, davinci variants and GPT-3.5 turbo, as instruments for the task of emotion classification.

XLM-RoBERTa, an advanced transformer-based language model developed by Facebook AI, extends upon the achievements of the RoBERTa model. It is pretrained on a vast multilingual corpus spanning over 100 languages, enabling it to capture the intricacies of diverse linguistic structures. Several factors contribute to the selection of XLM-RoBERTa for this study. Firstly, its pretraining includes text sources resembling the genre of children’s literature, such as book corpora, stories, open web text, and CC-News, aligning well with the current study’s focus. Secondly, its multilingual capabilities broaden its applicability, particularly in the context of language diversity. Lastly, the model employs Sentence-BERT (SBERT) (Reimers & Gurevych, 2019) for sentence-level embeddings, aligning with the study’s objectives and providing a robust foundation for research. Consequently, this study will undertake fine-tuning procedures on the XLM-RoBERTa model.

The GPT-3 family of language models, including Curie and Davinci, utilizes transformer architecture and is well-known for its outstanding performance across various tasks (refer to Fig. 4). Table 3 provides details on model names, indicative accuracy, training cost, and inference cost. Recently, OpenAI has made GPT-3.5 accessible for public use, allowing customization for fine-tuning tasks. This version, leveraging transformer architecture, demonstrates remarkable proficiency across a wide range of tasks. Particularly noteworthy is the GPT-3.5 turbo variant, an improvement over its predecessor, GPT-3.5, with enhanced capabilities in understanding and generating natural language or code. For instance, the GPT-3.5-turbo-1106 model introduces significant enhancements, such as improved instruction following. Consequently, this study will initially focus on Curie and Davinci models, followed by an examination of the latest iteration, GPT-3.5-turbo-1106. Additionally, as more advanced models become available in the future, the study will adapt by incorporating them into the current fine-tuning process.

Table 3 Models in GPT family

(Note: This information is retrieved from OpenAI on 15 Sep 2023)

The primary aim of this study’s fine-tuning endeavor on transformer models for emotion classification is to leverage their formidable capabilities in enhancing overall model performance. The anticipated outcome of this undertaking is the provision of heightened accuracy and dependability in the identification of emotions within textual data. Additionally, the fine-tuning process promises valuable insights into the efficacy of these cutting-edge models, particularly within the domain of children’s literature and bilingual contexts encompassing both English and Chinese. Such insights will serve to inform subsequent research and development initiatives within the natural language processing field, ultimately contributing to the advancement of more refined emotion recognition systems.

5.1 Supervised and unsupervised classification results

The outcomes presented in Table 4 provide an assessment of the efficacy of four supervised and unsupervised classification models—Logistic Regression, Naïve Bayes, SVMs, and Neural Networks—in the context of predicting emotions encompassing five distinct categories: Joy, Sadness, Anger, Fear, and Love. These models underwent evaluation based on their F1 scores and classification accuracy (CA), serving as indicators of their performance in emotion classification tasks. The results illustrated in Table 4 underscore the divergence in performance exhibited by each model across the various emotion categories. SVMs consistently manifest superior classification accuracy in most emotion categories, closely followed by Neural Networks. Conversely, Logistic Regression and Naïve Bayes models exhibit mixed performance, with Naïve Bayes achieving a notably high F1 score for Joy but comparatively lower scores for other emotions. Figure 5 provides a visual representation of multiclass classification F1 and accuracy results.

Table 4

The supervised and unsupervised classification experiment results by emotion type
Models	Joy		Sadness		Anger		Fear		Love
Models	F1	CA	F1	CA	F1	CA	F1	CA	F1	CA
Logistic Regression	0.877	0.752	0.667	0.752	0.792	0.752	0.625	0.752	0.727	0.752
Naïve Bayes	0.904	0.704	0.571	0.704	0.769	0.704	0.488	0.704	0.619	0.704
SVMs	0.900	0.824	0.787	0.824	0.833	0.824	0.722	0.824	0.800	0.824
Neutral Network	0.880	0.816	0.772	0.816	0.863	0.816	0.757	0.816	0.733	0.816

(Note: The best performing model F1s are underlined.)

These findings accentuate the consequential influence of the choice of classification model on the overall proficiency of emotion classification tasks. Consequently, it becomes imperative to weigh the distinctive attributes of each model against the specific dataset and task at hand. The ensuing discussion delves into the performance nuances of each model and their attendant implications: (1) Logistic Regression, characterized by its relatively uniform performance across the emotion categories, attains the highest F1 score for Joy (0.877) and the lowest for Fear (0.625). This model’s consistency may be attributed to its straightforward nature and capacity for establishing linear decision boundaries. However, Logistic Regression’s limitations become evident when confronted with intricate data patterns, wherein it may falter in capturing complex feature relationships. (2) The Naïve Bayes model achieves the most elevated F1 score for Joy (0.904) but displays suboptimal performance in classifying other emotions, notably Fear (0.488). This performance incongruity may be ascribed to the model’s foundational assumption of feature independence, which may not always align with the intricate dynamics inherent in real-world datasets. Within the realm of emotion classification, this assumption can result in misclassifications when intricate word-emotion relationships are at play. (3) Support Vector Machines consistently outperform other models across the majority of emotion categories, affirming their robustness and adaptability for handling high-dimensional data. The employment of kernel functions in SVMs endows them with the capability to capture intricate, non-linear feature relationships, rendering them an ideal choice for emotion classification tasks (Kächele et al., 2016). Nonetheless, it is worth noting that SVMs may entail computational intensity, particularly in scenarios involving extensive datasets. (4) The Neural Network model demonstrates robust overall performance, boasting the highest F1 score for Anger (0.863) and competitive scores for other emotions. Neural Networks, renowned for their capacity to “encode complex, hierarchically organized information” (Elman, 1993, p. 4), align seamlessly with the demands of emotion classification tasks. Nevertheless, they necessitate substantial training data and computational resources, presenting potential constraints in specific contexts.

According to the average supervised and unsupervised classification outcomes presented in Table 5, the SVMs model emerges as the top performer across a spectrum of evaluation metrics. Notably, it exhibits the highest F1 score (0.824), classification accuracy (0.824), precision (0.843), and recall (0.824) in comparison to its model counterparts. These results substantiate the SVMs model’s proficiency in discerning and delineating distinct classes within the dataset. Furthermore, the equilibrium maintained between precision and recall underscores the model’s prowess in delivering both accuracy and sensitivity in its predictive capabilities.

Table 5

The average supervised and unsupervised classification experiment results
Models	Average
Models	F1	CA	Precession	Recall
Logistic Regression	0.756	0.752	0.786	0.752
Naïve Bayes	0.704	0.704	0.733	0.704
SVMs	0.824	0.824	0.843	0.824
Neutral Network	0.816	0.816	0.823	0.816

One noteworthy facet to consider when interpreting the findings of this study pertains to the influence of feature extraction techniques on model efficacy. In this experiment, all models were tested with SBERT word embeddings, which capture semantic information at the sentence level. This approach allows the models to incorporate valuable context when making predictions (Reimers & Gurevych, 2019), which is crucial for understanding and predicting emotions in text. However, it is worth noting that different feature extraction techniques might yield different results, and future studies could explore alternative methods, such as bag-of-words, term frequency-inverse document frequency (TF-IDF), or word2vec, to compare their effectiveness in the context of emotion classification.

A salient insight gleaned from the supervised and unsupervised classification experiment results underscores the delicate balance between model complexity, computational resources, and classification performance. While more intricate models, including Neural Networks, may furnish heightened accuracy, they invariably necessitate substantial computational resources. This consideration can assume paramount significance in certain research contexts or when handling extensive datasets. Consequently, the judicious selection of a model commensurate with the task’s specific requirements emerges as a pivotal decision.

5.2 Transformer-based classification results

Additionally, this study delves into the prospective capabilities of transformer-based language models, including XLM-RoBERTa, GPT-curie, GPT-davinci, and GPT-3.5-turbo-1106, to augment the accuracy of emotion classification via multiclass categorization.

5.2.1 Fine tuning with XLM-RoBERTa

Table 6 displays the results of fine-tuning the XLM-RoBERTa model for a classification task, spanning eight epochs. Each epoch reports key performance metrics, encompassing accuracy, F1 score, precision, and recall. An analysis of these findings reveals that the third epoch demonstrates the most favourable performance, exhibiting the following metrics: an accuracy of 0.746, an F1 score of 0.747, a precision of 0.763, and a recall of 0.746. This particular epoch surpasses the others with superior accuracy, F1 score, and precision. Moreover, its notably elevated recall value suggests the model’s ability to maintain a commendable equilibrium between precision and recall throughout the third epoch.

Table 6

The XLM-RoBERTa classification results
Epoch	CA	F1	Precision	Recall
1	0.731	0.734	0.749	0.731
2	0.746	0.742	0.752	0.746
3	0.746	0.747	0.763	0.746
4	0.716	0.715	0.719	0.716
5	0.724	0.723	0.732	0.724
6	0.724	0.727	0.740	0.724
7	0.731	0.732	0.735	0.731
8	0.746	0.746	0.749	0.746

To mitigate the risk of overfitting during the fine-tuning of a pre-trained model, it is essential to continuously monitor performance metrics on a validation dataset, as advocated by Faber and Rajko (2007). A conspicuous decline in performance on the validation set typically indicates overfitting to the training data, warranting careful consideration. In this specific case, the performance metrics do not exhibit a significant decline beyond the third epoch and the third epoch consistently yields the most favourable results.

In comparison to the previously discussed models (Logistic Regression, Naïve Bayes, SVMs, and Neural Network), the XLM-RoBERTa model demonstrates competitive performance, considering it has only undergone a limited number of fine-tuning epochs. Nonetheless, it is important to acknowledge that the SVMs model maintains a slight performance advantage. It is plausible that further enhancements in the XLM-RoBERTa model’s performance could be achieved through additional fine-tuning iterations, exploration of diverse learning rates, or an extension of the training process.

5.2.2 Fine tuning with GPT models

Table 7 provides an overview of the multiclass classification results obtained through the fine-tuning of three GPT models, namely curie, davinci, and GPT-3.5-turbo-1106. The performance metrics, encompassing the F1 score (F1) and classification accuracy (CA), are meticulously documented for each epoch of the training process.

Table 7

GPT models classification results
Models	Average
Models	F1	CA
curie	0.803	0.807
davinci	0.846	0.848
gpt-3.5-turbo-1106	0.869	0.958

Upon examination of the provided results, it is evident that the GPT-3.5-turbo-1106 model consistently outperforms its counterparts, the curie and davinci model, in both F1 score and classification accuracy. The F1 score of the GPT-3.5-turbo-1106 model reached 0.869, with a classification accuracy of 0.958. This observed superiority aligns with the expectations, as the GPT-3.5-turbo-1106 model, being larger and more potent, possesses a heightened capacity to capture intricate patterns and nuances within the training data. This enhanced capability generally translates into superior performance across a spectrum of natural language processing tasks, including classification.

Comparing the average supervised and unsupervised classification results with the transformer-based classification results reveals intriguing insights into the performance of various models in different scenarios. In the average supervised and unsupervised classification experiments, SVMs emerged as the top performer, achieving the highest F1 score of 0.824 and CA of 0.824. On the other hand, when transitioning to transformer-based classification tasks, fine-tuned models such as XLM-RoBERTa and GPT-3.5-turbo-1106 exhibited competitive performance. For instance, GPT-3.5-turbo-1106 achieved an impressive F1 score of 0.869 and CA of 0.958. This contrast highlights the adaptability of different models to specific classification tasks. While SVMs excel in supervised and unsupervised classification, more complex models like GPT-3.5-turbo-1106 show promise in handling transformer-based categorization.

This study explores emotion analysis in bilingual children’s literature, introducing the BilingualChildEmo dataset, a bilingual Chinese-English dataset for five basic emotions. It outlines the project’s challenges and objectives in creating this dataset and details its formation and annotation processes. The study examines emotion distribution and cross-lingual emotional expression. Various emotion analysis approaches, including supervised methods like SVMs, Naïve Bayes, Logistic Regression, unsupervised Neural Networks, and advanced transformer models (XLM-RoBERTa, GPT-3.5-turbo) are explored. The results show SVMs excelling with a top F1 score of 0.824 and CA of 0.824, while fine-tuned models like GPT-3.5-turbo-1106 perform competitively with an F1 score of 0.869 and CA of 0.958, illustrating model adaptability to specific tasks. This research connects emotion analysis, children’s literature, and bilingual studies, shedding light on emotions, language, and culture interplay.

Nevertheless, it is important to acknowledge some limitations in the current study. This study employs natural language processing techniques to scrutinize texts, providing insights into emotion dynamics within children’s literature. Despite the model’s reliance on human annotation, achieving an F1 score of 0.869 is commendable, given the inherent subjectivity in individuals’ emotional perceptions (Petrides, 2010). To further elevate the performance of emotion analysis in literary studies, future research endeavors could contemplate dataset expansion, involving additional annotators, and harnessing more potent large language models, such as GPT-4. Moreover, researchers might consider broadening the scope to encompass the classification of explicit versus implicit emotions, considering additional contexts and linguistic cues (Lee, 2015). This approach would facilitate a more comprehensive understanding of the dynamics of emotions in bilingual literary texts.

Furthermore, the dataset’s notable predominance of Joy and Sadness, alongside the relatively infrequent occurrences of Anger, Fear, and Love, serves as an initial indicator of prevalent emotional trends in children’s literature. It is essential to recognize that this interpretation necessitates further examination and comprehensive contextual exploration. Therefore, future studies could contemplate expanding the dataset and undertaking more extensive and meticulous analyses of the emotional themes inherent in children’s literary works. This approach will establish a more resilient and nuanced foundation for drawing definitive conclusions in this field.

In conclusion, future research directions encompass expanding the dataset through additional annotators and the integration of advanced language models like GPT-4, aiming to elevate emotion analysis in bilingual children’s literature. Further exploration into explicit versus implicit emotions, facilitated by emotion dictionaries, promises a deeper understanding of emotional dynamics. Given the dataset’s notable prevalence of Joy and Sadness, comprehensive investigations into emotional themes within children’s literature are on the horizon, poised to yield nuanced insights into this domain. These prospective endeavors hold the potential to enhance accuracy, reduce subjectivity, and enrich our comprehension of emotions, language, and culture in this unique context.

Definition of emotions and examples given to the annotators

Joy

A feeling of great pleasure and happiness.

Example 1

“好哇!好哇!”整个宫廷呼喊着,娇小的公主笑得十分开心。

Example 2

“What a delightful time I shall have in my garden,” he said, and he went to work at once.

Sadness

The condition or quality of being sad.

Example 1

那天下午孩子们跑进来时, 发现巨人躺在那棵树下死了, 身上盖满了白色的鲜花。

Example 2

Poor people, to lose their only son!

Anger

A strong feeling of annoyance, displeasure, or hostility.

Examples 1:织工气愤地看着他, 说: 你看我干什么༟

Examples 2: And the weaver looked at him angrily, and said, ‘Why art thou watching me?

Fear

An unpleasant emotion caused by the belief that someone or something is dangerous, likely to cause pain, or a threat.

Examples 1:他顿时感到一阵巨大的恐惧, 他跟织工说: “你在织什么样的长袍༟”

Examples 2: It is a very dangerous thing to know one’s friends.

Love

A strong feeling or constant affection for a person.

Examples 1:他是很愛他, 因為他親過他的嘴。

Examples 2: Here at last is a true lover, said the Nightingale.

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Author Contribution

Y.L., S.Y.M.L., and D.L. contributed to the conception and design of the study. Y.L. compiled the bilingual Chinese-English dataset and conducted the initial analyses. S.Y.M.L. contributed to the methodology and was involved in refining the emotion detection techniques along with Y.L. D.L. supervised the project, contributed to the interpretation of data, and provided critical revisions that were important for the intellectual content. Y.L. and S.Y.M.L. drafted the manuscript. All authors reviewed and approved the final manuscript. D.L. (corresponding author) also handled communication between the team during the manuscript preparation and submission process.

Data Availability

The data that support the findings of this study have been deposited in: https://huggingface.co/datasets/nanaaaa/emotion_chinese_english. The DOI for the dataset is 10.57967/hf/1019.

Adukia, A., Christ, C., Das, A., & Raj, A. (2022). Portrayals of race and gender: Sentiment in 100 years of children’s literature. ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS).
Ahmad, K. (Ed.). (2008). In Proceedings of the Workshop on Sentiment Analysis: Emotion, Metaphor, Ontology and Terminology (EMOT-08), In Association with LREC-08, Marrakech, Morocco, May 27, 2008.
Alm, C. O., Roth, D., & Sproat, R. (2005). Emotions from text: machine learning for text-based emotion prediction. Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing.
Alves, F., & Jakobsen, A. L. (2021). The Routledge handbook of translation and cognition. Routledge Abingdon and New York.
Aman, S., & Szpakowicz, S. (2007). Identifying expressions of emotion in text. International Conference on Text, Speech and Dialogue.
Atito, S., Awais, M., & Kittler, J. (2021). Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602.
Batool, R., Khattak, A. M., Maqbool, J., & Lee, S. (2013). Precise tweet classification and sentiment analysis. 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS).
Becker, J., & Kleinman, A. (2013). Psychosocial aspects of depression. Routledge.
Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings. Technical Report C-1, The Center for Research in Psychophysiology, University of Florida.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Brynielsson, J., Johansson, F., Jonsson, C., & Westling, A. (2014). Emotion classification of social media posts for estimating people’s reactions to communicated alert messages during crises. Security Informatics, 3, 1–11.
Chen, Y., Lee, S. Y., & Huang, C. R. (2009). A cognitive-based annotation system for emotion computing. Proceedings of the Third Linguistic Annotation Workshop (LAW III).
Chuang, Z. J., & Wu, C. H. (2002). Emotion recognition from textual input using an emotional semantic network. 7th International Conference on Spoken Language Processing, ICSLP 2002.
Danisman, T., & Alpkocak, A. (2008). Feeler: Emotion classification of text using vector space model. AISB 2008 Convention Communication. Interaction and Social Intelligence.
Dawei, W., Alfred, R., Obit, J. H., & On, C. K. (2021). A literature review on text classification and sentiment analysis approaches. Computational Science and Technology: 7th ICCST 2020, Pattaya, Thailand, 29–30 August, 2020, 305–323.
Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. (2020). GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547.
Deshpande, M., & Rao, V. (2017). Depression detection using emotion artificial intelligence. 2017 International Conference on Intelligent Sustainable Systems (iciss).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dimitrov, S., Zamal, F., Piper, A., & Ruths, D. (2015). Goodreads versus Amazon: the effect of decoupling book reviewing and book selling. Proceedings of the International AAAI Conference on Web and Social Media.
Edwards, D. (1999). Emotion discourse. Culture & Psychology, 5(3), 271–291.
Ekman, P. (1992). Are there basic emotions? Psychological Review, 99(3), 550–553.
Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1), 71–99.
Faber, N., & Rajko, R. (2007). How to avoid over-fitting in multivariate calibration—The conventional validation approach and an alternative. Analytica Chimica Acta, 595(1–2), 98–106.
Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of the ACM, 56(4), 82–89.
Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020). Retrieval augmented language model pre-training. International Conference on Machine Learning.
Haider, T., Eger, S., Kim, E., Klinger, R., & Menninghaus, W. (2020). PO-EMO: Conceptualization, annotation, and modeling of aesthetic emotions in German and English poetry. arXiv preprint arXiv:2003.07723.
Haslip, M. J., Allen-Handy, A., & Donaldson, L. (2019). How do children and teachers demonstrate love, kindness and forgiveness? Findings from an early childhood strength-spotting intervention. Early Childhood Education Journal, 47, 531–547.
Jacobs, A. M., Herrmann, B., Lauer, G., Lüdtke, J., & Schroeder, S. (2020). Sentiment analysis of children and youth literature: is there a pollyanna effect? Frontiers in psychology, 11, 574746.
Jain, M., Narayan, S., Balaji, P., Bhowmick, A., & Muthu, R. K. (2020). Speech emotion recognition using support vector machine. arXiv preprint arXiv:2002.07590.
James, W. (1884). What is an emotion? Mind, 9(34), 188–205.
Johnson-Laird, P. N., & Oatley, K. (1989). The language of emotions: An analysis of a semantic field. Cognition and emotion, 3(2), 81–123.
Kächele, M., Schels, M., Meudt, S., Palm, G., & Schwenker, F. (2016). Revisiting the EmotiW challenge: how wild is it really? Classification of human emotions in movie snippets based on multiple features. Journal on Multimodal User Interfaces, 10, 151–162.
Kaya, H., Salah, A. A., Karpov, A., Frolova, O., Grigorev, A., & Lyakso, E. (2017). Emotion, age, and gender classification in children’s speech by humans and machines. Computer Speech & Language, 46, 268–283.
Kohout, S., Kruikemeier, S., & Bakker, B. N. (2023). May I have your Attention, please? An eye tracking study on emotional social media comments. Computers in Human Behavior, 139, 107495.
Kwon, O. W., Chan, K., Hao, J., & Lee, T. W. (2003). Emotion recognition by speech signals. Eighth European Conference on Speech Communication and Technology.
Landis, J. R., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 363–374.
Lathey, G. (2015). Translating children’s literature. Routledge.
Lauriola, I., Lavelli, A., & Aiolli, F. (2022). An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing, 470, 443–456.
Lee, S. Y. M., Chen, Y., & Huang, C. R. (2009). Cause event representations for happiness and surprise. Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation.
Lee, S. Y. M., Chen, Y., & Huang, C. R. (2010). A text-driven rule-based system for emotion cause detection. Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text.
Lee, S. Y. M., Chen, Y., Huang, C. R., & Li, S. (2013). Detecting emotion causes with a linguistic rule-based approach. Computational Intelligence, 29(3), 390–416.
Lee, S. Y. M. (2015). A linguistic analysis of implicit emotions. Chinese Lexical Semantics: 16th Workshop, CLSW 2015.
Lin, K. H. Y., Yang, C., & Chen, H. H. (2008). Emotion classification of online news articles from the reader’s perspective. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.
Logan, K. L. (1998). The song of the nightingale: Form and function in Oscar Wilde's fairy tales. The Florida State University.
Love, N. (2007). Are languages digital codes? Language sciences, 29(5), 690–709.
Majid, A. (2012). Current emotion research in the language sciences. Emotion Review, 4(4), 432–443.
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4), 1093–1113.
Michel, P., & Kaliouby, E. (2003). R. Real time facial expression recognition in video using support vector machines. Proceedings of the 5th International Conference on Multimodal Interfaces.
Mihalcea, R., & Liu, H. (2006). A corpus-based approach to finding happiness. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.
Mohammad, S., & Turney, P. (2010). Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text.
Mohammad, S. M. (2012). From once upon a time to happily ever after: Tracking emotions in mail and books. Decision Support Systems, 53(4), 730–741.
Mohammad, S. M., & Turney, P. D. (2013). Nrc emotion lexicon. National Research Council Canada, 2, 234.
Mohammad, S. M., & Bravo-Marquez, F. (2017). Emotion Intensities in Tweets. arXiv preprint arXiv:1708.03696.
Mohammad, S. (2018). Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers).
Moruzi, K., Smith, M. J., & Bullen, E. (2017). Affect, emotion, and children’s literature: Representation and socialisation in texts for children and young adults. Routledge.
Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. arXiv preprint arXiv:1103.2903.
Nikolajeva, M. (2013). Picturebooks and emotional literacy. The reading teacher, 67(4), 249–254.
Nikolajeva, M. (2014). Reading for Learning: Cognitive approaches to children’s literature (Vol. 3). John Benjamins Publishing Company.
Nodelman, P. (2008). The hidden adult: Defining children’s literature. JHU.
Oberländer, L. A. M., & Klinger, R. (2018). An analysis of annotated corpora for emotion classification in text. Proceedings of the 27th International Conference on Computational Linguistics.
Parrott, W. G. (2001). The nature of emotion. Blackwell handbook of social psychology: Intraindividual processes, 375–390.
Petrides, K. V. (2010). Trait emotional intelligence theory. Industrial and organizational psychology, 3(2), 136–139.
Picard, R. W. (1995). 2000). Affective computing. MIT press.
Plutchik, R. (2003). Emotions and life: Perspectives from psychology, biology, and evolution. American Psychological Association.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Raschka, S., & Mirjalili, V. (2019). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Rimé, B. (2009). Emotion elicits the social sharing of emotion: Theory and empirical review. Emotion Review, 1(1), 60–85.
Rojo, A. (2017). The role of emotions. The handbook of translation and cognition, 369–385.
Schwieter, J. W., & Ferreira, A. (2017). The handbook of translation and cognition. Wiley.
Sen, P. C., Hajra, M., & Ghosh, M. (2020). Supervised classification algorithms in machine learning: A survey and review. Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018.
Shaver, P. R., Morgan, H. J., & Wu, S. (1996). Is love a basic emotion? Personal Relationships, 3(1), 81–96.
Sosea, T., & Caragea, C. (2020). Canceremo: A dataset for fine-grained emotion detection. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Staiano, J., & Guerini, M. (2014). Depechemood: a lexicon for emotion analysis from crowd-annotated news. arXiv preprint arXiv:1405.1605.
Stevenson, D. (1997). Sentiment and Significance: The Impossibility of Recovery in the Children’s Literature Canon, or The Drowning of The Water Babies. The Lion and the Unicorn, 21(1), 112–130.
Strapparava, C., & Mihalcea, R. (2008). Learning to identify emotions in text. Proceedings of the 2008 ACM symposium on Applied computing.
Susskind, J., Littlewort, G., Bartlett, M., Movellan, J., & Anderson, A. (2007). Human and computer recognition of facial expressions of emotion. Neuropsychologia, 45(1), 152–162.
Tang, H., Tan, S., & Cheng, X. (2009). A survey on sentiment detection of reviews. Expert Systems with Applications, 36(7), 10760–10773.
Toury, G. (1995). Descriptive translation studies and beyond (Vol. 4). J. Benjamins Amsterdam.
Turc, I., Chang, M. W., Lee, K., & Toutanova, K. (2019). Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962.
Wang, C., Couch, L., Rodriguez, G. R., & Lee, C. (2015). The Bullying Literature Project: using children’s literature to promote prosocial behavior and social-emotional outcomes among elementary school students. Contemporary school psychology, 19, 320–329.
Wilce, J. M. (2009). Language and emotion. Cambridge University Press.
Worsham, J., & Kalita, J. (2020). Multi-task learning for natural language processing in the 2020s: where are we going? Pattern Recognition Letters, 136, 120–126.
Yang, B., & Cardie, C. (2014). Context-aware learning for sentence-level sentiment analysis with posterior regularization. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Zad, S., Heidari, M., Jr, J., H., & Uzuner, O. (2021). Emotion detection of textual data: An interdisciplinary survey. 2021 IEEE World AI IoT Congress (AIIoT).
Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1253.
Zhou, K., & Long, F. (2018). Sentiment analysis of text based on CNN and bi-directional LSTM model. 2018 24th International Conference on Automation and Computing (ICAC).

The finetunded model based on XLM-RoBERTa is available on HuggingFace: https://doi.org/10.57967/hf/1912.
The finetuned model names in the study are as follows: “ft-5mrXNWkJyam16R3PXleb7pfF” for the Curie model, “ft-RyhLLRiSFR3FulbXeKNJKo6B” for the Davinci model, and “ft:gpt-3.5-turbo-1106:personal:bilingualchildemo:935N4ypy” for the GPT-3.5-turbo-1106 model.
The data is from OpenAI group [PUBLIC] Best practices for fine-tuning GPT-3 to classify text: https://docs.google.com/document/d/1rqj7dkuvl7Byd5KQPUJRxc19BJt8wo0yHNwK84KfU3Q/edit

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Examining emotions in English and translated Chinese children’s literature: a bilingual emotion detection model based on LLMs

Status:

Version 1

Abstract

Figures

1 Introduction

2 Literature review

3 Dataset

3.1 BilingualChildEmo

3.2 Annotation

3.3 Data analysis

3.3.1 Emotion distribution

3.3.2 A comparison of emotions in Chinese and English sentences

4 Modelling methods

4.1 Supervised machine learning: SVMs, Naïve Bayes and Logistic Regression

4.2 Unsupervised deep learning: Neural Network

4.3 Transformers with LLMs: XLM-RoBERTa and GPT

5 Experiment results

5.1 Supervised and unsupervised classification results

5.2 Transformer-based classification results

5.2.1 Fine tuning with XLM-RoBERTa

5.2.2 Fine tuning with GPT models

6 Conclusion

Appendix

Declarations

Author Contribution

Data Availability

References

Footnotes

Additional Declarations

Status:

Version 1