Commonalities and variations in emotion representation across modalities and brain regions

doi:10.21203/rs.3.rs-4309581/v1

Download PDF

Article

Commonalities and variations in emotion representation across modalities and brain regions

https://doi.org/10.21203/rs.3.rs-4309581/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 09 Sep, 2024

Read the published version in Scientific Reports →

You are reading this latest preprint version

Humans express emotions through various communication modalities such as facial expressions and natural language. However, there are still many uncertainties regarding the relationships between multiple emotions expressed through different modalities and their correlations with neural activities in diverse brain regions. First, we calculated the representations of various emotions as multi-dimensional vectors of data from three different modalities (visual, language, and visio-linguistic modalities) and used Representational Similarity Analysis to compare similarities between modalities. Second, we examined the similarity between the emotion representation of each modality and representations derived from brain activities across 360 regions. Third, we investigated the linear transferability of emotion representation from vision to other modalities. Our results revealed: 1) the representational structures of emotions expressed in different modalities share commonalities with modality-specific variations; 2) the emotion representations in different solo modalities showed relatively higher similarity with representations in different partial brain regions, while multi-modal emotion representation was most similar to representations across the entire brain region; and 3) emotion representations can be linearly mapped onto each other. These findings suggest that emotional experiences are represented differently in each brain region with varying degrees of similarity across different modalities, and that they may be cross-modally conveyable through different modalities.

Biological sciences/Psychology

Biological sciences/Psychology/Human behaviour

Emotion

Facial expression

Representational similarity analysis

Multi-modal

fMRI

Deep learning

During communication, humans convey their emotions to others through a variety of bodily reactions including facial expressions, utterances, and gestures, as well as through the use of language. Many researchers have investigated how humans express and/or recognize emotions in others using individual modalities, such as facial expressions or language.

Ekman reported that humans can express and recognize six distinct emotions (happiness, fear, disgust, anger, surprise, and sadness) through facial expressions, and this was found to be universal across different cultures (known as the “Basic 6 Emotions Theory”) [1]. Subsequent research has further supported the universality of facial expressions and the recognition of certain basic emotions across cultures [2–7].

Another line of studies indicates that emotions can be represented in a multi-dimensional space on the basis of several ‘core affective’ elements. A pioneering approach demonstrated that a two-dimensional circular structure, reflecting arousal and valence, can capture the semantic distribution of emotions [8]. Subsequent studies extended the idea to representing each emotion as a multi-dimensional vector, where the distance between vectors corresponds to the pairwise (dis)similarity between emotions. This approach aimed to describe a wider range of emotions by exploiting a greater number of dimensions in the vector space [9], and has been widely utilized in sentiment analysis, which analyzes emotions represented in text form by combining natural language processing techniques [10–12].

Within the trend of investigating a broader range of emotions, recent studies have demonstrated that our subjective emotional experience can be more fine-grained than that described by Ekman’s Basic 6 Emotion Theory [3, 4, 13–17]. For instance, Cowen & Keltner reported that at least 27 distinct dimensions are required to distinguish human emotional experiences evoked by observing diverse visual scenes [13]. The same authors also reported that human observers can recognize 28 emotion categories from facial expressions [14]. A recent neuroscience study also supported the theory of fine-grained emotion representation, with Horikawa et al. (2020) [18] demonstrating that scores for 34 emotions, rated by human annotators on various videos, could be decoded using a regression model based on the brain activity evoked by observation of the videos. They also reported that different brain regions contributed to the prediction of different emotions.

While we express and perceive various emotions through different modalities, the following questions arise. How does the representation of emotions share commonalities or differ across different modalities? Given the involvement of various brain regions in emotion recognition and emotional experience, which modalities correspond to the neural representation of emotions within each brain region? Few studies have analyzed the correspondence between different modalities and brain regions in terms of emotion representation.

In recent years, advances in machine learning technologies have enabled us to represent a variety of emotions within specific modalities as multi-dimensional vectors (hereafter, referred to as emotion vectors) derived from extensive datasets of images and language. Furthermore, a pretrained model utilizing the learning method known as “Contrastive Language-Image Pre-training” (CLIP) and designed to obtain a joint representation of visual and linguistic information from extensive paired datasets has become publicly available. Therefore, we can utilize these machine learning techniques to address the question of which modality, and whether unimodal or cross-modal representation, is more similar to the semantic emotion representation based on brain activity.

Representational similarity analysis (RSA) is a useful method for comparing differences in emotion representation across different modalities and different brain regions [20]. The first step of RSA involves calculating representation similarity matrices (RSMs) by measuring the distances between all pairs of emotion vectors or activity patterns in the brain regions elicited by various emotional experiences. Subsequently, the correlation between RSMs of different modalities and different brain regions is calculated to assess the similarity in the representational structure describing how the emotions within the same set are related to each other.

In this study, we conducted three experiments on the representational relationships of emotions. First, focusing on over 20 emotion categories reported in previous studies, we examined the differences in the representational relationships of emotions between different modalities. Specifically, we analyzed emotion vectors based on facial expressions (Facial expression dataset [14]), linguistic expression (Word2Vec [21]; ConceptNet [22]), and visio-linguistic expressions (CLIP’s multi-modal embeddings) (Experiment 1). Second, using brain activity data evoked by various emotional experiences (dataset from Horikawa et al. [18]), we calculated brain activity-based emotion vectors for various brain regions. We then used RSA to assess the extent to which different modalities resemble the emotional representations of individual brain regions (Experiment 2).

In the third experiment, we examined whether emotional representations based on facial expressions could be linearly transformed to emotional representations based on other modalities. This assessment aimed to explore similarities across different emotional representations from a viewpoint other than RSA. We used an artificial neural network (ANN) to classify emotions from facial image features and examined how transferable emotion representations are across different modalities when using only linear transformations (Experiment 3).

Through the comparative analysis conducted in this study, we aimed to reveal the commonalities and differences in emotion representations across modalities and elucidate the correspondences with emotion representations in diverse brain regions.

Emotion vectors in different modalities

In Experiment 1, we compared representational relationships of emotions across three different modalities: facial expressions, natural language, and a multi-modal representation based on both images and language.

First, to obtain a representational similarity matrix of emotion expressed by facial expression, we used a dataset available from https://hume.ai/products/facial-expression-model as data breakdown list (referred to as ‘Facial expression dataset’), which consists of various facial images and their scores for 28 distinct emotion categories [14], as rated by human annotators. For the following analysis, we selected 701 images in which faces were reliably detected using the Dlib automatic face detection library [23] and utilized their human-evaluated emotion scores provided in the form of probability distributions. Since only one image was assigned the maximum emotional score for the “Realization” emotion category, with all the other images having very low-score values for this category, we excluded “Realization” from the emotion categories in the following analysis. The remaining 27 emotion categories were: Amusement, Anger, Awe, Concentration, Confusion, Contemplation, Contempt, Contentment, Desire, Disappointment, Distress, Disgust, Doubt, Ecstasy, Elation, Embarrassment, Fear, Interest, Love, Pain, Pride, Relief, Sadness, Shame, Surprise, Sympathy, and Triumph. The co-occurrence/variance of the scores for each emotion pair served as a measure of the similarity between the emotion pairs perceived by human annotators from the same facial expressions. Therefore, we calculated the RSM for the 27 emotions on the basis of the correlation coefficients of the score values across all 701 images for each emotion pair (we refer to this matrix as the visual (face) emotion RSM).

Second, to analyze how humans express emotions through natural language, we utilized models pretrained on two different natural language processing algorithms, Word2Vec (https://code.google.com/archive/p/word2vec/) and ConceptNet (https://github.com/commonsense/conceptnet-numberbatch, Numberbatch 19.08). Both algorithms represent words as unit vectors in a high-dimensional space (300-dimensional space) based on the co-occurrence patterns of words observed in a large-scale text corpus. From these models, we extracted the vectors corresponding to the words of the aforementioned 27 emotion categories. These vectors for the 27 emotions were then used to calculate the RSM on the basis of their correlation coefficients, and we refer to them as the linguistic (Word2Vec/ConceptNet) emotion RSM.

Third, we examined how emotions are represented for both images and text in the joint representation space within the pretrained CLIP model. In this model, images and their corresponding text are embedded in the same space, making it possible to directly compare and measure the similarity between images and text. Therefore, the vectors in this joint space, corresponding to sentences (text prompts) describing the emotion categories, serve as emotional representations that reflect the semantics of both visual and linguistic information. For example, if we input the prompt “a photo of an emotion of XXX” into the CLIP model, we expected to obtain multi-dimensional vectors corresponding to the conceptual representation of the emotion category XXX. Additionally, if we input the prompt “a photo of a XXX looking face”, we expected to obtain vectors related to the representation of the facial expression of the emotion category XXX. We calculated the correlation similarity between the multi-modal embedding vectors (512-dimensional unit vectors) for two conditions, leading to the computation of the RSMs, which we refer to as the visio-linguistic (concept) and visio-linguistic (face) emotion RSMs.

Subsequently, we computed the correlation coefficients between the lower triangular components of the RSMs for five conditions across three different modalities: visual (face), linguistic (Word2Vec), linguistic (ConceptNet), visio-linguistic (concept), and visio-linguistic (face). T-tests were then conducted to determine whether the correlations were significantly different from zero. Additionally, we assessed the significance of differences between conditions in terms of the Fisher z-transformed correlation coefficients. False discovery rate (FDR) correction using the Benjamini-Hochberg method was used to correct for multiple comparisons [24].

Emotion vectors in individual brain regions

In Experiment 2, we evaluated the neural representations of various emotions across different brain regions using the blood oxygen level dependent (BOLD) data, video data, human-rated emotion score data, and analysis scripts provided at https://github.com/KamitaniLab/EmotionVideoNeuralRepresentation by Horikawa et al. [18]. The BOLD signal data in this repository were recorded from five participants while they observed 2181 video stimuli selected to elicit various types of emotional experiences. Each video stimulus was associated with scores for the 34 distinct emotions defined by Cowen & Keltner (2017) [13]. We trained a Ridge regression model to predict the rated emotion scores for 34 emotions from the BOLD signal responses using previously described analysis scripts [18]. The weights of this regression model reflect templates of brain activity patterns corresponding to each of the 34 emotions for each subject. In our study, we initially divided the time-averaged BOLD signal responses for each video into 360 cortical regions based on the human brain atlas (the HCP360 parcellation) defined by Glasser et al. [25]. Within each brain region, we selected the top 500 voxels whose BOLD signal exhibited the highest correlation with the changes in emotion scores associated with the presented video stimuli. We then trained a Ridge regression model using the activity patterns of the selected voxels to predict the rated emotion scores for each brain region. The optimal hyper-parameter for L2-norm regularization of the regressor was determined through six-fold cross-validation. Thereafter, we calculated the RSM in each brain region (which we refer to as the brain emotion RSM) on the basis of the correlation coefficients between the weight vectors of the trained regressor for 15 emotion categories that were included in both the 27 emotion categories in Experiment 1 and the 34 emotion categories in the brain datasets. The 15 emotion categories used in Experiment 2 were: Amusement, Anger, Awe, Confusion, Contempt, Disappointment, Disgust, Fear, Interest, Pride, Relief, Sadness, Surprise, Sympathy, and Triumph.

Subsequently, we calculated the correlation coefficients between a brain emotion RSM from 360 cortical regions and those for three modalities (visual, linguistic, and visio-linguistic RSMs) to assess the extent to which each modality showed similarities in representational emotion relationships across different brain regions. For the statistical test, we grouped 360 regions into 13 coarser regions of interest (ROIs) and aggregated the data of the sub-regions within each ROI (from both hemispheres) across different observers. The names of the ROIs and their abbreviations were as follows: visual cortex [VC], inferior parietal lobule [IPL], precuneus [PC], temporo-parietal junction [TPJ], temporal area [TE], medial temporal cortex [MTC], superior temporal sulcus [STS], anterior cingulate cortex [ACC], insula, orbitofrontal cortex [OFC], dorsolateral prefrontal cortex [DLPFC], dorsomedial prefrontal cortex [DMPFC], and ventromedial prefrontal cortex [VMPFC]. These regions are indicated in Fig. 4. We then calculated the correlation coefficients of the RSMs for the aggregated data and performed a two-sample t-test, correcting for multiple comparisons using FDR correction with the Benjamini-Hochberg method.

Linear transferability between different modalities

In Experiment 3, we investigated the extent to which the representational relationships between emotions expressed though facial expressions can be linearly projected from the representational relationships of other modalities.

We employed two analytical methods for this evaluation: one applying rotation by linear rigid transformation using singular value decomposition to emotion vectors, and the other using an ANN trained to classify facial expression categories (Fig. 1). Since the emotion vectors used in Experiments 1 and 2 had different dimensions across modalities, it was not feasible to directly apply them to these two analyses. To align the number of vector dimensions and normalize the lengths of emotion vectors in each modality while maintaining the semantic distance relationships between emotions, we employed a spherical multi-dimensional scaling (MDS) method to the RSMs of each modality defined as correlation distance matrices. We used the Smacof package in the R programming language (https://www.rdocumentation.org/packages/smacof/versions/0.9-5/topics/smacofSphere.dual) for this spherical MDS dimension reduction to create unit vectors of emotions with 26 dimensions. In the analysis employing linear rigid transformation, rotation matrices were calculated using leave-one-emotion-out cross-validation for the 27 emotion categories. Additionally, visual (face) emotion vectors were derived from the Facial Expression dataset by 10-fold cross-validation and the average prediction accuracy was evaluated. Prediction accuracy was defined as the correlation coefficient between the rotated vectors and the ground-truth vectors. We assessed the chance-level performance of this analysis by calculating the prediction accuracy of rotated vectors from emotion vectors whose components were randomly shuffled across vector dimensions using 10 different seeds. In the second analysis, we initially trained a multi-layer neural network (MLN) consisting of three fully connected (FC) layers to predict 27 emotion scores from facial image features using the Facial expression dataset mentioned in Experiment 1 (please refer to the Supplemental information for details on how to extract facial image features from images). Each of the 27 units of 26-dimensional weights in the final layer in the trained model reflects a template of the image features related to each emotion expressed through facial expression, providing another form of emotion representation (which we refer to as the visual (ANN) emotion vectors). Therefore, the transfer accuracy, assessed after replacing the weights of the final layer with the emotion vectors of the other modalities (Fig. 1), could serve as a measure of the extent to which the emotion representations of facial expressions are transferable from those in other modalities. The weights of the final layer were set as 27 units of 26-dimensional vectors and were trained under the constraint of normalization to 1 (please refer to the Supplemental information for details of the ANN training conditions). The emotion vectors of the other modalities that best matched the visual (ANN) emotion vectors after the spherical MDS reduction and the linear rigid transformation described in the first analysis were used to replace the weights of the final layer to evaluate transfer accuracy. We defined the prediction accuracy of the trained ANN as the correlation coefficient between the predicted emotion scores and the ground-truth scores. To assess the chance-level performance of this ANN analysis, we calculated the prediction accuracy with the weights replaced with emotion vectors whose components were randomly shuffled across vector dimensions using 10 different seeds.

Experiment 1

In Experiment 1, we calculated the RSMs for 27 emotion vectors across five conditions and three modalities (visual, linguistic, and visio-linguistic) to examine the extent of similarity in the representation of emotions between modalities. Figure 2 illustrates the RSMs for each condition/modality, while Figure 3 displays the correlation coefficients between RSMs across conditions/modalities. The RSMs showed significant correlations across all conditions/modalities (p < 0.001 with FDR correction). However, Fisher’s z-test revealed differences in correlation coefficients between conditions. Further inspection of the Fisher’s z-test results indicated that intra-modality correlations (correlations between two conditions within the same modality) were significantly higher than inter-modality correlations (correlations between two conditions across different modalities) (Table 1). These findings suggest that emotional representations are similar within the same modality, regardless of the choice of model and method used to acquire emotion vectors, while the representational structure varies across different modalities.

Experiment 2

For the representational relationships of multiple emotions derived in the previous experiment, we observed higher similarities in comparisons within the same modality than in comparisons across different modalities, suggesting the presence of modality-dependent variations in emotion processing and representation. In the next experiment, we quantified the neural representations of individual emotion categories in each brain region as multi-dimensional vectors and explored the extent to which modalities exhibited similarities in the representational relationship of emotion across diverse brain regions.

Figure 4 depicts a flattened cortical map where the color of individual brain regions indicates the correlation coefficients between modality-specific emotion RSMs and the RSM of the corresponding brain region. Figure 4a, c, and e correspond to the results of the comparisons with the visual, linguistic, and visio-linguistic emotion RSMs, respectively. Only the results of the linguistic (Word2Vec) and visio-linguistic (concept) conditions are shown as representatives of each modality because the mean correlation coefficients were higher for these conditions than for the rest of the conditions (the results for linguistic (ConceptNet) and visio-linguistic (face) are provided in Figure S1). The correlation map for the visual (face) emotion RSM (Figure 4a) primarily shows higher correlation in the occipital to parietal regions in comparison with other areas. By contrast, the correlation map for the linguistic (Word2Vec) emotion RSM (Figure 4c) reveals higher correlation from parietal to frontal regions in comparison with other areas. The correlation map for the visio-linguistic (concept) emotion RSM (Figure 4e) indicates high correlation across a broader area extending from occipital to frontal regions in comparison with the results for other modalities (mean RSM correlations: visual (face) emotion RSM: r = 0.18; linguistic (Word2Vec) emotion RSM: r = 0.25; visio-linguistic (concept) emotion RSM: r = 0.38. Please see Figure 4b, d, and f for histograms of the RSM correlations across 360 brain regions for each modality). These results indicate that for the representational relationships of emotions, the joint representation of both visual and linguistic information is more consistent with the neural representation in a wider range of brain regions than representations based solely on either visual or linguistic information.

For statistical testing, we grouped the 360 cortical regions into 13 coarser-scale ROIs and examined the differences in correlation coefficients between different modalities for each of these 13 ROIs (Figure 5) (Note: we did not observe a strong tendency of inter-hemisphere differences in correlation, except for the linguistic condition in the IPL, as indicated in Figure S2; therefore, Figure 4 shows the results of analysis of data merged across hemispheres). The visio-linguistic condition showed significantly higher correlations than the other two modalities in 11 out of 13 ROIs. The results of the t-test with the FDR correction in each ROI are summarized in Table 2.

The results support the observation from Figure 4 that the representational similarity in brain activity changes depending on the emotional experience, and that it is more similar to the joint representation of both visual and linguistic information than to representation based solely on either visual or linguistic information across diverse brain regions. Moreover, in many brain regions the brain emotion RSM showed higher correlation with the linguistic emotion RSM than with the visual emotion RSM (PC: t(138) = −5.31, p < 0.001; TE: t(78) = −3.87, p < 0.001; MTC: t(38) = −7.09, p < 0.001; STS: t(78) = −5.63, p < 0.001; ACC: t(98) = −10.0, p < 0.001; Insula: t(158) = −6.89, p < 0.001; OFC: t(38) = −3.56, p < 0.01; DMPFC: t(58) = −4.20, p < 0.001; VMPFC: t(78) = −8.22, p < 0.001, FDR corrected). However, in VC there was no significant difference in correlation between the visual emotion RSM and the linguistic emotion RSM (t(518) = 1.53, p = 0.13, FDR corrected), and in IPL, the visual emotion RSM showed significantly higher correlation with the brain emotion RSM than with the linguistic emotion RSM (t(58) = 3.47, p < 0.01, FDR corrected). This result indicates that the representation of emotions based on brain activity in the IPL is more similar to the representation of emotions based on facial expressions than the representation based on natural language. These findings demonstrate, particularly in solo modalities (visual and linguistic), the presence of regional differences in the representational relationship of emotional experiences. The visual emotion RSM is more similar to the brain emotion RSM in the posterior cortex, while the linguistic emotion RSM is more similar to the brain emotion RSM in the anterior cortex.

Experiment 3

In Experiment 3, we performed two analyses to examine whether the representational relationships of emotions across modalities can be linearly mapped to each other. Figure 6a shows the results of using linear rotation and the rigid transformation technique to predict visual (face) emotion vectors from emotion vectors of other modalities. Across all modalities, the prediction accuracies were significantly higher than under a random condition (linguistic (ConceptNet): t(52) = 7.36, p < 0.001; linguistic (Word2Vec): t(52) = 6.05, p < 0.001; visio-linguistic (concept): t(52) = 7.08, p < 0.001; visio-linguisitc (face): t(52) = 6.64, p < 0.01; FDR corrected).

In the second analysis, we initially trained an ANN to classify 27 emotions on the basis of facial image features. The weights of the final layer of the trained model reflected a template of the image features related to judgement of the emotions expressed through facial expression, serving as visual (ANN) emotion vectors. Therefore, if the representational relationships between emotions within a modality can be linearly mapped to those expressed though facial expression, we would expect the ANN to demonstrate significantly higher classification accuracy than chance level, even after replacing the weights of the final layer of the ANN with the weights derived from the emotion vectors of other modalities.

Figure 6b shows the classification accuracy results before and after the weight replacement. There was no significant difference in classification accuracy before and after replacement of weights with the weights derived from visual (face) emotion vectors that were obtained from the same facial expression dataset used for the ANN training (t(18) = 0.65, p = 0.52), supporting the validity of this evaluation method using weight replacement. In comparison with the chance level, we observed significantly higher classification accuracy in all replacement conditions (Visual (ANN): t(18) = 9.95, p < 0.001; Visual (face): t(18) = 9.05, p < 0.001; Linguistic (ConceptNet): t(18) =5.61, p < 0.001; Linguistic (Word2Vec): t(18) = 5.11, p < 0.001; Visio-linguistic(concept): t(18) = 4.78, p < 0.001; Visio-linguistic (face): t(18) = 5.07, p < 0.001; FDR corrected). However, the transfer accuracies after the replacement of non-visual modality conditions (red and green bars in Figure 6b) were significantly lower than the original classification accuracy (denoted as Visual (ANN) in Figure 6b) before the replacement (Linguistic (ConceptNet): t(18) = 5.25, p < 0.001; Linguistic (Word2Vec): t(18) = 5.10, p < 0.001; Visio-linguistic (concept): t(18) = 4.66, p < 0.001; Visio-linguistic (face): t(18) = 3.68, p < 0.01). These results indicate that the representational relationships between multiple emotions within each modality can be aligned with each other through linear rigid transformations, although they may not be perfectly aligned.

Through RSA, we showed that the emotions represented in three different modalities, i.e., visual, linguistic, and visio-linguistic modalities, share commonalities in relational structure with modality-specific variations (Experiment 1). We also observed that in individual brain regions, emotion representations derived from BOLD signal changes associated with emotional experiences showed different degrees of similarity to the emotion representations from visual, linguistic, and visio-linguistic modalities (Experiment 2). Furthermore, we explored the extent to which emotion representations based on facial expressions correspond to those of other modalities through prediction analysis using a rigid transformation technique and evaluation of transfer performance using an ANN trained on facial expression discrimination (Experiment 3). The results of the present study reveal three key points. 1) The representational relationships between emotions calculated from the same modality but using different methods are similar, although this similarity diminishes to some extent across different modalities. 2) The representational relationships of visual emotion and linguistic emotion show relatively strong correlations with neural responses in posterior and anterior brain regions, respectively. The representational relationships of visio-linguistic emotion are the most similar to neural responses across the entire brain regions. 3) The representational relationships between emotions across different modalities have similar structures to the extent that they can be linearly mapped onto each other.

Although different modalities exhibited distinctive features in emotion representation, these representations showed a degree of similarity, enabling linear mapping between them. Our results indicate that the topology of emotion representation is more or less preserved across different modalities, suggesting a certain consistency between emotions expressed though facial expression and those conveyed through natural language. These findings are consistent with our daily life experience in which there is little discrepancy between emotions conveyed through different modalities.

Previous research reported that various brain regions process distinct emotion categories, and that the processing of a particular emotion category involves a distributed brain region network rather than localized one-to-one correspondence between specific brain areas and emotion categories [18, 27, 28]. Our study aimed to use RSA to test whether different brain regions represent emotional experiences in distinct ways corresponding to different modalities by comparing RSMs based on brain activities with those calculated from various datasets. The most crucial finding of this study is that individual brain areas activate differently depending on emotion categories, but also with varying degrees of similarity across different modalities.

The overall similarity between the representation of emotions in brain regions and that in visio-linguistic modalities, which involve multi-modal representations of both visual and linguistic information, was higher than the single modality representation of either visual or linguistic information. Numerous previous studies investigated brain regions involved in multi-modal emotion representations across various modalities [29–31], and these studies have consistently highlighted the contributions of areas such as TPJ, PC, STS, medial prefrontal cortex (MPFC), and OFC to modality-independent emotion representations. In our study, several regions, including TPJ, PC, STS, and MPFC, exhibited a high correlation between the brain emotion RSM and visio-linguistic emotion RSM. Our finding, which is in line with previous literature, supports the concept that these areas are involved in modality-independent processing of emotional expressions.

To the contrary, our results showed that the OFC, which was previously reported to be involved in multi-modal emotion representations [29], did not exhibit a high correlation with the emotion RSM of any modality. This discrepancy could be attributed to differences in analytical approach. Chikazoe et al. [29] analyzed OFC activity on the basis of emotions described using affective dimensions such as positive or negative valence, while the data we used from Horikawa et al. [18] involved an analysis based on emotion categories. Psychophysical studies [3, 13] and a neuroscientific approach [18] have shown that human emotions are better explained by emotion categories than by affective dimensions. However, OFC activity patterns are reported to exhibit less correlation with evaluations based on emotion categories, suggesting that the OFC may process emotion attributes on the basis of coarse affective reactions, such as pleasant and unpleasant, rather than fine-grained emotion categories (see Figure S1A in Horikawa et al. [18]).

In our study, several regions associated with visual information processing, such as the posterior areas and IPL, showed a tendency for high representational similarity with the visual modality. However, the brain regions other than posterior areas showed a tendency for high representational similarity with the linguistic modality. Previous research reported that the IPL processes emotional expressions specific to facial expressions [32]. Therefore, our observation of the highest correlation with the visual modality in the IPL (compared with the other modalities) may be contingent on the emotional impressions conveyed by the facial images in the video stimuli. To explore this possibility, we conducted an additional analysis similar to Experiment 2, dividing the video stimuli into two sets according to whether the video included a human face or not (with and without face conditions). However, in this experiment there was no significant difference between videos with and without face conditions (refer to Figure S4). This supplementary result suggests that although the representation of emotion in the activity of the IPL is highly correlated with the representation of emotion through facial expressions, it may depend on factors beyond reading emotions from facial expressions themselves. It is also important to note that the absolute correlation coefficients between the IPL’s brain activity RSM and emotion RSMs from all modalities are relatively low. Further refined experiments are necessary to conclusively determine whether the processing of emotions in the IPL is solely attributable to facial expressions or whether it involves additional factors.

This study is in accord with psychological reports and neuroscientific reports asserting that human emotions are represented as multi-dimensional vectors categorized into distinct dimensions [3, 4, 13, 14, 16–18, 31]. Our study is valuable in that it demonstrates through a machine learning approach that emotion representations can be linearly transferable across different modalities. This indicates a certain degree of homogeneity in the relationships between emotion representations across different modalities, reflecting a property essential for establishing semantic correspondences between emotions, even when they are expressed through different modalities, and which is crucial for facilitating smooth communication in daily life.

The high correlation between the visio-linguistic emotion RSM and brain emotion RSM can be attributed to at least two factors other than the use of multi-modal representation. First, there is a possibility that transformer models, including attention mechanisms, i.e., the backbone of the pretrained CLIP model used for calculating visio-linguistic representation, are well-suited to the extraction of sematic representation from training data in general. Second, since the CLIP model was trained with 400 million pairs of images and text, the use of large-scale training data might be critical to capture a diverse range of precise emotion expressions. To examine the first factor, we performed a similar analysis to that in Experiment 2, using a transformer model trained only on text data and available as Go emotions [33] to extract linguistic emotion vectors (refer to Figure S5 in Supplemental information). Figure S5 indicates that the correlation map calculated using Go emotions is similar to those of ConceptNet and Word2Vec, and that the overall average correlation coefficients across diverse brain regions are comparable with those of other linguistic modality conditions. This suggests that the high correlation between the visio-linguistic emotion RSM and brain emotion RSM cannot be solely attributed to the use of a transformer model for calculating the emotion RSM. Instead, the importance lies in conducting multi-modal learning with large-scale data to acquire emotion representation that is highly similar to that in human brain. Our finding suggests that emotional experiences are represented differently in each brain region, with varying degrees of similarity across different modalities, and that they may be cross-modally conveyable through different modalities.

Acknowledgements

We thank Dr. Daiki Nakamura for his insightful comments and advice, which have contributed to enhancing the quality of this paper. His expertise and support were invaluable in collecting the datasets and on implementing neural networks for our study. This work was financially supported by the Japan Science and Technology Agency, Moonshot Research & Development Program grant JPMJMS2012, and the National Institute of Information and Communications Technology (NICT) grant NICT 22301 awarded to R.H. H.K. was supported by JSPS KAKENHI (23K16985).

Author contributions

R.H. supervised this study. R.H. and H.K. contributed to conceptualization, visualization and writing the manuscript text. H.K. conducted the experiments. H.K. and R.H. analyzed the data and interpreted the results. All authors reviewed the manuscript.

Data availability

The datasets analyzed during the current study are available in the github repository, https://github.com/KamitaniLab/EmotionVideoNeuralRepresentation, and in the web page https://hume.ai/products/facial-expression-model. The datasets generated during the current study are available from the corresponding author on reasonable request.

Ethics declarations

The data used in this paper were obtained from open-source repositories. This study has no ethical issues according to the criteria of the Institutional Review Board.

Competing Interests

The authors declare no competing interests.

Ekman, P., Sorenson, E. R., & Friesen, W. V. Pan-cultural elements in facial displays of emotion. Science. 164, 86–88 (1969).
Cordaro, D. T. et al. The recognition of 18 facial-bodily expressions across nine cultures. Emotion. 20, 1292–1300 (2020).
Cowen, A.S., Laukka, P., Elfenbein, H.A., Liu, R., & Keltner, D. The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures. Nat. Hum. Behav. 3, 369–382 (2019).
Cowen, A.S. et al. Sixteen facial expressions occur in similar contexts worldwide. Nature. 589, 251–257 (2021).
Ekman, P. Facial expression and emotion. American Psychologist. 48, 384–392 (1993).
Elfenbein, H. A., & Ambady, N. On the universality and cultural specificity of emotion recognition: A meta-analysis. Psychological Bulletin. 128, 203–235 (2002).
Sauter, D. A., Eisner, F., Ekman, P., & Scott, S. K. Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations. Proc. Natl. Acad. Sci. U.S.A. 6, 2408–2412 (2010).
Russell, J. A. Affective space is bipolar. Journal of Personality and Social Psychology, 37(3), 345–356 (1979).
Plutchik, R., The nature of emotions, American Scientist, 89(4), 344-350 (2001).
Cambria,E., Poria, S., Gelbukh, A., & Thelwall, M., Sentiment analysis is a big suitcase. IEEE Intell. Syst.s, 32(6), 74-80 (2017).
Susanto, Y., Livingstone, A.G., Ng, B.C., & Cambria, E., The Hourglass model revisited, IEEE Intell. Syst., 35(5), 96-102 (2020).
Wankhade, M., Rao, A. C. S., & Kulkarni, C., A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7), 5731-5780 (2022).
Cowen, A. S., & Keltner, D. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Natl. Acad. Sci. U.S.A. 114, E7900-E7909 (2017).
Cowen, A. S., & Keltner, D. What the face displays: Mapping 28 emotions conveyed by naturalistic expression. American Psychologist. 75, 349–364 (2020).
Keltner, D., Sauter, D., Tracy, J., & Cowen, A. Emotional Expression: Advances in Basic Emotion Theory. J. Nonverbal Behav. 43, 133–160 (2019).
Koide-Majima, N, Nakai, T, & Nishimoto, S. Distinct dimensions of emotion in the human brain and their representation on the cortical surface. Neuroimage. 222,117258 (2020).
Kragel, P. A., Reddan, M. C., LaBar, K. S. & Wager, T. D. Emotion schemas are embedded in the human visual system. Sci. Adv. 5, eaaw4358 (2019).
Horikawa, T., Cowen, A.S., Keltner, D., & Kamitani, Y. The neural representation of visually evoked emotion is high-dimensional, categorical, and distributed across transmodal brain regions, iScience. 23, 101060 (2020).
Radford, A. et al. Learning transferable visual models from natural language supervision. International conference on machine learning. 8748–8763 (2021).
Kriegeskorte, N., Mur, M., & Bandettini, P. A., Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 4 (2008).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. (2013).
Speer, R., Chin, J., & Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. Proceedings of the AAAI conference on artificial intelligence. 31, (2017).
King, D. E. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10, 1755-1758, (2009).
Benjamini, Y., & Hochberg, Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, B57, 289-300, (1995).
Glasser, M. et al. A multi-modal parcellation of human cerebral cortex. Nature. 536, 171–178 (2016).
Lundqvist, D., Flykt, A., & Öhman, A. The Karolinska Directed Emotional Faces - KDEF, CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, ISBN 91-630-7164-9 (1998).
Hamann S., Mapping discrete and dimensional emotions onto the brain: controversies and consensus. Trends in cognitive sciences, 16(9), 458–466 (2012). https://doi.org/10.1016/j.tics.2012.07.006
Saarimäki, H., Gotsopoulos, A., Jääskeläinen, I. P., Lampinen, J., Vuilleumier, P., Hari, R., Sams, M., & Nummenmaa, L., Discrete Neural Signatures of Basic Emotions. Cerebral cortex, 26(6), 2563–2573 (2016).
Chikazoe, J., Lee, D. H., Kriegeskorte, N., & Anderson, A. K., Population coding of affect across stimuli, modalities and individuals. Nature Neuroscience, 17, 1114–1122 (2014).
Peelen, M. V., Atkinson, A. P., & Vuilleumier, P., Supramodal representations of perceived emotions in the human brain. Journal of Neuroscience, 30(30), 10127-10134 (2010).
Skerry, A.E., & Saxe, R. Neural representations of emotion are organized around abstract event features. Curr. Biol. 25, 1945-1954 (2015).
Sarkheil, P., Goebel, R., Schneider, F., & Mathiak, K. Emotion unfolded by motion: a role for parietal lobe in decoding dynamic facial expressions. Social cognitive and affective neuroscience, 8(8), 950–957, (2013).
Demszky, D. et al. GoEmotions: A Dataset of Fine-Grained Emotions. In Proc. 58th Annual Meeting of the Association for Computational Linguistics. 4040–4054 (ACL, 2020)

Table 1. Comparisons between intra-modality correlations and inter-modality correlations using Fisher’s z transformation.

Intra-modality correlation	Inter-modality correlation	z statistic	p value	Intra-modality correlation	Inter-modality correlation	z statistic	p value
Linguistic (ConceptNet) & Linguistic (Word2Vec)	Visual (face) & Visio-linguistic (concept)	−7.110	p < 0.001	Visio-linguistic (concept) & Visio-linguistic (face)	Visual (face) & Visio-linguistic (concept)	−5.318	p < 0.001
	Visual (face) & Visio-linguistic (face)	−7.010	p < 0.001		Visual (face) & Visio-linguistic (face)	−5.218	p < 0.001
	Visual (face) & Linguistic (ConceptNet)	−8.936	p < 0.001		Visual (face) & Linguistic (ConceptNet)	−7.144	p < 0.001
	Visual (face) & Linguistic (Word2Vec)	−10.958	p < 0.001		Visual (face) & Linguistic (Word2Vec)	−9.165	p < 0.001
	Linguistic (ConceptNet) & Visio-linguistic(concept)	6.644	p < 0.001		Linguistic (ConceptNet) & Visio-linguistic(concept)	−4.852	p < 0.001
	Linguistic (ConceptNet) & Visio-linguistic (face)	7.989	p < 0.001		Linguistic (ConceptNet) & Visio-linguistic (face)	−6.200	p < 0.001
	Linguistic (Word2Vec) & Visio-linguistic (concept)	8.242	p < 0.001		Linguistic (Word2Vec) & Visio-linguistic (concept)	−6.450	p < 0.001
	Linguistic (Word2Vec) & Visio-linguistic (face)	9.374	p < 0.001		Linguistic (Word2Vec) & Visio-linguistic (face)	−7.582	p < 0.001

Table 2. The results of the t-test of the mean correlation coefficient between visual and visio-linguistic and between linguistic and visio-linguistic in each ROI. The p-value was adjusted by the FDR correction with the Benjamini-Hochberg method. The significance level was p < 0.05 after adjustment.

	Visual vs. visio-linguistic			Linguistic vs. visio-linguistic
ROI	t-value	Adjusted p-value	Significance	t-value	Adjusted p-value	Significance
VC	t(518) = −12.3	p < 0.001	*	t(518) = −14.8	p < 0.001	*
IPL	t(58) = 0.18	p = 0.86		t(58) = −3.71	p < 0.001	*
PC	t(138) = −18.5	p < 0.001	*	t(138) = −14.3	p < 0.001	*
TPJ	t(58) = −9.90	p < 0.001	*	t(58) = −10.3	p < 0.001	*
TE	t(78) = −6.97	p < 0.001	*	t(78) = −3.84	p < 0.001	*
MTC	t(38) = −9.65	p < 0.001	*	t(38) = −3.47	p < 0.01	*
STS	t(78) = −14.5	p < 0.001	*	t(78) = −9.87	p < 0.001	*
ACC	t(98) = −14.8	p < 0.001	*	t(98) = −6.76	p < 0.001	*
Insula	t(158) = −15.7	p < 0.001	*	t(158) = −9.91	p < 0.001	*
OFC	t(38) = −4.66	p < 0.001	*	t(38) = −4.66	P = 0.13
DLPFC	t(78) = −4.99	p < 0.001	*	t(78) = −3.33	p < 0.01	*
DMPFC	t(58) = −6.55	p < 0.001	*	t(58) = −3.29	p < 0.01	*
VMPFC	t(78) = −12.3	p < 0.001	*	t(78) = −5.65	p < 0.001	*

No competing interests reported.

Download PDF

Journal Publication

published 09 Sep, 2024

Read the published version in Scientific Reports →

Reviews received at journal
02 Jun, 2024
Reviews received at journal
26 May, 2024
Reviewers agreed at journal
23 May, 2024
Reviewers agreed at journal
15 May, 2024
Reviewers invited by journal
12 May, 2024
Editor assigned by journal
10 May, 2024
Editor invited by journal
25 Apr, 2024
Submission checks completed at journal
25 Apr, 2024
First submitted to journal
23 Apr, 2024

You are reading this latest preprint version

Commonalities and variations in emotion representation across modalities and brain regions

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Methods

Emotion vectors in different modalities

Emotion vectors in individual brain regions

Linear transferability between different modalities

Results

Experiment 1

Experiment 2

Experiment 3

Discussion

Declarations

Acknowledgements

Author contributions

Data availability

Ethics declarations

Competing Interests

References

Tables

Additional Declarations

Status:

Journal Publication

Version 1