Comparison of Attention-based Deep Learning Models for EEG Classification

Objective: To evaluate the impact on Electroencephalography (EEG) classification of different kinds of attention mechanisms in Deep Learning (DL) models. Methods: We compared three attention-enhanced DL models, the brand-new InstaGATs, an LSTM with attention and a CNN with attention. We used these models to classify normal and abnormal (i.e., artifactual or pathological) EEG patterns. Results: We achieved the state of the art in all classification problems, regardless the large variability of the datasets and the simple architecture of the attention-enhanced models. We could also prove that, depending on how the attention mechanism is applied and where the attention layer is located in the model, we can alternatively leverage the information contained in the time, frequency or space domain of the dataset. Conclusions: with this work, we shed light over the role of different attention mechanisms in the classification of normal and abnormal EEG patterns. Moreover, we discussed how they can exploit the intrinsic relationships in the temporal, frequency and spatial domains of our brain activity. Significance: Attention represents a promising strategy to evaluate the quality of the EEG information, and its relevance, in different real-world scenarios. Moreover, it can make it easier to parallelize the computation and, thus, to speed up the analysis of big electrophysiological (e.g., EEG) datasets.


I. INTRODUCTION
EEG is an electrophysiological technique that allows to acquire the electrical activity of the brain with a very high temporal resolution, in a non-invasive way and a relatively low cost. These are the reasons why EEG has become very popular in many fields, ranging from medical diagnostics, pervasive healthcare, smart fitness, as well as gaming and Brain-Computer Interface (BCI) [1]- [4]. Across different applications, the detection of pathological EEG patterns is still a challenging question. Early approaches using Machine Learning (ML) relied on intense pre-processing and the extraction of well-established engineered features [5], [6]. However, in the last decade, several DL models have been also successfully proposed (see [7] for a recent survey). As a result, the challenge has moved from the development of relevant engineered features to the necessity of massive data collection, needed to effectively train optimal DL models. In such increasing amount of data, identifying the most relevant information has become an important challenge. Attention [8], one of the most recent and influential ideas in DL, allows to easily embed external knowledge into a DL model and to learn which portions of the data are relevant to the final output. This mechanism is expected to improve the explainability [9], [10], i.e., interpretability, of a DL network, and to make it easier to introduce parallel computing. In [11], the authors showed how the exploitation of simple attention mechanisms can highly enhance the performance of a DL model over a standard Long Short-Term Memory (LSTM). In fact, the former could effectively model the correlations between different measurements in a lenghty clinical dataset through the so-called inter-attention. On the other hand, LSTM is known to hardly manage long-term memory dependencies. Therefore, in the last few years, a number of different attention strategies have been applied to EEG-based recognition. Phan et al. [12] proposed deep bidirectional Recurrent Neural Network (RNN)s with attention mechanism for singlechannel automatic sleep stage classification. Zhu et al. [13] describe how to perform automatic sleep staging 2 recognition using a attention-enhanced Convolutional Neural Networks (CNNs). Sha et al. [14] developed a Gated Recurrent Unit (GRU)-based RNN with hierarchical attention for mortality prediction. A novel multi-view attention network (MuVAN) [15] has been shown to learn fine grained attentional representations from multivariate time series. MuVAN provides twodimensional attention scores to estimate the quality of information of each view within different time-stamps. Velivckovic et al. [16] and Zoppis et al. [17] proposed two different stacked architectures made by an attentionenhanced Graph Neural Network (GNN) to detect epileptic seizures in a multi-channel EEG dataset. In [18], ChannelAtt has been designed to jointly learn both multiview data representations from a multi-channel EEG dataset and their contribution scores to dynamically identify irrelavant channels in a seizure detection problem. FusionAtt [19], a deep fusional attention network, can learn channel-aware representations of multi-channel biosignals, and dynamically quantify the importance of each channel. Cho et al. [20] and Ma et al. [21] introduced AttnSense, a framework to combine attention mechanisms with CNNs and GRU in order to capture the dependencies of the sensed signals in both the spatial and the time domains. In [11], the authors designed SAnD (i.e., Simply Attend and Diagnose), an architecture that employs a masked self-attention mechanism, together with positional encoding and dense interpolation strategies for incorporating temporal order. Selfattention is also employed in [22] as alternative dynamic convolutions. Finally, in [23], CNNs encode frequency bands-related information and find the global temporal context, taking advantage of a multi-head self-attention of a transformer model. While in the abovementioned literature the architectures with attention mechanisms have been compared to baseline, i.e. networks devoid of attention, in our work, we investigated the possibility to exploit the same attention strategy over different datasets. Moreover, we studied how attention is applied to different DL models. To this aim, we re-implemented some of them. We selected the most commonly used DL models for EEG recognition: CNNs, LSTM, and GNN. Each of them was enhanced by the introduction of attention and we evaluated its impact on the final classification outcome. The relevance of the attention mechanisms used in the different DL models has been discussed in the context of the classification of normal versus abnormal, either artifactual or pathological, EEG segments across different datasets.
The paper consists of the following sections: Sec. II is devoted to EEG signal preprocessig and feature extraction; in Sec. III we introduce the three attention-enhanced models we considered for our investigations. Sec. IV describes the research methodologies and Sec. V provides the results of the evaluation of our models over 3 public EEG datasets. Finally, in Sec. VI we discuss them and the impact of attention in each single model. Sec. VII concludes our work and identifies the most relevant future perspectives.

II. PRE-PROCESSING AND FEATURE EXTRACTION
In this work, we focused on EEG data including normal and abnormal frames (i.e., segments), either pathological or artifactual. Eleven well-established timedomain and frequency-domain features were computed from each EEG frame of each individual EEG channel. Particularly, in the time domain, we extracted the mean, the variance, the zero-crossings, the area under the curve, the skewness, the kurtosis, and the peak-to-peak distance (as listed in [24]). Then, in the frequency domain, we computed the spectral power in well-known and clinically relevant frequency bands, i.e., the delta (δ) band corresponding to (0.5, 4) Hz, the theta (θ) band to (4,8) Hz, the alpha (α) band to (8,12) Hz, and the beta (β) band to (12,30) Hz. In the following, we refer to the set of time-and frequency-domain, channel-wise, features as the vector: with F = 11 and t = 1, 2, ..., T where T represents the number of available ordered time frames. Additionally, in each time frame t, we computed the Spearman's correlation coefficient between each pair of EEG channels obtaining a correlation matrix r for each available time frame t: with t = 1, 2, ..., T , given T the number of available frames and r ij (t) the correlation coefficient in the frame t between channel i and channel j. At each time frame, the input to the deep learning models used in this work includes both the feature vector x c for all channels c = 1, 2, ..., C, and the correlation matrix r. Unless explicitly specified, the dependence on the time frame, i.e., t, will be discarded to avoid confusion. Fig. 1 reports an example of EEG recordings and all pre-processing steps to obtain the input to each deep learning model.

III. ATTENTION MODELS
The three attention-enhanced DL models considered in this work share similar architectures that differ in the first layer. InstaGATs (see Section III-A) performs graph convolution on the input information; the LSTM with attention model (see Section III-B) exploits an LSTM at the first layer to process the time-related information in the input data; finally, the CNN with attention model (see Section III-C) performs a one-dimensional convolution operation on the input. Next layers in each model have identical architecture: an LSTM layer, a dense layer, and a classification layer. The attention mechanism of each model fits the processing specificity in the model's first layer. The attention is placed between the model's first layer and the LSTM layer, except the LSTM network with attention, where it is located after the second LSTM layer.
We used categorical cross-entropy as a loss function for parameter optimization in all models, defined as follows: where y i,j is the correct label class for the frame i, p i,j is the predicted output for it, n is the number of samples in the dataset, m is the number of classes. In our investigations, we considered only two classes and the output was one-hot encoded. The final layer of every model implemented the sof tmax function. To update parameters, mini-batch gradient descent method was applied. The latter updated the model parameters using a batch consisting of B samples. The batch size B was empirically set.

A. InstaGATs
Based on the seminal paper of [16] implemented in [17], a stacked architecture is proposed. The model is visualised in Fig. 2. It was obtained by placing an LSTM layer on the top of the Graph Attention Network (GAT) layer. The latter implemented attention and performed graph convolution. A dense layer with dropout and a sof tmax layer completed the model architecture.
GAT [16] operated on graphs; thus, it was necessary to represent the input data into a graph G at each t-th frame, t ∈ {1, 2, .., T }. To this aim, at each t, we derived the adjacency matrix of graph G(t) from the correlation matrix r(t), as in (2). Thus, the GAT's input matrix is formed by C rows, each one defined by: (3) where r refers to the correlation coefficients in r(t), as in (2), f to the time-domain and frequency-domain features, as in (1), and || is the symbol of concatenation. Each G(t) was formed by C nodes (each corresponding to one EEG channel) and C 2 edges. Each node i ∈ {1, 2, ..., C} was described, at time frame t, by the feature vector x i (t), as in (1). Given G(t), t ∈ {1, 2, .., T }, InstaGATs performed graph convolution with attention on each node and its neighborhood (using their feature vectors) and provided the so-called embeddings, h t , which were then delivered to the LSTM layer (Fig. 2).
The attention mechanism was introduced by adapting the approach proposed in [16]. With reference to GAT, attention defines the relevance of every particular feature in the node's feature vector during graph convolution. The latter is expressed by the attention coefficients. Formally, given G(t) with nodes' feature vectors x i , with i ∈ {1, 2, ..., C}, attention is obtained by evaluating the following function a : based on their feature vectors x v and x u , respectively. W G is a F × F matrix of weights between the nodes' features. Thus, the attention coefficients i v,u quantify the significance of each feature for a node u ∈ N , where N is in the first order neighborhood of node v. It is worth noting that all model's parameters, including the attention coefficients, were optimized, end-to-end, using the loss function described in (III). By means of Insta-GATs, we could capture both the relevant topology of the corresponding graph and the temporal dependency across different EEG channels, while discarding irrelevant data from the LSTM's memory.

B. LSTM with Attention
As the next model, we referred to the network presented in [24], and we implemented the Long Short-Term Memory with Attention (LSTM+Att). Compared to its original version with 3-layer LSTM architecture, we focused on a 2-layer LSTM, to be consistent with the other models considered in this study. On the top of the input LSTM layer with T cells, we added another LSTM layer with T LSTM cells, as in InstaGATs. Then, there is a dense layer delivering its information to the classification layer, consisting of 2 neurons. The network was trained using the loss function described in (III). For each EEG frame t, the input vector was created using the concatenation of the Spearman's correlation coefficients from r(t) and the vector of time-domain and frequencydomain features, x c (t). The concatenated vector for one frame is expressed by: where each s i is described by (3). As depicted in Fig. 3, the attention layer was stacked on the top of the second LSTM layer, assigning an appropriate weight α i to each i-th cell output, h i , of the LSTM layer. Each vector h i was multiplied by its weight α i and, then, after concatenation of T vectors, we obtained one vector, which was delivered to the dense layer without dropout. The EEG classification was performed by the last layer implementing sof tmax. Each LSTM layer's cell built its own representation of the input frame. We can say that, in this model, the attention mechanism focused on timesteps (frames) with the most discriminative information, assigning them higher coefficients α i . More formally, if h i was the output from the i-th cell of the second LSTM layer, the attention coefficients were calculated as .., T } and W s was a weight matrix. Next, they were normalized and the attention weights α i = sof tmax(u i ) were computed. Unlike the original paper, we trained the model using the categorical cross-entropy loss function, as in (III).

C. CNNs with Attention
Lastly, we implemented a model based on Convolutional Neural Networks with Attention (CNNs+Att). 5 This model was adapted from [25], where the authors introduced the Convolutional Block Attention Module (CBAM), an attention mechanism specifically designed for convolutional networks. The latter consisted of two kinds of attention mechanisms: a channel attention and a spatial attention sub-module. They worked complementary. The first one was responsible for defining what was meaningful in the input, while the second focused on where the relevant information was placed. Relevance was determined by the attention coefficients matrix A c for the channel attention (computed via shared Multi Layer Perceptron (MLP)) and A s for spatial attention (as the result of the convolutional operation). According to the authors' suggestion, we applied them sequentially, i.e., first the channel sub-module and, then, the spatial one. As depicted in Fig. 4, the CNNs+Att overall architecture is composed of a CNNs layer, combined with the CBAM module, an LSTM layer and a dense layer with dropout. The model performs 1D convolution operation on each input vector (defined as in (4)), representing a time frame (t), t ∈ 1..., T . A first refinement of the input feature map was obtained by multiplying it with the matrix A c (the channel attention map); then, a second one via multiplication with A s (the spatial attention map). The outputs from the CNNs layer were concatenated and fed to the LSTM layer.

Baseline models without attention and implementation
As a comparison for the models with attention previously described, we considered the following three models: a GNN, an LSTM and a CNNs. The latter share the same architectures with their respective attentiondoped models, but the attention layer was removed. All models were implemented in Python using Tensorflow 2 framework [26]. For InstaGATs and the GNN, the Spektral library of [27] was used. We optimized all hyper-parameters by selecting the values where the F1score, as averaged across all datasets, was the highest. Stochastic Adam optimizer [28] was used for the models' parameters optimization. Tab. II reports the optimized values for all models.

IV. RESEARCH METHODOLOGIES AND EVALUATION
In this section, we describe the public datasets we analyzed, the strategy for optimizing the hyperparameters of the models, and we define the validation methodologies and metrics used to evaluate the different models.

A. Datasets
Three different EEG datasets which are provided by the Temple University Hospital of Philadelphia (Pennsylvania) were used in this work. All of them are publicly available, either on PhysioNet [29] or on the Temple University Hospital of Philadelphia (Pennsylvania) repository website [30]. These datasets are slightly imbalanced with a ratio of 1 : 3, in the worst case (see Tab. I). Then, we did not implement any balancing technique in our original datasets.
The first dataset is called TUH EEG Abnormal [31]. Recording annotations are available, including patients' clinical history, medications, comments from expert visual inspection. EEGs were recorded using a sampling frequency of 250 Hz and 16 bit resolution. Two montages, i.e., the averaged reference (AR) and the linked ears reference (LE), were used, depending on the file. From this dataset we obtained 1472 positive samples (with any kind of abnormality, for further details see [31]) and 1521 negative samples (normal EEG samples).
The second dataset TUH EEG Artifact [30] contains both normal EEG signals and EEG signals affected by 5 different types of artifacts: 161 chewing events (chew), 606 eye movements (eyem), 254 muscular artifacts (musc), 60 shivering events (shiv) and 178 instrumental artifacts (elpp, e.g., electrode pop, electrostatic artifacts or lead artifacts). Therefore, from this dataset we obtained 1259 positive samples (with any kind of artifact, regardless their origin) and 1940 negative samples (clean EEG samples). EEGs were recorded using a sampling 6 frequency of 250 Hz and 16 bit resolution. To note, in the positive class we decided to include all kinds of artifact despite their different origin (e.g., physiological or instrumental) to train the models to recognize a good EEG pattern from an abnormal one. This is, in fact, in line with traditional EEG signal processing where identification and exclusion of artifactual EEG frames is the first -and, typically, time consuming -step of pre-processing.
The third dataset, named TUH EEG Seizure [32], contains EEG recordings affected by different types of seizures. Recordings include 19 EEG channels and annotations by very well-trained medical doctors. EEGs were recorded using a sampling frequency of 250 Hz and 16 bit resolution with a bipolar channel configuration. From this dataset, we extracted 4240 samples with focal non-specific seizures and 1717 samples with generalized non-specific seizures. Focal non-specific seizure samples were included in the negative class, while the generalized non-specific seizure samples in the positive class. This choice reflects the relative severity of one type of seizure compared to the other: indeed, focal seizures affect a limited portion of brain activity compared to the generalized ones. The latter, thus, possibly lead to more severe brain damages.
The datasets significantly differ by size, study aim as well as number of events and even EEG acquisition setup. However, this variety allowed us to widely assess the performance of different DL models, evaluating their ability to classify EEG in several specific, but different, classification problems.

B. Data preparation
Despite the heterogeneity of the datasets, a few common steps of pre-processing were applied: the EEG recordings from every channel (i.e., sensor location) were filtered in the frequency band (0.1, 47) Hz, normalized using a min-max centered normalization technique [33] and segmented into 2 s frames using a sliding window approach, in line with the most common literature on EEG (e.g., see [34], [35]). The sampling frequency and the number of channels could differ from dataset to dataset, and even from file to file. Therefore, as a first step, we downsampled all of them to the lowest available sampling frequency, i.e., 250 Hz. Then, the largest common set of C channels for all files was retained. The pre-processing phase and the data preparation phase were carried out using PyEEGLab [36]. After band-pass filtering and segmentation (see Section II), we obtained a roughly balanced number of frames per class in each dataset, as reported in Table I.

C. Evaluation and performance metrics
To increase the reliability of results and to obtain a more robust error estimation, a stratified 10-fold crossvalidation was applied. In each fold, every model was trained for 50 epochs with a batch size of 32. In order to evaluate the models performance, we used wellestablished classification metrics: accuracy, recall, precision and F1-score. Recall and precision are the most relevant metrics to investigate the effectiveness of a classifier when searching for rare, but pathological, samples. Given N as the total number of samples to classify, T P (True Positive) the number of samples correctly classified as positive, T N (True Negative) the number of samples correctly classified as negative, F P (False positive) the number of samples incorrectly classified as positive, and F N (False Negative) the number of samples incorrectly classified as negative, we could calculate the accuracy, recall, precision, and F1-score.

V. RESULTS
In this research, we evaluated the performance of 3 models with attention, as described in Section III, in 3 very different classification problems over EEG, with the aim to evaluate to what extent similar architectures with attention can classify different kinds of EEG patterns. Tab. III reports all results in terms of accuracy, recall, precision and F1-score, for every model and every dataset. First, we used the models to identify those EEG segments which included any abnormal spike, Kcomplex or any other pathological waveform in the TUH EEG Abnormal. Here, the best accuracy performance was scored by the baseline LSTM model with 79.18%, while the best F1-score was achieved by LSTM+Att model with 78.88%. However, most models provided an accuracy and F1-score values above 75%, except for the CNNs model which reached lower values (around 68%). In TUH EEG Artifact dataset, we aimed to detect EEG segments where any kind of physiological or instrumental artifact was present. Again, LSTM and LSTM+Att obtained the best results. However, from Tab. III, we noticed that InstaGATs and CNNs+Att achieved significantly better results with respect to their baseline models without attention. Finally, in TUH EEG Seizure, we reached the best accuracy (96.98%) and F1-score (94.71%) with the CNNs+Att. However, here, all models obtained comparable high accuracies (above 95%) and F1-scores (above 92%). To note, when comparing with the state of the art, we typically focus on accuracy, as the most common classification metric used in the literature. However, since the datasets are not fully balanced and we need to keep into consideration pathological events (i.e., relevant positive samples), the F1-score has a key role to take into account two different kinds of error that can occur: type I, or false positive samples, and type II, or false negatives. Both of them are significant when dealing with pathological EEG. However, the desirable output from any classifier is to have high rates of correct detection of positive samples (i.e., recall), in order to identify the most abnormal EEG segments, and to ensure low rates of false detection of positive samples (i.e., precision), in order not to mistakenly intervene on healthy regions of the brain. In Fig. 5, we also provide a representative boxplot to show the high variability (i.e., standard deviation) of F1-scores across datasets. It reports the scores for InstaGATs (similar tendency was observed for the other models, not included for space limitations).

VI. DISCUSSION
The application of DL models on real-world data, such as EEG recordings, still represents a challenging task. Multiple concurrent factors, like technical equipment, recording noise, physical and emotional states are combined together in these datasets to form complex scenarios, in which inter-and intra-subject differences  tion problems, we reached the state of the art with most of the models (both with and without attention). In the TUH EEG Abnormal dataset, the LSTM model achieved an accuracy of 79% that is comparable with the models proposed in [37] (the authors collected, themselves, the dataset) which provided a variable error detection level between 21% and 41%. Also, the TUH EEG Seizure dataset has been analyzed in other papers, [38] among others, with accuracy values between 79% and 92% and F1-score between 75% and 94% (actually, we slightly outperformed the best literature results with all our models). Finally, the TUH EEG Artifact dataset has been considered by [39]- [41]. They achieved a peak accuracy value of 74.99% with a deep CNN model, 71.43% with a standard Linear Discriminant Analysis (LDA) model, and 71.1% with a CNN-based attentionenhanced architecture, respectively. Moreover, we noticed that it seems to be easier to classify different kinds of seizures, i.e., focal versus generalized, compared to abnormal or other pathological events. This might be due to the well defined difference between the two classes. In fact, focal (non-specific) seizures are typically defined as short-lasting, space-limited seizure events, while generalized (non-specific) seizures spread across several brain regions (i.e., EEG sensors) at the same time and they are longer lasting. On the other hand, the positive classes in the other two datasets, i.e., TUH EEG Abnormal and TUH EEG Artifact, contain a large variety of abnormal or artifactual events that are, possibly, harder to be accurately detected. This might have facilitated all models to classify, with better performance metrics, seizures in the TUH EEG Seizure dataset, compared to other events in the remaining datasets. Also, this might explain why results for TUH EEG Seizure showed acceptable (small) standard deviations of the F1-scores in cross-validation, while in the other datasets variability is larger. Then, in order to evaluate the impact of the attention mechanisms in each proposed model, we point out that they were purposely simplified. Their architectures were composed of one attention layer, which encoded a model-specific attention mechanism, an LSTM layer, and a dense layer that provided the output.
This simple, but effective, common design enabled us to compare the different attention-enhanced models. In fact, each attention mechanism was designed to exploit the input features with a specific representation, i.e., in the time, frequency and spatial domains. For example, the GAT model applied attention to a spatial representation derived from the first-order neighborhood's hierarchical structure. The mechanism leveraged the local spatial embeddings obtained from the input correlation matrix and the features of the nodes from neighborhood. The LSTM+Att model filtered out irrelevant information by applying attention in the temporal dimension. Finally, the CBAM module of CNNs+Att exploited attention for each EEG channel, separately. Following this approach, those models that primarily use spatial features showed a performance increase when attention was applied (e.g., InstaGATs over GNN, and CNNs+Att over CNN). Whereas, those architectures that rely on temporal features demonstrated no advantages when attention is added. Interestingly, the proposed attentionenhanced models could take advantage of different EEG descriptions which can alternatively, or jointly, consider time, frequency and space (i.e., sensors location). In fact, such domains usually provide key pieces of information to interpret or analyze brain activity corresponding to different individuals' behaviours or tasks performance. These considerations could prospectively provide important indications to setup appropriate experimental protocols and data processing pipelines, depending on what individual behaviour/task has to be investigated. For instance, if specific brain regions interactions are expected, as in many cognitive tasks [42], an architecture such as InstaGATs could identify clusters of EEG sensors that mostly account for the synchronization between those regions during the accomplishment of the cognitive task [42]. In this case, it could be possible to leverage the InstaGATs's spatial-dependant embeddings to identify the most involved brain areas. Also, in case of reaction tasks [43] (when the individual is expected to promptly react to an external stimulus), time-dependant features could be filtered out more accurately by an architecture such as LSTM+Att. We can also highlight 9 that, despite their simple architectures, the attention mechanism allowed the models to achieve high accuracy levels across different real-world scenarios with minimal pre-processing. The latter is very often guided by experts or based on field knowledge. However, this might lead to non-reproducible results over the same dataset, depending on the analyst who performs data analysis. Thus, reducing pre-processing could represent an important advantage compared to standard ML or to other DL methods. This work still lacks a proper investigation of larger datasets affected by artifacts, where pre-processing might be highly needed. However, it might open the way to further investigations with the aim to reduce empirically-based pre-processing in big EEG datasets.

VII. CONCLUSIONS
In this paper, we aimed to evaluate the impact of different kinds of attention mechanisms, when added on well-established DL models. We compared 3 architectures: the brand-new InstaGATs, the LSTM+Att [24] and a CNNs+Att [25]. We used these models to classify different kinds of EEG normal and abnormal patterns, with further distinction between artifactual and pathological abnormalities. The EEG datasets were available online and included a large variety of spikes, seizures, instrumental artifacts and many other relevant events. Our results showed that we achieved the state of the art in all classification problems, regardless the large variability of the datasets and the simple architecture of the proposed attention-enhanced models. We also observed that each kind of attention mechanism can enhance the identification of a specific EEG domain characterization: e.g., in TUH EEG Artifact, attention might have particularly helped InstaGATs and CNNs+Att, compared to their corresponding models without attention, to identify artifacts, which typically spread across different brain areas. More generally, we showed that, depending on how the attention mechanism is applied and where the attention layer is located in the DL model, we can alternatively leverage the information contained in the data in one of the above mentioned domains. Interestingly, this can pave the way to further investigations on big EEG datasets and on a more robust selection of the information provided by large amounts of data, in order to relate brain activity with the individuals' behaviour or task performance. Therefore, attention represents a promising strategy to evaluate the quality of the EEG information, and its relevance, in different real-world scenarios. Moreover, attention can make it easier to parallelize the computation and, thus, to speed up the analysis of big electrophysiological (e.g., EEG) datasets.