Procedure
The study consisted of two parts. In the first part, a detailed visual analog scale (VAS) assessment was carried out. The VAS provided a simple measure of pain intensity in clinical practice, with a score of 0 indicating no pain and a score of 10 representing intolerable, most severe pain. Amputees reported the intensity of PLP over the past month during the scoring process; In addition, the nature of PLP included throbbing, stabbing, pressure, and electric shock pain. In the second part, the EEG was recorded during the presentation of visual stimuli in a motor imagery paradigm (see below).
The MI paradigm was controlled by E-prime 2.0 experimental software (Psychology Software Tools, Inc., Sharpsburg, PA, USA). The subjects were seated in front of a 17-inch screen located at a distance of 60 cm. The subject's hand and forearm were held still and the index fingers were not in contact with any objects. For the MI task, subjects were asked to imagine stretching their left or right calf without actual movement when they saw a left arrow or a right arrow presented on the computer screen. There were a total of 120 trials, with 60 trials of each of the two types appearing randomly, corresponding to the visual cues of the left and right arrows on the screen. The experimental procedure of the MI-task was shown in Fig. 1. At the beginning of the task, the word "Ready" appeared in the center of the computer screen for 3 sec, prompting the subjects to start the task. Each trial started with a 2 sec presentation of a fixation cross in the center of the screen and the subjects began to perform the task when they saw the arrow stimulus continuously presented for 5 sec on the screen. After motor imagery, subjects rested for 3 sec and then proceeded to the next trial. The inter-trial interval was 10 sec. There was a 5 min break at the end of 60 trials, and the subjects were asked to stay motionless and relax during this resting period.
ERP analysis
The raw EEG data contained target and noise signals, such as blinks, electromyography, and electrocardiography. The quality of EEG data would significantly affect the final results so we used Matlab2017 (Natick, MA, USA) and EEG lab (EEGLAB version 14, La Jolla, CA, USA) software to preprocess the raw EEG data and eliminate noise interference. Electrodes on the neck/face were excluded, and the remaining 173 electrodes were kept for further analysis. We used a 1Hz high-pass filter and a 30 Hz low-pass filter to preserve a useful frequency band. EEG data were re-referenced with a whole brain average reference. Epochs were extracted from − 2 sec to 5 sec relative to arrow onsets. The artifacts due to eye movement and blinking were removed from EEG recordings, using independent component analysis (ICA). Independent component analysis was used to identify and remove residual artifacts such as blinks, lateral eye movement, and voltage drifts. Individual trials with excessive muscle activity (> 100 µV peak-to-peak amplitude) were excluded. The averaged data were re-referenced to an average voltage value across all electrodes.
Electrodes C3 and C4 (corresponding to the sensorimotor area) were selected to analyze the cerebral activation of the ERP. ERP has two important properties, namely constant latency and constant waveform. Therefore, based on these two important properties of ERP it is possible to superimpose the EEG waveforms of the same event. During the superposition process, due to the same stimulus-induced ERP waveforms with constant waveforms and constant latencies, which leads to the enhancement of the ERP waveforms under the event, whereas stimulus-independent potentiation activities cancel each other out during the superposition process.
Model
In this study, the model employed the binary cross-entropy loss function during training. The binary cross-entropy loss function is widely used in deep-learning classification tasks. It has the advantages of simple computation and fast convergence, and it performs exceptionally well for binary classification problems. The definition of the binary cross-entropy loss function is as follows:
$$Loss =-\frac{1}{ N}{\sum }_{i=1}^{N}{y}_{i}\cdot {log}(p\left({y}_{i}\right))+(1-{y}_{i})\cdot {log}(1-p({y}_{i}\left)\right)$$
Here, \({y}_{i}\) represents the true target values (such as actual class labels), and \(p\left({y}_{i}\right)\) denotes the predicted probabilities from the model. Compared to other loss functions such as the mean squared error, the binary cross-entropy loss function is more effective at measuring the problem of overfitting when dealing with imbalanced class and small-to-medium-sized datasets. Therefore, it is highly suitable for the binary classification of EEG signals in this article. The computation process of the binary cross-entropy loss function is relatively simple, which facilitates accelerated training and conserves computational resources. When combined with gradient descent, the binary cross-entropy loss function converges more quickly than some other loss functions. This aspect is particularly crucial in EEG signal classification problems, where data volume is often substantial. Rapid convergence contributes to efficient training within limited computational resources. At the same time, relative to the general cross-entropy loss function, the binary cross-entropy loss function boasts greater numerical stability. This means that during the training process, the updates of model parameters are more stable, thus providing reliability in the presence of fluctuating EEG signals.
The model based on the classification of EEG signals data is mainly composed of a patch and position embedding module, transformer encoder module, and classifier module. Transformer has a good ability to extract feature dependencies over long distances due to the presence of a multi-head attention mechanism, which can effectively capture the periodic global features of EEG signals. The overall network structure is shown in the figure (Fig. 3).
Patch and position embedding: This article focuses on two different attention mechanisms: Attention on the temporal sequence and Attention on the spatial sequence. The former aims to concentrate more attention on valuable features at a specific time, while the latter focuses on enhancing the value of features in a specific channel space. For the Attention on temporal sequence experiment, the model input is divided along the time dimension into k non-overlapping time slices and then flattened into a one-dimensional vector (with k equal to the number of channels) to ensure that the inputs to the attention module are the same. In contrast, for the Attention on spatial sequence experiment, the model input is split according to the electrode channels and then linearly mapped to the model's dimension. By comparing the results of the two experiments, the study can verify whether EEG signal features predominantly exist within the same electrode channel at different time points or within different electrode channels at the same time point.
Transformer encoder
The self-attention module within the transformer encoder's architecture employs a query-key-value (QKV) pattern and executes computations as shown in the formula.
$$\begin{array}{c}Q=X{W}_{Q}\\ K=X{W}_{K}\\ V=X{W}_{V}\end{array}$$
By utilizing the learnable matrices \({W}_{Q}\) and \({W}_{K}\), elements within these matrices improve in the direction of the highest accuracy as the model iterates. Consequently, Q and K can better represent different dimensions of information within the model. The attention module maps a set of queries and key-value pairs, where the output is computed as a weighted sum of the values. The self-attention module uses scaled dot-product as the attention-scoring function, and the output vector sequence can be simplified as:
$$\stackrel{\sim}{A} =Softmax\left(\frac{A}{\sqrt{{d}_{k}}}\right)$$
The parameter dk for the \({k}^{th}\) dimension acts as a normalization factor to prevent values input into the softmax function from becoming too large, which would lead to partial derivatives approaching zero. Further normalization is carried out, to further obtain a more comprehensive attention weight, it is necessary to perform a linear transformation with the matrix V to obtain the final result of the \({i}^{th}\) header output, and the computational process is shown in the formula:
$$Hea{d}_{i}=Attention\left(Q,K,V\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{K}}}\right)V=\stackrel{\sim}{A}V$$
The above describes the calculation process of the single-head attention module. However, in this paper, we utilized the Multi-Head self-attention (MHSA) module, which incorporates multiple attention heads. In the MHSA module, the outputs of different heads are combined using the concatenation function. This allows us to obtain combined attention weights that integrate diverse subspace information. Simultaneously, it is crucial to combine the residual module and the Multilayer perceptron (MLP) module to address the challenges of deep network degradation and difficulties in training while integrating the subtle feature information among tokens. The specific formula is as follows:
$${Z}_{i}^{{\prime }}=MHSA\left(LN\left({Z}_{i-1}\right)\right)+{Z}_{i-1} i=1,\dots ,L$$
$${Z}_{i}=MLP\left(LN\right({Z}_{i}^{{\prime }}\left)\right)+{Z}_{i}^{{\prime }} i=1,\dots ,L$$
Classifier
Traditionally, the MLP classifier has been widely used and shows good performance when handling simpler tasks. However, when dealing with the classification of more complex data, MLP may require more intricate architectures to enhance their performance. The inclusion of hidden layers and the implementation of regularization techniques can have a favorable impact on both the model's generalization ability and its practical performance. In line with this notion, the present study adopts similar approaches.
To begin, a fully connected layer is added to the neural network architecture, accompanied by the introduction of an activation function. This inclusion aims to enhance the network's representational capacity and optimize the information flow within the model. By incorporating multiple hidden layers on top of the linear layers, the model becomes capable of learning and abstracting complex combinations of input features, ultimately leading to improved network performance. This design empowers the model to better capture intricate patterns in the data and make predictions regarding the target categories with greater accuracy, thereby achieving superior outcomes on the test set.
To combat the issue of overfitting, we incorporated the Dropout regularization technique into our model. This technique randomly omits some of the neurons or their interconnecting pathways during training, thereby stimulating the model to rely upon a broader range of features rather than excessively leaning on a specific attribute or feature. Consequently, the neural network is better equipped to trade-off between various features, resulting in improved generalization performance and overall robustness of the classifier.
The Dropout technique proves to be highly effective in the training phase and helps abate the overfitting problems, mitigate bias towards specific features, and augment the model's generalization ability. Within the family of regularization techniques, L1 and L2 options exist as viable choices. By design, L1 regularization induces sparsity in the solution space, thereby leading to a significant number of weight values being set as zero. In contrast, L2 regularization uniformly decreases weight values. For this study, we opted to implement L1 regularization as our primary method. The final data transfer process in our classifier follows as below:
$${Y}^{{\prime }}=\text{Softmax}\left(L1\left(\text{Relu}\left(\text{Linear}\left(X\right)\right)\right)\right)$$
$${Y}^{{\prime \prime }}=\text{Softmax}\left(L1\left(\text{Relu}\left(\text{Linear}\left({Y}^{{\prime }}\right)\right)\right)\right)$$
$$Y=\text{Linear}\left({Y}^{{\prime \prime }}\right)$$
Evaluation
Our model for this study was developed based on the open-source deep learning framework pytorch1.11.0, and the specific experimental equipment is as follows: the CPU is Intel(R) Xeon(R) Platinum 8255C, the GPU is a single-card V100-SXM2-32GB, the CUDA version is 11.3, and the programming language is Python3.8 ( ubuntu20.04).
We have introduced a new metric called "adjustment time" as an indicator of a model's convergence speed. Adjustment time refers to the number of epochs required during model training for the model's performance to reach 95% of the difference between its initial and final convergence performance. This metric reflects the model's ability in terms of convergence speed and training efficiency. The concept of adjustment time assists in a deeper evaluation and comparison of the speed of the model training process. A shorter adjustment time means the model achieves its potential performance more quickly within a limited number of epochs. Thus, with this metric, researchers can intuitively assess the performance differences between different models. Moreover, adjustment time complements other commonly used evaluation metrics such as accuracy and loss, providing researchers with a more comprehensive view of the model's performance during training. In cases where model performance is similar, it allows for greater savings in computational resources and time.
For the comparison network, we chose the MLP model and the Long-Short Term Memory (LSTM) model, both of which have been proven to be very effective in a range of tasks such as feature extraction, classification of EEG signals, etc [46]. The MLP employs activation functions (such as ReLU) for the non-linear transformation of input data through neurons. The number of hidden layer nodes and layers themselves can further be increased to enhance representation learning, enabling optimal model performance when confronted with complex functional mappings.
Meanwhile, LSTM (a specialized RNN architecture) acquires temporal dependency information from sequenced data. Unlike MLP, LSTM can retain data for an extended period and process data with uneven time intervals. They also overcome the drawbacks of conventional RNN models by effectively mitigating the gradient vanishing issue. This empowers the model to comprehend extended sequences and improve classification accuracy accordingly.
Given the prodigious size of EEG signal files (120MB per file), utilizing conventional network models to train these large files is incredibly taxing on GPU memory. To address this issue, our models were developed to be moderately lightweight to enable smoother training. Table 2 presents the relevant model paraments and hyperparameters.
Table 2
Paraments and hyperparameter tuning for the model. This table presents the various hyperparameters and their selected values used in the RET model and compared models
Initial learning rate | 0.05 |
Epochs | 40 |
Batch size | 4 |
RET | Embedding dimension | 384 |
Head numbers | 2 |
Number layers | 2 |
MLP | Hidden size | 64 |
Depth | 3 |
LSTM | Hidden size | 128 |
Depth | 2 |