Brain Computer Interface-EEG based Imagined Word Prediction Using Convolutional Neural Network Visual Stimuli for Speech Disability

Brain Computer Interface (BCI) is one of the fast-growing technological trends, which nds its applications in the eld of the healthcare sector. In this work, 16 electrodes of Electroencephalography (EEG) placed according to the 10-20 electrode system are used to acquire the EEG signals. A BCI with EEG based imagined word prediction using Convolutional Neural Network (CNN) is modeled and trained to recognize the words imagined through the EEG brain signal, where the CNN model Alexnet and Googlenet are able to recognize the words due to visual stimuli namely, up, down, right, left and up to ten words. The performance metrics are improved with the Morlet Continuous wavelet transform applied at the pre-processing stage, with seven extracted features such as mean, standard deviation, skewness, kurtosis, bandpower, root mean square, and Shannon entropy. Based on the testing, Alexnet transfer learning model performed better as compared to Googlenet transfer learning model, as it achieved an accuracy of 90.3%, recall, precision, and F1 score of 91.4%, 90%, and 90.7% respectively for seven extracted features. However, the performance metrics decreased when the number of extracted features was reduced from seven to four, to 83.8%, 84.4%, 82.9%, and 83.6% respectively. This high accuracy further paves the way to future work on cross participant analysis, plan to involve a larger number of participants for testing and to enhance the deep learning neural networks to create the system developed to be suitable for EEG based mobile applications, which helps to identify what the words are imagined to be uttered by the speech-disabled persons.

However, the performance metrics decreased when the number of extracted features was reduced from seven to four, to 83.8%, 84.4%, 82.9%, and 83.6% respectively. This high accuracy further paves the way to future work on cross participant analysis, plan to involve a larger number of participants for testing and to enhance the deep learning neural networks to create the system developed to be suitable for EEG based mobile applications, which helps to identify what the words are imagined to be uttered by the speech-disabled persons. Background Brain-Computer Interface (BCI) can be used as a device to understand the processes ongoing in the brain of the patient or disabled persons [1]. Disability for a person either physically or mentally does affect their living style, most of them are unable to work and nd living for their own. For physical disabilities, they are constrained or limited in their movement and required assistance to perform their daily life needs.
In recent years, many organizations or societies such as World Health Organization have raised concerns about the problem faced by a disabled person to the public. They expect the public to raise awareness to those disabled people and demand more facilities from the government in order to help them in their daily life, [2] for example reserved disabled parking, ramp for the disabled person, road guidance magnet for blind people and etc.
An Electroencephalography (EEG) signal can be obtained from the human brain by contacting electrodes to the scalp of the head which non-invasively captures the activity of electricity of the brain, unlike implanted electrodes into the brain and other methods [3]. Since human responses can be linked to cortical activities, Electroencephalogram (EEG) can act as a source to classify the word imagined. The word imagined by a person can be recognized by analyzing multiple electrodes at the same time, where when multiple electrodes are receiving the electricity signal spikes, then the bundling behavior of human emotion can be modeled accordingly [4].
Researchers [3] present recent developments in channel selection and evaluation algorithms for the purpose of the processing of EEG signals in applications like early seizure detection, motor imagery, sleep state analysis, emotion and mental activity classi cation. Covering the usage of ve different techniques for channel selection. The techniques are ltering method, wrapper method, embedded technique, hybrid method, and also human-selection method [5]. The advantages and disadvantages of this approach were discussed. Presented the usage of the techniques in the applications mentioned [6]. The study discusses the use of four techniques for all the applications. Focusing on the application of channel selection algorithm for motor imagery, ltering technique is commonly used in many researches. This is because it is able to improve the accuracy of the BCI [7][8][9]. Other techniques like wrapper technique and embedded technique have also yielded positive results. The study provides background knowledge on algorithms that can be deployed to select EEG channels, process and classify data received. Further work can be done to determine a channel selection technique that can produce the highest accuracy which can then be used in applications involving visual and auditory evoked memories. The channel selection methods are based on feature extraction of EEG data and therefore, the techniques have been used extensively in motor imagery application [10][11][12].
There are many classi cation algorithms being applied in BCI technology. Researchers [13][14][15][16] reviewed the modern classi cation algorithms for data produced by an EEG device. The algorithms are classi ed into four main methods, namely adaptive classi ers, matrix and tensor classi ers, adaptive learning classi ers, and also deep learning classi ers [17][18][19]. This research discusses major issues faced in classifying EEG signals. The working principle, advantages, and disadvantages of each classi er are explained thoroughly. In addition to that, the research also analyses the properties of each classi er, i.e. stable or unstable, dynamic or static, regularized [20][21][22][23]. Similar to the researcher's previous work [24][25][26], this research is also covering the suitability of the classi er application-based. The research is concluded with future possibilities of the classi ers discussed.
One of the gaps that still need to be bridged is in increasing the accuracy of EEG feature classi ers [27].
There is a need for investigation of the best performing classi er for each task and how they can be used collectively. Training trials required to achieve accurate results can be reduced [28][29].
The perks of using EEG are that it is inexpensive compared to other medical devices, non-invasive, and can be portable. Therefore, EEG is widely used to study neuroplasticity changes in many areas. EEG can also be used for early prediction and more research is required to be carried out considering the types of lesions, time is taken for rehabilitation, and larger sample size of patients [30].

Results
In this section, the comparison is done for 2 different trained models and the trained model with better performance is selected, where the performance is in terms of accuracy achieved and training duration. In this experiment, 2 trained model is compared which is the Alexnet and Googlenet. The reason for choosing these 2 trained models for transfer learning is because Alexnet requires lesser training duration but is only able to achieve average accuracy. On the other side, Googlenet requires a longer training duration but it is more likely to achieve higher accuracy. The tabulated data are as shown in Table 1, where the training for Googlenet is set with the epoch of 160 as Googlenet requires a longer training duration while the Alexnet only set with an epoch of 80. Training Model Evaluation Matrics The designed model's performance needs to be evaluated based on the results obtained. The models were evaluated not only based on the accuracy but also based on True Positive Rate or Sensitivity or Recall, precision, and F1 Score.
Recall is de ned as the ratio of true positive to the sum of true positive and false negative and is given as,

Recall = TruePositive TruePositive + FalseNegative
Precision is de ned as the ratio of true positive to the sum of true positive and false positive and is given as,

Precision =
TruePositive TruePositive + FalsePositive F1 score is de ned as the harmonic mean of the other two metrics, namely precision, and recall, and is given as,

Initial Learning Rate Analysis
In this testing analysis, the Initial Learning Rate (ILR) is analyzed to select the best ILR in order to achieve better accuracy. Hence, the ILR will be set at three different values in order to train the CNN models and a comparison is done to select the best, based on the maximum achievable accuracy and maximum validation and training loss difference. It can be inferred from Table 2, that when the ILR is 0.0005 and 0.0003, the maximum achievable accuracy is only upto 66.43%, while when the ILR is 0.0001, the maximum achievable accuracy is greater than 69.68%. This allows the accuracy of more than 69.68% to be achieved while testing. At the same time, the maximum validation and training loss difference is at the minimum for ILR of 0.0001. Hence, for the CNN models, the ILR of 0.0001 is selected based on the ILR analysis.

Training Model Duration Test
This test is done to evaluate the time taken for the models to be trained and the data collected is tabulated as shown in Table 3. From the generated results in Figure 1 and Figure 2, it is clear that even with a longer training duration and higher epoch, the Alexnet is still able to achieve higher accuracy than the Googlenet. This is due to the Alexnet having a larger input signal size than the Googlenet and the structure of the Alexnet is also be better for extracting the interrelated features of the EEG signals and scalogram.

Experimental Test Results of Models using 7 extracted features
In this experiment, 80% of the data collected was used for training and 20% was used for testing and validation of the model. Each participant is tested 10 times and the average of the test result for model 1 is presented in Table 4 along with their average, while for model 2 is presented in Table 5 along with their average.
Thus, the results of each of the trained models used in the training for the classi cation of EEG signal using 7 extracted features are generated and recorded into Table 4 and 5, which includes the Recall, Precision, Accuracy, and F1 score of each trained model with different epoch used. In this experiment, 80% of the data collected was used for training and 20% was used for testing and validation of the model. Each participant is tested 10 times and the average of the test result for model 1 is presented in Table 6 along with their average, while for model 2 is presented in Table 7 along with their average.
Thus, the results of each of the trained models used in the training for the classi cation of EEG signal using 7 extracted features are generated and recorded into Table 6 and 7, which includes the Recall, Precision, Accuracy, and F1 score of each trained model with different epoch used.

Discussion
In this research work, BCI-EEG based word prediction using CNN is demonstrated. In this experimental work, rst, the ILR was analyzed with three different ILR of 0.0001, 0.0003, and 0.0005. Based on the maximum achievable accuracy and maximum validation and training loss difference, ILR of 0.0001 was selected as it allows room to achieve higher accuracy of greater than 69.68%.
The next two training models were trained and tested. Based on the training duration required, Alexnet was selected with 80 epochs and with less training duration but resulted in higher accuracy.
Further, the experimental test results for 7 and 4 extracted features were analyzed between the two CNN models, and the results are tabulated as shown in Table 8.  Table 8, it can be inferred that model 1, Alexnet outperforms in terms of all the performance evaluation metrics for 7 and 4 extracted features used. The higher accuracy achieved was 90.3% by Alexnet when 7 extracted features were used. Also, to be inferred is that when the number of extracted features used is decreased from 7 to 4, all the performance metrics decreased in both the trained models, however, the performances decreased by 3.3% on average.  Table 9 shows the comparison of the developed models with the existing models presented in the literature [31], [32]. It could be observed that the accuracy of the developed model 1 has been improved and the percentage of improvement is 3.79%, 6.71%, and 27.90%, as compared with model 2 (Googlenet), [31] and [32]. It can be noted that the number of features extracted has been reduced and the system or models developed is analyzed and the performance metrics are evaluated.

Conclusion
Thus, the Alexnet transfer learning model is selected to be the best model as compared Googlenet, as it achieved an accuracy of 90.3% and the nal training option of 80 epoch, 64 batch size, scalogram preprocessing method, the ratio of 80:20 training and validation set and initial learning rate of 0.0001. This high accuracy further paves the way to future work on cross participant analysis, to increase the number of participants and plan to involve a larger number of participants for testing and to enhance the deep learning neural networks to create the system developed to be suitable for EEG based mobile applications.

EEG Recording Device
Although there is a lot of research work done in the area of human speech detection and recognition modeling, it is always challenging to use this data that works on the wireless EEG device to acquire the EEG signal data from the human brain and to transmit it wirelessly to the computer interface that creates a speech recognition model. There is always di culty in the acquisition of the data from wireless devices due to various inferior signal conditions. But wireless devices bene t such as easy connection, the easy transmission of data, cheap prices, easy to mount on the head, and so on.
In this research work, Epoc Signal Server is used heavily for the EEG raw data recorder to Simulink for mathematical signal processing for feature extraction and classi cation ranking purpose. Besides, Emotiv Control Panel is used to check the electrodes connectivity strength before the recording and training started. Emotiv Testbench and Emotiv Brain Activity Map are used heavily for visual analysis hand in hand with the Simulink recorded data to provide a better strategy to analyze the data e ciently.
EEG is responsible to collect data emitted by the cortical cortex of the brain. The EEG has 16 electrodes which are placed according to a 10-20 system. The EEG device communicates wirelessly to the laptop and therefore other additional components to support the functions of the EEG device are required. The EEG that was used is 'Emotiv EPOC+ [12], as shown in Figure 3.
Data Acquisition The whole system starts with the EEG device. There are 16 sensors on the EEG device, and these sensor locations are xed using the 10-20 system. Two of the 16 sensors are used as reference points. These two sensors will be placed behind the ears on mastoid bones. The 16 sensors are located at Fp1, Fp2, F3, F4, F7, F8, C3, C4, T3, T4, P3, P4, T5, T6, O1, O2. Data collected by each of these sensors is considered a separate channel. The sensors will measure the potential difference of the electrical signals red by the neurons in the brain. The unit used by the sensors is micro-volts.
Data collected by mobile EEG device is at a sampling rate of 128Hz. This new sampling rate will produce lower samples for each recording, thus computational power and time required to train and test classi ers are reduced drastically. The recording of each activity will contain 256 samples. Some variables and arrays were initialized like the 14 selected EEG channels. The 14 channels were selected based on the construction of the Emotiv EPOC EEG device which is used in this research work to read data from these 14 channels. Then the number of participants was checked and looped to open all the raw EEG les to analyze the EEG signals channel by channel.
The computing device receives data from the EEG device via Bluetooth connection. The data received has 25 channels which is additional nine channels. These nine channels contain other data such as timestamp, counter, marker signal, synchronization signal, gyro values and etc. The data from the EEG device is collected every 0.0078125 seconds based on the sampling rate of 128Hz. At the end of the twosecond recording, 256 samples are collected and tabulated in a matrix form. The dimensions of the matrix would be 256×25. A high pass lter is then used to reduce the effects of DC offset and also lter out low-frequency noises that may exist in the signal. The high pass lter has a roll-off frequency of 5Hz.
Data is saved into the computing device in matrix format allowing the les to be accessed on MATLAB during classi er training.

Data Acquisition Protocol
First, the participants were briefed on the experiment to be carried out and mentioned that the data acquired will be purely used for the research work only following the code of ethics of Anna University guidelines. The data acquisition protocol was developed and briefed to the participants that all agreed to the instructions during the recording. A total of 10 participants were involved, with each being tested separately using two different models using CNN. The participants were asked to imagine 10 words in sequential order, and the signal for each word imagined was recorded with 5 secs gap in between. For this, a stopwatch clock was used that was placed in front of them. The signal imagined was recorded for a duration of 3 secs, followed by a gap of silence for 5 secs. Thus 10 words imagined were recorded for a period of 75 secs for each participant.

Experiment
Convolutional Neural Network (CNN) is used to predict the word imagined as the Continuous Wavelet Transform (CWT) will convert the EEG signals into 3D scalogram image and CNN is able to capture the time-related and spatial dependencies of an image when the relevant lter is applied. The method is shown in Figure 4 and explained in sequence as follows: Pre-processing using Morlet Continuous Wavelet Transform In order to perform the pre-processing of the dataset and normalization of label value, the dataset size and format were rst checked by loading the dataset. CWT has the window function which can tackle the main wavelet function where the window is shifted and scaled in the process of conversion. This allows the windowing at a longer time interval at low frequency and for high frequency, a short time interval windowing will be used. Moreover, with the capability of the splitting window of variance sizes, it provides the highly effective analysis of the low and high-frequency information of the EEG signal with the nonstationary property. The spectral analysis was done using Morlet wavelet CWT (MCWT) as it is more suitable for non-stationary EEG signals.  The MCWT was done and produced 1 scalogram per channel, as each trial has 14 channels and thus, the 14 scalograms were combined into 1 image and this combined image will be used as the input for the CNN.
For example, the location [1, 1, 1:8064] represents the rst data of the rst electrode of the rst trial out of 40 trials and the 8064 layers are the data recorded per sample, where the sampling frequency is 128Hz and the data length is 63 seconds. After obtaining the data from one electrode, MCWT is applied with a sampling frequency of 128Hz to convert the data into the scalogram.
The scalogram produced is shown in Figure 5, but the generated image has a label and white bar covering the image which increases the training duration and accuracy of the CNN.
Finally, the 14 scalograms that represent the 14 electrodes recording the EEG signal of the same instance are combined into 1 image as shown in Figure 6 to ease the process of labeling matching and allow the CNN to understand the direct relationship and difference of each scalogram of the same instance when changes have appeared. Then the combined image is saved into the respective designated folders and the combined image is as shown in Figure 6.

Normalisation
The value of the dataset label was normalized to 1 (High) and 0 (Low), where 1 and 0 indicate that the value is in the range of 1-5 and 6-9 respectively. The normalization of the value is required by this method so that the accuracy of the system can be increased by reducing the wide range of parameters. The training process of the CNN using the pre-processed dataset and the normalized dataset labels. The CNN can be divided into 2 parts where the rst part is the feature learning layer which generally extract the feature from the input signal, second part is the classi cation layer where the extracted signal features were attened into a series of column vector for the feed-forward neural network to perform training.

ReLU activation function
The Recti ed Linear Unit (ReLU) was used due to its high e ciency for computational conversion as it does not restrict the range of the activation which means from 0 to in nity. Over tting issues and long training duration can also be avoided. The ReLU function can be represented mathematically shown in equation 9.

Alexnet
The Alexnet is 8 layers deep convolutional neural network where it is capable to predict signals over 1000 categories of the object as more than 1 million signals have been trained by the network. The fully connected layer is used to classify the signals and the number of classi cation output of both the fully connected layer and output layer are 1000. However, the classi cation output required in this work is 10 categories.
The backpropagation method was used for the adjustment of weight and bias for every iteration until certain series of epochs where the fully connected layer will be able to perform the classi cation.
Finally, the output of the fully connected layer will be sent to the soft-max classi cation layer for nal classi cation as a label and the trained CNN model will be used for testing. The output of the softmax layer is equal to the desired number of outputs which in this case is 10 (left, right, up, down, front, back, stop, pick, red, blue).  Emotiv EPOC+ EEG device Combined all 14 channels of scalogram into image size of 1988 x 447 pixels