Identification of attention deficit hyperactivity disorder with deep learning model

This article explores the detection of Attention Deficit Hyperactivity Disorder, a neurobehavioral disorder, from electroencephalography signals. Due to the unstable behavior of electroencephalography signals caused by complex neuronal activity in the brain, frequency analysis methods are required to extract the hidden patterns. In this study, the feature extraction was performed with the Multitaper and Multivariate Variational Mode Decomposition methods. Then, these features were analyzed with the neighborhood component analysis and the features that contribute effectively to the classification were selected. The deep learning model including the convolution, pooling, and bidirectional long short term cell and fully connected layer was trained with the selected features. The trained model could effectively classify the subjects with Attention Deficit Hyperactivity Disorder with a deep learning model, support vector machines and linear discriminant analysis. The experiments were validated with an Attention Deficit Hyperactivity Disorder open access dataset (https://doi.org/10.21227/rzfh-zn36). In validation, the deep learning model was able to classify 1210 test samples (600 subjects in the control group as Normal and 610 subjects in the ADHD group as ADHD) in 0.1 s with an accuracy of 95.54%. This accuracy rate is quite high compared to the Linear Discriminant Analysis (76.38%) and Support Vector Machines (81.69%). Experimental results showed that the proposed approach can innovatively classify Attention Deficit Hyperactivity Disorder subjects from the Control group effectively.


Introduction
Classification of Electroencephalography (EEG) signals is an important step in the design of the Brain-Computer Interface (BCI) [1]. One of the BCI applications is Attention Deficit Hyperactivity Disorder (ADHD) detection. ADHD, a neurodevelopmental disorder, is characterized by executive functions and attention deficit. It affects approximately 5% of adults and 10% of children worldwide [2]. Also, it varies according to the population and the disease can be up to 20% of the population [3]. For the diagnosis of ADHD, experts use neuropsychological assessments and heterogeneous cognitive profiles. However, wide cognitive profiles complicate the diagnosis [4]. One of the methods used to support the diagnosis is the evaluation of Electroencephalography (EEG) signals. The diagnosis of ADHD can be performed more safely by examining the signal changes in the response of the patients to different stimuli. A clear diagnosis of ADHD is important in solving individuals' social and psychiatric problems [5].
Various methods such as Event-Related Potential (ERP) method [1], statistical analysis of the signal [6], and observation of the Power Spectral Density (PSD) of the signal [7], application of photic stimuli [8] were proposed by using EEG in detecting ADHD. In these studies, wavelet transform, frequency space transformation, Welch power spectrum transform were used [6][7][8][9]. In one of these studies, there were significant changes in the alpha band of the PSD of the EEG signals of individuals with ADHD [9]. Thus, artificial intelligence algorithms played an active role in ADHD classification by using changes in PSDs [10]. When applying frequency transformation to extract spectral information from a signal, it is assumed to be a reliable representation of the relative phase of power coefficients obtained versus frequency. However, this assumption is not always valid. The average of the signal is used to solve this problem. The analysis with averaging weakens the signal components. It is also unreliable in small data sets [11]. Instead of the averaging, the Multitaper method is the motivation of the study as it creates PSDs and reduces the prediction bias by obtaining more than one independent estimation from the same sample. Frequency powers were also obtained in 3 different forms, and features of outputs at different power levels were obtained with the Multivariate Variational Mode Decomposition (MVMD) method. These features allow the evaluation of the signal in 3 different bands.
Different methods have been used in the literature for the diagnosis of ADHD. These studies are usually performed with artificial intelligence algorithms. Feature extraction methods provide valuable information to artificial intelligence algorithms in the measurement of PSDs changes in EEG signals with the help of stimuli. Machine learning algorithms, neural networks and deep learning models and ADHD subjects were classified with the obtained feature vectors. Yang et al. processed the data obtained from the 128 channels with the Principal Component Analysis (PCA) algorithm for ADHD classification. The authors applied the processed data to the K-Nearest Neighbors (KNN) and Support Vector Machines (SVM) classifier as a set of features. The authors validated their experiments with the cross-validation technique. In the presented results, the highest accuracy was obtained with the KNN algorithm as 83.33%. [12]. Khoshnoud et al. used 19-channel EEG signals in their study. Data recording was performed while resting with eyes closed. Approximate Entropy, Lyapunov Exponent and Multifractal Singularity Spectrum were used for feature extraction in EEG signals. These features are applied to the radial basis function network (RBFN) and SVM to classify them. In the study, classification based on frequency band power was evaluated using the same type classifiers. An accuracy of 83.33% was achieved with SVM under a four-fold cross-validation test. As a result, it has been observed that nonlinear features provide better separation between ADHD and control than band power characteristics [13]. Chen et al. obtained features from the power spectrum for ADHD detection from EEG signals. These features are applied to SVM are divided into 4 groups as relative spectral power, spectral power ratio, complexity and dual phase. An accuracy of 84.59% was obtained in the SVM method used in the classification performed with these features [14]. Jahanshahloo et al. obtained the feature vector of the study using the fractal dimension, band power, and wavelet and Autoregressive (AR) coefficients. ADHD classification with this feature vector was performed by SVM method. In the experimental results, it has been observed that the combination of fractal dimension and wavelet transform features achieve well discrimination ability. In the classification made using these features, 88.77% accuracy was achieved with the SVM method as a result of the tenfold cross-validation approach [15]. Mueller et al. used two age-matched groups of adults in their study. Two visual stimuli were applied to the 2 classified groups in their study. The ERP responses in EEG recordings were separated into Independent Component Analysis (ICA) and ADHD classification was performed by SVM method. The classification accuracy was obtained as 91% by using the tenfold cross-validation [16]. Dea [20].
Studies involving deep learning methods have also been suggested in the literature. Chen et al. proposed a Convolutional Neural Network (CNN) -based method for detecting ADHD from EEG signals in their study. The feature extraction was performed by arranging the order of the channels belonging to the EEG signals. In addition to these features, an accuracy of 94.67% was achieved with the feature matrix obtained by calculating 13 features [21]. Dubreuil et al. detected ADHD with a CNN model trained using the stacked multi-channel EEG time-frequency separations of ERP. The higher accuracy (88%) was obtained with its model trained with 2800 feature vectors rather than Recurrent Neural Network (RNN) [2]. Marcano et al. used 5 EEG channels selected for the ratio of theta and beta power values measured during an attention task. A probability ratio detector is designed in the study. The Area under Curve (AuC) at resting and one excitation was achieved as 73%. It was also obtained as a false positive rate (FPR) of 0.32 [7].
In this study, the Multitaper method and MVMD were used in an innovative way together with the Neighborhood Component Analysis (NCA) and Deep Learning Model (DLM) with Bidirectional Long Short Term Memory (BLSTM) unit. The frequency-power values generated by the data obtained from each EEG channel with the Multitaper method and MVMD were used to obtain the components of the EEG signal. The 2% tolerated 407 features were selected with the NCA algorithm. Support vector machines (SVM), Linear Discriminant Analysis (LDA) and DLM classifiers were trained with 2/3 of these features. Then the 1/3 holdout validation result of the experiments, the highest accuracy was obtained by using NCA and DLM methods together with 95.54%. Since the NCA method enables the classifier to classify with fewer features, the processing time is shortened.
The main contributions of this study are as follows: 1. The MVMD contributes to noise immunity and mode alignment, as well as the removal of negativity in the EEG signal. While the Multitaper method creates power spectral density (PSD), data loss is prevented by taking averages because it reduces the prediction bias by obtaining more than one independent estimation from the same sample. 2. The proposed method was tested by holdout validation group data separated by 33.3%. As a result of experimental studies, a classification accuracy of 95.54% was achieved with DLM, which is much higher than with LDA and SVM. When effective features are used in classification; the high ADHD classification is clearly obtained and processing time is substantially decreased.

ADHD dataset
The EEG data used in the study were obtained using the potential differences between the electrodes placed according to the international 10-20 system. Fz, Cz, Pz, C3, T3, C4, T4, Fp1, Fp2, F3, F4, F7, F8, P3, P4, T5, T6, O1, O2 channels, data received from 19 channels with 128 Hz sampling frequency are included. The A1 and A2 electrodes were the references with the earlobes. Visual attention tasks are included in the EEG recording protocol. With the continuous stimulation applied on each task, the subjects were asked to count certain visuals. EEG recordings corresponding to these stimuli were obtained [22].
Records in the data set belong to 121 subjects, aged 7-12 years. These subjects are categorized as 61 subjects in ADHD and 60 in the control group. There were no reports of psychiatric disorders, epilepsy, or any high-risk behaviors in the control group. EEG signals recorded from 121 subjects presented in the data set were used to obtain the data used in the study. The 30 segments were obtained from each of these signals in 10-s intervals with 1280 samples without overlapping. These segments were obtained randomly from different locations in time without overlapping each other. Experiments were conducted with a total of 3630 feature vectors. During the EEG recordings, pictures of a series of cartoon characters were shown to the children for stimulation. Children were asked to count these characters. In the experimental setup, the number of characters in each image was randomly selected between 5 and 16. Each image was shown immediately after the child's response and without interruption [22].

Architecture of the proposed ADHD identification model
A study was proposed to detect individuals with ADHD using EEG signals. The flow of the study is presented in Fig. 1. In the method developed in the proposed study, Multitaper, MVMD, NCA and DLM were used together. There are 3630 feature vectors in the dataset of the study. These feature vectors consist of 1830 ADHD subjects and 1800 control groups. In the recording of the data, RAW EEG records belonging to 8 of 19 channels that included potential differences between electrodes were obtained using the international 10-20 system. These channels are C3, C4, P3, P4, T5, T6, O1, and O2. These channels are especially preferred because they are regions where eye-blink artifacts have few effects. In the feature extraction stage, the power density values of 1-49 Hz frequencies were obtained by applying the Multitaper method. At the same time, EEG signals of each one of the 8 channels are divided into 3 components, separately. Thus, a total of 1942 features were obtained in feature extraction. Then, the 407 features of the concatenated features were selected that achieve optimal classification by using the NCA feature selector algorithm. The subjects with ADHD were classified into the malignant class and the control group were labeled as benign class by the DLM.

Feature extraction with multivariate variational mode decomposition
MVMD makes it possible to use one-dimensional Variational Mode Decomposition as multi-dimensional. EEG recordings consisting of multi-channel signals can be processed with MVMD. In addition, this method ensures consistency of multi-channel component frequencies [23]. MVMD involves extending the signal to multivariate data instead of parsing . Hilbert-Huang Transform is used to obtain the one-sided spectrum. Then u(t) is used to determine the center frequency. It is then modulated to the fundamental frequency corresponding to the frequency spectrum of each mode, multiplied by the exponential term to determine the corresponding center frequency w(t). In MVMD, the multivariate modulated oscillations of K are calculated by obtaining using the signal u k (t) . The optimization function in this case is obtained by Eq. 1 [24].
In Eq. 1, the term u k,c (t) is a complex valued signal with a single frequency w k component in each channel. The channel number c and the mode number k indicate the analytically modulated signal. For the optimization specified in Eq. 1, firstly, the constrained optimization problem is transformed into an unconstrained optimization problem. With this transformation, the problem is obtained as an augmented Lagrangian function by adding two penalty terms. The Lagrangian function is expressed by Eq. 2.
In Eq. 2, α denotes the balance parameter used to provide the necessary data accuracy constraint. λ is the Lagrange multiplier. The problem that turns into an unconstrained optimization problem is solved by using the alternative direction method (ADMM) algorithm of the multipliers and the components of different channels and frequency bands are obtained.

Feature extraction using Multitaper
The Multitaper method is used to obtain the power spectral density by moving the information contained in a signal to the frequency space. The power spectrum is formed by distributing the average power of a signal to certain frequency values in the signal [25]. The average power of the x[n] that is the discrete time signal in the range n 1 and n 2 is obtained. The total energy of the signal in N finite time is also finite. This situation is shown by Eq. 3. A tapering window function w is defined by k and the sub-sequence x j [n] periodogram corresponding to column j of a single-conical spectrogram is calculated by Eq. 5. In Eq. 5, T function is the Fourier transform of the w function. The estimator S is obtained by multiplying this transformation by the sampling frequency. Although the obtained S is an approximation of the long-term power spectrum, the periodogram-variance problem needs to be solved. In solving this problem, more than one tapering window functions are used to reduce the deviation and variance found in the periodogram. The windows are shown withW = {w L 1 , w L 2 , ..., w L k } . Each W window in the array is defined as K tapering window of length L. The multi-stage spectral estimator using W windows is obtained by Eq. 6.

Feature selection applying dimensionality reduction neighborhood component analysis algorithm
NCA is a nonparametric feature selection method. It produces non-negative weights for all features. NCA produces non-negative weights. Relief's negative weights mean an excess of features. The negatively weighted features are pruned. Then, the positive weighted features are selected by using the most distinctive features to generate weights. Before weights are produced, the features are normalized using min-max normalization [26].
In Eq. 7, W denotes the weight vector of NCA. The feature vector is fr and t v is the target vector. Weights are obtained by matching the features normalized with min-max with the target vector.

Linear discriminant analysis
The LDA method separates the two classes using a linear boundary between features. In separation, the argument is expressed as a linear variable. This argument appears as a label for a class [27]. First, models of probability density functions are obtained for data generated from each class. Then, a new data point is classified by determining the probability density function whose values are greater than the others. The separation function of the LDA classifier is a linear compound of X's complements. This calculation is expressed as in Eq. 8 [27].
In Eq. 7, w is the weight vector and m 0 is the bias value. The decision-making for classes is defined with D value. The X is a pXN k matrix of N k samples. These samples are p-dimensional data from class k. μ k means the previous probabilities of each class and δ is the covariance matrix. Each x value is obtained with argmax as in Eq. 9. The resulting LDA decision boundaries are linear across data classes.
Consequently, discrimination is a predominantly linear combination of predictors. Generally, estimators with large differences between class averages will have larger weights, also when the class averages are similar the weights will be small.

Support vector machines
SVM method transforms the input data vectors into a higher dimensional by passing it through a kernel process. The data in the area resulting from the transformation are classified by modeling complex decision boundaries with a hyperplane. In the classification process, the distance between the hyperplane and the nearest data point is maximized [28]. Generally, SVM can be formulated as seen in Eq. 9.
In Eq. 10, w is the weight vector and b is the bias value. The I'th input and output pair (x i ,y i ) is obtained, where x i is the input and y i is the output. The estimated output value of the i th sample is calculated with x T i ⋅ w + b . N is the number of samples and y(w) is the regularized term. α, on the other hand, is a non-negative parameter used to balance between the data fitting loss term and the regulator term.

DLM with bidirectional long short term memory cell
BLSTM units has an important place in the DLM model used in the study. They are used in solving sequential classification problems. The storage or updating of the current memory is determined. Therefore, the BLSTM unit is capable of modeling long-range dynamic dependencies. This is the solution to avoid the vanishing gradient problem that arises during training [29].
BLSTM unit has an input gate, forgetting gate and output gate. A single BLSTM unit is defined by Eq. 11.
In Eq. 11, W is the weight matrix and b is the deviation variable. i is the entrance gate of the j th LSTM unit at time t. σ is expressed as the sigmoid function. The input data at time t is expressed as x t and the output of the previous BLSTM unit is h t .
In Eq. 12, f t describes the forgetting gate. In the forget gate, the importance of information is calculated and unnecessary information is discarded.
In Eq. 13, c ∼j t represents the new memory gate unit and the memory content of the previous unit is expressed as b.
The new memory content is calculated by forget gate unit. This represents updated memory content.
The update process in BLSTM block is performed and c j t is obtained by using Eq. 14. The o j t expressed as the output unit that controls the final output state. The output of the BLSTM cell h j t is calculated by Eq. 15 at the last BLSTM output unit that enabled at time t.
BLSTM model has the ability to access content in both forward and backward directions. The demonstration of the BLSTM model is presented in Fig. 2.
There are also 1 Dimensional convolution layer (Conv1D), batch normalization and dense layers in the DLM, which is included in the LSTM unit. The 407 features are reconstructed with Conv1D. Then, the normalization is performed in each batch process with the obtained convolutional features. The data obtained from the LSTM unit is classified into 2 classes with the dense layers. For the hyper-parameters of DLM, the intermediate layer number was set as 100, the initial learning rate was 0.05, the gradient threshold was 1, and the mini batch size was 384 with the best performance. The parts in the architecture of the layers belonging to the DLM model created in this study are shown in Table 1.

Experimental results
This section covers the application and evaluation of the combined use of Multitaper, MVMD, NCA and DLM modules proposed in this study. First, the experimental setup, performance criteria and dataset are expressed. Then, the results of the experiments performed in the data set of the study are presented to validate the approach. Finally, a comparison is made between the approach applied and the ADHD classification methods suggested in the literature.

Experimental setup
The proposed method has been implemented in the Python programming environment. Experiments with the proposed method were carried out on an i7 9900 Intel processor running at 2.40 GHZ, 32 GB of RAM and an NVIDIA 940 M GPU.

Metrics for proposed model evaluation
The accuracy, precision, recall and f1 score metrics are used to measure the performance of the proposed approach. The   (2) true positive (TP), true negative (TN), false positive (FP) and false negative (FN) expressions are used in the calculation of these metrics. These expressions are derived from the confusion matrix. TP refers to ADHD subjects who are correctly classified. FP shows the subjects with ADHD but included in the control group. Shows subjects in the FN control group but classified as ADHD. TN represents the correct classification in the control group. The parameters obtained with these parameters are obtained by Eqs. 16, 17, 18 and 19 respectively.

Evaluation of the proposed approach
EEG signal segments were obtained in 10 s using raw data. Power distribution at frequencies between 1 and 49 Hz and features of 8 channels divided into 3 components were obtained by applying the Multitaper transformation and MVMD of these segments. In the study, nw parameter was chosen as 1.25. This parameter mostly reflects the power change in the graphics marked in red. As can be seen in the graphs, more power changes occur in subjects with ADHD. An example of the signal separated into its components by MVMD is shown in Fig. 3. The concatenated features of the MVMD and PSD have 1942 features. This 1942 feature is selected by the NCA algorithm and reduced to 704. Afterwards, the feature vector, which was selected by the NCA method applied to the entire dataset, was applied to LDA, SVM and DLM, respectively.
Prn + Rcl 2420 of 3630 data were used in the training of 3 separate classifiers. The classifiers trained in these data were applied to holdout validation with 1210 data, and their superiority to each other was revealed. While 76.38% accuracy was obtained with the LDA classifier, 81.69% accuracy was obtained as a result of experiments with the SVM classifier. The accuracy was achieved as 95.54% with the DLM designed in this study.
The NCA feature selection algorithm was applied to the data set of the study to increase the effectiveness of it. The method, which was optimized by selecting the best 407 features of NCA, both increased the success of DLM and an effective system that works faster has emerged as a result of experiments. Finally, the performance criteria of the proposed approach are compared with different studies developed in the literature.
The confusion matrix of the best model is presented in Table 2. Of the 600 subjects in the control group, 568 were correctly classified. Among the group with ADHD, 588 out of 610 subjects were classified correctly. The FP rate was only 22 (3.6%). Precision, recall and f1 score were calculated as 0.95, 0.96 and 0.95, respectively. The ROC curve for the success of the method is presented in Fig. 4. The area under the curve was obtained as 0.96.

Discussion
The performance analysis of the study was performed by comparing the performance of other known classification methods. The experimental results obtained regarding the identifying of ADHD times are shown in Table 3. These metrics were obtained with validation with 1210 test  The detection accuracy of the model proposed in the relevant studies in the literature in ADHD classification and the accuracy rates obtained from other classification algorithms are compared in Table 4. Different methods have been used for ADHD detection. Feature extraction has been performed using methods such as multifractal singularity spectrum, approximate entropy, PSD, largest Lyapunov exponent, wavelet coefficients, chaotic time series analysis and Spectrogram. The feature vector is given to the classifier directly or by processing with feature reduction algorithms. Classification is made with machine learning, neural network or deep learning models by processing features with algorithms such as PCA and DISR.
Among the studies using SVM machine learning, Dea et al. achieved 94.1% accuracy by using PSD, PCA and SVM methods together [17]. Khaleghi et al. gathered the highest accuracy with MLP was obtained as 91.83% by using DISR and MLP together [19]. Dubreuil et al. an accuracy of 88% was achieved with the combination of Spectrogram and CNN deep learning method [2]. Fouladvand et al. stated that they made the detection of ADHD with LSTM with 84% accuracy [30].
In the experiments conducted with the same data set, it was shown that the entropy measurements of the especially the recordings in the C3 channel were effective in detecting ADHD. In the study, subjects with ADHD were classified  with an accuracy of 93.65% with holdout validation 30% data slice [31]. When the trained model was tested with a 1/3 of the whole dataset, a test accuracy of 95.54% was achieved. These results show that the proposed method is more successful and effective than the methods suggested in the literature. While using the Multitaper method (PSD) used in the proposed method, data loss was prevented by taking an average because it reduces the prediction bias by obtaining more than one independent estimate from the same sample. Abbas et al. found distinctive features in the Beta power band in their experiments with the same data set. In the experimental results, the AuC success metric for ADHD detection in this band was 0.7585 [32].

Conclusion
In this study, a method that performs ADHD diagnosis from EEG signals in which Multitaper, MVMD, NCA and DLM are used together in an innovative way is proposed. The dataset obtained from the EEG data obtained from 121 subjects. In addition, the results of previous studies are compared with the performance metrics obtained. PSD values with Multitaper method and 8 channels and 3 signal components with MVMD were obtained and concatenated. These metrics are important in discovering the powers that are least affected by blink artifact and significantly increased in stimulation in ADHD subjects. The feature vector reduction was implemented with NCA to improve performance metrics. Many DLM variants were also checked for false positives to achieve the best data generalization. It was found that the best 407 feature selection and hyper-parameters presented in the study were improved. 1210 data could be classified with 95.54% holdout validation accuracy. Furthermore, the classification performance obtained with deep learning was obtained more successfully than SVM and LDA classifiers. It shows that the proposed method for accuracy in experiments deals with False-positives less than other ADHD classification methods. Faster training and higher success level of DLM used with NCA provided an advantage over deep learning methods. The training takes about 117 s and checks the ADHD in 1210 data about 0.1 s.
Funding This work was not funded by any organization.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.