Transfer Learning and Self-Distillation for automated detection of schizophrenia using single-channel EEG and scalogram images

Schizophrenia (SZ) has been acknowledged as a highly intricate mental disorder for a long time. In fact, individuals with SZ experience a blurred line between fantasy and reality, leading to a lack of awareness about their condition, which can pose significant challenges during the treatment process. Due to the importance of the issue, timely diagnosis of this illness can not only assist patients and their families in managing the condition but also enable early intervention, which may help prevent its advancement. EEG is a widely utilized technique for investigating mental disorders like SZ due to its non-invasive nature, affordability, and wide accessibility. In this study, our main goal is to develop an optimized system that can achieve automatic diagnosis of SZ with minimal input information. To optimize the system, we adopted a strategy of using single-channel EEG signals and integrated knowledge distillation and transfer learning techniques into the model. This approach was designed to improve the performance and efficiency of our proposed method for SZ diagnosis. Additionally, to leverage the pre-trained models effectively, we converted the EEG signals into images using Continuous Wavelet Transform (CWT). This transformation allowed us to harness the capabilities of pre-trained models in the image domain, enabling automatic SZ detection with enhanced efficiency. The accuracy achieved from the 5-second records of the EEG signal, along with the combination of self-distillation and VGG16 for the P4 channel, is 97.83% ± 1.3. This indicates a high level of accuracy in diagnosing SZ using the proposed method.


Introduction
Based on data published by the World Health Organization (WHO), approximately 1% of the global population, amounting to around 21 million individuals, is estimated to be affected by Schizophrenia (SZ) [1].This profound and severe brain disorder significantly impacts various cognitive functions, including thinking, memory, comprehension, speech, and behavioral traits [2,3].The enduring nature of this psychiatric condition adversely affects employment, marriage, and lifestyle, resulting in a compromised quality of life and an impaired ability to function effectively in work settings.It is alarming to note that statistics indicate that between 20% and 40% of individuals with SZ have made at least one suicide attempt [4].SZ is characterized by atypical behavior, socio-psychological challenges, as well as the presence of anxiety and depression.Hallucinations, where individuals perceive sensory experiences that do not correspond to reality, are also prevalent among those with SZ.While the exact causes of SZ remain uncertain, the prevailing belief is that the interplay of hereditary, biochemical, and environmental factors collectively contributes to the onset of this condition [5].It is important to note that SZ is a chronic disease requiring long-term therapy.Unfortunately, a significant portion of SZ patients, particularly those from low or middle-income countries, do not have access to the necessary care they need [6].
Developing a low-cost and accurate diagnosis for SZ presents a significant challenge.The current approach often involves relying on the expertise of a trained psychiatrist who observes and evaluates the individual's behavior and symptoms.However, clinical methods can be less reliable due to the overlap of various attributes between SZ and other brain disorders [7].Contrary to the previous statement, positron emission tomography (PET) and magnetic resonance imaging (MRI) are commonly used scans that can be employed as diagnostic tools for SZ.These imaging techniques offer valuable insights into the brain's structure and functioning, aiding in the identification and assessment of SZ.While scans like PET and MRI are effective in diagnosing SZ, they can be costly due to the requirement for high-end instruments.However, an alternative approach to address this issue is the utilization of the electroencephalogram (EEG) technique, which offers a more affordable option for capturing brain activity.EEG can provide valuable information about brain function and has the potential to serve as a cost-effective alternative in diagnosing SZ [8].
At present, there is a lack of a widely recognized clinical test for SZ, and its diagnosis heavily relies on the evaluation of behavioral symptoms, including hallucinations, functional decline, and disorganized speech, as observed by experts.However, these assessments are subjective and may not always yield highly accurate results.To overcome the limitations mentioned above, there is a need for an automatic, reliable, and reproducible approach that utilizes advanced machine learning methods to analyze brain imaging modalities.Such an approach would offer a promising solution to enhance the accuracy and consistency of SZ diagnosis, providing a more objective and efficient means of assessing the disorder.
Utilizing signal processing and machine learning algorithms, EEG signals can undergo processing and analysis to differentiate between SZ patients and non-SZ individuals.By leveraging the power of machine learning with the insights from brain imaging techniques, a more objective and efficient means of diagnosing SZ can be achieved.This would enhance the accuracy and consistency of SZ diagnosis, providing a significant step forward in improving the overall management and treatment of the disorder.Nowadays, there has been a growing interest in the use of deep learning methods as an innovative alternative to the traditional feature-based approaches [9,10].Deep learning algorithms have the capability to automatically extract meaningful features and directly classify them from the data, mimicking the data processing and decision-making patterns observed in the human brain.Several researchers have devised approaches for the automated evaluation of SZ cases by utilizing EEG signals.
A comprehensive overview of various methods of machine learning employed for distinguishing between SZ and non-SZ classifications can be found in a referenced publication [11].In a study conducted by Aristizabal et al. [12], the utilization of automated techniques and deep learning algorithms on EEG data was explored to identify individuals between the ages of 9 and 12 who are at a higher risk of developing schizophrenia.Aslan et al. [13] introduce an automated approach for diagnosing SZ patients based on EEG recordings.The proposed method involves transforming the raw EEG data into 2D time-frequency features using the Short-time Fourier Transform (STFT) and employing a Convolutional Neural Network (CNN) architecture, specifically the VGG16 model.The study highlights a correlation between frequency components observed in EEG recordings and SZ, with mid-level frequencies demonstrating a crucial role in discerning SZ patients from healthy individuals.Sun et al. [14] enhance the classification accuracy between SZ patients and healthy controls by utilizing EEG signals.They introduce a hybrid deep neural network (DNN) that combines CNN and long short-term memory (LSTM) models.The proposed method involves converting fuzzy entropy (FuzzyEn) and fast fourier transform (FFT) features into RGB images, which are then utilized as input for the network.This approach aims to enhance the accuracy of classifying individuals with SZ by utilizing the transformed EEG features in the form of RGB images.Chandran et al. [15] focus on detecting schizophrenia by utilizing LSTM, a deep learning technique, and extracting features from EEG signals.Nonlinear features such as Katz fractal dimension (KFD), approximate entropy (ApEn), and time-domain variance values are calculated in their study.
The aim is to leverage these extracted features to effectively identify patterns indicative of schizophrenia within the EEG signals.Shoeibi et al. [16] present intelligent deep learning (DL) approaches for automating the diagnosis of SZ using EEG signals.The study compares DL models, such as LSTMs, one-dimensional convolutional networks (1D-CNNs), and 1D-CNN-LSTMs, with conventional machine learning methods.The objective is to assess the performance and effectiveness of DL-based techniques in accurately diagnosing SZ by analyzing EEG data.Gosala et al. [17] explore the utilization of Wavelet Scattering Transform (WST) for classifying neuro-disorders, specifically focusing on SZ.The researchers conduct a comparative analysis with Continuous Wavelet Transform (CWT) and Discrete Wavelet Transform (DWT) methods.The study's results suggest that techniques of ensemble modeling perform better when using CWT and DWT features, while traditional machine learning methods outperform ensembling methods when utilizing WST features.Siuly et al. [18] developed a deep ResNet-based DL framework to identify schizophrenia.Within the deep ResNet architecture, the feature representation of a deeper unit is defined as the feature representation of a shallower unit, augmented by the accumulated residual responses of the preceding units.The dataset they employed comprises EEG data collected from a total of 81 participants, encompassing 49 individuals diagnosed with schizophrenia and 32 healthy control subjects.
The main novelty of this research is that, 5-second recording of single-channel EEG is used to achieve a more efficient and accurate model for the automated detection of schizophrenia.In addition, for the first time, we combine knowledge distillation and transfer learning based on CNN networks to enhance the performance of our model.Besides, we utilize the scalogram of EEG signals, which employs CWT to convert the EEG signal into a 2D image.This approach allows us to capture the temporal and frequency characteristics of the EEG data in a visual representation.Furthermore, we conducted an analysis to determine the optimal channel for achieving higher accuracy in our research.

Dataset
In this research we have used a public and available dataset [19].The study protocol obtained approval from the Ethics Committee of the Institute of Psychiatry and Neurology in Warsaw.The dataset comprises a total of 28 participants, including 14 SZ patients (The group consisted of 7 males with an average age of 27.9 ± 3.3 years and 7 females with an average age of 28.3 ± 4.1 years) and 14 healthy individuals (Among the participants, there were 7 males with an average age of 26.8 ± 2.9 years and 7 females with an average age of 28.7 ± 3.4 years).A duration of 15 minutes of EEG data was recorded from all subjects while they were in an EC 5 resting state condition.The data was recorded using the standard 10-20 EEG electrode positions for 19 channels, as shown in Figure 2, with a sampling frequency of 250 Hz.The following EEG channels were recorded: Fp2, F8, T4, T6, O2, O2, Fp1, F7, T3, T5, O1, F4, C4, P4, F3, C3, P3, Fz, Cz, and Pz.Also, PCz was employed as the reference electrode during the recording.

Preprocessing
As mentioned in the previous section, each EEG data recording lasted for 15 minutes, and the sampling frequency was set to 250 Hz.Consequently, we have 225,000 samples per EEG record.In the first stage of preprocessing, we apply a Butterworth band-pass filter to our signal, with a frequency range of 1 to 45 Hz.This filter can be created by combining a low-pass Butterworth filter and a high-pass Butterworth filter.Subsequently, each signal record was divided into smaller 5-second segments [20].Considering the sampling frequency of 250 Hz, each new record will contain 1250 samples.The entire preprocessing steps are illustrated in Figure 3.In the end, we will obtain approximately 180 records for each subject, all labeled identically.

Scalogram using CWT
The wavelet transform is a powerful tool that enables the creation of a two-dimensional time-frequency representation of an EEG signal in the form of an image.To capture the dynamics of EEG signals and extract the power in specific frequency bands, a commonly used approach is to utilize the CWT.CWT is employed to break down a signal into its constituent wavelets.These wavelets are characterized as rapidly changing patterns highly localized in time.Unlike the Fourier Transform (FT), which breaks down a signal into an infinite series of sines and cosines with varying frequencies, the CWT employs a different approach.By utilizing rescaled and repositioned iterations of a mother wavelet, the CWT achieves exceptional precision in both time and frequency localization [21].The equation representing the CWT function is denoted as Equation 1 [22]: The wavelets are produced using a wavelet function () known as the Mother wavelet, which is described as Equation 2: By combining ( 1) and ( 2), we obtain the form of Equation 3: In Equation 3,  represents the scaling parameter,  represents the translation parameter,  * represents the conjugate form of , and  represents the time shift.The scaling function introduces a trade-off between time and frequency information.By adjusting the scaling factor, it is possible to achieve enhanced time resolution but at the expense of frequency resolution [22].
In this study, we utilize the Complex Morlet wavelet (cmorB-C) as the mother wavelet Equation 4. In the cmorB-C, B represents the bandwidth, and C represents the center frequency.
Figure 4 illustrates the transformation of signal records into images using the CWT for both a healthy subject and a subject with schizophrenia.

Convolutional neural networks (CNN)
The CNN is a type of neural network that operates in a feedforward manner.The CNN excels at extracting features from data using convolution structures, In contrast to traditional methods of feature extraction [23].Convolutional neural networks eliminate the need for manual feature extraction.The design of CNN is based on the principles of visual perception [24].It finds extensive applications in various types of data, particularly in the analysis and processing of images [25].The main components of a CNN architecture include convolutional layers, pooling layers, dropout layers, dense layers, softmax layers, and classification layers.Convolution, pooling, and dropout layers are responsible for extracting deep features, while dense, softmax, and output layers are used for classification purposes [26].The results obtained from the convolution process can be referred to as feature maps.When applying a convolution kernel of a specific size, it is possible to lose information at the edges of the input data.Additionally, padding is utilized to expand the size of the input by adding zero values, which introduces the necessary adjustment for the output size.Furthermore, the density of convolution is controlled by the stride.A larger stride leads to a lower density of convolutions.Following convolution, the feature maps contain numerous features, which can contribute to the risk of overfitting.Therefore, pooling techniques, such as max pooling and average pooling, are introduced to eliminate redundancy and reduce the dimensionality of the feature maps [23,27,28].Nonlinear layers, often implemented with activation functions like ReLU, are utilized to enhance the network's capability to handle nonlinear problems.After each convolutional and fully connected layer, the ReLU is applied to introduce nonlinearity in the neural network [20].

Transfer Learning
Transfer learning involves taking advantage of a pre-trained model that has been trained on a specific task and applying it to solve a new problem or task.Due to its ability to train deep neural networks with limited data, transfer learning has gained significant popularity in the field of deep learning.Transfer learning involves leveraging the knowledge gained from one task (task A) to enhance generalization and performance on a different task (task B).This is achieved by transferring the learned weights of a network from task A to task B. Transfer learning offers various benefits, including reduced training time, improved performance of neural networks (typically), and the ability to achieve good results even with limited data [29].

Pre-trained models
Several pre-trained machine learning models have gained popularity in the field.Pre-trained CNN models are trained on extensive datasets that contain a vast amount of images.Leveraging pre-trained CNN models can expedite and streamline the training process.Rather than starting from scratch, we can employ the pre-trained model's weights and learned features as a base and fine-tune them for our particular task.By adopting this approach, we can make use of the knowledge and expertise encoded in the pre-trained model, addressing concerns regarding overfitting and underfitting in the process.VGGNet, ResNet, and Inception are highly recognized and extensively employed pretrained CNN models in the domain of image processing.These models, which were the best models of ILSVRC from 2012 to 2015, have been trained on the ImageNet database.Over 14 million labeled images, organized into more than 20,000 categories, are included in ImageNet, a vast image database.It has been instrumental in driving advancements in computer vision and deep learning research.Additionally, the exploration of a range of lightweight CNN-based networks could be pursued to achieve low parameter demands and real-time detection capabilities.Examples of such networks include MobileViT and MobileNetV3 [30].

VGG16 CNN-based model
VGG16 is a CNN-based model that is widely regarded as one of the most advanced and effective models in the field of computer vision, it can accurately classify 1000 images into 1000 distinct categories.The architecture of this network is shown in Figure 5.The designation "16" in VGG16 signifies the presence of 16 weight-bearing layers within the model.VGG16 consists of a total of 21 layers, including 13 convolutional layers, 5 Max Pooling layers, and 3 Dense layers.However, out of these 21 layers, only 16 layers contain learnable parameters, known as weight layers.Also, VGG16 accepts input tensors of size 224×224 with three RGB channels.A distinctive aspect of VGG16 is its emphasis on using 3×3 filters in the convolution layers with a stride of 1, along with consistent padding.Additionally, it consistently incorporates max-pooling layers with 2×2 filters and a stride of 2. This approach minimizes the number of hyper parameters in the model.Max-pooling layers are inserted between the blocks, and Following the 5 blocks of convolutional layers, there are three fully-connected layers.The last layer is a softmax layer, which generates output probabilities for each class [31].

Knowledge distillation
Knowledge Distillation denotes the procedure of transferring knowledge from a sophisticated model to a more straightforward one.It involves training smaller models to achieve similar accuracy as larger models by leveraging the knowledge gained from the larger models.Within the context of knowledge distillation, the term "teacher network" is used to describe the larger model, whereas the "student network" refers to the smaller network.The fundamental concept behind Knowledge Distillation is to train a smaller and less complex model to imitate the behavior and generalization capabilities of a larger and more complex model.The process of knowledge distillation involves transferring knowledge from the teacher network to the student network by optimizing a loss function [32].Figure 6 [33] illustrates a typical framework for knowledge distillation, where a teacher-student relationship is established.

Figure 6. The general framework for knowledge distillation involving a teacher-student relationship
The methodologies for knowledge distillation can be classified into three primary categories based on the synchronization of updates between the teacher and student models: self-distillation, online distillation and offline distillation [33].Figure 7 showcases each of these methods individually.
Figure 7.The "Pre-trained" refers to the process of learning networks prior to knowledge distillation and the "To be trained" refers to the process of learning networks during knowledge distillation.
In this research, we have used self-distillation method.Self-distillation involves utilizing identical networks for both the teacher and student models.This refers to a form of online distillation where knowledge from the deeper layers of a network is transferred to the shallower layers of the same network.Knowledge acquired by the teacher model during the early epochs can be transferred to its later epochs to facilitate the training of the student model [33].

Proposed method
Within this section, a thorough description of our proposed methodology will be presented, delving into its various aspects and dimensions.
As mentioned before, one of the notable aspects of this research is the innovative use of single-channel EEG signals for the automated diagnosis of schizophrenia, while still achieving the desired accuracy.By employing single-channel EEG, we are confronted with a relatively smaller dataset in comparison to the multi-channel mode.Furthermore, the utilization of single-channel EEG covers a narrower region of the brain surface, posing certain difficulties for specific tasks.Indeed, given these limitations, it becomes vital to identify the optimal brain area and channel that can provide the most valuable data for diagnosis of schizophrenia.On the flip side, utilizing a single-channel brings certain benefits, including enhanced subject comfort and cost reduction.Additionally, the lower volume of data associated with the single-channel mode allows for faster diagnosis speeds.
Self-distillation is a form of knowledge distillation where both the teacher and student networks are entirely identical [33].By employing this approach, we expect to enhance the accuracy of our base model as we progress through the utilization of the student network.For this study, we employed the VGG16 network as both the teacher and student networks.The only adjustment needed to apply this network is to modify the number of neurons in the last layer from 1000 to 2, aligning with the two classes we need to classify.It is important to mention that all layers' weights are set as trainable, enabling continuous updates throughout the training process.
Following the application of a frequency band-pass filter ranging from 1 to 45 Hz and the segmentation of the EEG signal data into 5-second intervals, we employ the Complex Morlet mother wavelet with a bandwidth of 1.5 and a central frequency range of 1 (cmor1.5-1).This process generates the corresponding scalogram images for each segment.In the next stage, we resize the generated images to dimensions of 224×224 to ensure compatibility with the VGG16 network.Following that, we normalize the images using mean values of [0.485, 0.456, 0.406] and standard deviation values of [0.229, 0.224, 0.225].We split our dataset into three main parts, where 70% of the dataset is used for training, 10% is used for validation, and the remaining 20% is reserved for testing purposes.The training process commences with the training of the teacher's network, and we save the best model obtained, determined by the lowest loss achieved during the training.Following that, we move on to the training of the student network, utilizing the previously saved model of the teacher network to enhance the training process.The distillation of knowledge from the teacher's network to the student's network is achieved through the combination of loss values.The combination of the loss functions of these two networks is illustrated in Figure 8.

Figure 8. Self-distillation structure
As depicted, the final loss function is a combination of two loss functions, namely Loss 1 ( 1 ) and Loss 2 ( 2 ), and is represented by Equation 5.
It is worth noting that the knowledge distillation process incorporates the softmax function along with the temperature parameter (T) which is represented by Equation 6.The temperature hyperparameter is used to adjust the logits, thereby influencing the final probabilities obtained from the softmax function.When the T is set to 1, the resulting output distribution will be identical to that of a typical softmax output.By increasing the value of the parameter T, we can enhance the randomness of the output distribution, introducing more variability [34].For this study, we assigned a value of 10 to the parameter T.

𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑦
1 loss is a discrete form of KL Divergence function.KL divergence is a metric that measures the difference between two probability distributions.Equation 7provides the formula for calculating the KL Divergence.
In Equation 5,  2 represents the cross-entropy loss function, which is defined by Equation 8. Cross-entropy loss measures the difference between the predicted probability distributions and the actual labels in classification models of deep learning.
Algorithm 1 illustrates the proposed training method through the use of pseudo-code.As depicted in algorithm 1, this method is clearly outlined in a step-by-step manner.

Algorithm 1: Training Procedure of the Proposed Method
Input: A batch of images (X) and their labels (Y)

Output: Accuracy and loss
Definitions: num_epochs ← 30, T ← 10, min_loss ← inf.This collaborative approach aims to optimize the student network, ultimately resulting in the final model.Table 1 presents the key hyper-parameters employed during the execution of both the teacher and student networks.Figure 9 illustrates the architecture of the proposed model in this research, showcasing all its components and aspects.

Results
In this section, we showcase the implementation results of the proposed architecture and conduct a comparative analysis with similar works.The purpose of this analysis is to demonstrate the effectiveness and performance of our approach in diagnosing schizophrenia based on EEG signals.
Various evaluation metrics can be employed to assess the performance of the proposed method, and the selection of the appropriate metric depends on the type of data and task at hand.For this study, we utilized five widely recognized evaluation metrics, namely accuracy, F1-score, precision, sensitivity, and specificity.These metrics collectively offer a comprehensive evaluation of the performance and effectiveness of the proposed method.Brief descriptions of each metric are provided below.All of these metrics can be computed using just four parameters: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).These parameters encapsulate the crucial information needed to calculate accuracy, F1-score, precision, sensitivity, and specificity.A TP occurs when the proposed model predicts a positive outcome and the actual outcome is also positive.When the predicted results indicate negative outcomes, and upon evaluation, they are indeed confirmed to be negative, we refer to them as TN.Instances that are predicted as positive but are actually negative are classified as FP.The model may predict a negative outcome, but the actual outcome turns out to be positive.These instances are known as FN.

Accuracy:
The accuracy of the model can be calculated using the following formula: It is the most frequently used metric for evaluating a model's performance, although it may not provide a definitive measure of its effectiveness.

F1-score:
The F1 score is a metric that blends precision and recall using their harmonic mean.It considers the contributions of both metrics, making a higher F1 score indicative of better performance.
Precision: This metric indicates the proportion of positive instances within the total instances predicted as positive.
Sensitivity: This metric signifies the ratio of positive instances among the total instances that are indeed positive.
Specificity: This metric denotes the percentage of negative instances among the total instances that are actually negative.
The confusion matrix provides a useful visual representation of the key parameters (TP, TN, FP, and FN).It is commonly displayed in the form of a matrix, often referred to as Figure 10 in this context.

Figure 10. Structure of Confusion Matrix
Table 2 presents the evaluation criteria for all channels in this study.As previously mentioned, the number of epochs is set to 30. Figure 11 illustrates the changes in accuracy and loss throughout these epochs for two channels, P4 and P3.It is evident that overfitting did not occur, indicating that the training network was effectively trained.Also, Figure 12 provides an example of the confusion matrix associated with the two EEG signal channels, P4 and P3, as an illustration of the model's performance in classifying schizophrenia.For this study, the Python programming language and the PyTorch library were utilized.The experiments were conducted on Google Colab Pro, utilizing a Tesla T4 graphics card and 15 GB of graphics RAM.

Discussion
By simultaneously employing transfer learning and self-distillation for the automatic diagnosis of SZ, we were able to attain an accuracy of 97.83% ± 1.3 using just a 5-second single-channel (P4) EEG signal.This illustrates how the proposed approach is efficient in achieving precise diagnoses using a short input duration.As reiterated earlier, a primary aim of this study is to enhance the model's efficiency, minimize input data duration, and alleviate computational expenses.This objective has been markedly and effectively attained.
The impact of SZ disease on distinct brain regions varies.Table 3 illustrates the average metrics derived from EEG channels associated with different areas.In general observation, the parietal region exhibits the highest performance compared to other regions, aligning with the findings of Yildiz et al. [35].Therefore, selecting a channel from this specific region could be a favorable decision for automated schizophrenia detection.[41] research, we have employed 5-second EEG signal records to achieve the objective of reducing time and cost for schizophrenia diagnosis.In contrast, they utilized 25-second EEG signal records in their study.By employing images instead of signals as input, we can harness the power of pre-trained models more efficiently.This is due to the fact that pre-trained models are usually trained on image data, which makes them well-suited and compatible for tasks involving images.Moreover, by leveraging pre-trained models and transfer learning, we can optimize the model with fewer iterations, resulting in reduced computational demands.In contrast, when building models from scratch, we must train and update the network weights from the beginning.The application of self-distillation enables us to achieve higher accuracy without increasing the size and complexity of our base model.This not only reduces the training costs but also lowers the computational requirements of the network.
Among the existing limitations, we can highlight the challenge of optimally selecting parameters associated with the wavelet transformation and the parameters of the pre-trained model.Equally crucial is the selection of the appropriate pre-trained CNN model type.The length of the EEG signal's time window also holds pivotal importance and requires careful consideration.It should strike a balance between not being excessively long while containing sufficient information for SZ detection.

Conclusion
One approach to utilizing large pre-trained networks for EEG signals is to transform the signal into an image using various methods and then inputting it into the model.In this research, we employed the Complex Morlet wavelet method for this conversion process.Collecting EEG signals using a multitude of channels can pose challenges and inconvenience, especially for individuals facing physical or mental challenges.To overcome this challenge and streamline the process, it is recommended for studies to adopt a single-channel EEG approach.This shift can enhance user comfort and simplify data collection procedures.Moreover, as artificial intelligence continues to advance and find applications in disease diagnosis and prediction, it becomes essential to focus on developing optimized deep learning networks.This approach not only saves time but also reduces costs, making it a promising way for future research and real-world applications.Given the challenges highlighted, we made the decision in this research to utilize single-channel EEG for the automatic diagnosis of schizophrenia and according to the obtained results, the P4 channel demonstrated the highest accuracy for diagnosing schizophrenia, achieving an accuracy of 97.83%.In this study, rather than utilizing a large pre-trained network, we chose to improve the performance of a smaller network through the use of knowledge distillation.This approach offers advantages such as lower hardware costs and faster inference, leading to increased efficiency in obtaining results.
In the future, there are various methods to expand and further advance this work.One potential approach for further exploration involves integrating multi-modal learning with knowledge distillation.This would involve utilizing both the raw EEG signal and its corresponding scalogram image as inputs for two separate models concurrently, potentially resulting in a more effective and robust system.Also, the integration of transformer networks holds significant potential for future endeavors focused on improving and accelerating the automated diagnosis of schizophrenia.

Statements & Declarations
Funding The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Figure 1
Figure1illustrates the structure of the presented methodology.It has 5 main parts: data acquisition, segmentation and frequency filtering, scalogram, knowledge distillation and the last step is classification.In the subsequent sections, we will provide a thorough examination of each of these steps, offering clear explanations and in-depth discussions.

Figure 1 .
Figure 1.Block diagram for the presented methodology

Figure 3 .
Figure 3.The entire preprocessing pipeline, which includes frequency filtering and segmentation

Figure 4 .
Figure 4. Transformation of signals into images using CWT for both a subject with schizophrenia (a) and a healthy subject (b)

Figure 9 .
Figure 9.The architecture of the proposed model

Figure 9 .
Figure 9.This figure depicts the learning curves for two channels: (a) P4 and (b) P3.

Table 1 .
Hyper-parameter values of the Teacher and Student Models

Table 2 .
Comparison of results for all EEG channels using various evaluation criteria, including the proposed method.

Table 3 .
Mean accuracy, F1 score, precision, sensitivity, and specificity values for SZ detection across various brain regions.

Table 4 .
Comparison of the performance of the proposed method with existing methods.