A. *Superlet transform*

The superlet transform (SLT) was proposed by Moca eta al. in [24] as an improvement of continuous wavelet transform to improve resolution of any non-stationary signal in time-frequency frame. WT provides a trade-off between time and frequency resolution in joint T-F domain analysis. For e.g., CWT using a ‘Morlet’ wavelet with less number of cycles provides precision temporal information but poor frequency resolution. On the contrary, increasing the number of cycles of “Morlet” wavelet will lead to poor temporal resolution and accurate frequency resolution. To overcome the aforesaid issue, basic idea of the “superlet” was proposed which improves the TFR by combining both shorter wavelets (with high time resolution, less cycles) with longer wavelets (more cycles with poor time resolution), simultaneously. In other words, the superlet transform makes use of “Morlet” wavelets having fixed central frequency (*ω*), and different number of cycles progressively (for better frequency resolution or constrained bandwidth). Mathematically, superlet transform can be expressed as [24]:

$$\:{SLT}_{\omega\:,k}{=\{Ѱ}_{\omega\:,c}|c={c}_{1},\:{c}_{2},........{c}_{k}\}$$

1

In the above equation, Ѱ is the mother wavelet with centre frequency *ω*, and *k* is the order of the SLT and *c* is the number of cycles. In the case of SLT, the number of cycles can be selected either additively or multiplicatively, which are given by *c*i = *c*1 + *i* − 1, for *i* = 2,…,*k* and *c*i = *i*.*c*1, respectively. In the other words, SLT is a multi-ordered wavelet transform, which covers multiple frequency bandwidth with fixed centre frequency. Also, a superlet with fixed centre frequency and order *k* = 1, is analogous to CWT. The computation of SLT is done in the same way as that of CWT with the exception that here SLs are used instead of wavelets. The response of SLT to any arbitrary signal *g*(*t)* is defined as the geometric mean GM of the responses of individual wavelets in the set which is expressed as:

$$\:R\left[{SLT}_{\omega\:,k}\right]=\sqrt[k]{\prod\:_{n=1}^{k}R\left[{\text{Ѱ}}_{\omega\:,{c}_{n}}\right]}$$

2

In the above equation, \(\:R\left[{\text{Ѱ}}_{\omega\:,{c}_{n}}\right]\) is the response of the *i**t*h wavelet to any signal. In the case of complex wavelets like Morlet or Gabor, complex convolution is used and Eq. (2) can be modified as:

$$\:R\left[{SLT}_{\omega\:,k}\right]=\surd\:2\times\:g\left(t\right)*{\text{Ѱ}}_{\omega\:,{c}_{n}}$$

3

In Eq. (3), * indicates complex convolution operation and \(\:\surd\:2\) is a term used strictly for the analytic wavelets and *g*(*t*) is any arbitrary signal. The SLT can estimate the oscillation packets present in any non-stationary signal at the central frequency, *ω*. The magnitude of ALST is computed by taking the GM of the magnitude of each individual wavelets. Finally, the magnitude of SLT is squared to obtain the time-frequency scalogram.

*B. Adaptive superlet transform*

In [24], the concept was adaptive superlet transform was proposed where the order of the central frequency of the superlets were adjusted to tackle the problem decreasing bandwidth as the frequency increases. An adaptive superlet transform (ASLT), starts with low order, *k* = 1, that can estimate low frequencies. The order *k* is then increased monotonically as a function of central frequency (*ω*) as follows [24]:

$$\:{{ASLT}_{\omega\:,k}=SLT}_{\omega\:,k}|k=p(\omega\:)$$

4

so that the improved resolution for both time and frequency is attained for the entire frequency domain. In Eq. (4), *p* is an integer. The general choice to order linearly using the following equation [24]:

$$\:p\left(\omega\:\right)={k}_{min}+\left[{{(k}_{max}-k}_{min})\frac{\omega\:-{\omega\:}_{min}}{{\omega\:}_{max}-{\omega\:}_{min}}\right]$$

5

In Eq. (5), *k**min* and *k**max* denotes the orders corresponding to the smallest and largest central frequencies \(\:{\omega\:}_{max}\) and \(\:{\omega\:}_{min}\), respectively and [] is the nearest integer operator. When the desired frequency range of interest is wider, then it is recommended to use ASLT while SLT may be used for narrower frequency bands. Since the frequency of PQDEs can fluctuate over a wide range from standard power frequency (50Hz) to several kHz for high frequency transients, therefore, we have used ALST for analysis of PQDEs to achieve better resolution in T-F frame.

*C. Convolutional neural network*

Convolutional neural network or CNN is mathematical model, which can process input image data to a new feature representation field. A basic structure of CNN consists of following layers [9,,25]

• **Input layer**:

This is the first layer of CNN architecture. In input layer, image is given as input. The input image needs to be resized accordingly before feeding into the convolution layer.

• **Convolution layer**:

Convolution layer (CL) is the core block of CNN architecture, comprises of a set of learnable filters known as “kernel”. In this layer, input image or output of previous layer is transversely convolved with kernels to extract feature map [15–19]. The convolution operation can be mathematically expressed as [25]:

$$\:P*Q(i,j)=\int\:P(a,b)Q(i+a,j+b)$$

6

It is noteworthy to mention that performance of a CL depends on some factors such as size of kernel and the number of kernels. During convolution operation, kernels are moved by a fixed length known as “stride”. In addition, zero padding has been applied to the output in order to maintain the image size. For an input image with dimension, *W**m* × *H**m* × *K**m* where *W**m* is the width, *H**m* is the height, and *K**m* are the number of channels. With *K**0* number of kernel filters with size *r* × *r*, the output feature map *W**0* × *H**0* × *K**0* can be written as [25]

$$\:{W}_{0}=\frac{{W}_{m}-r+2z}{q}+1$$

$$\:{H}_{0}=\frac{{H}_{m}-r+2z}{q}+1$$

7

In the above Eq. (7), “*q*” is stride whereas “*z*” is size of zero padding [17]. It is pertinent to mention that convolution operation is combined with activation layer. This activation layer introduces the non-linearity in the network.

• **Pooling layer**:

After convolution operation, the dimension of convolved features is very large, which may cause excessive computational cost if the convolved features are directly adopted. Considering the fact, the pooling layer (PL) reduces the dimension of convolved feature conserving the features with a high degree of spatial structure [11]. Pooling layer also helps to regulate overfitting. There are two types of pooling methods namely, average and max pooling. Average pooling takes average value of convolved feature map within a pooling window, whereas max pooling selects the feature with maximum value within a pooling window.

• **Fully connected layer**:

In fully connected layer (FC), the output of CL/PL has been converted into a one-dimension feature vector and the score for each category is calculated. It is noteworthy to mention that fully connected layer is similar to ordinary neural network and each hidden unit in this layer is connected to all activations in the previous layer [25].

• **Softmax layer**:

Softmax function maps the score obtained from FC into probabilistic value [9]. Based on the probabilistic value, probable class can be predicted.

Combining the above layers, one can build a customizable CNN architecture. The number layer in a customizable CNN architecture is completely empirical and it completely depends on the type of classification task. It is noteworthy to mention that performance of a CNN architecture is governed by various factors such as number and size of kernel, number of CL, type of activation function, type of pooling, number of FC, number of hidden units in FC. In existing literature there are several benchmark CNN models like AlexNet, VGGNet, ResNet etc. have been proposed by various researchers for image classification. A brief overview of different CNN models are shown below.

*(1) AlexNet CNN*

The AlexNet CNN was proposed by Alex Krizhevsky et al. in 2012 [26] as a winner of ImageNet Large Scale Visual Recognition Challenge (ILSVRC). In the ILSVRC dataset, 1.2 million images are present with 1000 different class labels. Detailed description of AlexNet model has been provided in earlier works in [26]. The basic structure of AlexNet consists of 8 layers. The initial two convolution layers have 96 and 256 filters with size 11×11×3 and 48×5×5, respectively followed by local response normalization and two maxpooling layers with filter size 3×3. The next three convolution layers consists of 384, 384, and 256 numbers of filters, with sizes 256×3×3, 192×3×3 and 192×3×3, respectively. These three convolution layers are followed by another maxpooling layer with filter size 3×3. Finally, two fully connected layers with 4096 number of neurons and two dropout layers and finally one FC is present at the output. AlexNet is a series connected CNN model with 60 million learnable parameters.

(2) *VGGNet CNN*

VGGNet is a popularly used CNN architecture originally proposed by the Oxford Visual Geometry Group (VGG) [27]. Like the AlexNet model, training of the VGGNet was also ILSVRC database. The number of layers in VGGNet CNN model varies from 11 to 19 among which in this work we have used 16-layer network known as VGGNet16. The VGGNet16 has 13 convolution layers with 3×3 convolution filters. There are 5 maxpooling layers of size 2×2 with step length of 2 is placed at the output of each convolution layers to reduce the spatial volume of the extracted feature output of the convolution layer. At the output of the final max pooling layer, three FC layers with 4096 number of neurons are connected. A softmax classification layer is connected at the output for the classification. Like AlexNet, VGGNet16 is also a series connected network with 138 million learnable parameters.

(3) *ResNet CNN*

Residul networks also known as ResNets are a set of popular CNN networks. ResNets proposed in [28], work on the principle of residual learning which was introduced to take care of the vanishing gradient problem often encountered in deeper networks. In a ResNet CNN model, a new technique known as the ‘identity mapping’ strategy has been incorporated in its hidden layers. This ‘identity mapping’ can solve the problem of vanishing gradient by providing skip connections i.e. by allowing shortcut paths for gradients to bypass through. Also, the skip connections aid in reducing overfitting issue during extraction of features. Different layers of ResNet like 18, 50, 101 etc. are proposed in existing literature. In this study, ResNet with 50 layers i.e. ResNet 50 has been used to classify PQDEs. The ResNet50 model consists of 5 stages with each stage contains a convolution block and an identity mapping block. The convolution block and identity mapping block contain 3 convolution layers. The total number of learnable parameters in ResNet50 is about 23 million.

*D. Proposed CNN model*

Although all these pre-trained deep learning models have delivered satisfactory performance in the context of image classification, yet one major limitation of the existing CNN

models is that neither of these aforesaid benchmark models are customized. In other words, the total number of layers used in the above benchmark CNNs are fixed and neither of them can be tuned as per the requirement of the user. Moreover, these CNN models are computationally expensive and require a greater number of learnable parameters. For classification of PQDEs T-F images, it may so happen that a lightweight CNN model with lesser number of learnable parameters may deliver better performance than the existing models and that too at reduced computational burden. This motivates us to design a lightweight CNN model for classification of PQDE images. The detailed architecture of the developed CNN model is given in Fig. 2.

The proposed CNN structure consists of 12 layers with 3 convolution layers, having filters size 3×3. The number of filters used in each convolution layer was kept fixed at 12, 32 and 64, respectively. Each convolution layer is followed by batch normalization (BN), ReLU activation and rank based average pooling layer (RAP) which were successively placed one after the other. The function of the BN layer is to reduce the feature dimensions at the convolution layer output, without causing any loss of information and at the same time reducing the training time. It is to be mentioned here that in the proposed CNN model, instead of using other activation functions sigmoid, tanh etc., ReLU activation function has been deployed since it does not suffer from vanishing gradient problem. Moreover, ReLU activation function can reduce learning time and at the same time computation complexity by producing sparse representation during training a CNN model. The RAP layer has been used here because it operates by computing the mean of the weighted feature values, thereby overcoming the loss of information. The output of the final RAP layer, there is a dropout layer with value of 0.5 (chosen empirically) followed by two FC layers in succession. The output of the last FC is added with 22 neurons to classify 22 PQDEs. Finally, a SoftMax layer is connected at the output of the last FC layer for classification. The total number of learnable parameters of the proposed CNN model is roughly 1 million.