A novel Scaled-Gamma-Tanh (SGT) activation function in 3D CNN applied for MRI classification

doi:10.21203/rs.3.rs-1295626/v1

Activation functions in the neural network are responsible for ‘firing’ the nodes in it. In a deep neural network (DNN) they ‘activate’ the features to reduce feature redundancy and learn the complex pattern by adding non-linearity in the network to learn task-specific goals. In this paper, we propose a simple and interesting activation function based on the combination of scaled gamma correction and hyperbolic tangent function, which we call Scaled Gamma Tanh (SGT) activation. The proposed activation function is applied in two steps, first is the calculation of gamma version as y = f(x) = ax^α for x<0 and bx^β for x>=0, second is obtaining the squashed value as z=tanh(y). The variables a and b are user-defined constant values whereas ${\alpha }$ and ${\beta }$are channel-based learnable parameters. We analyzed the behavior of the proposed SGT activation function against other popular activation functions like ReLU, Leaky-ReLU, and tanh along with their role to confront vanishing/exploding gradient problems. For this, we implemented the SGT activation functions in a 3D Convolutional neural network (CNN) for the classification of magnetic resonance imagings (MRIs). More importantly to support our proposed idea we have presented a thorough analysis via histogram of inputs and outputs in activation layers along with weights/bias plot and TSNE-projection of fully connected layer (FCL) for the trained CNN models. Our results in MRI classification show SGT outperforms standard ReLU and tanh activation in all cases i.e., final validation accuracy, final validation loss, test accuracy, Cohen-Kappa score, and Precision.

An activation function is used in DNN primarily for two purposes: i. Adding non-linearity in the whole system to learn complex patterns ii. Normalizing or thresholding the output of each layer to reduce the computational burden. Here, for a CNN, if only linear activation f(x)= wx+b is used, then stacking multiple functions of f(x) produces only a single degree output noting that the convolution layer itself is also a linear operation layer. Aside output values can monotonically explode to a maximal or minimal level causing difficulty in training to reach convergence. Hence, the learned polynomial expression should be in order greater than 1 to learn complex patterns due to multi-dimension features¹² i.e., the decision boundary needs to be non-linear. For this, the activation functions need to be chosen properly in deep networks as it has significant effects on the training dynamics and required task performance¹³⁴.

Traditionally neural networks implementing Multilayer Perceptron (MLP) used sigmoid function or tanh as a non-linear operator in its neuron or node⁵⁶⁷⁸. Later with emerging complexity in DNN, many other activation functions based on the non-linear operation were proposed. However, most of them were highly complex and designed for a very deep network for their high-level abstract representation in natural image datasets like ImageNet⁹¹⁰¹¹. It made the network more complex to understand its working mechanism and feature extraction process¹². So, still simpler non-linear rectifiers like ReLU¹³ and its variants Leaky-ReLU¹⁴ are the most popular ones along with other Parametric ReLU (P-ReLU)¹⁵, GELU¹⁶, ELU¹⁷, SELU¹² being occasionally used in DNN like CNN^5181920. ReLU defined as f(x)=max (0, x) completely blocks the negative input for positive gradient flow whereas its other variants allow a computed flow of negative input for small negative gradients loss. Although the vanishing gradient problem was solved with positive gradients loss in ReLU, it gave rise to another similar problem called ‘dying ReLU’, which is encountered if higher negative input keeps on prevailing at the cost of sparsity. Later these problems were solved using Leaky-ReLU and P-ReLU¹³¹⁵ with non-zero activation for negative inputs as f(x)=αx, where α is a constant scalar or a learnable parameter. However, in the case of medical image classification like MRI and Positron Emission Tomography (PET), ReLU and Leaky-ReLU are still the dominant ones due to their simplicity and training images being in greyscale format. Recent works in MRI classification using DNN include designing robust and better architecture, ensemble models along with clinical features, and experiments to apply new learning and optimization algorithms^2122232425. While very few works have been done in designing novel activation functions specifically to MRI, as most researchers use the existing activation methods²⁶²⁷²⁸. Hosseini-Asl et al.²¹ used Sigmoid and ReLU function to design deeply supervised and adaptable 3D CNN (DSA-3D-CNN) trained on structural MRI (sMRI) images, for the prediction of Alzheimer’ disease (AD) vs. mild cognitive impairment (MCI) vs. controlled normal (CN) task. Payan et al.²² proposed sparse auto-encoder (SAE) patch-based 3D CNN using sigmoid activation function to classify MRI scans. Similarly, Oh et al.²³ performed 5-fold cross-validations (CV) using convolutional auto-encoder (CAE) based volumetric CNN with ReLU as the activation function for AD vs. NC classification along with supervised transfer learning for sMCI vs. pMCI classification. Gupta et al.²⁹ used CNN with sigmoid activation function to classify MRI into 3 classes with transferred features learned from natural images using autoencoder. E.Goceri²⁵ proposed Sobolev gradient-based optimization for 3D-CNN, results for MRI classification accuracy were reported higher with Leaky-ReLU in comparison to sigmoid and ReLU. Recently Huang et al.²⁶ implemented a combination of GELU and ReLU in their DNN model for brain tumor image classification and achieved a 95.49% success rate.

Generally Gamma correction (f(x)=x^γ)³⁰ is about contrast enhancement and non-monotonically intensity mapping to new values, depending on the exponent γ for the input x. In deep learning scenario, Gamma correction is mostly used to produce augmented images (with defined γ values like γ = 0.5,1.5,2, etc.) for increasing training material³¹³²³³. This idea seems helpful to increase the training result by producing multiple versions of gamma-corrected images using different values of γ in f(x)=x^γ. However, it should also be noted that some image’s quality might deteriorate due to the unmatched version of gamma. With the higher value of γ, we can wash out the image whereas with the lower value of γ we might lose the important pixel information. Hence ‘γ’ should be a ‘versatile’ constant or technically a learnable parameter as per channels rather than a ‘fixed’ constant. Hence our idea is to select appropriate gamma value for each image, or more specifically for all the images (or their features) obtained from all the channels output after Batch normalization (BN). Hence our method is not to increase the number of augmented images rather find appropriate values of gamma for each filter output and bring non-linearity in the model at the same time without increasing the number of training samples which basically works as an activation function (please see Figure 1).

In this work, a novel activation function is proposed with the stepwise combination of gamma correction technique and hyperbolic tangent function. Although zero centered symmetric functions like Sigmoid, tanh is desirable for activation function for un-skewed gradients however, those functions proved to be not worthy due to the vanishing gradient problem. So, the best proven recent activation functions are mostly non-symmetrical around zero, hence we are also developing a non-symmetric function. For the application of our proposed idea, we have implemented the proposed SGT activation technique for MRI classification using our previously used architecture³⁴ with a reduced number of fully connected layers. As each activation layer is preceded with BN layer, the idea is to distribute histogram with saturation at low and high intensities of input data, which was originally mean centered at zero with unit variance. In other words, the intensity profile is dispersed from the central region to the edges. This brings higher variance in weight distribution with significant discrimination in features to support the classification (please see histogram distribution figure 4).

Proposed SGT activation and training process

The proposed SGT activation is performed in two steps:

Step 1: f(x)= y = ax^α for x<0 and bx^β for x>0 (1)

Here the first step is finding the gamma corrected version of input x as in equation (1). x is an input defined by a 4D matrix/Tensor as ${X}_{l}$ with each pixel/feature value ${X}_{n}^{b}$for ‘b^th’ batch and ‘n^th’ filter in layer $\text{'}l\text{'}$. a and b are constant scaling factors that were set manually. For n filters, we have n values of learnable parameters (i.e., ${\alpha } \text{o}\text{r} {\beta }$) which implies that for all the different (or same)-class images belonging to the same mini-batch, the value of exponent (α and β) remains the same, whereas the value of exponents is different for the same-class images in different channels, hence are activated differently in each channel as shown in matrix representation in equation (2).

${Y}_{l}=a.\left[\begin{array}{ccc}{X}_{1}^{1}& \cdots & {X}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {X}_{n}^{1}& \cdots & {X}_{n}^{b}\end{array}\right]^\left[\begin{array}{c}{{\alpha }}_{1}\\ ⋮\\ {{\alpha }}_{n}\end{array}\right]=a.\left[\begin{array}{ccc}{{X}_{1}^{1}}^{{{\alpha }}_{1}}& \cdots & {{X}_{1}^{b}}^{{{\alpha }}_{1}}\\ ⋮& \ddots & ⋮\\ {{X}_{n}^{1}}^{{{\alpha }}_{n}}& \cdots & {{X}_{n}^{b}}^{{{\alpha }}_{n}}\end{array}\right]=\left[\begin{array}{ccc}{Y}_{1}^{1}& \cdots & {Y}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {Y}_{n}^{1}& \cdots & {Y}_{n}^{b}\end{array}\right]$ $(\text{f}\text{o}\text{r} {X}_{n}^{b}cript>$

$=b.\left[\begin{array}{ccc}{X}_{1}^{1}& \cdots & {X}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {X}_{n}^{1}& \cdots & {X}_{n}^{b}\end{array}\right]^\left[\begin{array}{c}{{\beta }}_{1}\\ ⋮\\ {{\beta }}_{n}\end{array}\right]=b.\left[\begin{array}{ccc}{{X}_{1}^{1}}^{{{\beta }}_{1}}& \cdots & {{X}_{1}^{b}}^{{{\beta }}_{1}}\\ ⋮& \ddots & ⋮\\ {{X}_{n}^{1}}^{{{\beta }}_{n}}& \cdots & {{X}_{n}^{b}}^{{{\beta }}_{n}}\end{array}\right]=\left[\begin{array}{ccc}{Y}_{1}^{1}& \cdots & {Y}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {Y}_{n}^{1}& \cdots & {Y}_{n}^{b}\end{array}\right]$ $(\text{f}\text{o}\text{r} {X}_{n}^{b}>0 )$ (2)

where ${X}_{l}=\left[\begin{array}{ccc}{X}_{1}^{1}& \cdots & {X}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {X}_{n}^{1}& \cdots & {X}_{n}^{b}\end{array}\right]$ is the input to the layer $l$.

Here a and b are scaling constants selected manually, for our case we have selected a and b to 0.1 and 1.1 respectively. It is done to behave slightly as a monotonic function when the exponents are equal to 1 and resemble the Leaky-ReLU function in the first step (please see figure 2(a)). Later in the second step, when passed through the hyperbolic tangent (both exponents as 1) function, the output for the positive part will resemble tanh, and for the negative part will partly resemble the Leaky-ReLU function (please see figure 2(b)). However, on changing the exponent value and sign, different activation plots can be generated as shown in figures 2(c) and 2(d). Here it should be noted that only using step 1 for activation might explode the activated value in the positive region and can lead to vanishing gradient in the negative region (please see ‘only-gamma’ plot in figure 2(b)) which causes computational difficulty in convergence during training. So, a thresholding function with non-linear and symmetric property in positive and negative axis is required, for which we have selected the tanh function. The learnable parameters α and β values work as a positive gamma corrector, hence the weight updates of value α and β are calculated from the partial derivative of equation 1 during backward propagation as in equations (3) and (4):

$$\frac{dl}{d\alpha }=\sum _{b}\sum _{n}0.1\times real\left({log}_{10}{X}_{b}^{n}\right).real\left({{X}_{b}^{n}}^{\alpha }\right).\frac{dl}{dz} for {X}_{n}^{b}cript>$$

3

$$\frac{dl}{d\beta }=\sum _{b}\sum _{n}1.1\times real\left({log}_{10}{X}_{b}^{n}\right).real\left({{X}_{b}^{n}}^{\beta }\right).\frac{dl}{dz} for {X}_{n}^{b}>0$$

4

Please note when ${X}_{b}^{n}=X$ is negative and α is a rational decimal number, the resulting ${X}^{\alpha }$ becomes a complex number, in that case, we will only use the real part of the complex number. The same is the case with ${log}_{10}X$ and ${X}^{\beta }$. Also, the absolute values of α or β are used in equations (2), (3) and (4) for getting positive exponents.

Step 2: z=tanh(y) or in matrix form as:

$${Z}_{l}=real\left[\begin{array}{ccc}tanh{(Y}_{1}^{1})& \cdots & tanh{(Y}_{1}^{b})\\ ⋮& \ddots & ⋮\\ tanh{(Y}_{n}^{1})& \cdots & tanh{(Y}_{n}^{b})\end{array}\right]=\left[\begin{array}{ccc}{Z}_{1}^{1}& \cdots & {Z}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {Z}_{n}^{1}& \cdots & {Z}_{n}^{b}\end{array}\right]$$

5

Here since all the operations are an element-wise matrix operation, the matrix calculated using (2) is passed to matrix calculation as in (5), then the output matrix ${Z}_{l}$ of layer $l$ is passed into the pooling layer. For the layer loss $\frac{dl}{dX}$, first the derivative of ${Y}_{l}$ with respect to (w.r.t) ${X}_{l}$ is calculated using equation (6), so that the output ${Y}^{\text{'}}$ dimension matches exactly the dimension of the layer input i.e., ${X}_{l}$.

$${Y}^{\text{'}}=\frac{d{Y}_{l}}{d{X}_{l}}=0.1\times \alpha .real\left({X}^{\alpha -1}\right) for Xcript>$$

$$=\frac{d{Y}_{l}}{d{X}_{l}}=1.1\times \beta .real\left({X}^{\beta -1}\right) for X\ge 0$$

6

Then, the overall gradient loss $\frac{dl}{dX}$is calculated through the output of this layer as the derivative of ${Z}_{l}$ w.r.t ${Y}^{\text{'}}$, which is backpropagated to the former layers using equation (7).

$$\frac{dl}{dX}=\frac{d{Z}_{l}}{d{Y}^{\text{'}}}.\frac{dl}{dZ}=\frac{dtanh\left({Y}^{\text{'}}\right)}{d{Y}^{\text{'}}}.\frac{dl}{dZ}={sech}^{2}\left({Y}^{\text{'}}\right).\frac{dl}{dZ}$$

7

Here, $\frac{dl}{dZ}$ is the loss back-propagated from the deeper layers. Since z = tanh(y) is used as a squashing function, the final output value of the layer is non-uniformly scaled before passing out to the next layer resulting in z being a non-symmetric function centered at zero. This is shown in figures 2(c) and 2(d), where d(proposed-SGT) shows the plot for the final output of the first-order derivative of the proposed function. For condition with exponents α and β both being 1, the activation layer behaves like tanh in the positive part and leaky ReLU in the negative part, whereas for the case of derivative, the first-order derivative is a constant so behaves exactly like Leaky ReLU with output constant 0.3592 and 0.99006 for positive and negative part respectively. Such behavior was observed in few filters with β(positive)>α(negative) as in the 18th filter which seems to be constant output as in two different filters non-lineared at 0. However, since both α and β are channel-wise learnable parameters, the value is not the same for all the channels (please see figure 5). The final value of α and β were examined to be between -0.2 and 1.3, and rarely the identical value. Regarding our experiment, in most of the filters, the values of both α and β were a positive rational number with decimals, and β being greater than α in the majority case. More discussion on this is done in the discussion section. In the case with both β(positive)>α(positive), follows the graph as in 31st filter (please see graph figure 2(d)) where the gradients value for positive x gradually keeps on decreasing with the value of x, however, the rate of decrease is lower than the tanh derivate. This helps to prevent gradients values from becoming infinitely small, whereas in the negative derivative part the value is almost constant and fairly equals to become 1, for all cases. So, the network becomes less prone to the vanishing gradient or exploding gradient. It is to note that when the input X, α, β becomes 0, it causes an indeterminate form as Sech (0) = ∞ also log (0) = ∞ in this case, we simply replace the value of the parameters as 0.001 to continue training. Few α, β values were recorded undefined still after the convergence (please see Figure 6), however, they can be ignored.

For training the network and optimizing the parameters we used the Adam³⁵ optimization technique. It is a first-order gradient-based optimization algorithm to update parameters until it reaches convergence. The learnable parameter (${w}_{t}$) (weights/bias/defined terms like α and β) during ${t}^{th}$ iteration is updated using Adam optimization as follow:

$${w}_{t+1}={w}_{t}-\frac{{am}_{t}}{\sqrt{{v}_{t}}+\varepsilon }$$

8

Where $a$ is the learning rate constant-value kept at 0.001 in our case, $\varepsilon$ is a very small regularization constant value (10⁻⁸) used as offset to keep a non-zero denominator. An element-wise moving average of parameters gradients (${m}_{t}$) and its squared value (${v}_{t}$) keeps on being updated as in equations (9) and (10), where ${b}_{1}$ and ${b}_{2}$ are decay rates for ${m}_{t}$ and ${v}_{t}$ kept at 0.9 and 0.990 respectively.

$${m}_{t}={b}_{1}{m}_{t-1}+\left(1-{b}_{1}\right)\nabla E\left({w}_{t}\right)$$

9

$${v}_{t}={b}_{2}{v}_{t-1}+\left(1-{b}_{2}\right){\left[\nabla E\left({w}_{t}\right)\right]}^{2}$$

10

Here $\nabla E\left({w}_{t}\right)$ represents the first-order derivative of loss ($E$) for the parameter ${w}_{t}$, which is the cross-entropy loss i.e.

$$loss \left(E\right)= -\frac{1}{N}\sum _{n=1}^{N}{\sum }_{i=1}^{K}{t}_{ni}ln\left({y}_{ni}\right)$$

11

where for $N$ is the total numbers of training samples with $K$ mutually exclusive labels and ${t}_{ni}$is targeted output, and ${y}_{ni}$ is the predicted value with its natural log ($ln$) calculated for $n$^th sample belonging to $i$ ^th class.

Classification performance and methods

The performance evaluation of the proposed function was done with the classification of three cohorts of MRIs clinically categorized as AD, CN, and MCI obtained from the ADNI website³⁶. The demographic detail of the used MRIs is shown in Table 1. Multiple scans from the same patients with different gradient wrapping and scale correction techniques were used to add heterogeneity and increase the number of experiment samples³⁷. The detailed architecture used in the analysis is shown in Table 2. The total dataset was divided into three parts viz train, validation, and test set in the ratio of 5:2:3 so that 495 MRIs were used in training, 197 MRIs for validation, and 296 MRIs were separated for testing the trained models.

Table 1

Participants' demographics and MRI counts.
Dataset type	AD participants	CN participants	MCI participants
Male/Female	29/36	22/38	54/33
Mean age	73.55/75.43	75.57/74.43	77.06/72.41
Total number of Participant	65	60	87
Number of MRI scans	209	305	474

Table 2

CNN baseline architecture used to train and classify the MRI 3D scans. Here, while analyzing the performances of different activation functions, layers containing SGT functions i.e., layer_gamma3d are replaced with other existing standard activation functions. Weights and bias values for convolution and FCL were initialized using the ‘Glorot’ initialization technique and for the proposed SGT layer, α and β values were randomly initialized between 0 to 1. The initial learning rate was set at 0.001 with learn drop factor of 0.95 after every 10 epochs and fully trained up to 80 epochs.
Layer no.	Layer Name	Layer description	Output size	No of learnable Parameter
1	Image Input	64×64×64×1 images with 'zero-center' normalization	64×64×64×1	0
2	Convolution	64 3×3×3×1 convolutions with stride [1 1 1] and padding 'same'	64×64×64×64	Weights=1728 Bias= 64
3	Batch Normalization	Batch normalization with 64 channels	64×64×64×64	Offset = 64, Scale = 64
4	layer_gamma3d or ReLU/Leaky-ReLU/Tanh	Proposed SGT function with 2 learnable parameters for 64 channels	64×64×64×64	α = 64, β = 64 or 0
5	3-D Max Pooling	2×2×2 max pooling with stride [1 1 1] and padding [0 0 0; 0 0 0]	63×63×63×64	0
6	Convolution	64 5×5×5×64 convolutions with stride [1 1 1] and padding 'same'	63×63×63×64	Weights=512K Bias= 64
7	Batch Normalization	Batch normalization with 64 channels	63×63×63×64	Offset = 64, Scale = 64
8	layer_gamma3d or ReLU/Leaky-ReLU/Tanh	Proposed SGT function with 2 learnable parameters for 64 channels	63×63×63×64	α = 64, β = 64 or 0
9	3-D Max Pooling	2×2×2 max pooling with stride [2 2 2] and padding [0 0 0; 0 0 0]	31×31×31×64	0
10	Convolution	64 7×7×7×64 convolutions with stride [1 1 1] and padding 'same'	31×31×31×64	Weights=1.404M Bias= 64
11	Batch Normalization	Batch normalization with 64 channels	31×31×31×64	Offset = 64, Scale = 64
12	layer_gamma3d or ReLU/Leaky-ReLU/Tanh	Proposed SGT function with 2 learnable parameters for 64 channels	31×31×31×64	α = 64, β = 64 or 0
13	3-D Max Pooling	2×2×2 max pooling with stride [3 3 3] and padding [0 0 0; 0 0 0]	10×10×10×64	0
14	Convolution	64 9×9×9×64 convolutions with stride [1 1 1] and padding 'same'	10×10×10×64	Weights=2.985M Bias= 64
15	Batch Normalization	Batch normalization with 64 channels	10×10×10×64	Offset = 64, Scale = 64
16	layer_gamma3d or ReLU/Leaky-ReLU/Tanh	Proposed SGT function with 2 learnable parameters for 64 channels	10×10×10×64	α = 64, β = 64 or 0
17	3-D Max Pooling	2×2×2 max pooling with stride [4 4 4] and padding [0 0 0; 0 0 0]	3×3×3×64	0
18	Fully Connected	1728 fully connected layer	1×1×1×1728	Weights=2.98M Bias= 1728
19	Dropout	50% dropout	1×1×1×1728	0
20	Fully Connected	3 fully connected layer	1×1×1×3	Weights=5.18K Bias= 3
21	SoftMax	SoftMax function	1×1×1×3	0
22	Classification Output	Cross-entropy with 'AD', 'CN' and 'MCI' labels	1×1×1×3	0

Table 3

Results for multi-class MRI classification using CNN architecture as in Table 2.
Type	Name	Final Validation Accuracy (%)	Test Accuracy (%)	Final Validation Loss	Cohen-Kappa	Precision (class-wise [AD CN MCI])	Predicted Confusion Matrix	True Confusion Matrix
Using Standard Activation functions	Tanh	90.862	92.57	0.5338	0.897	[0.8889 0.9120 0.9507]	56 1 6 0 83 8 1 6 135	63 0 0 0 91 0 0 0 142
	ReLU	87.817	91.22	1.0425	0.8603	[0.9048 0.9011 0.9225]	57 3 3 1 82 8 1 10 131
	Leaky-ReLU	90.355	93.92	0.8201	0.9029	[0.9206 0.9121 0.9648]	58 2 3 1 83 7 1 4 137
Using all or partially SGT function	gamma2	90.862	92.91	0.8777	0.887	[0.9206 0.8901 0.9577]	58 0 5 5 81 5 1 5 136
	gamma2_alt	91.370	92.91	0.7587	0.886	[0.8730 0.9231 0.9577]	55 0 8 0 84 7 1 5 136
	gamma4_adam	92.893	92.57	0.5683	0.8818	[0.8730 0.9451 0.9366]	55 6 2 0 86 5 1 8 133
	gamma4_sgdm	92.893	93.24	0.3086	0.892	[0.9048 0.9121 0.9577]	57 2 4 1 83 7 1 5 136

From Table 3 it is observed that all the version of the network using SGT activation (i.e., gamma2, gamma2_alt, gamma4_adam, gamma4_sgdm) has higher validation accuracy than the other activation schemes. Here gamma2 means the first two activations are SGT and other ReLU, gamma2_alt means first and third is SGT and other ReLU, gamma4_adam uses all four activation layers as SGT with Adam optimizer while gamma4_sgdm also uses four SGT activation layers but the optimizer is Stochastic Gradient Descent with Momentum (SGDM). The validation set is the test set used during training to calculate the accuracy of prediction at different epochs, hence it helps to know how well the network is learning. Figure 3 (b) shows the validation accuracy calculated at different epochs along with its training accuracy in 3(a). It can be clearly noticed that the SGT activated network (gamma4_sgdm, gamma4_adam) reaches higher validation accuracy than other activation schemes in the final stages of training. The final validation accuracy reported in Table 3 is the accuracy on the validation set at the 80th epoch or the final epoch. Similarly, the test set is the set that is completely unseen for the trained model and the higher performance in the test set means the network is well generalized and has good performance for unseen data. To get an unbiased result, the experimental environment along with all the hyperparameters and participating MRIs were always kept identical for all networks irrespective of the choice of activation functions. During test set classification, Leaky-ReLU performed the best with around 0.5% higher test accuracy than that of gamm4_sgdm. Still, the test accuracy of all SGT activated networks was higher than the ReLU and tanh by 2% and 1% respectively, which indicates that the proposed SGT activation scheme outperforms the traditional ReLU activation by a clear margin.

Histogram analysis and asymmetric distribution

Weights of each layer’s input (X_l) or output (Z_l) as in equations (2) or (5) is plotted against its frequency in the histograms. The normalized output values from BN are zero-mean with almost normal distribution, so it is not a good idea to throw away all the negative valued parameters/weights using activation functions like ReLU or sigmoid¹⁶. Though the flow of gradient is positive in ReLU, if a bunch of the weights is negative it causes dead ReLU with ‘zero’ derivative for negative weights, hence not every time ReLU is a wise choice. In cases like MRI, mostly with black background (low pixel value), it is better to use alternative activation function like Leaky-ReLU, GELU, SELU that provides non-zero gradients for negative weights ensuring the flow of gradient loss.

Figure 4 shows the input and output histogram plots through the SGT layer in comparison to ReLU versions. Here, please note that the input to the activation layer is the output from the batch normalization and the output of the activation layer is the input to the pooling layer. In figure 4, the input histogram of all activation layers has an almost symmetrical distribution which means most of the image pixel lies in the grey region after BN. Our goal of gamma correction is to reduce this grey zone and make the distinction between white (bright) and black (dark) regions. If we look at figure 4(b), the mid-grey region is very few in the case of output from the proposed SGT layer, whereas the output with ReLU has very high zeros and leaky-ReLU output still seems centered at zero, hence the clear skewness is seen in positive part. While the SGT layers’ output data are decentralized in opposite edge regions unlike BN, and it seems like the combination of the output of tanh and Leaky-ReLU histogram. Additionally, this asymmetric feature distribution in the SGT layer supports the classification task due to the higher variance between the edge regions.

Channel wise Activation

We were highly interested to see what value the SGT (layer_gamma3d in Table 2) parameters would learn at different activation layers. The stem plot for α and β values from all activation layers as in Figure 6 shows that for the first SGT activation layer (i.e., Layer 4) the values for α and β were mostly positive and only a few remained negative, also there were more β with value >1 than α. The range for the value of α and β lied between -0.4 to 1.4. Interestingly in the intermediate activation layers (i.e., Layer 8,12) and the final activation layer (i.e., Layer 16) none of the values for β remained negative while the values for α in most channels remained negative. This might imply that for feature value x>0, required positive gamma correction, and for negative feature value x<0, required negative gamma correction in the intermediate layer. In a more general statement, the gamma activation made brighter pixels look brighter and darker pixels look darker, which resulted in a more distinct intensity profile.

Analyzing weights and bias in the final fully connected layer

Table 4. Correlation matrix for weights as shown in Figure 7. The colored ones are the highest measured value for the sample-parent pair, higher being better.

FCL represents an MLP Feedforward network with learnable weights and bias but mostly without activation function when used in CNN¹⁰¹¹. In FCL all inputs are mapped to output unlike the convolutional layers which are used as a patch-based feature extractor, therefore weights and bias in FCL are highly responsible for predicting the result, and the weights themselves suggest which input has more effect (or gain) on output. Thus, the weight distribution pattern of FCL might indicate how a network behaves during the test phase. To interpret this, we plotted all trained weights of the final FCL (Input nodes=1728, output nodes= 3, connection= 5184) for all 3 classes as shown in figure 7. Later the correlation matrix is calculated as in Table 4, which shows a sample MRI’s features (or weights) calculated from the FC layer is closely correlated with its parent class. For instance, the test sample CN MRI’s FC weights i.e., act1_CN has correlation value [0.143417277 0.24265146 -0.009627914] with the trained network corresponding layer weights [FC_AD_row FC_CN_row FC_MCI_row]. So, the highest correlation value is 0.24265146 for FC_CN_row implies, the MRI test sample has a higher affinity for ‘CN’ class weights during classification besides, it supports the logic behind why the network predicts the test sample label as ‘CN’.

After weights analysis, we were also interested to analyze the bias value. So, the idea is to check how much network is biased to each class via calculated bias in the final FCL layer. The obtained bias value is from the last FCL, which goes into SoftMax for probability calculation. We know weights in the network directly influence the output value for input, whereas bias works as a regularization constant to make non-zero output when input/weights are zero and don’t have a successive layer-wise influence on the output. Although it is difficult to exactly interpret the bias value theoretically, we assume the bias values close to each other cohort, can correlate how each other is numerically related. E.g., for Tanh trained CNN the obtained bias value is [AD CN MCI] = [-0.006021075 0.000316184 0.004943716], which means that AD (with negative value) is closely related to CN (small positive), being the difference of value between AD and CN greater than AD and MCI, which is against the general assumption that AD is closely related to MCI, both being a dementia condition. This might also indicate that the tanh network can easily differentiate between AD and MCI rather than AD and CN, which is not what it should be, the same is the case with Leaky ReLU. Surprisingly this might be supportive for the classification task, as a higher difference in bias would make the network easier to calculate the class-wise probabilities scores. On the contrary, the proposed SGT networks (gamma4_adam and gamma4_sgdm) have a larger difference between AD and CN bias values, one being positive and the other being negative. While MCI is nearly 0 indicating a moderate status between AD and CN. The lower difference in MCI and CN bias values in the gamma4_adam network might suggest a higher difficulty in classification and generalization between CN and MCI, which supports the real scenario.

Figure 9 represents the 3D t-SNE projection for visualization of reduced features from the final FCL. The features into the FCL are originally from multiple channels later reduced into a single channel, so are considered flattened features. However, each MRI’s flatten feature needs to be reduced to a 2D or 3D dimension for proper visualization. The distinctive clustered distribution in the projection means the network is learning class discriminant properties with good fitness.

DNN design and hyperparameter selection are task-specific with no single model or function that can work universally for all, however, after all the experiments and analysis we can conclude:

A novel channel-wise dynamic activation function is introduced with superior performance than standard ReLU and tanh function in 3D CNN for MRI classification.
We showed that the proposed activation function can diminish the negative gradient loss arising with the negative weights with less likelihood for vanishing or exploding gradient problem and also zero gradient problem unlike dead ReLU (please see derivative plots in Figure 2(c) and 2(d)) for shallower networks.
The analysis performed in histograms (Figure 4), showed negative weights are produced in a quite large measure during convolution and batch normalization operation so, the idea of utilizing negative weights to relatively contribute to the gradient loss proved meaningful with the proposed activation function.
We tried to explore the pattern of weights and bias in the final FCL and how numerically they might be related (Figure 7, 8, and Table 4) in regard to the classification task. This might be one of the few attempts in this field as weights can be optimized in numerous approaches and are difficult to analyze mathematically.

Our idea is quite simple as well as interesting so we hope, our work could be helpful and meaningful for other researchers working in deep learning. In the future, more modifications are required for superior performance than all other activation functions and to work universally in all kinds of the image dataset.

Related figures with equations and implementation code.

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) under Grant NRF-2019R1A4A1029769 and Grant NRF-2021R1I1A3050703. Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

Data availability

Data used in this article are publicly available on Alzheimer’s disease Neuroimaging Initiative (ADNI) database: https://ida.loni.usc.edu accessed on 10 February 2021. All methods were carried out in accordance with relevant guidelines and regulations as stated in the official website http://adni.loni.usc.edu/methods/documents/ accessed on 30 January 2022.

Competing interests

The authors declare no competing interests.

Author contributions

B.K. conceptualized the idea, performed theoretical interpretation, conceived the experiments, and prepared the manuscript. G.-R.K. worked in the funding acquisition and supervised the research.

Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. in Proceedings of the thirteenth international conference on artificial intelligence and statistics 249–256 (JMLR Workshop and Conference Proceedings, 2010).
Ramachandran, P., Zoph, B. & Le, Q. V. Searching for activation functions. arXiv Prepr. arXiv1710.05941 (2017).
Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929 (2012).
Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017).
Golilarz, N. A. & Demirel, H. Thresholding neural network (TNN) with smooth sigmoid based shrinkage (SSBS) function for image de-noising. in 2017 9th International Conference on Computational Intelligence and Communication Networks (CICN) 67–71 (IEEE, 2017).
Gregor, K., Danihelka, I., Graves, A., Rezende, D. & Wierstra, D. Draw: A recurrent neural network for image generation. in International Conference on Machine Learning 1462–1471 (PMLR, 2015).
Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition 770–778 (2016).
Szegedy, C. et al. Going deeper with convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition 1–9 (2015).
Klambauer, G., Unterthiner, T., Mayr, A. & Hochreiter, S. Self-normalizing neural networks. in Proceedings of the 31st international conference on neural information processing systems 972–981 (2017).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. in Icml (2010).
Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. in Proc. icml vol. 30 3 (Citeseer, 2013).
He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proc. IEEE Int. Conf. Comput. Vis. 2015 Inter, 1026–1034 (2015).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (gelus). arXiv Prepr. arXiv1606.08415 (2016).
Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv Prepr. arXiv1511.07289 (2015).
Anwar, S. M. et al. Medical image analysis using convolutional neural networks: a review. J. Med. Syst. 42, 1–13 (2018).
Yamanakkanavar, N., Choi, J. Y. & Lee, B. MRI segmentation and classification of human brain using deep learning for diagnosis of Alzheimer's disease: A survey. Sensors (Switzerland) 20, 1–31 (2020).
LeCun, Y. et al. Handwritten digit recognition with a back-propagation network. in Advances in neural information processing systems 396–404 (1990).
Hosseini-Asl, E., Gimel’farb, G. & El-Baz, A. Alzheimer’s disease diagnostics by a deeply supervised adaptable 3D convolutional network. arXiv Prepr. arXiv1607.00556 (2016).
Payan, A. & Montana, G. Predicting Alzheimer’s disease: a neuroimaging study with 3D convolutional neural networks. arXiv Prepr. arXiv1502.02506 (2015).
Oh, K., Chung, Y.-C., Kim, K. W., Kim, W.-S. & Oh, I.-S. Classification and visualization of Alzheimer’s disease using volumetric convolutional neural network and transfer learning. Sci. Rep. 9, 1–16 (2019).
Acharya, U. R., Oh, S. L., Hagiwara, Y., Tan, J. H. & Adeli, H. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput. Biol. Med. 100, 270–278 (2018).
Goceri, E. Diagnosis of Alzheimer’s disease with Sobolev gradient‐based optimization and 3D convolutional neural network. Int. j. numer. method. biomed. eng. 35, e3225 (2019).
Huang, Z. et al. Convolutional neural network based on complex networks for brain tumor image classification with a modified activation function. IEEE Access 8, 89281–89290 (2020).
Virtue, P., Yu, S. X. & Lustig, M. Better than real: Complex-valued neural nets for MRI fingerprinting. arXiv (2017).
Sharma, R., Goel, T., Tanveer, M., Dwivedi, S. & Murugan, R. FAF-DRVFL: Fuzzy activation function based deep random vector functional links network for early diagnosis of Alzheimer disease. Appl. Soft Comput. 106, 107371 (2021).
Gupta, A., Ayhan, M. & Maida, A. Natural image bases to represent neuroimaging data. in International conference on machine learning 987–994 (PMLR, 2013).
McKesson, J. L. Learning Modern 3D Graphics Programming. Arcsynthesis. org 17, (2012).
Chen, C., Bai, W. & Rueckert, D. Multi-task learning for left atrial segmentation on GE-MRI. in International workshop on statistical atlases and computational models of the heart 292–301 (Springer, 2018).
Hong, J. et al. Brain age prediction of children using routine brain MR images via deep learning. Front. Neurol. 11, (2020).
Zhang, Y.-D. et al. Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed. Tools Appl. 78, 3613–3632 (2019).
Khagi, B. & Kwon, G. R. 3D CNN Design for the Classification of Alzheimer’s Disease Using Brain MRI and PET. IEEE Access 8, 217830–217847 (2020).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv Prepr. arXiv1412.6980 (2014).
Cuingnet, R. et al. Automatic classification of patients with Alzheimer’s disease from structural MRI: A comparison of ten methods using the ADNI database. Neuroimage 56, 766–781 (2011).
Khagi, B. & Kwon, G. R. Convolutional Neural Network-Based Natural Image and MRI Classification Using Gaussian Activated Parametric (GAP) Layer. IEEE Access 9, 96930–96947 (2021).

No competing interests reported.

AppendixSGT.pdf

A novel Scaled-Gamma-Tanh (SGT) activation function in 3D CNN applied for MRI classification

Status:

Version 1

Abstract

Figures

Introduction

Proposed SGT activation and training process

Classification performance and methods

Discussion And Analysis

Histogram analysis and asymmetric distribution

Channel wise Activation

Conclusion

Appendix

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1