Here the first step is finding the gamma corrected version of input x as in equation (1). x is an input defined by a 4D matrix/Tensor as \({X}_{l}\) with each pixel/feature value \({X}_{n}^{b}\)for ‘bth’ batch and ‘nth’ filter in layer \(\text{'}l\text{'}\). a and b are constant scaling factors that were set manually. For n filters, we have n values of learnable parameters (i.e., \({\alpha } \text{o}\text{r} {\beta }\)) which implies that for all the different (or same)-class images belonging to the same mini-batch, the value of exponent (α and β) remains the same, whereas the value of exponents is different for the same-class images in different channels, hence are activated differently in each channel as shown in matrix representation in equation (2).
\(=b.\left[\begin{array}{ccc}{X}_{1}^{1}& \cdots & {X}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {X}_{n}^{1}& \cdots & {X}_{n}^{b}\end{array}\right]^\left[\begin{array}{c}{{\beta }}_{1}\\ ⋮\\ {{\beta }}_{n}\end{array}\right]=b.\left[\begin{array}{ccc}{{X}_{1}^{1}}^{{{\beta }}_{1}}& \cdots & {{X}_{1}^{b}}^{{{\beta }}_{1}}\\ ⋮& \ddots & ⋮\\ {{X}_{n}^{1}}^{{{\beta }}_{n}}& \cdots & {{X}_{n}^{b}}^{{{\beta }}_{n}}\end{array}\right]=\left[\begin{array}{ccc}{Y}_{1}^{1}& \cdots & {Y}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {Y}_{n}^{1}& \cdots & {Y}_{n}^{b}\end{array}\right]\) \((\text{f}\text{o}\text{r} {X}_{n}^{b}>0 )\) (2)
where \({X}_{l}=\left[\begin{array}{ccc}{X}_{1}^{1}& \cdots & {X}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {X}_{n}^{1}& \cdots & {X}_{n}^{b}\end{array}\right]\) is the input to the layer \(l\).
Here a and b are scaling constants selected manually, for our case we have selected a and b to 0.1 and 1.1 respectively. It is done to behave slightly as a monotonic function when the exponents are equal to 1 and resemble the Leaky-ReLU function in the first step (please see figure 2(a)). Later in the second step, when passed through the hyperbolic tangent (both exponents as 1) function, the output for the positive part will resemble tanh, and for the negative part will partly resemble the Leaky-ReLU function (please see figure 2(b)). However, on changing the exponent value and sign, different activation plots can be generated as shown in figures 2(c) and 2(d). Here it should be noted that only using step 1 for activation might explode the activated value in the positive region and can lead to vanishing gradient in the negative region (please see ‘only-gamma’ plot in figure 2(b)) which causes computational difficulty in convergence during training. So, a thresholding function with non-linear and symmetric property in positive and negative axis is required, for which we have selected the tanh function. The learnable parameters α and β values work as a positive gamma corrector, hence the weight updates of value α and β are calculated from the partial derivative of equation 1 during backward propagation as in equations (3) and (4):
$$\frac{dl}{d\alpha }=\sum _{b}\sum _{n}0.1\times real\left({log}_{10}{X}_{b}^{n}\right).real\left({{X}_{b}^{n}}^{\alpha }\right).\frac{dl}{dz} for {X}_{n}^{b}cript>$$
3
$$\frac{dl}{d\beta }=\sum _{b}\sum _{n}1.1\times real\left({log}_{10}{X}_{b}^{n}\right).real\left({{X}_{b}^{n}}^{\beta }\right).\frac{dl}{dz} for {X}_{n}^{b}>0$$
4
Please note when \({X}_{b}^{n}=X\) is negative and α is a rational decimal number, the resulting \({X}^{\alpha }\) becomes a complex number, in that case, we will only use the real part of the complex number. The same is the case with \({log}_{10}X\) and \({X}^{\beta }\). Also, the absolute values of α or β are used in equations (2), (3) and (4) for getting positive exponents.
Step 2: z=tanh(y) or in matrix form as:
$${Z}_{l}=real\left[\begin{array}{ccc}tanh{(Y}_{1}^{1})& \cdots & tanh{(Y}_{1}^{b})\\ ⋮& \ddots & ⋮\\ tanh{(Y}_{n}^{1})& \cdots & tanh{(Y}_{n}^{b})\end{array}\right]=\left[\begin{array}{ccc}{Z}_{1}^{1}& \cdots & {Z}_{1}^{b}\\ ⋮& \ddots & ⋮\\ {Z}_{n}^{1}& \cdots & {Z}_{n}^{b}\end{array}\right]$$
5
Here since all the operations are an element-wise matrix operation, the matrix calculated using (2) is passed to matrix calculation as in (5), then the output matrix \({Z}_{l}\) of layer \(l\) is passed into the pooling layer. For the layer loss \(\frac{dl}{dX}\), first the derivative of \({Y}_{l}\) with respect to (w.r.t) \({X}_{l}\) is calculated using equation (6), so that the output \({Y}^{\text{'}}\) dimension matches exactly the dimension of the layer input i.e., \({X}_{l}\).
$${Y}^{\text{'}}=\frac{d{Y}_{l}}{d{X}_{l}}=0.1\times \alpha .real\left({X}^{\alpha -1}\right) for Xcript>$$
$$=\frac{d{Y}_{l}}{d{X}_{l}}=1.1\times \beta .real\left({X}^{\beta -1}\right) for X\ge 0$$
6
Then, the overall gradient loss \(\frac{dl}{dX}\)is calculated through the output of this layer as the derivative of \({Z}_{l}\) w.r.t \({Y}^{\text{'}}\), which is backpropagated to the former layers using equation (7).
$$\frac{dl}{dX}=\frac{d{Z}_{l}}{d{Y}^{\text{'}}}.\frac{dl}{dZ}=\frac{dtanh\left({Y}^{\text{'}}\right)}{d{Y}^{\text{'}}}.\frac{dl}{dZ}={sech}^{2}\left({Y}^{\text{'}}\right).\frac{dl}{dZ}$$
7
Here, \(\frac{dl}{dZ}\) is the loss back-propagated from the deeper layers. Since z = tanh(y) is used as a squashing function, the final output value of the layer is non-uniformly scaled before passing out to the next layer resulting in z being a non-symmetric function centered at zero. This is shown in figures 2(c) and 2(d), where d(proposed-SGT) shows the plot for the final output of the first-order derivative of the proposed function. For condition with exponents α and β both being 1, the activation layer behaves like tanh in the positive part and leaky ReLU in the negative part, whereas for the case of derivative, the first-order derivative is a constant so behaves exactly like Leaky ReLU with output constant 0.3592 and 0.99006 for positive and negative part respectively. Such behavior was observed in few filters with β(positive)>α(negative) as in the 18th filter which seems to be constant output as in two different filters non-lineared at 0. However, since both α and β are channel-wise learnable parameters, the value is not the same for all the channels (please see figure 5). The final value of α and β were examined to be between -0.2 and 1.3, and rarely the identical value. Regarding our experiment, in most of the filters, the values of both α and β were a positive rational number with decimals, and β being greater than α in the majority case. More discussion on this is done in the discussion section. In the case with both β(positive)>α(positive), follows the graph as in 31st filter (please see graph figure 2(d)) where the gradients value for positive x gradually keeps on decreasing with the value of x, however, the rate of decrease is lower than the tanh derivate. This helps to prevent gradients values from becoming infinitely small, whereas in the negative derivative part the value is almost constant and fairly equals to become 1, for all cases. So, the network becomes less prone to the vanishing gradient or exploding gradient. It is to note that when the input X, α, β becomes 0, it causes an indeterminate form as Sech (0) = ∞ also log (0) = ∞ in this case, we simply replace the value of the parameters as 0.001 to continue training. Few α, β values were recorded undefined still after the convergence (please see Figure 6), however, they can be ignored.
For training the network and optimizing the parameters we used the Adam35 optimization technique. It is a first-order gradient-based optimization algorithm to update parameters until it reaches convergence. The learnable parameter (\({w}_{t}\)) (weights/bias/defined terms like α and β) during \({t}^{th}\) iteration is updated using Adam optimization as follow:
$${w}_{t+1}={w}_{t}-\frac{{am}_{t}}{\sqrt{{v}_{t}}+\varepsilon }$$
8
Where \(a\) is the learning rate constant-value kept at 0.001 in our case, \(\varepsilon\) is a very small regularization constant value (10−8) used as offset to keep a non-zero denominator. An element-wise moving average of parameters gradients (\({m}_{t}\)) and its squared value (\({v}_{t}\)) keeps on being updated as in equations (9) and (10), where \({b}_{1}\) and \({b}_{2}\) are decay rates for \({m}_{t}\) and \({v}_{t}\) kept at 0.9 and 0.990 respectively.
$${m}_{t}={b}_{1}{m}_{t-1}+\left(1-{b}_{1}\right)\nabla E\left({w}_{t}\right)$$
9
$${v}_{t}={b}_{2}{v}_{t-1}+\left(1-{b}_{2}\right){\left[\nabla E\left({w}_{t}\right)\right]}^{2}$$
10
Here \(\nabla E\left({w}_{t}\right)\) represents the first-order derivative of loss (\(E\)) for the parameter \({w}_{t}\), which is the cross-entropy loss i.e.
$$loss \left(E\right)= -\frac{1}{N}\sum _{n=1}^{N}{\sum }_{i=1}^{K}{t}_{ni}ln\left({y}_{ni}\right)$$
11
where for \(N\) is the total numbers of training samples with \(K\) mutually exclusive labels and \({t}_{ni}\)is targeted output, and \({y}_{ni}\) is the predicted value with its natural log (\(ln\)) calculated for \(n\)th sample belonging to \(i\) th class.