In this section we enlist descriptions of all the terminologies associated with our evaluation strategy for all the models. For a binary classification task, we have cases of true positivity (TP), true negativity (TN), false positivity (FP), and false negativity (FN). TP indicates a correctly classified positive, i.e. in our case, a correctly classified case of IDC. Similarly, TN indicates a correctly classified negative, FP a falsely classified positive and FN a falsely classified negative. Based on these four terms, we define precision, sensitivity (or recall), specificity, F1-score and balanced accuracy. These metrics are widely used in the literature for classification tasks. Precision\(P\) is the ratio of TP to all the labels predicted as positive and is given by (1),
|
\(P= \frac{TP}{(TP+FP)}\)
|
(1)
|
\(P\) helps answering to what extent the model correctly classifies positive cases. Further, sensitivity \({S}_{n}\) (or recall) is the ratio of TP to the number of positives in reality, given by (2),
|
\({S}_{n}= \frac{TP}{(TP+FN)}\)
|
(2)
|
\({S}_{n}\) gives the measure of how many correct predictions of positive cases were made out of total positive cases. Specificity \({S}_{p}\) can be seen as an opposite of\({S}_{n}\) because it gives the measure of correctly labelled negatives (TN) out of the total population of the real distribution of negatives. Mathematically,
|
\({S}_{p}=\frac{TN}{(TN+FP)}\)
|
(3)
|
F1-score \(F\) takes a combination of \(P\) and\({S}_{n}\) which presents the harmonic mean between these two variables. It is given by (4) as,
|
\(F= \frac{2{S}_{n}P}{({S}_{n}+P)}\)
|
(4)
|
In this paper, we use two different types of accuracy metrics: regular accuracy (RAC) and balanced accuracy (BAC). RAC will be used when we describe the test set validation accuracy of different models. However, once a confusion matrix of classifications is generated for all the models, we will calculate a BAC that will better represent model performance. BAC is required when there is a high class imbalance and can be mathematically expressed for binary classification tasks as,
|
\(\text{B}\text{A}\text{C}= \frac{\left[\frac{TP}{(TP+FP)}+\frac{TN}{(FN+TN)}\right]}{2}\)
|
(5)
|
While, RAC can be mathematically expressed by (6) as,
|
\(\text{R}\text{A}\text{C}= \frac{(TP+TN)}{(TP+FP+FN+TN)}\)
|
(6)
|
Finally, we use the Matthews’ Correlation Coefficient (MCC) [99] for in-depth analysis of each model. MCC (also known as the phi coefficient) lies in the range\(\in [-1, 1]\) where − 1 and 1 respectively mean total disagreement between observation and prediction, and perfect prediction. A value of 0 indicates that the model is as efficient as a random classifier. Most importantly, it is a balanced metric, meaning that class imbalance does not perturb the ease of its interpretation. Mathematically,
|
\(\text{M}\text{C}\text{C}= \frac{(TP\times TN)-(FP\times FN)}{\sqrt{(TP+FP)(TN+FN)(TN+FP)(TP+FN)}}\)
|
(7)
|
A binary cross entropy loss (BCE) is calculated for the training of all the models. This BCE loss is taken into consideration when we calculate an optimization function (that we describe in this section later) and also by the neural net itself for the adjustment of weights and biases. BCE is expressed mathematically as,
|
\(H\left(v\right)= -\frac{1}{n}\sum _{i=1}^{n}{y}_{i}\text{log}\left(p\left({y}_{i}\right)\right)+\left(1-{y}_{i}\right)\text{log}\left(1-p\left({y}_{i}\right)\right)\)
|
(8)
|
In (8), the distribution of data labels is given by\(y\) making\(p\left({y}_{i}\right)\) the model’s prediction on data label\(i\). True data distribution is represented by\(v\) with\(n\) being the total number of samples. Given (8), we are now able to tract an optimization function used to select the best trained-from-scratch traditional CNN. In our experiments, we train fifteen CNNs by changing various parameters such as number of layers, neurons, regularizations, etc. that we shall describe in Sect. 5 more. As mentioned earlier, to determine how feasible transfer learning is in our application, we must compare it to some baselines, and hence we use vanilla CNNs for this comparison. Selection of a ‘best’ CNN can be tricky due to three metrics that all play a pivotal role in describing performance, namely, validation accuracy (or RAC), validation BCE loss, and training time. Here, validation refers to the calculation of metrics on the validation or test set (we use validation set and test set interchangeably in this paper, although their meanings in detail are not exactly same). Ideally, it is desirable to maximize RAC, minimize BCE loss and minimize training time, as we do in (9). Given a classifier model\({M}_{{\theta }_{i}; {\phi }_{i}}\) with parameters\({\theta }_{i}\) and implementation information\({\phi }_{i}\), we denote a set\(C=\{{M}_{{\theta }_{1}; {\phi }_{1}}, {M}_{{\theta }_{2}; {\phi }_{2}}, \dots , {M}_{{\theta }_{i}; {\phi }_{i}}, \dots , {M}_{{\theta }_{15}; {\phi }_{15}}\}\) that contains all the traditional CNN models used for experimentation. The implementation information\({\phi }_{i}\) can be thought of as an \(m\)-tuple where \(m\) is the number of hyper-parameters (and other architectural information) that we vary over all our experiments. The cardinality and elements of this \(m\)-tuple will be clearly shown in Sect. 5. Now, mathematically, the optimization function\(\mathbb{ }\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) is given by (9), if we denote\(\text{m}\text{a}\text{x}\left(x\right)\) and \(\text{m}\text{i}\text{n}\left(x\right)\) by \(\psi \left(x\right)\)and\(\omega \left(x\right)\) respectively,
|
\(\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)= \frac{\psi \left(\alpha \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)}{\omega \left(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)+\omega \left({H}_{{M}_{{\theta }_{i}; {\phi }_{i}}}\left(v\right)\right)}, \forall {M}_{{\theta }_{i}; {\phi }_{i}}\in C\)
|
(9)
|
In (9), \(\alpha (.)\) denotes the validation RAC, \(\tau (.)\) denotes the training time, and \({H}_{{M}_{{\theta }_{i}; {\phi }_{i}}}\left(v\right)\) denotes the BCE loss for given model\({M}_{{\theta }_{i}; {\phi }_{i}}\). The objective is to maximize\(\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) given by\(argma{x}_{{M}_{{\theta }_{i}; {\phi }_{i}}}\left(\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)\). This procedure yields us a single model \({M}_{{\theta }_{i}; {\phi }_{i}}\) that we regard as the ‘best’ vanilla CNN to be compared with other SOTA implementations. Hence, maximizing\(\mathbb{ }\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) transforms (9) as,
|
\(\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)= \underset{{M}_{{\theta }_{i}; {\phi }_{i}}}{\underset{⏟}{argmax}}\left(\frac{\psi \left(\alpha \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)}{\omega \left(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)+\omega \left({H}_{{M}_{{\theta }_{i}; {\phi }_{i}}}\left(v\right)\right)}\right), \forall {M}_{{\theta }_{i}; {\phi }_{i}}\in C\)
|
(10)
|
It is important to note that we had to normalize values of the function \(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) because of the huge difference in the scale of the values yielded by \(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) as compared to \(\alpha \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) and \({H}_{{M}_{{\theta }_{i}; {\phi }_{i}}}\left(v\right)\) – the latter two being restricted in the range\(\in [0, 1]\). Typically, \(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) yields values of units of seconds (s) which, due to hardware-related limitations, can never lie in\([0, 1]\). Thus, we apply a normalized\(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) in our final optimization function, this function being denoted as\(N\left(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)\),
|
\(\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)= \underset{{M}_{{\theta }_{i}; {\phi }_{i}}}{\underset{⏟}{argmax}}\left(\frac{\psi \left(\alpha \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)}{\omega \left(N\left(\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)\right)+\omega \left({H}_{{M}_{{\theta }_{i}; {\phi }_{i}}}\left(v\right)\right)}\right), \forall {M}_{{\theta }_{i}; {\phi }_{i}}\in C\)
|
(11)
|
The normalization function\(N\left(x\right)\) is defined by (12) as,
Using (8) and (12) in (11), we get,
\(\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)=\)
\(\underset{{M}_{{\theta }_{i}; {\phi }_{i}}}{\underset{⏟}{argmax}}\left(\frac{\psi \left(\alpha \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\right)}{\omega \left(\frac{\tau \left({M}_{{\theta }_{i}; {\phi }_{i}}\right)-\omega \left(\tau \left(C\right)\right)}{\psi \left(\tau \left(C\right)\right)-\omega \left(\tau \left(C\right)\right)}\right)+\omega \left(-\frac{1}{n}\sum _{i=1}^{n}{y}_{i}\text{log}\left(p\left({y}_{i}\right)\right)+\left(1-{y}_{i}\right)\text{log}\left(1-p\left({y}_{i}\right)\right)\right)}\right)\)
|
(13)
|
\(\forall {M}_{{\theta }_{i}; {\phi }_{i}}\in C\)
|
|
We remark that the range of \(\mathbb{O}\left({M}_{{\theta }_{i}; {\phi }_{i}}\right)\) varies between\([0, \infty )\).