Automatic classification of brain magnetic resonance images with hypercolumn deep features and machine learning

Brain tumours are life-threatening and their early detection is very important in a patient's life. At the present time, magnetic resonance imaging is one of the methods used for detecting brain tumours. Expert decision support systems serve specialist physicians to make more accurate diagnoses by minimizing the errors arising from their subjective opinions in real clinical settings. The model proposed in this study detects important keypoints and then extracts hypercolumn deep features of these keypoints from some convolutional layers of VGG16. Finally, Random Forest and Logistic Regression classifiers are fed with a set of these features. Random Forest classifier offered the best performance with 94.51% accuracy, 91.61% sensitivity, 8.39% false-negative rate, 97.42% specificity, and 97.29% precision using fivefold cross-validation in this study. Consequently, it is thought that the proposed model could contribute to field experts by integrating it into computer-aided brain magnetic resonance imaging diagnosis systems.


Introduction
Medical imaging analysis plays a vital role in detecting abnormalities. Brain tumour is considered as one of the deadliest diseases [1]. Early detection of brain tumour helps radiologists for an effective prognosis and increases the potentiality of long-term survival [2]. Detection of whether there is a brain tumour is very important in terms of planning the treatment process. Magnetic resonance imaging (MRI) is used to detect the tumour in the brain. Usually, field specialists handle out brain tumour analysis manually. However, MRI screening by field specialist is time-consuming and open to human errors. On the other hand, automatic MRI classification by an expert system decreases the workload of neurologists and helps them make final decisions. Artificial intelligence-based systems that present efficient solutions to biomedical problems have recently attracted researchers. Coronary artery disease [3], eye diseases [4,5], breast cancer [6][7][8][9][10][11], brain tumour [12][13][14][15][16][17][18][19], Alzheimer disease [20][21][22][23][24][25][26][27] are just a few of them.
Researchers have proposed many different approaches [15,[28][29][30] based on image processing and machine learning for brain tumour classification which has attracted widespread attention so far. Classical machine learning approaches use hand-craft feature extraction techniques. Focusing only on low-level or high-level features and relying on handcrafted features are the main problems in the feature extraction phase of traditional machine learning [31]. Expert knowledge is needed in using the handcrafted technique for the features required in solving a machine learning problem. Moreover, detailed knowledge of the task is also required to obtain these features [32]. In addition, intra-class variations and inter-class similarities pose difficulties in obtaining high classification accuracy [33]. Besides these, handcraft feature extraction is expressed as a low-level method in the literature [34]. In line with this reason, good feature extraction is required for successful classification and modelling.
Deep learning architectures, which are sophisticated more than artificial neural networks, provided an opportunity to build models with effective and high performance. They could be successfully applied on the detection and classification of various diseases in the medical field without the need for the feature extraction process. In this context, deep learning-based classification and segmentation studies on brain MRI are quite popular and widespread. For example, Mehrotra et al. examined the performance of five pre-trained deep learning algorithms on benign and malignant brain tumour images. According to the authors' studies, the AlexNet pre-trained model provided higher performance compared to others [13]. Deepak and Ameer performed Meningioma, Glioma, and Pituitary tumour classification by sending the deep features extracted with GoogleNet pretrained model as input to the K-Nearest Neighbors and Support Vector Machines [16].
Mohsen et al. classified the brain MRI including normal, glioblastoma, sarcoma, and metastatic bronchogenic carcinoma tumours, using deep neural networks. In the classification step, they used new features by concatenating principal component analysis and discrete wavelet transform features [14]. Kesav and Jibukumar proposed a new architecture for brain tumour classification and tumour-type object detection using region-based convolutional neural networks. In their study, the authors first classified the Glioma and healthy tumour MRI and then identified tumour regions on the Glioma MRI [35].
Tandel et al. proposed a majority voting-based ensemble algorithm to optimize overall classification performance of five convolutional neural networks-based models and also five traditional classifier-based models [36]. Marghalani and Arif identified areas of interest, using the SURF keypoint detection algorithm in their studies. Then, they extracted the features of the areas of interest and built a visual dictionary. Lastly, they classified brain tumour, Alzheimer's disease, and normal brain images, using this visual dictionary [37]. Amin et al. proposed a new method based on fused MRI sequences for brain tumour classification. They used a 23-layer CNN model to classify the segmented images obtained by a global thresholding method [38]. Shafi et al. proposed an ensemble learning method using magnetic resonance images to classify brain tumours or neoplasms and multiple sclerosis. In their study, authors first identified the region of interest about tumour and lesion and obtained the features representing these regions. They conducted majority voting prediction with the support vector machine as the base learner [39].
Swati et al. classified multiclass brain tumours, using transfer learning and block-based fine-tuning with the fivefold cross-validation technique. The authors examined AlexNet, VGG16, and VGG19 pre-trained models, and reported that VGG19 outperformed AlexNet and VGG16 [31]. Paul et al. classified meningioma, glioma, and pituitary tumours from brain images with fully connected and convolutional neural networks models [40]. Abiwinanda et al. applied a simple CNN architecture that includes the convolution, maximum pooling, and smoothing layers followed by fully connected hidden layer to recognize glioma, meningioma, and pituitary [41]. As can be seen, CNN-based studies, specifically transfer learning approaches are widely used in the literature.
Deep learning could perform feature extraction and classification into self-learning, but generally requires a large training dataset for learning. Deep learning application and CNN training from scratch are difficult on small medical datasets [31]. In line with this subject, the analysis of magnetic resonance imaging was carried out, using machine learning and its sub-field deep learning techniques. Moreover, this approach consists of machine learning experiments on the hypercolumn deep features extracted from MRI. First, keypoint detection is conducted on the brain MRI with Oriented FAST and rotated BRIEF (ORB) keypoint detector algorithm [42]. ORB is a computationally efficient keypoint detection algorithm that is less affected by Gaussian image noise [42,43]. Second, the vectors including hypercolumn deep features are obtained from five different layers of the VGG16 model by taking the positions of the obtained keypoints as reference. Lastly, the feature vectors are sent as input to the random forest (RF) and logistic regression (LR) classifiers. Therefore, the performances of the classifiers on the tumour and normal cases are discussed. In this context, the main contribution of this study is as follows: • Keypoints are detected on the images in the training set. • Hypercolumn deep features are extracted from different layers, and then class labelling is performed for each feature set. • Hypercolumn deep features are sent to RF and LR classifiers. Due to the small number of images in the hidataset used in this study, the VGG16 pre-trained architecture is used. • Finally, a comparative analysis of the different models is carried out for tumour classification.

Material and methodology
The main aim of this study is to propose a new model that consistently and efficiently detects normal and tumour images. Classification experiments include three scenarios within the frame of hold-out and fivefold cross-validation techniques. Scenario A includes the experiments on the VGG16 deep features. Scenarios B and C include the results based on the integration of hypercolumn deep features and traditional classifiers. The hypercolumn deep features are extracted from these points based on the VGG16 architecture, after important keypoints on the image are detected in the Scenario A and Scenario B. All experimental studies in this study were implemented in a 64-bit Windows 10 operating system with Intel @1.80 GHz CPU and 8 GB RAM, and Python 3.6 with Keras 2.3.1 framework. The general block diagram of the proposed study is shown in Fig. 1. As seen in this block diagram, the class prediction is performed by sending the hypercolumn deep 1 3 features as input to the models. RF and LR classifiers predict the target class, using the hypercolumn deep features for each keypoint. Thus, the performances of different classifiers trained on hypercolumn deep features are discussed. Differently from the studies in the literature, the proposed study performs classification for each hypercolumn deep feature obtained from keypoints on the image and lastly votes them. The workflow including the numbers between 1 and 8 in Fig. 1

Dataset
In this study, a public brain MRI dataset prepared by Chakrabarty [44] was used. The dataset contains 253 images, including 155 tumours and 98 normal samples. Since the number of normal ones is less than positive samples, the data augmentation technique was applied to images in the normal class. So, the dataset was balanced with 155 tumour cases and 155 normal cases. Figure 2 presents sample images.

VGG16 deep learning architecture
VGG16 architecture was developed by Simonyan and Zisserman [45] based on the CNN model which consists of 138 million parameters. VGG16 won the ImageNet Large Scale Visual Recognition Challenge competition held in 2014 with an accuracy score of 92.7%. VGG16 deep learning model was trained on a dataset of 14 million images corresponding to 1000 categories. This model consists of five convolution blocks and fully connected layers [46]. The last layer of each convolution block contains maximum pooling and all hidden layers use the Rectified Linear Unit [47] activation function. Convolution and maximum pooling of the VGG16 present the feature vector representing the image. Five convolution blocks are followed by fully connected layers where feature vectors are transformed into 1-dimensional arrays. The soft-max function in the last layer of this architecture is used to classify binary and multi-class data. This function is an activation function used to determine the most likely class [48].

Hypercolumn deep features
In the CNN architecture, all features extracted from convolutional layers have a role in the classification, and the output layer represents the final features to be used in classification.
Hypercolumn deep features are obtained by combining intermediate features extracted from the layers of the CNN architecture into a single vector [49]. Hypercolumn is a vector of activation outputs obtained from CNN layers for any pixel in the image [50]. With the hypercolumn deep feature extraction technique, more accurate prediction is performed using the features that represent previous layers. Many studies using this technique such as Alzheimer's disease classification [51], tumour region detection [52], brain magnetic resonance imaging classification [50] are available in the literature.

Performance metrics
There are two classes in the dataset as tumour (positive) and normal (negative). The confusion matrix obtained in twoclass classifications includes true positive (TP), true negative (TN), false positive (FP), and false negative (FN) basic criteria. Here, the TP refers to the number of positive images that were correctly classified, while the TN refers to the number of correctly classified negative images. In addition, the FN gives the number of misclassified positive images while the FP gives the number of misclassified negative images. The performances of models are measured using different metrics such as accuracy (Acc), sensitivity (Sen), false negative rate (FNR), specificity (Spe), and precision (Pre). Sen called as True Positive Rate (TPR) refers to the ratio of positive samples correctly classified to all positive samples, while Spe refers to the ratio of negative samples correctly classified to all negative samples. FNR is the ratio of misclassified tumour cases to all tumour cases. Accuracy indicates the general classification performance by calculating the ratio of correctly classified samples out of all samples. Precision gives the ratio of positive samples correctly classified to all cases predicted positively. These metrics were given in between Eqs. 1, 4, respectively.

Model training
Before model training, firstly, data augmentation techniques were applied to some images in the negative class to eliminate the unbalanced distribution problem on the dataset and improve the model performance. Image rotation between 5 and 15 degrees were applied to images in the negative class, and then 57 images were randomly selected from the augmented sample images subset. Therefore, each class consists of 155 images.
In this study, machine learning experiments were carried out by utilizing hold-out and fivefold cross-validation techniques on the balanced dataset. The hold-out technique involves the performances of traditional classifiers on the deep features extracted from pre-trained VGG16 model based on the transfer learning approach (Scenario A) and the performances of them on the hypercolumn deep features (Scenario B). With the hold-out technique, as shown in Fig. 3a, the dataset was split into 80:20 ratios for training and testing sets, respectively. So, 248 images were reserved for the training of the models and 62 images for the testing. Table 1 summarized the sub-dataset distribution according to hold-out technique.
In the last scenario (Scenario C) which is the fivefold cross-validation technique (k = 5), as shown in Fig. 3b, while 80% of the dataset, which is k − 1 parts of the dataset for each fold, was reserved for the model training, the remaining part (20%) was reserved for model testing. Experiments that include training and testing of the models were carried out five-times in Scenario C in this context. In Scenarios B and C, important keypoints required for training of the models were detected on the train set utilizing the ORB detector. Figure 4 shows the keypoints detected on some sample images. Hypercolumn deep feature vectors regarding these keypoints were extracted from certain convolutional layers given in Table 2. All layers in the CNN architecture could be used to extract features, but it causes system resource problems. As a result of try-anderror experiments with combinations of different layers, these layers were selected to extract hypercolumn deep features. Feature maps were upscaled to 224 × 224 when extracting these features. All hypercolumn feature vectors extracted from tumour and normal images are assigned to the related class.  Dynamically obtaining the hypercolumn feature vector with the ORB keypoint detector has a limitation for CNN models because CNN-based models get input in mxn sizes which m and n indicate the width and height of the image or matrice, respectively. The fact that detecting different keypoints from each image is a handicap to obtain mxn matrix structure. For example, 19 keypoints can be detected from any image, while 189 keypoints from another image. Here, the numbers 19 and 189 are given symbolically. In this case, 19 × 1472 hypercolumn deep features are extracted from the first image, while 189 × 1472 hypercolumn deep features are extracted from the other image. 1472 is the dimension of the feature vector formed by combining the features extracted from different layers in VGG16 for any keypoint. Therefore, a limitation is available for the feature vector, which is expected as a matrix. It is inevitable to make manipulations on the feature vector to overcome this problem. Since the input data for CNN must be a matrix in mxn sizes, there must be an equal number of keypoints for each image. This problem can only be overcome by: a) Selecting randomly p keypoints from each image which has many keypoints b) Filling the missing parts of the mxn matrix with zeros Both approaches have disadvantages. While the first one causes the loss of meaningful information, the second one causes overfitting. In the proposed approach, hypercolumn deep features extracted with reference to the keypoints detected from each image in the training set were matched with the class of the image. Thus, dynamically detecting a different number of keypoint from the images is not a problem. In other words, hypercolumn deep features are dynamically extracted and target class information is assigned for each hypercolumn deep feature. So, the proposed approach in this study overcomes the disadvantages mentioned above. After hypercolumn deep features were extracted from images in the training dataset, traditional classifier-based models were trained on these features. After the model was built, the keypoint detection on the test dataset and the extraction of hypercolumn deep features from the keypoints were carried out. Then, each of the hypercolumn deep features was classified separately, and then, taking into account the hypercolumn deep features and class matches, the class of image was assigned to the class with the highest number of labels with the major voting approach. In the case of an equal number of positive and negative classifications, dynamically keypoint detection is performed again on the image. As a result of the classifications made for each point, an equal number of positive and negative classifications were not encountered in the experiments. Some parameters of the RF and SVM classifiers used with their default parameter settings are listed in Table 3.

Results
The performances of models were tested using 3 different scenarios listed below: In Scenario A, the classification performances of the RF and LR models on the deep features extracted from the VGG16 architecture based on the transfer learning approach were summarized with confusion matrices in Fig. 5. Confusion matrices could also be presented by coloring as here. There are many examples of this in the literature. There is a transition from light to dark in direct proportion to the number of mistaken classifications in the confusion matrices presented in this study. As can be seen in the confusion matrices, the RF misclassified 5 of 31 images in the normal class as the tumour, while LR misclassified 9 of 31 images in the normal class as the tumour. In addition, RF correctly classified 26 of 31 images with tumour while LR correctly classified 23 of them.
Scenario B presents the classification performances of RF and LR classifiers on the hypercolumn deep features extracted from the images in the test dataset. Figure 6 summarized the confusion matrices obtained by the classifiers  in the experiments. As seen in here, RF has less misclassification compared to LR. While RF misclassified 3 of 31 images in the normal class, LR misclassified 4 of 31 images in the normal class. In addition, the RF classifier correctly classified 30 of 31 images with tumour while misclassifying 1 of them. LR classifier correctly classified 29 of 31 images with tumour while misclassified 2 of them. Table 4 summarizes the performance of the models on the hypercolumn deep features and VGG16 deep features within the frame of the hold-out technique.
Within the frame of Scenario C, the confusion matrices obtained by RF and LR classifiers for each fold were presented in Fig. 7. As can be seen in the confusion matrices between Fold 1 and Fold 5, the RF classifier showed better performance than LR in the proposed approach. Table 5 summarized the performance of the models on the hypercolumn deep features within the frame of the fivefold cross-validation technique. The average Acc, Sen, FNR, Spe, and Pre values of RF and LR classifiers were presented in bold in this table.
The performances of the classifiers on the hypercolumn deep features in Scenarios B and C are better than the VGG16 deep features achieved with Scenario A. Overall results show that hypercolumn deep features improved the RF and LR accuracies to 93.55% from 83.87% and to 90.32% from 72.58 when compared to VGG16 deep features within the frame of the hold-out technique. In addition, hypercolumn deep features in the fivefold cross-validation improved the RF accuracy to 94.51% from 93.55% when compared to hypercolumn deep features with the hold-out technique. Moreover, the FNRs of the classifiers used in this study were also examined in detail. Figure 8 demonstrates the FNR  values of the classifiers with the fivefold cross-validation technique. Accordingly, the LR misclassified more tumor cases as normal in each experiment except fold #1 compare to RF. These experiments show that the proposed approach is feasible and effective for classifying brain tumours. The Receiver operating characteristic (ROC) curve shows the relationship between true positive rate (TPR) and false positive rate (FPR). TPR also called Sensitivity gives the ratio of correctly predicted positives to all positives. FPR is the ratio of falsely predicted negatives to all negative data. The area under the curve (AUC) value represents the area under the ROC curve. If this value is close to 1, it indicates that the model presented high success in separating the classes, and close to 0 indicates that the model presented low success [53]. While Fig. 9 presents the ROC curves with AUC values that show the performances of RF and LR algorithms on the hypercolumn deep features and VGG16 deep features Fig. 10, demonstrates the ROC curves with AUC values obtained by utilizing the fivefold cross-validation. According to the results, the performances of the classifiers on the hypercolumn deep features are quite high when compared to the VGG16 deep features. When compared the Scenario A and Scenario B, it can be seen that the AUC values of RF are better than LR. In addition, 0.945 average AUC value of RF and 0.858 average AUC value of LR confirm the results of fivefold cross-validation. This indicates that the RF has good classification performance. Table 6 summarizes the results of this study and other studies in the literature. As seen in the 'Method' column from this table, more studies were conducted with classical deep learning approaches. For example, Deepak and Ameer obtained 97.1% accuracy using deep features and SVM classification with the fivefold cross-validation technique [16].

Conclusion
Detection of brain tumour is very important in terms of the fast and planned treatment process and increasing the survival rate of the patient. In this study, a novel model that classifies normal and brain tumour cases was proposed. This model extracts hypercolumn deep features from specific layers of the VGG16 deep learning architecture for the images and classifies them with the RF classifier. Major voting technique reaches the final decision regarding the classifications for each hypercolumn deep feature vector dynamically extracted from each image. The performance of the proposed model was validated with hold-out and fivefold cross-validation techniques. Handcrafted features are not used and minimum pre-processing is required in the experiments in the proposed study. The proposed approach offered the best performance with 94.51% accuracy, 91.61% sensitivity, 97.42% specificity, and 97.29% precision on the hypercolumn deep features. The proposed model could play an important role in the development of expert systems to be used in real clinical environments, taking into account its high classification accuracy. Such an expert system to be developed using a low-cost system based on deep learning architecture will help specialist physicians to make a more accurate diagnosis by minimizing errors arising from their subjective opinions. A model with high generalization capacity on the images obtained from different resources is planned for future work. Also, using other deep architectures that could achieve a better performance in computer vision in future studies is aimed. In addition, the studies based on the different deep learning architectures on state-of-the-art images obtained by deep dream and blurry image techniques are among the future studies.