The K-nearest Neighbor Classifier
The optimal value for the tuning parameter k for kNN classification model was selected based on highest model accuracy on training data for a range of k values. Model accuracy reduced with increasing k values. Accuracy was highest for k = 5 (Table 1).
Table 1
Values of the tuning parameter, k and the corresponding accuracy and kappa statistics for the kNN model on the training dataset
k | Accuracy | Kappa |
5 | 0.927 | 0.639 |
7 | 0.924 | 0.615 |
9 | 0.915 | 0.564 |
11 | 0.908 | 0.510 |
13 | 0.904 | 0.483 |
15 | 0.897 | 0.430 |
17 | 0.893 | 0.399 |
19 | 0.889 | 0.367 |
21 | 0.887 | 0.348 |
23 | 0.884 | 0.324 |
The kNN classifier model with k=5 had a predictive accuracy rate of 0.932 [95% CI: 0.889, 0.957] and “no-information rate” (NIR) of 0.929 with p-value (accuracy > NIR) = 0.991, thus there is no evidence accuracy is higher than NIR, suggesting that the predictive performance of the kNN classifier on the data is not any better than random guessing. We cannot use this model to predict for new data.
The Rf Classifier
The RF hyperparameter, mtry was evaluated for the RF model using repeated cross-validation and mtry equal to 6 was optimal. This means that the RF classifier used 6 predictors to split the tree. Graphical presentation of the results on accuracy against randomly selected predictors is as shown in Figure 1.
The RF classifier model had an overall accuracy of 0.911 [95% CI: 0.874, 0.939], kappa statistic of 0.54 and NIR of 0.929 with p-value (accuracy > NIR) = 0.916 suggesting a poor model. We therefore do not pursue the confusion matrix.
Support Vector Machine Classifier (Svm)
Three SVM classifier models were implemented; linear kernel SVM, polynomial kernel SVM and radial basis function SVM and here we provide the predictive performance of these models respectively.
Linear kernel SVM
The linear SVM model attained highest accuracy with cost “C” of 5.75. This cost parameter was obtained using repeated cross-validation whose results are shown in Figure 2.
The linear kernel SVM classification model (C = 5.75) had overall accuracy of 0.957 [95% CI: 0.929,0.976]. The corresponding NIR was 0.886 with p-value < 0.0001 (accuracy > NIR), thus accuracy score was significantly higher than NIR which implies that the classifier model performed better than one could do by always predicting the most common class. The model had a Kappa of 0.811 signifying substantial strength of agreement between the model’s predictions and the actual labels of classes while controlling for accuracy of a random classifier. Table 2 displays the classifier predictions on the test dataset and classifier metrics based on the confusion matrix. From the predictions, it is clear all samples of B. oleae (Bol), and B. zonata (Bzo) in the test dataset have been classified into their respective observed group.
Table 2
Classification results for the SVM classifiers on test dataset of morphometric measurements of Bactrocera spp., with observed species affiliation in the rows and predicted species allocation in the columns. Correct classification rate appears along the diagonal in bold.
Classifier | Observed | Predicted (%) | Sensitivity | Specificity |
| | Bco | Bcu | Bdo | BI | Bka | Bol | Bzo | | |
SVM - L | Bco | 80.0 | 0 | 0 | 20.0 | 0 | 0 | 0 | 1.000 | 0.997 |
| Bcu | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0.818 | 1.000 |
| Bdo | 0 | 0 | 25.0 | 75.0 | 0 | 0 | 0 | 0.500 | 0.981 |
| BI | 0 | 0.7 | 0.7 | 98.6 | 0 | 0 | 0 | 0.965 | 0.892 |
| Bka | 0 | 0 | 0 | 37.5 | 62.5 | 0 | 0 | 1.000 | 0.991 |
| Bol | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 1.000 | 1.000 |
| Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 1.000 | 1.000 |
SVM - R | Bco | 80.0 | 0 | 0 | 20.0 | 0 | 0 | 0 | 1.000 | 0.997 |
| Bcu | 0 | 88.9 | 0 | 11.1 | 0 | 0 | 0 | 1.000 | 0.997 |
| Bdo | 0 | 0 | 37.5 | 62.5 | 0 | 0 | 0 | 1.000 | 0.984 |
| BI | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0.956 | 1.000 |
| Bka | 0 | 0 | 0 | 75.0 | 25.0 | 0 | 0 | 1.000 | 0.981 |
| Bol | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 1.000 | 1.000 |
| Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 1.000 | 1.000 |
SVM - P | Bco | 80.0 | 0 | 0 | 20.0 | 0 | 0 | 0 | 0.800 | 0.997 |
| Bcu | 0 | 88.9 | 0 | 11.1 | 0 | 0 | 0 | 0.889 | 0.997 |
| Bdo | 0 | 0 | 50.0 | 50.0 | 0 | 0 | 0 | 1.000 | 0.988 |
| BI | 0.35 | 0.35 | 0 | 98.2 | 1.1 | 0 | 0 | 0.962 | 0.865 |
| Bka | 0 | 0 | 0 | 62.5 | 37.5 | 0 | 0 | 0.500 | 0.984 |
| Bol | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 1.000 | 1.000 |
| Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 1.000 | 1.000 |
Bco - B. Correcta, Bcu - B. cucurbitae, Bdo - B. dorsalis, BI - B. invadens, Bka - B. kandiensis, Bol - B. oleae, Bzo - B. zonata; SVM-L: linear kernel SVM, SVM-R: radial kernel SVM, SVM-P: polynomial kernel SVM. |
The linear kernel SVM model achieved sensitivity rate of above 80% for all species except for B. dorsalis (Bdo) while specificity ranged from 89–100%.
Radial kernel SVM classifier
Selection of optimal model for radial kernel SVM require determination of the optimal values of tuning parameters namely gamma (γ) and cost (C). We tested different values of γ ranging from 0.01 to 0.1 with step 0.01 while C was in range 0.01 to 10.0 with step 0.25 and obtained the values that minimize the classification error for the 10-fold cross-validation. The optimal model was obtained with γ = 0.06 and C = 9.51. Using these parameters, the radial kernel SVM model had accuracy of 0.96 [95% CI: 0.933, 0.978], Kappa statistic of 0.810 and NIR of 0.91 with p-value (accuracy > NIR) = 0.0002. NIR being significantly lower than accuracy suggests the radial kernel SVM model is superior to random guessing.
Just as with the linear kernel SVM model, the sensitivity and specificity for B. oleae (Bol), and B. zonata (Bzo) was 100%. (Table 2).
Polynomial SVM classifier model
The polynomial SVM model attained optimal accuracy at a degree of 2, scale of 2 and cost of 0.1. Using the test dataset, the classifier model yielded predictive accuracy of 0.951 [95% CI: 0.921, 0.972], Kappa statistic of 0.784 and NIR of 0.886 with p-value (accuracy > NIR) < 0.0001, suggesting a good model. The sensitivity for B. oleae (Bol), B. zonata (Bzo) and B. dorsalis (Bdo) was 100% respectively, while the model had smallest sensitivity on B. kandiensis (Bka) (Table 2).
Artificial Neural Network Classifier
The optimal ANN model was selected based on the accuracy obtained by varying the number of nodes of the network. The ANN model was optimal at 17 nodes and decay of 0.042. We fitted a feedforward (15-17-7) network, thus a model with 15 input neurons, 17 hidden neurons and 7 input neurons. The predictive accuracy for this model was 0.96 [95% CI: 0.933, 0.979], Kappa statistic of 0.833 and NIR of 0.873 with p-value (accuracy > NIR) < 0.0001. Thus, the neural network was superior to NIR. The classification results of the ANN classifier on test dataset and the estimated metrics are presented in Table 3.
Table 3
Classification results for the ANN classifier on test dataset of morphometric measurements of Bactrocera spp., with observed species affiliation in the rows and predicted species allocation in the columns. Correct classification rate appears along the diagonal in bold.
Observed | Predicted (%) | Sensitivity | Specificity |
| Bco | Bcu | Bdo | BI | Bka | Bol | Bzo | | |
Bco | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 |
Bcu | 0 | 88.9 | 0 | 11.1 | 0 | 0 | 0 | 0.889 | 0.997 |
Bdo | 0 | 0 | 50.0 | 50.0 | 0 | 0 | 0 | 0.667 | 0.987 |
BI | 0 | 0.35 | 0.35 | 98.2 | 1.1 | 0 | 0 | 0.975 | 0.878 |
Bka | 0 | 0 | 12.5 | 25.0 | 62.5 | 0 | 0 | 0.625 | 0.991 |
Bol | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 1.000 | 1.000 |
Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 1.000 | 1.000 |
Bco - B. Correcta, Bcu - B. cucurbitae, Bdo - B. dorsalis, BI - B. invadens, Bka - B. kandiensis, Bol - B. oleae, Bzo - B. zonata
The metrics for ANN classifier show that sensitivity was lowest for B. dorsalis (Bdo) and B. kandiensis (Bka) while the sensitivity and specificity for B. Correcta (Bco), B. oleae (Bol) and B. zonata (Bzo) was 100%, respectively (Table 3).
Finally, a summary of performance metrics namely accuracy, Kappa, no-information rate and associated p-value of all the ML classifiers under study are presented in Table 4.
Table 4
Summary of performance metrics for all the machine learning classifiers under study
| | | | p-value |
Classifier Model | Accuracy [95% CI] | Kappa | NIR | (Acc > NIR) |
k-Nearest Neighbor | 0.932 [0.899, 0.957] | 0.648 | 0.929 | 0.469 |
Random Forest | 0.912 [0.874, 0.939] | 0.536 | 0.929 | 0.916 |
SVM | | | | |
Linear kernel | 0.957 [0.929, 0.976] | 0.811 | 0.886 | < 0.0001 |
Radial kernel | 0.960 [0.933, 0.979] | 0.810 | 0.908 | 0.0002 |
Polynomial kernel | 0.951 [0.921, 0.972] | 0.784 | 0.886 | < 0.0001 |
ANN | 0.960 [0.933, 0.979] | 0.827 | 0.883 | < 0.0001 |