Comparative study of ensemble models of deep convolutional neural networks for crop pests classification

Pest infestations on wheat, corn, soybean, and other crops can cause substantial losses to their yield. Classification of crop pests is of considerable importance for accurate and intelligent pest control. Ensemble models can effectively improve the accuracy of crop pests classification and different ensemble models can produce different results. To study the advantages and disadvantages of ensemble models under different agricultural production environments, six basic models are trained on a D0 dataset. Then, three models with the best classification performance are selected. Finally, the ensemble models, i.e., linear ensemble named SAEnsemble and nonlinear ensemble SBPEnsemble, are designed to combine the basic models for crop pests classification. The accuracies of SAEnsemble and SBPEnsemble improved by 0.85% and 1.49% respectively compared to the basic model with the highest accuracy. Comparison of the two proposed ensemble models shows that the accuracy of SAEnsemble is lower when the pests are very similar to the background, while SBPEnsemble is more likely to make wrong predictions under the condition of huge difference between pests and background. Therefore, in practical agricultural production, choosing linear or nonlinear ensemble methods according to different situations can effectively improve the accuracy of crop pests classification.

crop production losses. Once pests develop in fields, only timely diagnosis by farmers can enable effective treatment. Pest prevention methods depend on the species of pest, but sorting pest species manually is cumbersome and inefficient because of the high similarity and complex structure among pests. Traditional pest classification methods include the use of the K-means clustering algorithm, which requires the manual extraction of insect image features. This is time-consuming when the dataset is large [10]. As relevant information features are extracted from images manually, the learning system is not automated [4]. In another study, a framework that classified leaf diseases of five diverse plants using image processing and machine learning algorithms achieved an accuracy of 83-94% [1]. Traditional pest classification methods rely on manual operation and are generally used for small labels and dataset of small categories. In the face of pest classification, these methods appear increasingly inefficient and miscellaneous.
With the rapid development of deep convolution networks in recent years, researchers have begun to use CNN to develop image classification systems for pest classification. CNN can learn features at all locations in an image, and the shallow layers can learn simple features, while the deep layers can learn complex features, so CNN show better performance in the task of pest classification. To improve the accuracy of CNN, deepening the network is commonly used. But as the network deepens, the CNN model can be rapidly large and consume a lot of memory resources and storage resources. To propose simple and efficient methods, ensemble learning [29] is proposed.
Ensemble learning accomplishes learning tasks by combining different models, which can be used for classification problems, regression problems, feature selection, outlier detection, etc. Ensemble learning is a technical framework that combines models along different methods to achieve better results. It consists of two main steps, the first step is to choose different models, and the second step is to choose a combination method and then combine these models into a new model.
There are many methods of Ensemble learning, including hard voting method, soft voting method, learning method, etc. The learning methods are divided into two main categories, linear learning, and nonlinear learning. Linear learning focuses on learning the weights of each model, representing the relationships between the models. While nonlinear learning takes the output of different models as the input of a new model for training, representing the internal relationships of the models. Therefore, with the basic models unchanged, ensemble models using the linear learning method and ensemble models using the nonlinear learning method will produce different results, according to which the different results can guide actual agricultural production. The basic models used in this paper are Xception, InceptionV3, and MobileNetV2. The key contributions of the proposed work are as follows: 1. Two Ensemble models are proposed, SBPEnsemble, a linear ensemble model, and SAEnsemble, a nonlinear ensemble model. 2. Ensemble models are comprehensively compared, advantages and disadvantages of the two ensemble models in agricultural production are summarized. 3. The classification performance of basic models and the applicable pest species are compared.
The rest of this paper is organized as follows: Section 3 introduces the dataset and methods used in the current work. Section 4 presents the experimental results, advantages and disadvantages of the two proposed ensemble models. Section 5 compares the two proposed ensemble models with other studies. Finally, Section 6 presents some conclusions and future prospects.

Literature review
There has been a lot of research dedicated to the classification of insect pests. GoogLeNet model is used to classify ten common crop pests and achieved an increase of 6.22% in accuracy compared to the most advanced method [28], the model can extract features at different scales. Thenmozhi and Reddy [26] built a CNN manually, compared this model with other advanced CNN pest classifiers, and analyzed the influence of transfer learning. Many methods are used to enhance the data to improve CNN performance and tested on two pest dataset [20], these data enhancement methods greatly improve the robustness of the model. The faster region-based CNN (Faster R-CNN) is trained to detect the location of lesions on leaves, its classification accuracy of 7 types of tea tree insect pests was 89.4% [17]. Fast R-CNN resolves two problems of R-CNN, numerous candidate frames to be computed and the slow training speed. Its training speed is 8 times faster than R-CNN while improving the accuracy. VGG16 and InceptionV3 are used to detect and identify rice pests, a two-stage model derived from fine-tuning was proposed, which also achieved good results [7]. Fangyuan et al. [11] proposed a cascade pest-classification method based on two stages framework, and a context-aware attention network was constructed to classify the pest images. Both of the above methods combine different models, making reasonable use of the advantages of different models. An important experiment is conducted in which models were trained using images obtained under experimental conditions and field conditions [2]. This method compares the accuracy of CNN models under different conditions, and puts the scenario of CNN usage in a natural scene, proving that no matter how accurate the model was, the accuracy would be sharply reduced by nearly 50% when applied to the actual production situation. Many algorithms in pest classification are based on the following models: Xception, InceptionV3, Vgg16, Vgg19, MobileNetV2, Resnet50, and SqueezeNet [13,19,21,22,24,30,34]. These models usually have different network structures and characteristics for learning different features and the generalization ability of the model also differs, which leads to the differing classification performance of different models for different classes of crop pests. It is a good scheme to ensemble CNN models to create a system with high predictive capacity. An ensemble model based on a genetic algorithm is proposed to improve the classification accuracy of a basic model for multiple types of insect pests [3].
In recent years, transformer architectures have changed many areas of deep learning. It was first proposed in [27], which is an architecture based on attention mechanisms that completely dispenses with convolution operations. Influenced by this, Visual Transformer(ViT) was proposed, a transformer architecture inspired by natural language processing (NLP) tasks. After experiments, it was found that this method outperformed deep convolution-based networks in terms of training time and structure [9], but the network also had the problem of having too many parameters, which made it difficult to be directly ported to mobile devices for agricultural production. To solve the problem of too many parameters, a new deformable self-attentive module was proposed, in which the location of key sum-value pairs can be selected in a data-dependent manner, and this method has reduced the number of ViT parameters to some extent and achieved good results in the field of pest classification [31]. The merits and demerits of existing works are listed in Table 1.

Dataset
In this paper, the dataset is a publicly available D0 dataset that consists of 40 kinds of crop pests. All 4508 RGB images having resolutions of 200 × 200 were proposed [33]. The corresponding numbers and label of each class are listed in Table 2. The D0 dataset was divided into three groups: training, verification, and testing, with 3151 images being used for training, 378 for verification, and 943 for testing. The D0 dataset contains classes that are common such as the Halyomorpha halys and Pieris rapae, classes that have Obvious contours such as Sesamia inferens, classes that have unique features such as Eurydema domulus, two very similar classes such as Corythucha ciliat and Corythucha marmorata. Samples of 10 classes are shown in Fig. 1.

Methods
To improve accuracy compared to the basic model and to select a suitable ensemble model under different situations. This paper contains the following two main processes. First, six basic models are pre-trained by using the transfer learning on the D0 dataset, and three models having the best performance are selected. Then, the two ensemble models are designed to investigate crop pest classification and compare them to three traditional ensemble models. Figure 2 shows the framework of the comparative study.

Transfer learning
Transfer learning is an emerging family of machine learning techniques and has been actively studied in machine learning and AI communities in recent years [16]. It is to solve new learning tasks using fewer examples by using information gained from solving related tasks [25]. There are two transfer learning methods. First, all the layers of the pre-training model are frozen, except the layers added by searchers. Second, the part of layers of the pre-training model is frozen, the multi-layer convolution layer close to the input, for a large amount of low- VGG16,InceptionV3 Two models were combined Complex combination process Dosovitskiy et al. [9] transformer Convolution is removed Too many parameters Zhuofan et al. [31] ViT Better performance than CNN Too many parameters level information retained. In this comparative study, Vgg16, Vgg19, and Resnet50 use the first method. InceptionV3, Xception, and MobileNetV2 use the second method.

Xception
Xception uses depth-wise separable convolution instead of the traditional convolution operation. This helps maintain classification accuracy while reducing the number of parameters and the amount of computation [6]. This paper freezes the first 120 layers of Xception and adds three fully connected layers containing 512 neurons. The activation function is ReLU [14], which is used to avoid linearity, Softmax activation function is used in the final output layer to classify the pests. Figure 3a shows the model structure. It can be seen that there are many Aulacophora indica frozen layers of Xception, which retains its ability to extract shadow features. On this basis, a few connection layers are added, the number of neurons is reduced from 512 to 40, and 216 and 108 are skipped, which reduces the calculation.

InceptionV3
InceptionV3 is an implementation of GoogLeNet, its ability to deconstruct these features into smaller convolution sections [35]. Its network can be efficiently decomposed into small convolution kernels, which greatly reduces the number of parameters of the model and the chance of overfitting [18]. This paper freezes the first 270 layers of InceptionV3 and add three fully connected layers containing 512 neurons. The corresponding activation function is ReLU. Softmax activation functions are used in the final output layer to classify the pests. The specific structure is shown in Fig. 3b.

MobileNetV2
MobileNetV2, the fast execution speed makes experimenting and parameter tuning much easier, while the low memory consumption is a desirable quality in the context of an ensemble  Step2 Fig. 2 The general framework of the comparative study of networks [5]. This paper freezes all layers of MobileNetV2 and adds two fully connected layers containing 512 neurons. The model structure is given in Fig. 3c. It can be seen that all layer of MobileNetV2 has been frozen because MobileNvetV2 is a model with very small parameters, which are suitable for the pest classification task in this paper.

Ensemble models
In this paper, ensemble models are based on the voting method [15], which is divided into two kinds. One is hard voting which includes one ensemble model, in which each basic models vote in the class with the highest prediction probability, and each model has a vote. When the voting results of the three models are different, the model with the highest probability on the verification set is chosen to provide the result. The second is soft voting which includes three ensemble models, the first model takes the average of all models' predicted probability in a certain class as the standard and the class with the highest probability as the result. It has disadvantages in that the accuracy of the model is not considered, so the model with low accuracy may have a high impact on the result. To solve the problem, the second ensemble model of soft voting is proposed. This ensemble model takes the accuracy of the three models in the verification set as the weight and linearly multiplies the weights with the predicted results. In this way, models with high accuracy will be assigned higher weight. But the accuracy of the model is not the best weight. To obtain the best weight, the linear ensemble model based on simulated annealing, SAEnsemble is proposed. It is the third model of soft voting.

Linear ensemble
Simulated annealing [32] is a random optimization algorithm based on the mountain climbing algorithm. Its starting point is based on the similarity between the annealing process of solid material in physics and the combinatorial optimization problem. It will start from a relatively high initial temperature. With the continuous decline of temperature parameters, combined with the probability jump characteristics to randomly find the global optimal solution of the objective function in the solution space, the local optimal solution can jump out probabilistically and eventually approach the global optimal solution. This algorithm is usually used to optimize tasks.
In this paper, the initial solution space consists of three random solutions between 0 and 1. The corresponding initialization objective function value y_max is obtained. It is determined whether T is higher than Tmin. If it is large, it enters the loop. If it is small, it jumps out of the loop. After entering the loop, each new solution will float up and down the range of 0.05 × 0.025×T based on the old one. After obtaining new solutions, it is checked whether they are all between 0 and 1. If they are all consistent, the objective function value yNew is calculated. If yNew is greater than y, it is updated to the new solution space and then compared with the previous maximum objective function value y_max. If it is greater, the maximum objective function value and the corresponding solution space are updated. If yNew is smaller than the previous y, then Eq. (1) determines whether the solution space is updated or not. This process is repeated until the optimal objective function value and the optimal solution space are attained. The steps described above are shown in Fig. 4, which are the processes of obtaining the best weight by SAEnsemble.

Nonlinear ensemble
The linear ensemble assigns the weight to the basic model directly, which will ignore the important nonlinear relations inside the outputs of basic model. A nonlinear ensemble takes the output of the basic model as the input of another new network for training. In this comparative study, each basic models outputs the prediction probabilities of 40 classes at one time, so the three models output 120 prediction probabilities in total, which are taken as the inputs of the Back Propagation(BP) neural network, This network will pass the loss function to measure the probability distribution of neural network prediction and the distance between the real-worth probability distribution. The sample loss is reduced by increasing the output probability of the corresponding position with the target value. It iterates repeatedly to obtain the optimal solution that the 40 best predictions for an image. This ensemble model is called SBPEnsemble. The first layer of SBPEnsemble is an input layer of 512 neurons, with 120-size onedimensional data input. It randomly throws away half of the data. The second layer contains 512 neurons, regularized by L2, and the activation function is ReLU. It also randomly throws away half of the data. The third layer, i.e., the output layer, contains 40 neurons and uses a linear activation function. Figure 5 shows the specific structure.

Experimental setup and training processes
Before the image is input into the model, it is also subjected to data enhancement using image processing methods such as rotation, scaling, and mirroring. The experimental environment is 3080TI of the Ubuntu operating system, using Python, Keras, and OpenCV learning framework. InceptionV3 has the most parameters, and it takes the longest time to train. The training times of a basic model are listed in Table 3. SAEnsemble is based on the simulated annealing algorithm, its computational complexity O(n) depends on the size of the input sample m, the number of nodes N, which is a finite constant. SBPEnsemble is based on BP neural network,  Fig. 4 Process of obtaining the best weights for the models through simulated annealing and its computational complexity O(n) depends on the size of input samples m and the number of neural nodes n. After comparing the computational complexity of the two algorithms, the computational complexity of SAEnsemble grows faster than that of SBPEnsemble, so SAEnsemble requires a longer runtime. It can also be seen from the runtime that SAEnsemble is about three times that of SBPEnsemble. In this comparative study, three basic models were used for transfer learning, two ensemble models were used for training. Each model was trained through 100 epochs. The process of model parameter adjustment is as follows. First, three optimizers, SGD, Adam, and Adagrad, are selected. The accuracy of the three basic models after training is shown in Fig. 6a, it can be seen that the Adam optimizer has the best result. Adam is the most commonly used optimizer with the fastest convergence. Then the learning rate is adjusted. Three different learning rates, 0.0005, 0.001, and 0.002, are used respectively. It can be seen from Fig. 6b that MobileNetV2 is most suitable for a 0.001 learning rate, while InceptionV3 V3 and Xception both use a 0.005 learning rate. The final Accuracy, precision, and recall of all models are shown in Table 4.

Performance of the CNN models in pest classification
In classes with the highest accuracy of each basic model, it is found that Xception can outperform the other models on some beetles' classes, such as classes 12,17, 23, 24, and 25, see the second line of Fig. 7. To input one image into the three basic models, the Fusion feature map extracted from the last convolution layer of Xception is precisely concentrated on the target region, while MobileNetV2 also concentrates in the target region but the region is more divergent and the learning performance is not as good as Xception, see Fig. 8. However, MobileNetV2 show better performance on some moths' classes, such as class 1,9,20, 26,35, which is shown in the third line of Fig. 7. To input one image into the three basic models, only MobileNetV2 learned the correct area, see Fig. 9. InceptionV3 has the highest accuracy on test sets. The different basic model has different classification performance for pest class, this is one of the reasons to combine the three basic models. The process of the Fusion feature map is shown in Fig. 10. Table 5 presents the precision, accuracy, f1 of SBPEnsemble on the test set. The model can fully classify 23 insect pests with an average classification rate of 96.55%. Table 6 presents the accuracy of SAEnsemble on the test set. The model could fully identify 22 insect pests with an classification rate of 96.03%. The distribution of the two matrices is roughly the same because they are combined with the same three basic models, but there is some difference between the two ensemble models in performance. SAEnsemble can comprehensively consider the features extracted from each model to get the weight and then carry out a linear ensemble, when the insect pest and the background were very similar, the features extracted by a basic models were not obvious, leading to  classification errors in SAEnsemble, such as Fig. 11. When the characteristics of the two insect pests are very similar, the features extracted by three basic model are also similar. Which leads to classification errors in SAEnsemble. The feature images extracted from two classes by three basic models are shown in Fig. 12. The nonlinear ensemble model SBPEnsemble is a BP neural network, which learns all the outputs of the three models. From a mathematical point of view, it avoids the issue that SAEnsemble faces, the extraction of unobvious features and similar features to some extent. But, it is slightly weaker than that of SAEnsemble under the condition of huge difference between pest and background. In such a case, SAEnsemble could accurately extract the characteristics of insect pests and classify them correctly, while SBPEnsemble may make wrong predictions, see Fig. 13. Therefore, in a real-world production situation, when the pest has a high degree of similarity with background, the SBPEnsemble can be used to combine the InceptionV3 Xception MobileNetV2 Fig. 7 The five classes with the highest accuracy of the three basic models InceptionV3 Xception MobileNetV2 Fig. 8 The Fusion feature map of the first, middle, and last convolution in three basic models on class 12 model. When the pest is distinguishable from the background, SAEnsemble may achieve better results.

Influence of background on SAEnsemble and SBPEnsemble
To further study the impact of background in the image, two groups of images are randomly selected from the images with low accuracy. In the first group, the pests are very similar to the background. In the second group, the pests are very different from the background. The classification results of SBPEnsemble and SAEnsemble are below each image, as shown in Fig. 14. In the first group of images, all SBPEnsembles are correctly classified, while SAEnsemble only correctly classifies two pictures. In the second group of images, SBPEnsemble has two wrong classifications and SAEnsemble has all the correct classifications. The experimental result shows that the nonlinear ensemble model SBPEnsemble is better than SAEnsemble when the pests are very similar to the background; When the pests are difference from the background, the linear ensemble model SAEnsemble has better performance. According to the above results, selecting an appropriate ensemble method in agricultural production can effectively improve the accuracy. This section contains result illustration of two proposed models and compare the two proposed models with other studies on D0 dataset. It also contains comparison of model training with/ without foreground-based enhancement.

Result illustration of two ensemble models
When checking the prediction results of the five models, including three basic models and two proposed ensemble models. It is found that when the predicted values of the three basic models were different, SAEnsemble and SBPEnsemble can obtain the same (correct) results. The results obtained by ensemble models is not only based on the basic model with the highest accuracy, InceptionV3, but also considered two other models with lower accuracy, Xception and MobileNetV2, as shown in the second row in Fig. 15. It is proved that although the ensemble model is likely to rely on the model with the highest accuracy, it also learns the other two models with lower accuracy, so as obtaining better results. This indicates that the ensemble model is effective.

Comparison of the proposed model with existing methods
On the D0 dataset, [26] proposed a CNN model that has twelve layers, this model achieved a classification of 95.97%. Xie et al. [33] created an automated system for crop pest performance comparison of SAEnsemble, SBPEnsemble, and other methods for SMALL and IP102 datasets is presented in Table 7. The two proposed ensemble models have the highest accuracy. Compared with other pest classification work, such as GoogLeNet, the two ensemble models proposed in this paper have smaller parameters and higher accuracy. The two-stage model is too complex, so its operability is not as good as the ensemble model. Although the transformer has the highest accuracy, its to many parameters make it unable to be moved to the device, so its practicability is poor. The two ensemble models proposed in this paper have a simple structure, and can effectively improve the accuracy of a basic model and move to devices. When facing a complex agricultural environment, it can select the appropriate ensemble method to improve the accuracy of pest classification.

Limitations of the ensemble model
The two ensemble models cannot get high accuracy when the classification accuracy of the basic model is low, as shown in Fig. 16 for classes 5-7, 32, and 35. Classes 5-7 are pests with small targets and no obvious characteristics. Classes 32 and 35 are too similar to classes 18 and 0 respectively. The classification accuracy of the basic model is very low, leading to poor ensemble results. Whether linear ensemble or nonlinear ensemble, the final result depends on the accuracy of the basic model, which is the biggest limitation of the ensemble method.

Conclusion
This paper proposed and compared two ensemble models, SAEnsemble and SBPEnsemble, for corp pest classification. The experimental results on the D0 dataset of 4508 crop pest images show that SAEnsemble and SBPEnsemble achieved high accuracy rates of 95.54% and 96.18% respectively, which are 0.85% and 1.49% higher than the basic model. In addition, the different basic model has different classification performance in the pest class, Xception is suitable for pests of the beetle class and MobileNetV2 is suitable for pests of the moth family class. Different ensemble models can be selected as per different actual agricultural production conditions, the linear ensemble is suitable for cases when the pest profile is distinguishable from the background, and the nonlinear ensemble is suitable for cases when the pest profile is similar to the background. But, The performance of the nonlinear ensemble is better than the linear ensemble. In the future, this can serve as a guide to aid decision-making by farmers and help in preventing pest transmission.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.