Fast Self-Attention Deep Detection Network Based on Weakly Differentiated Plant Nematodes

Background: High-precision, high-speed detection and classification of weakly differentiated targets has always been a difficult problem in the field of image vision. In this paper, the detection of phytopathogenic Bursaphelenchus xylophilus with small size and very weak inter-species differences is taken as an example. Methods: Our work has been carried out in response to the current weakly differentiated target detection problems: a. To replace the complex network labelling and training process based on expert empirical knowledge, we proposed a lightweight Self-Attention network. Experiments proved that the key feature identification areas of plant nematodes found by our Self-Attention network is in good agreement with the customs expert empirical knowledge, and the feature areas found by the method in this paper can better obtain higher detection accuracy than the expert knowledge; b. To optimize the computing power caused by the input of the entire image, we used low-resolution images to quickly obtain the key feature location coordinates, then obtain high-resolution feature areas information based on the coordinates; c. We adopt the adaptive weight with multi-feature joint detection method based on the brightness of the heatmap to further improve the detection accuracy; d. We constructed a more complete high-resolution training dataset involving 24 species of Bursaphelenchus xylophilus and other common hybrid species with a total amount of data exceeded 10,000. Results: The algorithm proposed in this paper replaces the tedious extensive manual labelling in the training process, improves the average training time of the model by more than 50%, reduces the testing time of a single sample by about 27%, optimizes the model storage size by 65%, improves the detection accuracy of the ImageNet pre-trained model by 12.6%, and improves the detection accuracy of the no-ImageNet pre-trained model by more than 48%. Conclusions: Overall, the method in this paper has achieved better results in terms of model complexity, training annotation, as well as testing time and accuracy. In this paper, the detection of phytopathogenic Bursaphelenchus xylophilus with small size and very weak inter-species differences is taken as an example. Our work has been carried out in response to the current weakly differentiated target detection problems. Our paper proposes a simple and lightweight Self-Attention network to replace the complex network labelling and training process based on expert empirical knowledge, and its effectiveness is verified through experiments. Also, our paper further improves the detection speed and accuracy by fast focus network based on low-resolution images and multi-feature joint detection based on heatmap brightness weights. In the end, compared with other mainstream convolutional neural networks, the method in our paper has improved model complexity, training labelling as well as testing time and accuracy with different degrees of improvement. In addition, for other fields of fine-grained weakly differentiated classification recognition based on deep learning, the method proposed in our paper has strong applicability and generalization ability.

harmful to vegetation, the right is Bursaphelenchus mucronatus (B. mucronatus) which is harmless to the vegetation. It can be found that the overall morphology and characteristics of these two nematodes are very similar and difficult to distinguish.   Figure 2, it can be found that the key parts of different species of plant nematodes have some variability, but the degree of differentiation is very small [5]. The accuracy does not meet the requirements if traditional morphological identification methods are used [6]. In contrast, manual identification or molecular biology methods such as PCR and DNA barcode have the problems of low identification efficiency and huge cost [7,8].
In recent years, with the rapid development of deep learning technology, convolutional neural networks have significantly improved their effectiveness compared with traditional image detection methods [9]. By the method based on the attention mechanism, it is easier for the neural network to focus on the fine-grained detail information, improve the accuracy of image recognition classification under weak differences [10].
For weakly differentiated image classification often requires two steps: step 1. Manually label a large amount of data combined with expert empirical knowledge and train a key attention feature model based on Faster R-CNN [11], YOLO [12] and other general-purpose neural networks; step2. Based on the finegrained feature information obtained from the key attention feature model, then further training and testing of classification are performed to obtain the final result [13]. There are several problems in this operation: 1. The industry expert experience knowledge is difficult to obtain, and there is also a certain error in the expert experience, which will have an impact on the detection accuracy [14]; 2. The attention feature model trained by manual annotation of the original high-resolution image and based on networks such as Faster R-CNN will lead to a substantial increase in the model storage size, testing time, training time, which seriously affects the training and testing efficiency [15].
To address the above problems, this paper takes the detection of phytopathogenic B. xylophilus with small size and very weak inter-species differences as an example [16]. Our work has been carried out to focus on the weak inter-species differences and the problems in the process of achieving high-precision and high-speed targets: a. An open-source plant nematodes high-resolution microscopic image dataset taken by a Zeiss professional microscope was constructed. The dataset image uniform image resolution of 1388*1040, involving 24 species of conventional plant nematodes, with a total of 11,237 images. To the best of our knowledge, this is the largest and most diverse plant nematodes dataset that can be found; b. Construction of a Self-Attention feature network based on low-resolution images, which is capable of quickly and adaptively finding the key feature locations of plant nematodes to be detected in lowresolution images without relying on expert empirical knowledge. Experimentally verifying that the key feature identification areas of B. xylophilus found by this network match extremely well with the empirical knowledge provided by the customs experts; c. The high-resolution feature information obtained by the Self-Attention feature network adopted a multi-feature joint recognition method based on adaptive weights of the heatmap.
The final detection accuracy is improved from 90% to 99% compared with the expert empirical knowledge input method, the detect model storage size is optimized by 65%, the testing time is reduced by 27%. Compared with other mainstream convolutional neural networks, the algorithm model proposed in this paper not only eliminates the complicated expert prior knowledge input, but also achieves the optimal model size, testing speed, accuracy simultaneously.

Dataset
In this paper, we constructed the most abundant and diverse plant nematodes dataset that could be found.
The dataset involves 24 species of nematodes that are easily confused with B. xylophilus. The total number of dataset images reaches 11,237. All images were taken uniformly with a Zeiss professional microscope Axio Imager Z1 with an objective magnification of 40X and a uniform image resolution of 1388*1040 [17]. As shown in Figure 3, it is a schematic diagram of the collected common different kinds of plant nematodes.

Traditional detection methods
The process of the traditional detection method is shown in Figure 4 below, which has the following problems: 1. There is too much non-critical redundant information when the entire image is input, which affects the extraction and judgment of key features by the neural network, and then affects the detection accuracy [18]; 2. The computation is time-consuming. To address the above problems, researchers have made improvements. They combine expert empirical knowledge to construct a neural network based on feature areas attention mechanism through extensive labelling as well as training [19], so as to extract key feature areas to improve the detection accuracy of the neural network. The following figure shows the combined neural network architecture of the joint Faster R-CNN feature extraction network and the VGG19 [20] classification network. The above method can eliminate redundant information in non-critical areas of the image to a large extent, improve the detection accuracy. But there are still the following problems: 1. Based on expert empirical knowledge, a large amount of manual annotation is required in the early stage, and the input of the feature network is still high-resolution entire image, which is very time-consuming in labelling and computation; 2. It is not easy to obtain expert experience, and there is a possibility of misjudgement of expert experience, which leads to inaccurate detection accuracy in part.
Therefore, in order to find a better method that can simultaneously solve the problems of manual labelling, long calculation time, large model complexity, and lack of expert knowledge in the network, the following method is proposed in this paper.

Methods
In this paper, a deep neural network based on the fast Self-Attention inference is proposed, as shown in Figure 6. The overall network mainly contains three improved parts compared with the traditional method: Firstly, we construct a search network based on the Self-Attention features to replace the tedious labelling and training based on expert empirical knowledge, as shown in Part A below; Secondly, the computational overhead of finding key features of the whole image is reduced by fast down-sampling of the highresolution input, as shown in Part B below; Finally, through the key feature areas coordinates obtained by Part B, we obtain the high-resolution information of the feature areas. Based on the multi-feature areas joint input with adaptive weights, the goal of high-precision and high-speed classification is achieved, as shown in Part C below.

Self-Attention feature network
For any given high-resolution plant nematodes image, to better help the neural network to classify, we need to assist it in eliminating the influence of non-critical redundant information as much as possible.
Generally, we train additional attentional feature recognition networks based on general-purpose neural networks such as Faster R-CNN, with the help of expert empirical knowledge and through extensive timeconsuming labelling. Figure 7 (green box detect correctly/red box detect incorrectly) shows the result of the attention network trained based on the labelled data to identify key features. It can be found that the detection network has a certain recognition error. As the network model complexity increases, its false detection rate has decreased, but it also brings the disadvantages of a large model, complex networks, and increased computational time-consuming.  The following figure shows the key feature regions obtained from the Self-Attention feature network with different iteration times compared with the features based on the expert knowledge network: Fig.9 Comparison of output results between the Self-Attention feature network and expert knowledge feature network According to Figure 9, it can be found that as the iteration times increases, the key feature areas of the output of the Self-Attention feature network become closer and closer to the expert manual output.
When the loss is fully converged, the attention areas obtained according to the method in this paper obtains higher classification accuracy than the expert knowledge. In a sense, the method in this paper finds the feature areas that is better than the expert experience, and there is also a significant improvement in the network complexity and training time consumption. The results are shown as follows.

Small-scale fast focus method
Based on the Self-Attention feature network in Section 3.1, we can find the attention feature areas of interest, but this method uses high-resolution images as input, which has serious redundancy of image information and leads to high testing time consumption. In this paper, we use multiple down-sampled low-resolution images to find key feature areas, and maps the coordinates of the found key feature areas to high-resolution images to obtain high-resolution key feature areas, which further saves and optimizes the network computational power, as shown in the picture below.   x y x y x y x y P a P b P a P b Aa P a P b P a P b Where re Aa represents the whole rectangular box, a and b are custom variable parameter.
The experimental results with different area sizes and down-sampling times are shown in Figure 11A below. It is verified that the best recognition results are obtained when   The loss function used for the neural network is the cross-entropy loss function [21], its expression is as follows: Where M is the number of categories, ic y denotes the variable, which is 1 if the predicted category of the sample i is the same as the true category (equal to c), otherwise it is 0, ic p is the predicted probability of the observed sample i belonging to category c.
The following figure shows the comparison result of the comparison of detection accuracy and weights ratio using heatmap adaptive and equal ratio weights. It can be found using the adaptive ratio weights based on the average brightness of the heatmap can further improve the detection accuracy compared to the equal ratio method for different number of joint feature input experiments.

Experiment environment match
The experimental environment in this paper is an Intel Core i7-8700k 3.70 GHZ 12-core CPU, NVIDIA RTX2080 Ti GPU with 11GB display memory, 32GB RAM, configured with the TensorFlow open-source deep learning framework on Linux16.04 system [22].
In measuring the model performance, the accuracy rated was selected as the main evaluation metric in this paper [23]. For the training hyperparameters, we choose the adaptive iterative learning rate, the kinetic energy is set to 0.9, the weight decay is set to 0.0005.

Dataset
The plant nematodes dataset for training and testing in this paper includes 24 species of plant nematodes, with a total of 11,237 images, the image resolution is unified to 1388*1040, of which 9,126 are used for training and 2,111 are used for testing.

Results of Self-Attention network
The Self-Attention network used in this paper further improves the detection accuracy while effectively

Efficiency and accuracy improvement
The method in this paper avoids a lot of manual labelling and reduces the time cost required for data labelling. Meanwhile, the Self-Attention network in this paper is more lightweight, coupled with the fast search of low-resolution images and the multi-feature joint detection with adaptive weights, it further reduces the model size, training, and testing time, while ensuring the final accuracy of the algorithm.

0.995
No-ImageNet 0.98 From the experimental results in Figure 15 and Table 3 above, it can be concluded that based on the method in this paper, we avoid the tedious processes of extensive manual labelling. We improve the average training time of the model by more than 50%, reduce the testing time of a single sample by about 27%, optimize the model size by 65%. The detection accuracy of the ImageNet pre-trained model is improved by 12.6%, and the detection accuracy of the no-ImageNet pre-trained model is improved by more than 48%.

Conclusion
In this paper, the detection of phytopathogenic Bursaphelenchus xylophilus with small size and very weak inter-species differences is taken as an example. Our work has been carried out in response to the current weakly differentiated target detection problems. Our paper proposes a simple and lightweight Self-Attention network to replace the complex network labelling and training process based on expert empirical knowledge, and its effectiveness is verified through experiments. Also, our paper further improves the detection speed and accuracy by fast focus network based on low-resolution images and multi-feature joint detection based on heatmap brightness weights. In the end, compared with other mainstream convolutional neural networks, the method in our paper has improved model complexity, training labelling as well as testing time and accuracy with different degrees of improvement. In addition, for other fields of fine-grained weakly differentiated classification recognition based on deep learning, the method proposed in our paper has strong applicability and generalization ability.