Beet seedling and weed recognition based on convolutional neural network and multi-modality images

Difficulties in the recognition of beet seedlings and weeds can arise from a complex background in the natural environment and a lack of light at night. In the current study, a novel depth fusion algorithm was proposed based on visible and near-infrared imagery. In particular, visible (RGB) and near-infrared images were superimposed at the pixel-level via a depth fusion algorithm and were subsequently fused into three-channel multi-modality images in order to characterize the edge details of beets and weeds. Moreover, an improved region-based fully convolutional network (R-FCN) model was applied in order to overcome the geometric modeling restriction of traditional convolutional kernels. More specifically, for the convolutional feature extraction layers, deformable convolution was adopted to replace the traditional convolutional kernel, allowing for the entire network to extract more precise features. In addition, online hard example mining was introduced to excavate the hard negative samples in the detection process for the retraining of misidentified samples. A total of four models were established via the aforementioned improved methods. Results demonstrate that the average precision of the improved optimal model for beets and weeds were 84.8% and 93.2%, respectively, while the mean average precision was improved to 89.0%. Compared with the classical R-FCN model, the performance of the optimal model was not only greatly improved, but the parameters were also not significantly expanded. Our study can provide a theoretical basis for the subsequent development of intelligent weed control robots under weak light conditions.


Introduction
The presence of weeds in the field can cause great damage to crop seedlings. More specifically, weeds compete with crops for sunlight and nutrients, thus seriously affecting the photosynthesis of seedlings and increasing the spread of diseases and insect pests.
(1) A deep fusion algorithm was adopted to fuse the RGB (Red, Green and Blue) and near-infrared (NIR) images of beets and weeds obtained under weak light conditions into three-channel multi-modality images, then the fusion images were sent into CNN for training. (2) In the feature extraction layer, the traditional convolutional kernel was replaced by the deformable convolution. (3) The hard negative samples were fully excavated by using online hard example mining (OHEM) in the detection process, and then they were sent into the network for reidentification.
The structure of the rest of this paper is as follows. Section 2 introduces the related work, and Section 3 introduces the materials and methods of this paper, including data source, improved weeds and beets detection model. Section 4 introduces the experimental environment. The fifth section shows the experimental results and discusses the results. Finally, the conclusion of this work is drawn in Section 6.

Related work
At present, most weed detection methods are developed based on machine vision. Zhao et al. [35] proposed a classification method for weed classification based on a back propagation (BP) neural network. Following the fuzzy classification of the features, a genetic algorithm was used to optimize the network for the identification of weeds. Yan et al. [32] designed a method to identify weeds based on machine vision during the maize seedling stage. After distortion correction, HSI (Hue, Saturation and Intensity) color space conversion and threshold segmentation, the collected images of maize plants and weeds were identified according to the shape and color features. Bakhshipour et al. [6] designed a weed segmentation network based on an artificial neural network. The single-stage wavelet transform was used to extract weed texture features with 14 texture features selected to optimize the algorithm by using the principal component analysis method. Finally, the features were sent into the neural network in order to identify the weeds. Akbarzadeh et al. [2] used Gaussian support vector machine algorithm to classify corn and weeds under laboratory conditions, and compared the accuracy with the traditional data aggregation method based on two different wavelength discrete normalized difference vegetation index. Abouzahir [1] improved weed detection performance by using directional gradient histogram (HOG). The accuracy of weed detection using back propagation neural network is 71.2% ~ 83.3%, which is 37.6% higher than the traditional HOG algorithm. The aforementioned studies combine shallow feature extraction and pattern recognition to identify weeds. However, the feature extraction of such methods is time consuming and the applicability is weak. In addition, due to the influence of the complex field background, the weed characteristics extracted by humans can be ambiguous and uncertain, which consequently results in weed identification based on traditional machine vision at low accuracies.
Recently, many networks based on pre-trained CNNs have achieved promising results in weed and crop seedling detection. Jiang et al. [13] proposed a weed identification method based on a deep CNN and hash code, which was able to effectively compress the highdimensional features of the weed through a binary hash layer to detect weeds. Andrea et al. [4] used a classification method based on a CNN that identified maize seedlings and weeds by optimizing the number of convolutional kernels on the basis of the original classification network. Huang et al. [12] proposed a fully-connected CNN method by applying transfer learning to improve feature extraction, and a skip architecture structure for network optimization to detect weeds. Results from the aforementioned literature demonstrate that CNNs can not only automatically extract the shallow features (texture, color, etc.) of weeds and crops, but can also learn deeper abstract features. Moreover, CNNs are able to reduce the cost of feature extraction and are more robust for weed detection in a complex environment. Therefore, CNNs have the potential to be applied to detect beet seedlings and weeds. However, the models in the current literature are not sensitive to the feature information of weeds due to the weak light at night and the use of traditional convolutional kernels. This has resulted in feature extraction difficulties and low recognition accuracies.

Data source
In order to investigate the detection and identification of beets and weeds in complex backgrounds, images of beets and weeds were collected at the University of Bonn, Germany, in 2016. For more details about the dataset, see the link and study below (http:// www. ipb. unibo nn. de/ data/ sugar beets 2016/). The images were collected via a multi-modality camera (JAI AD-130GE), equipped with two high-sensitivity CCD multispectral sensors of 1.3 million pixels. The camera can simultaneously collect visible (400nm~650nm) and near infrared (NIR) (760nm~1000nm), with an output image size of 1296 × 1296 pixels [18]. The dataset contains a total of 2,093 images of beets and weeds at different growth stages. In the process of data acquisition, beet seedlings and weeds with different levels of maturity under varying angle transformations were consider for the image acquisition. Moreover, the same plant (beet and weed) was imaged multiple times under different ranges of overlap and occlusion between beets and weeds. Some image examples are shown in Fig. 1. Since the original image was collected in a low-light environment and it was difficult to visualize, the image shown in this article had been enhanced the exposure and brightness.

Multi-modality image fusion
In general, object detection methods based on deep learning aim to understand the distribution of basic data via a large amount of training data and subsequently induce the optimizer to adjust the parameters of the network [26]. At present, RGB images on weeds and beets are commonly used to train deep learning models. However, RGB images are sensitive to variations in light, resulting in the loss of important information on the shape, color and texture of target objects [33]. Therefore, the performance of such models is poor under complex backgrounds at night.
In order to solve this problem, in the current study, the NIR and visible images of beets and weeds were fused into multi-channel images. In particular, multi-modality image fusion spatially registers the data of the same image from different sources, and subsequently combines the information in each image to generate an integrated data set of all the images [24].

Deep fusion algorithm frame diagram
The visible and near-infrared images (Input1 and Input2) are extracted by a denseblock composed of convolution layers, and then sent to fusion layer for pixel-level superposition. These fusion images are reconstructed through a decoding network also composed of convolutional layers to obtain the three-channel multi-modality images. As shown in Fig. 2, the encoding network consists of two sections: C1 and denseblock. In the C1 section, a  Deep fusion method for multi-modality images layer, l 1 − norm fusion strategy is selected to fuse the visible and near-infrared images. Finally, four convolutional layers (3 × 3 convolutional kernel) are used to reconstruct the final fusion image in the decoding network. What's more, in this framework, the size of the input and output images is 1296 × 1296 pixels, and the number of feature mapping channels per convolutional layer is 16. More details of the deep fusion algorithm can be found in the study [14]. Figure 3 presents the three-channel multi-modality image obtained via the depth fusion algorithm. Following the fusion of the data, the LabelImg software (https:// tzuta lin. github. io/ Lab-elImg/) was used by experts in the agricultural field to label the beet seedlings and weeds based on the PASCAL Visual Object Classes Challenge [10]. A python script was then used to randomly divide the images and the corresponding tag files into training and testing sets at a ratio of 4:1.

Deformable convolution
Recently, the use of CNNs has made significant breakthroughs in many vision applications. However, due to the regular grid sampling and the fixed geometric structure in traditional convolution methods, it is difficult for networks to deal with geometric deformations. The adaptability of the existing models to process the geometric deformation of objects almost comes from the diversity of the data itself, and there is no mechanism to adapt to geometric deformation in the model. Thus, the ability of geometric transformation modeling is limited, and it cannot be adjusted adaptively according to the image content [8]. In order to overcome this limitation, a new module, deformable convolution, was adopted to improve the modeling ability of CNN for transformations in the current study. More specifically, an offset variable is added to the position of each sampling point in the convolutional kernels. The kernel with the offset variable can then be sampled randomly near the current position, and is thus not restricted to the previous regular grid points. Moreover, the offset variable can be obtained by learning within the target task without the need of an additional monitoring signal, improving the traditional convolution. Figure 4 shows the sampling methods of traditional convolution and deformable convolution with convolution kernel size 3 × 3. Figure 4(a) demonstrates the regular sampling grid (green points) of traditional convolution, while (b) present the deformed sampling locations (black points) with augmented offsets (blue arrows) in deformable convolution. Then (c) and (d) are special cases of (b), showing that deformable convolution generalizes scale, aspect ratio and rotation transformations. Figure 5 presents the internal structure of deformable convolution. First, the displacement required for deformable convolution is obtained through the output of a small convolutional layer, and the displacement is then applied to the convolutional kernel in order to achieve the effect of deformable convolution. This operation is able to add the offsets to the regular grid sampling locations in the standard convolution, thus enabling the free form deformation of the sampling grid. The offsets are learned from the preceding feature maps via additional convolutional layers. In deformable convolution, the grid R has the offsets p n | n = 1,2, … , N , and N = |R| , thus, formula (1) becomes: As the offset p n is typically fractional, formula (2) is implemented via bilinear interpolation [8].

The network structure of the improved model
In the target detection task, it is necessary to classify and locate targets. The classification task needs to increase the translation invariance of the object to classify the target at different positions, while the positioning needs to reduce the translation variation of the object to precisely locate the target position [7]. In order to balance the relationship between the two tasks, location information is fused by constructing position-sensitive score, and all information is fused by adding an RoI pooling layer to score maps. The network structure of the improved model is shown in Fig. 6. Following the input of the image, feature extraction is performed, resulting in a feature map of k 2× (C+1) dimensions. The region proposal network (RPN) [23] is then used to extract the regions of interest (ROIs) in the feature map, with C denoting the number of categories. The extracted ROIs are divided into k×k regions, with k generally equal to 3, corresponding to 9 regions: top-left, top-center, … and bottomright. Finally, the score of each region is determined by a pooling operation, and the output (2) y p 0 = ∑ p n ∈R w p n ⋅ x(p 0 + p n + p n )

Fig. 6
Key idea of the improved R-FCN for weeds and beets detection feature vector of the ROIs are then obtained by voting. This vector is subsequently used for classification and regression of the weeds and beets.

Equipment and platform
The Ubuntu 16.04 system is used as the operating platform, and MXNet is adopted as the deep learning framework to train the network. The computer memory of the system is 32GB, with a 3.6 GHZ i7-9700k CPU processor. Additionally, the 11 GB GeForce GTX1080Ti GPU with Pascal architecture was used.

Model parameter setting
In order to reduce variation of parameter updates and to stabilize the convergence of the model, a mini-batch stochastic gradient descent (SGD) was used to train the network [30]. The parameters were set as follows: the number of each mini-batch of samples was 128, the momentum factor was fixed to 0.9, and the weight attenuation factor was 0.0005 to avoid over-fitting. Finally, the gradient descending learning rate was applied to all layers of the network, and was gradually reduced in stages to 0.1 times of the current learning rate. Additionally, the initial learning rate during the training process was set to 0.005, and the model was iterated for 100 epochs.

Performance evaluation of the model
Precision and recall are widely used in the field of information retrieval. As with all machine learning problems, in order to calculate precision and recall, it is necessary to explain the following: True Positive (TP), the number of positive classes predicted as positive classes; False Positive (FP), the number of negative classes predicted as positive classes (error rate); True Negative (TN), the number of negative classes predicted as negative classes; and False Negative (FN), the number of positive classes predicted as negative classes (missing rate). Based on this, we define the following: The average precision (AP) is then calculated as follows: where p represents Precision , r represents Recall , and p is a function taking r as a parameter. The mean average precision (mAP) equals the average of all the AP categories, and is determined as follows: where classes represents the detected objects, and N is the number of all categories of objects to be detected.

The impact of multi-modality fusion images on mAP
Since the algorithm proposed in the current study only supports three-channel images as the input data, while the NIR image was composed of a single channel, RGB and multimodality images (three-channel fusion images) were used in all ablation experiments for verification. As reported in Table 1, the classical region-based fully convolutional network (R-FCN) model and our proposed model exhibited higher detection accuracies on the fusion data set compared with the RGB data set. This is attributed to the high sensitivity of the shallow features of beets and weeds in visible images under the low levels of light, as well as the complex environment. However, the near-infrared image is able to depict the thermal radiation of the target objects. The surface reflectivity of the object was completely different to that of the background, and was thus more robust to variations in light [21]. However, as the near-infrared images are of a low spatial resolution and contain limited texture information, the visible and NIR images were fused into a multi-modality image. These fusion images were sent to the CNN for training. The final detection accuracy of the model greatly improved with the use of the fusion data set, and the detection system was also more robust.
The detection results were visualized to further demonstrate the performance of multimodality images. As can be seen from Fig. 7, image triplet (a) demonstrates the detection result of the improved R-FCN model on the RGB data set, while image triplet (b) depicts the result of the same model using the multi-modality data set. Due to the fusion of the near-infrared and visible images for the multi-modality image, the feature information of beet seedlings and weeds were able to be better characterized under the terrible light and complex field backgrounds. Thus, the detection performance of the multi-modality fusion images was better than that of the improved R-FCN model. Hence, the subsequent ablation experiments were conducted using on the fusion data set.

The impact of deformable convolution on mAP
Models 2 and 3 used deformable convolutions, while models 0 and 1 implemented traditional convolutions (Table 2). Compared with the traditional convolution model, the deformable convolution model was able to improve the mAP of beets and weeds by 3-4% points. This can be attributed to the offset variables of the deformable convolution kernel, allowing for the feature expression of the CNN to automatically adapt to changes in the morphological of the target object. Figure 8 depicts the convolution results for the multi-modality data set. In Fig. 8(a), the detection results of the traditional convolution method miss some targets. The deformable convolution results (image triplet (b)) exhibited a higher detection accuracy and a lower missing rate than the traditional convolutional model (image triplet (a)). This is a result of the ability of the deformable convolution model to adaptively detect irregular geometric edges of seedlings when detecting smaller target objects.

The impact of OHEM on mAP
OHEM retrains the hard samples with large loss values, as such samples may lead to the misclassification of weeds and beets. Compared to other models without OHEM, the detection accuracy of models 1 and 3 with the OHEM algorithm were improved ( Table 2). This indicates that the OHEM method can suppress the simple samples and the small number of samples, making the training process more efficient. In addition, the OHEM algorithm also eliminated several heuristics and hyper-parameters by automatically selecting hard examples, thus simplifying the training process [27].

Performance comparison of different algorithms
In order to verify the effectiveness of our model, we compared its performance with the classical algorithm (Faster R-CNN [23], RetinaNet [16]) and the latest algorithm (Yolo V5) of target detection. All comparative experiments used the fusion images. As can be seen from Table 3, the detection performance of Faster R-CNN on weeds was relatively poor, because Non-Maximum Suppression (NMS) was used for post-processing in order to avoid overlapping candidate boxes when RPN generated proposals. However, due to different weed scales and mutual occlusion, it was difficult to detect accurately. The performance of RetinaNet and Yolo V5 were similar, in which Yolo V5 had higher detection accuracy for weeds while RetinaNet had higher detection accuracy for beet. The detection results were visualized to further demonstrate the performance of our model. As shown in Fig. 9, the Faster R-CNN had serious repeated detection boxes in weed detection, and the same weed was identified as multiple targets. There was not much difference between four algorithms in the detection results of beet. Due to the existence of very small beets, there was a certain amount of omission, so the overall accuracy was lower than that in weed detection accuracy. Our improved model showed optimal performance due to the use of OHEM and deformable convolution.

Detection results of the optimal model
In order to verify the prediction results of the optimal model under the actual field environment, six images of beets and weeds were selected from the test set. As demonstrated in Fig. 10, with the use of deformable convolution and multi-modality fusion images, our improved model was able to maintain a high detection accuracy. The detection results of beet and weeds with large targets was better while the classification confidence can reach 1. Small targets could also be detected successfully with a confidence level of more than 0.99. This indicated that the optimal proposed model exhibits strong generalization and robustness abilities for the detection of beet seedlings and weeds under the poor light and complex field backgrounds.

Conclusions
In the current study, an improved R-FCN model was proposed to detect and identify beet seedlings and weeds under poor light (at night) and complex field backgrounds. Based on the classic R-FCN network, the visible and near-infrared images of beets and weeds were fused into three-channel multi-modality images at pixel-level using a deep fusion algorithm. The fusion images were then sent to the convolutional neural network for training, so as to improve the mean average precision of beets and weeds detection. Furthermore, considering that the traditional convolutional kernel would restrict the geometric modeling ability of the model, deformable convolution was adopted in the feature extraction layer. Moreover, online hard example mining was introduced to excavate the hard negative samples in the detection process to retrain the misidentified samples. Through the aforementioned improvement measures, the average precision of beets and weeds were 84.8% and 93.2% respectively, and the mean average precision was increased from 82.3 to 89.0%. Compared with the original model, the detection accuracy was improved by approximately Detection results of fused images by different algorithms 7% points. Our results demonstrate that the performance of the optimized model is not only greatly improved, but the quantity of the model parameters is also maintained at a reasonable amount. The model can be compressed and deployed to industrial equipment to lay the research foundation for automatic weeding or spraying robots. In the future, the robot can be built to work continuously all day long by collecting images under different lighting conditions for model learning.