The results of weed segmentation are shown in Figure 9. The leaf age was determined based on the number of complete leaves in the weed, and the accuracy of the leaf age recognition was evaluated by comparing this value with the corresponding label at the time of collection. The plant centre was determined based on the intersection area of the top leaves of the weed [7]. Six instance segmentation algorithms were compared, the algorithm with the highest performance was selected, and adjust different hyperparameters for this algorithm. Two datasets were trained using two different backbones networks (ResNet50 and ResNet101), and the network that could realize the optimal balance between mIOU and mAP was selected. and the segmentation performance of the best-performing algorithm under different shooting angles and different leaf ages was evaluated. The results indicate that the data enhancement can enhance the model performance. The AP50 and mIOU of the BlendMask model using ResNet101 as the backbone combined with FPN are 0.720 and 0.607, respectively; these values are better than those for the ResNet50 framework, and thus, the former model can be used for weed segmentation. Moreover, the weed image captured at the top view angle exhibited a higher detection accuracy than that of the other two angles.
3.1 Comparison of instance segmentation models
To verify the effectiveness of the proposed method for weed segmentation, six instance segmentation algorithms, including Mask R-CNN, SOLO [55], PolarMask [56], CentreMask [57], YOLACT [51] and BlendMask were compared. The six algorithms were tested on two data sets (with and without data enhancement). To examine the recognition effect of the model in a complex field environment, the images in the test were those without any data enhancement. The test results are shown in Figure 10.
Figure (a) shows the F1 values of six instance segmentation networks. Two data sets (enhanced and unenhanced) were applied to train the instance segmentation model, and subsequently, the verification set was used to evaluate the performance of the algorithm to analyse the weed segmentation performance. According to figure (a), the performance of the 6 instance segmentation models on the dataset with data enhancement was generally higher than that on the dataset without data enhancement; thus, the implementation of the data enhancement can effectively improve the model accuracy. The F1 values of the Mask R-CNN, SOLO, PolarMask, CentreMask, YOLACT, and BlendMask after training on the data enhancement dataset were 0.9214, 0.9432, 0.8873, 0.9297, 0.9023, 0.9479, respectively. It can be noted that the F1 value of the BlendMask model was higher than that of the other five instance segmentation models in the weed segmentation task. The F1 value of the SOLO model was smaller than that of BlendMask by 0.0047. PolarMask exhibited the lowest weed segmentation performance among the six models.
Figures (b) and (c) show the AP50 and AP70 values of the six instance segmentation networks, respectively. The AP50 values of the R-CNN, SOLO, PolarMask, CentreMask, YOLACT and BlendMask after training on the dataset with enhanced data were 0.6932, 0.7131, 0.6542, 0.7085, 0.6721 and 0.7200, and the corresponding AP70 values were 0.5244, 0.5796, 0.4982, 0.5737, 0.5032, 0.5921, respectively. According to Figure (b), data enhancement can lead to higher average precision values than those corresponding to the dataset without data enhancement. In the case of data enhancement, the AP50 value of the BlendMask model is 0.7200, which is the highest among the six instance segmentation models. The AP50 value of the SOLO model is 0.7131, and the corresponding instance segmentation performance is similar to that of the BlendMask model. Moreover, the AP50 value of the PolarMask model is 0.6542, corresponding to the lowest segmentation performance. In the case of data enhancement, the AP50 of the five models ranges from 62% to 72%, and these models can thus satisfy the requirements for weed instance segmentation. According to Figure (c), in the case of data enhancement, the AP70 value of the BlendMask model is 0.5921, corresponding to the highest value among the 6 instance segmentation models. The AP70 value of the SOLO model is 0.5796, and the instance segmentation performance is similar to that of the BlendMask model.
Overall, among the 6 instance segmentation algorithms, the F1, AP50 and AP70 values of the BlendMask model were the highest in the task of weed segmentation in a complex field environment, and the AP50 value was 0.0268, 0.0069, 0.0685, 0.0115 and 0.0479 higher than that of the Mask R-CNN, SOLO, PolarMask, CentreMask and YOLACT, respectively. The AP50 value of the SOLO model was 0.0199, 0.0589, 0.0046, 0.041 higher than that of the Mask R-CNN, PolarMask, CentreMask, and YOLACT models, respectively. The model results show that the BlendMask and SOLO models can better exploit the image features than the other four models, thereby exhibiting a higher segmentation performance.
Since BlendMask and SOLO outperformed the other four models in terms of the segmentation performance, the BlendMask and SOLO models were replaced with two underlying networks (ResNet50 and ResNet101), and the prediction time under the different underlying networks was compared. The corresponding results are presented in Table 3
Table 3 Prediction time of different models under different backbone networks
Network
|
Times (ms)
|
ResNet50
|
ResNet101
|
SOLO
|
102.7
|
128.5
|
BlendMask
|
89.3
|
114.6
|
Table 3 lists the prediction durations of the model for a single picture, and it can be noted that when ResNet50 is used as the backbone network, BlendMask has the smallest prediction duration, which is 13.4 ms lower than that of SOLO. Figure 10 indicated that the segmentation performance of BlendMask was comparable to that of SOLO. However, according to the comparison in Table 3, under both the backbone networks, the prediction time of BlendMask for a single image is lower than that for SOLO. Therefore, considering both the segmentation performance and prediction time, it can be considered that BlendMask exhibits a satisfactory segmentation performance; the model is feasible and can realize prompt and accurate weed segmentation.
3.2 Comparison of different hyperparameters of BlendMask
We have changed the following four hyper-parameters of BlendMask:
- R, the resolution of bottom-level RoI,
- M, the resolution of top-level prediction,
- K, the number of bases,
- The bottom module is composed of the feature of the backbone network or FPN
The Table 4 is a comparison of different resolutions when the K=4 and the bottom module is C3 and C5. We set the resolution R of the bottom-level RoI to 28 and 56, with R/M ratio from 14 to 4.
Table 4 Comparison of different resolutions
R
|
M
|
AP50
|
AP70
|
Time(ms)
|
28
|
2
|
0.652
|
0.505
|
85.7
|
4
|
0.664
|
0.517
|
86.1
|
7
|
0.677
|
0.521
|
88.3
|
56
|
4
|
0.685
|
0.532
|
86.3
|
7
|
0.693
|
0.535
|
88.2
|
14
|
0.698
|
0.538
|
91.1
|
Note: Set the bases (K) of the model to 4, and use C3 and C5 for the bottom module from the backbone network ResNet101. By changing the resolution of the bottom-level RoI and the top-level prediction to compare the performance of the model.
It can be seen from Table 4 that increasing the resolution R of bottom-level RoI will lead to longer operation time of the model. Compared to R is 28, the performance of the model is generally higher when the R is 56. When R is set to 56, the M is set to 14, the AP50 value is 0.005 higher than when it is set to 7, and AP70 is 0.003 higher. But the prediction time is 2.9ms longer. Comprehensive consideration of prediction speed and accuracy, we set R to 56 and M to 7 in the next ablation experiment.
Table 5 Comparison of different bases
K
|
1
|
2
|
4
|
8
|
AP50
|
0.645
|
0.672
|
0.693
|
0.663
|
AP70
|
0.497
|
0.504
|
0.535
|
0.524
|
Note: We set R to 56 and M to 7.
Table 5 lists the comparison of different bases when R is set to 56 and M is set to 7. We set the number of bases from one to eight, looking for the best performance of the model. From Table 5, we can see that four bases achieve the best performance. In the next ablation experiments, we set the number of bases to 4.
Table 6 Comparison of the bottom feature locations from backbone or FPN
|
Feature
|
M
|
Time(ms)
|
AP50
|
AP70
|
Backbone
|
C3,C5
|
7
|
88.3
|
0.653
|
0.507
|
14
|
91.1
|
0.657
|
0.519
|
FPN
|
P3,P5
|
7
|
84.9
|
0.664
|
0.523
|
14
|
89.5
|
0.667
|
0.523
|
Note: the resolution of top-level prediction is set to , the resolution of bottom-level RoI is set to 7 and the number of bases is set to 4.
Table 6 lists the feature extraction performance comparison of different bottom modules when R is set to 56, M is set to seven and K is set to four. From Table 6, we can see that using FPN features as the input of the bottom module is effective for the performance of the model. In the next experiment, we use backbone combined with FPN to extract the features of weeds.
3.3 Segmentation results of weeds with different shooting angles and leaf ages
We compared the segmentation results of BlendMask with those of two different backbone networks (ResNet50 and ResNet101) combined with FPN under different leaf ages and shooting angles. The test set, which included 600 images, was used to verify the generalization ability of the model; therefore, 600 images without data enhancement were selected for testing. The total test set included 200 images each for the front view, side view, and top view. As shown in Figure 11, the labels a, b, c, a_leaf, b_leaf, c_leaf, and centre were not recognized, and we considered that these labels were identified as the background.
Figure 11 shows the confusion matrices of the detection results of the model in the case of data enhancement. This matrix calculates the statistics pertaining to the number of classified images by comparing the actual labels in the validation set data with the predicted types and indicates whether the model can differentiate among different classes. As shown in Figure 11, both ResNet50 and ResNet101 exhibit intuitive common features: The prediction for the leaves of Solanum nigrum is highly accurate; however, the prediction for Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus is not satisfactory. The centre prediction for the total test set for ResNet101 is second only to that for the leaves of Solanum nigrum; however, the prediction accuracy of Barnyard grass is the lowest. We can adjust the number of Barnyard grass data images in the model appropriately. Moreover, a part of the leaves of Solanum nigrum is predicted as that of the leaves of Abutilon theophrasti Medicus, likely because of the similar shape of these two leaves. ResNet101 exhibited a slightly higher predictive accuracy than that of ResNet50 in all the test sets. Thus, the performance of ResNet101 is higher than that of ResNet50.
Figure 12 shows the detection results of the BlendMask model under two backbone networks, three angles, and seven types of labels in the case of data enhancement. The precision rate, recall, and F1 value of the ResNet101 network for the total test set are 0.9710, 0.9271, and 0.9479, respectively.
According to Figure 12, under the condition of data enhancement, the precision, recall, and F1 values of ResNet50 in the front, side, and top views and all the test sets were greater than or equal to 0.7619, 0.7463, and 0.7634, respectively. In comparison, the precision, recall, and F1 values of ResNet101 were greater than or equal to 0.8983, 0.8267, and 0.8983, respectively, in the front, side, and top views and all the test sets. In other words, the precision, recall and F1 values of ResNet101 in the front, side, and top views and all the test sets were considerably higher than those of ResNet50, and ResNet101 exhibited a consistently higher performance than ResNet50. The F1 values of ResNet101 in the front, side, top, and total test sets were 0.9445, 0.9371, 0.9643 and 0.9479, respectively. When ResNet101 was used as the backbone network, the recall values of the top, front, and side view test sets were greater than or equal to 0.9149, 0.9051, and 0.8983, respectively. The top view test set exhibited the highest performance in all the classifications. For the total test set, the F1 values were 0.9661 and 0.9725, and the recall rates were 0.9679 and 0.9648 when Solanum nigrum and the leaves of Solanum nigrum were detected, respectively. Moreover, on the front, side, and top test sets, the classification performance of Solanum nigrum was higher than that of the other two kinds of weeds. For the plant centre, the precision values of ResNet101 in the front, side, and top views and total test sets were 1.0000. Since Figure 11 and Figure 12 only indicate the classification performance of the model, the recognition accuracy of the model cannot be determined, and the actual environment in the field is complex, which is expected to influence the weed identification. Therefore, the model recognition accuracy is critical to evaluate the model performance. Table 7 presents the detection results for the weeds under different networks and angles with data enhancement.
Table 7 Detection results for the weeds under different networks and angles with data enhancement.
Network
|
ResNet50
|
ResNet101
|
Front view
|
AP50
|
0.573
|
0.732
|
AP70
|
0.485
|
0.602
|
mIOU
|
0.482
|
0.597
|
Side view
|
AP50
|
0.564
|
0.645
|
AP70
|
0.472
|
0.540
|
mIOU
|
0.472
|
0.583
|
Top view
|
AP50
|
0.637
|
0.784
|
AP70
|
0.521
|
0.633
|
mIOU
|
0.553
|
0.642
|
Total test set
|
AP50
|
0.591
|
0.720
|
AP70
|
0.493
|
0.592
|
mIOU
|
0.502
|
0.607
|
The mAP is a commonly used index in target detection. Table 7 indicates that the mAP of ResNet101 is higher than that of ResNet50, indicating that the ResNet101 exhibits a high target detection performance. For the total test set, when ResNet101 is used as the backbone network, the AP50 and AP70 values are 0.720 and 0.592, respectively. Thus, when the threshold is equal to or greater than 0.5, ResNet101 exhibits a high detection performance. When ResNet101 is used as the backbone network, the AP50 values for the top, front, and side views and total test set are 0.784, 0.732, 0.645, and 0.720, respectively. The top view corresponds to a high detection performance.
The mIOU a valuable index to evaluate the segmentation results [31] and is commonly used to evaluate the segmentation performance of the BlendMask model. As indicated in Table 7, according to the test images of 600 weeds, for the total test set, the mIOU of ResNet50 and ResNet101 is 0.502 and 0.607, respectively. Thus, the ResNet101 model exhibits a higher network performance and can be applied to the segmentation of small target objects. In other words, this model can satisfy the needs of weed instance segmentation. When ResNet101 is used as the backbone network, the mIOU of the top view is 0.642, which is higher than those of the other datasets. Therefore, this configuration can achieve satisfactory segmentation results. Therefore, we choose ResNet101 combined with FPN to extract the features of weeds, as shown in Figure 13 is the recognition accuracy of different leaf ages in the case of data enhancement and ResNet101 combined with FPN.
In order to obtain the recognition accuracy of different leaves, we used three test sets, namely Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus datasets. Each dataset has 300 weed images, a total of 900 images, all images without data enhancement. In the Solanum nigrum data set, there are 100 weeds with 2 leaves, 3 leaves and 4 leaves respectively, and the remaining two data sets were also set in the same way. The accuracy of leaves identification is determined by comparing the leaves value calculated by the computer with the leaves value on the label when the data was collected. From Figure 13, we can see that the accuracy of leaves identification of the three weeds was higher than 80%. The recognition accuracy of 2 leaves and 3 leaves of Solanum nigrum is generally higher than that of the other two weeds. The recognition accuracy of 3 leaves was 0.957, which was the highest among all leaves. The recognition accuracy of Barnyard grass at 4 leaves was 0.017 and 0.022 lower than that at 2 leaves and 3 leaves, respectively. The recognition accuracy of 2 leaves of Barnyard grass was 0.005 lower than that of 3 leaves, which is the closest. The recognition accuracy of 2 leaves of Abutilon theophrasti Medicus is the lowest among all categories, with a value of 0.887.