Instance segmentation based method to obtain the phenotypic information of weeds in complex eld environments

Background: Weeds pose a critical threat to crop growth. The leaf age and plant centre, which represent the key phenotypic information of weeds, can help understand the morphological structure of weeds, thereby facilitating precise targeted spraying and a reduction in the herbicide usage. However, determining the weed types, leaf age and plant centre under complex eld conditions involving variations in the light and plant appearance along with leaf occlusion is challenging. With the advancement in the application of deep learning with computer vision, such approaches can likely overcome these challenges, as demonstrated in other complex agricultural applications. Results: We developed a weed segmentation method based on BlendMask, which could obtain the weed types, leaf age and plant centre under complex eld conditions. Mobile devices were used to capture digital images at different angles (front, side, and top views) of certain weeds (Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus) in the eld. Subsequently, two datasets (with and without data enhancement) were produced and input to the network. Moreover, two backbone networks, ResNet50 and ResNet101, were compared, along with six instance segmentation algorithms, and the instance segmentation results of the model under different angles were evaluated. The results indicated that data enhancement could enhance the model performance. In the case with data enhancement, the F 1 value, AP50 and AP70 scores, and mIOU with ResNet101 as the backbone network were 0.9479, 0.720, 0.592, and 0.607, respectively, corresponding to the highest segmentation performance. Furthermore, the top view images of the weeds corresponded to the highest detection accuracy, compared to that for the other two angles. Conclusion: can and centre and the image corresponding the top view angle can help The and research results can provide to further eld. The experimental results show that despite the interference of the straw and crop leaves in the background of the weeds in the eld, the BlendMask model using ResNet101 as the backbone network can realize accurate segmentation of the weeds with a satisfactory segmentation performance. Future research will be focused on evaluating image datasets that cover a wider range of weeds and crop varieties. Moreover, the identication eciency of the proposed approach is low; thus, the model eciency needs to be enhanced, and the trained model must be applied to the mobile platform of the spray system used for weeding. The proposed study combines articial intelligence technology with agronomic research concepts, and the ndings can facilitate the development of intelligent agriculture.

background and light uniformity [23]. Moreover, these studies were aimed at calculating the number of leaves for leaf segmentation; however, an image may have multiple weeds, and it is necessary to identify the leaf age, weed type and plant centre of each weed. Therefore, the segmentation of plant phenotypes in a complex farmland environment is a relatively unexplored research domain. Due to the complex environment conditions of farmlands, differences among plants, and the mutual occlusion of leaves, segmenting the leaf age and plant centre is challenging and often limits such analyses. In this study, the weed species, growth stage and plant centre were segmented through machine vision in a complex eld environment to guide the use of herbicides.
Deep learning is an emerging eld of machine learning, aimed at solving big data analysis problems. The DCNN is a deep learning method that is especially suitable for computer vision problems. In this study, an instance segmentation algorithm based on deep learning is proposed to obtain the weed phenotype in a complex eld environment. According to an agricultural survey, deep learning technology is more accurate than the traditional image processing technology [24]. Moreover, in the complex eld environment, the illumination, weather conditions, and soil background are complex and variable, and the plants may overlap [25] [26], and the DCNN model can address these aspects. Nevertheless, although the DCNN model can overcome these di culties, the farmland environment is complex, and a su ciently large dataset is required to train the deep learning model, to effectively manage the complex eld environment aspects and increase the model accuracy [27]. To this end, data enhancement can be performed, which is a common method in the eld of image recognition. In this approach, the image is expanded by randomly ipping the image, adding noise, and adjusting the brightness. Geetharamani et al. [28] used a nine-layer deep convolutional neural network to identify plant leaf diseases and employed six methods of data enhancement to enhance the model performance, which resulted in a classi cation accuracy of 96.4%. Piedad et al. [29] used the Mask R-CNN model to realize the non-invasive classi cation of clustered horticultural crops. Due to the limited dataset, the dataset was expanded to increase the model accuracy. In general, data enhancement is a key method to enrich the training samples and enhance the model performance; moreover, this approach can help enhance the suitability of a dataset for complex farmland environments.
When collecting the dataset, the shooting angle in the eld [30] and growth stage of the weeds may affect the dataset accuracy. Quan et al. used deep learning methods to detect maize seedlings under different growth stages, angles and weather conditions in a complex eld environment. It was proposed that when the angle between the camera and vertical direction is 0°, the detection accuracy is 0.95% lower than that for the other angles [27]; therefore, the model performance varied when the data were collected from different angles. However, Quan et al. primarily considered different oblique angles, and the information pertaining to different oblique angles is the fusion of that corresponding to different orthogonal angles [27]. Moreover, the position and shape of weeds in the eld are complex and changeable, and the shape of the same object is different under different shooting angles, which affects the accuracy of the dataset. Therefore, we collected data from three angles, corresponding to the front, side and top views, which could clarify the comprehensive information of weeds and enable the model to cope with the requirements of operations from different angles. Moreover, an instance segmentation algorithm based on deep learning was developed to obtain the weed phenotype in a complex eld environment. According to the existing research, the DCNN exhibits a high performance in solving complex environmental problems in the eld. Among the relevant approaches, instance segmentation based on deep learning is a new challenge for computer vision applications [31]. The model used in this study is a state-of-the-art method, aimed at detecting each object in a weed image and classifying each pixel of each instance. The output is the mask and bounding box of the target object [32], and the problem of leaf adhesion and occlusion can be solved in this manner [21].
Yu et al. [33] proposed an exemplar-based recursive instance segmentation framework to segment plant phenotypes and conducted experiments on a public benchmark to demonstrate the effectiveness of the method. Huang et al. [34] proposed a deep learning model for in-row crop detection in rice elds and constructed a eld rice detection dataset with a detection accuracy of 93.22%. This method identi ed a stem-base-centred square region at the plant level, which corresponds to the protected area image of mechanical weeding and is the plant centre. The plant centre is of signi cance for plant research.
Moreover, the instance segmentation algorithm based on Mask R-CNN, proposed by He et al. [35], could not only identify the bounding box but also mask the target contour, thereby outperforming the other models [36]. Jia et al. [37] used an improved Mask R-CNN model to segment overlapping apples, with an accuracy rate of 97.31%. In particular, the Mask R-CNN is representative of a two-stage segmentation network. Although many scholars have studied and applied the Mask R-CNN technique and achieved satisfactory results, the BlendMask [38] model proposed by Chen Hao et al. exhibited a higher segmentation performance on the COCO dataset [39] than the Mask R-CNN. BlendMask combines the ideas of top-down and bottom-up methods. Moreover, BlendMask employs the fully convolutional onestage object detection (FCOS) [40], which eliminates the calculation of the position-sensitive feature map and mask feature. Thus, the inference time does not increase with the number of predictions, as in the traditional two-stage method.
The aforementioned studies provide a feasible basis and reference for the application of the DCNN in plant segmentation. Moreover, it is noted that the DCNN can overcome the shortcomings of traditional image segmentation methods. The excellent performance of the BlendMask model indicates that in can well address the complex environment in the eld. In this study, the following weeds, which are commonly found in elds in Northeast China, were selected: Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus. According to the existing studies, a su cient number of well-de ned weed datasets are required to train DCNN models. Moreover, the weed images should be obtained from real scenes in elds to ensure that the images contain the morphological characteristics of the weeds at different growth stages in the complex eld environment and cover more variables as the model input. Therefore, we collected the weed images at three different angles (front, side, and top views). In addition, the classic DCNN network can be modi ed to increase the model accuracy. Therefore, we created two datasets containing 4000 and 6000 weed images, without and with data enhancement, respectively.
In particular, considering the aforementioned problems, this paper proposes a weed phenotype segmentation method based on the BlendMask to obtain the weed species, leaf age and plant centre of weeds. The main objectives were as follows: (1) To evaluate the feasibility of BlendMask in obtaining the weed species, leaf age and plant centre through weed phenotypic segmentation, considering seven evaluation indicators.
(2) To explore the in uence of data collection from different angles (front view, side view, top view) on the phenotypic segmentation of weeds in a complex eld environment, and to identify the optimal angle.
(3) To explore whether data enhancement can enhance the model performance.
(4) To explore whether the combination of ResNet101 with the FPN architecture for feature extraction can enhance the model performance.

Overview
Three typical weeds, Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus, which are commonly found in Northeast China were selected. The leaf age and plant centre of the weeds were segmented using the BlendMask model. First, datasets from different angles (front, side, top views) were collected in the actual eld environment. Second, two datasets (with and without data enhancement) were created. Third, the produced datasets were annotated, and the generated le was input to the network to train the network model. The backbone network of the BlendMask initialization model is the residual network combined with the feature pyramid network (FPN). This study employed different backbone networks (ResNet50 and ResNet101) in combination with the FPN architecture. The feature extraction performance based on the weed leaf age and plant centre was evaluated. Figure 1 shows the work ow of the experiment.

Image acquisition
The following three kinds of weeds were selected in this study: Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus. Solanum nigrum is an annual dicotyledon, Barnyard grass is an annual herb, and Abutilon theophrasti Medicus is an annual subshrub weed. The three kinds of weeds are commonly found in the elds of Northeast China, as shown in Figure 2. The data image source was the eld weed image. Because the greenhouse weeds exhibit a single background, whereas the images of eld weeds are more complex, the ability of the model to recognize weeds in the natural state could be veri ed. The eld data images were acquired in the Xiangfang District from May to June 2019. The Xiangfang District is located in the northeast plain and is the main planting area for maize, soybean and rice. The main weeds in the corn elds of the Xiangfang District are Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus. The images of these weeds were collected. Since most the weeds in the eld had a leaf age of two-ve, the images of only the weeds with a leaf age of less than ve were acquired.
Because the weed information obtained from a single shooting angle is not comprehensive, to more clearly illustrate the difference in the weed information obtained from different angles, a Python code was implemented to remove the background of the eld weed images, as shown in Figure 3. In contrast, the images used for training the model corresponded to the complex environment of the eld, and the background was not removed. Moreover, in general, the shooting weather [41] and acquisition angle [30] considerably in uence the dataset and affect the segmentation precision [42]. Data collection was initiated from the two-leaf period after crop planting, from May 20, 2019, to June 29, 2019, and images of weeds with different leaf ages were collected every 2 to 5 d under different weather conditions, angles, and growth stages, to obtain the data of the weeds at each leaf age stage in the growth cycle, as shown in Table 1. The camera of the iPhone 6s Plus device, with a focal length, maximum aperture, and maximum resolution of 4.2 mm, f/2.2 and 4032 × 3024 pixels, respectively, was used to capture images, and the weed images were stored in the JPEG le format. When collecting data, we marked the sample variety, leaf age, collection time, collection angle, collection weather, and temperature into the sample data. All the datasets were collected randomly in the farmland, and the images corresponded to a relatively clean soil background and complex eld background covered by straw and leaves; such disturbances were treated as part of the background.

Dataset construction and annotation
When training the network and conducting network testing, the input image size must match the input size of the network [43]; thus, the images were adjusted to a pixel size of 1024 × 1024 to construct the image dataset of the DCNN. The size of the eld shot image was 4032 × 3024; without disturbing the morphology of the plants in the image, the image was cropped to a size of 3024 × 3024, and these images were resized to 1024 × 1024. As the weeds in the images were required to be annotated, certain images not suitable for annotation were discarded. Finally, 4000 images were selected from 4574 images. Due to the limited number of datasets, a data enhancement scheme was adopted to further enrich the images to ensure that the images were highly representative and could re ect the real situation of the eld data more accurately [27]; moreover, the training precision of the model could be increased [44], the dataset could be expanded and over tting could be reduced [45] ( Figure 4).
The images were randomly rotated, and noise was added. Since the illumination is a critical aspect in the segmentation process, to enhance the robustness of the DCNN against the illumination variations owing to environmental changes, the datasets were further enhanced by simulating illumination changes [46]. The brightness was increased and decreased by 10%. Moreover, certain blurred, occluded and incomplete images were included in the dataset as negative samples, and 6000 data-enhanced pictures were obtained. The structure and proportion of the original dataset remain unchanged when data enhancement was implemented. Two datasets were prepared, with and without data enhancement. Both the datasets were randomly divided into training and veri cation sets, with a ratio of 8:2. The test set is selected from the images without data enhancement. The VGG Image Annotator labelling tool [47] was used for the annotation, as shown in Figure 5; the leaves and centre of weeds were surrounded by irregular polygons. Because the number of weeds in an image is uncertain under the actual working conditions, and the image may contain multiple weeds, the number of masked leaves in the picture could not be used to calculate the leaf age of a single weed. In this study, we used a rectangular frame to mark the outline of the outermost layer of a single weed and calculated the number of leaf masks in the rectangular frame, which corresponded to the leaf age of the weed. The rectangular frame was not masked. The labels were divided into seven categories ( Figure 5).

Two-stage instance segmentation model
Instance segmentation is one of the most challenging tasks in computer vision, because it involves not only the classi cation at the pixel level of semantic segmentation, but also certain characteristics of target detection. The two-stage Mask R-CNN is a representative algorithm. The Mask R-CNN extends the target detection framework of the faster R-CNN [48] by adding a masking branch at the end of the model [49]. This process ensures that each output instance segments the proposal box through a fully connected layer, to ensure that the segmentation is parallel to the target detection. To better detect small targets, the ROI pooling is changed to ROIAlign. The process ow of the Mask R-CNN model is shown in Figure 6. The output consists of three branches: bounding boxes, target classi cations and segmentation masks. The selected backbone networks for the Mask R-CNN were ResNet50 and ResNet101 combined with FPNs. In this con guration, rst, the backbone network extracts the feature map from the input image and outputs the features from the backbone network. The map is sent to the region proposal network (RPN) and ROIAlign to generate the region of interest (ROI). Finally, the ROI predicts the target category and bounding box through the convolutional layer and fully connected layer and segments the target region through the fully convolutional neural network (FCN). The instance segmentation task of the target is thus completed.

One-stage instance segmentation model
BlendMask is a one-stage dense instance segmentation algorithm that combines the instance-level information with lower-level ne-granularity semantic information. BlendMask is composed of a onestage target detection network FCOS [40] and a mask branch. Figure 7 shows the model structure of BlendMask. The mask branch has three parts: The bottom module is used to process the bottom features to generate the score maps, the top layer is attached to the box head of the detector to generate the top level attention corresponding to the base, and the blender module is used to fuse the base and attention.
BlendMask adds the bottom module to extract the low-level detailed features, based on the anchor-free detection model FCOS, and predicts the attention on the instance-level. BlendMask draws on the fusion method of the fully convolutional instance-aware semantic segmentation (FCIS) [50] and YOLACT [51] and incorporates the blender module to better integrate these features. Moreover, BlendMask combines the concepts of the top-down and bottom-up methodologies, thereby combining the rich instance-level information with accurate dense pixel features [38].
The structure of BlendMask is similar to that of the Mask R-CNN; however, in contrast to the RPN used by the Mask R-CNN, BlendMask chooses the one-stage detector FCOS, which eliminates the calculation of the position-sensitive feature map and mask feature. BlendMask uses the attention guided blender module to calculate the global map representation. Compared with the more complex hard alignment used in FCN and FCIS, the amount of calculation is considerably reduced under the same resolution. BlendMask is a method of dense pixel prediction, and the output resolution is not limited by the top level sampling. In the Mask R-CNN framework, to achieve more accurate mask features, the resolution of RoIPooler must be increased, thereby increasing the calculation time and network depth of the head.
BlendMask can establish deep neural network models of different depths by implementing different weight layers. The deep learning network models applied at present include AlexNet, ZF, GoogLeNet, VGG, and ResNet. Although a larger number of network layers may lead to a higher accuracy, the deeper network layers may result in degraded model training and detection speeds. Nevertheless, since the residual network does not increase the number of model parameters, the problem of training degradation can be alleviated, and the model convergence can be accelerated [49]. Therefore, in this study, ResNet50 and ResNet101 combined with the FPN were used as the backbone networks to extract the features of the weed images.

BlendMask training model
Before training the BlendMask, we introduced a pretraining model based on the COCO dataset [39] through transfer learning. The COCO dataset has 328,000 images, including 91 categories. The pretraining model extracted the weights after training on the COCO dataset, based on which, the established datasets were retrained. Through the transfer learning, the labour and cost of training could be reduced, the training e ciency could be enhanced, and the model parameters could be better adjusted. The BlendMask model was implemented using the AdelaiDet open source toolbox based on detectron2.
The experiment was performed on the Ubuntu 18.04 operating system, over a six-core Intel Core i7-8700K @ 3.70 GHz processor, 32 GB of memory, and a GPU built by NVIDIA GeForce (Santa Clara, CA, USA), along with the NVIDIA GeForce RTX 1080 Ti graphics card. The pretraining network parameters are listed in Table 2.

Training and evaluation
The momentum and initial learning rate for BlendMask was set as 0.9 and 0.01, respectively, and the training BatchSize was set as 4. After the parameters were set, training was conducted for 12 rounds, with 10000 iterations being implemented in each round. The basic framework of the BlendMask involved either ResNet50 or ResNet101.
The precision and recall rates can be de ned as follows: where "true positive (TP)" and "false positive (FP)" indicate the number of positive and negative results detected as positive, respectively, and "false negative (FN)" indicates the number of positive results detected as negative. The metric function (F 1 ) [52] of the precision and recall rates can be de ned as follows: To address the problem of multiclass imbalance, we averaged the seven classi cation indicators [53]. To more extensively evaluate the model algorithm, the IOU was considered to examine the measurements [54]. In general, the IOU measures the overlap between two bounding boxes. Figure 8 illustrates the calculation of the overlap degree between the weed prediction box and real box on the ground. The value of the IOU can be divided into three regions. When the IOU threshold is set as 0.7, anchors with IOU values less than or equal to 0.3 are considered to be negative anchors, and anchors with IOU values between 0.3 and 0.7 are neutral anchors; these cases are not considered. The anchors with IOU values greater than or equal to 0.7 are positive anchors. The system identi es the positive anchors and bounding box and matches these boxes to the ground-truth boxes to optimize the RPN output of the model. The maximum value of the anchor overlap with the ground-truth boxes is retained by the system. When the IOU threshold is set as 0.5, and the IOU values are greater or smaller than 0.5, a positive or negative ROI is observed, respectively. The positive ROI is allocated to the mask and ground-truth by the system. The detection performance of the model is evaluated considering the mean accuracy (mAP) [54]. The mAP can clearly re ect the performance when it is related to the target position information and category information of the target in the image. The AP can be calculated for each category separately, and the value for each category can be averaged to calculate the mAP. A larger mAP is desirable. This value can be calculated considering the AP. The thresholds were set as 0.5 and 0.7 in this study. When the IOU threshold was equal to or greater than 0.5 and 0.7, the mAP were de ned as AP50 and AP70, respectively. The IOU and mAP were de ned as follows: Note: N represents the number of images.

Results
The results of weed segmentation are shown in Figure 9. The leaf age was determined based on the number of complete leaves in the weed, and the accuracy of the leaf age recognition was evaluated by comparing this value with the corresponding label at the time of collection. The plant centre was determined based on the intersection area of the top leaves of the weed [7]. Six instance segmentation algorithms were compared, the algorithm with the highest performance was selected, and adjust different hyperparameters for this algorithm. Two datasets were trained using two different backbones networks (ResNet50 and ResNet101), and the network that could realize the optimal balance between mIOU and mAP was selected. and the segmentation performance of the best-performing algorithm under different shooting angles and different leaf ages was evaluated. The results indicate that the data enhancement can enhance the model performance. The AP50 and mIOU of the BlendMask model using ResNet101 as the backbone combined with FPN are 0.720 and 0.607, respectively; these values are better than those for the ResNet50 framework, and thus, the former model can be used for weed segmentation. Moreover, the weed image captured at the top view angle exhibited a higher detection accuracy than that of the other two angles.

Comparison of instance segmentation models
To verify the effectiveness of the proposed method for weed segmentation, six instance segmentation algorithms, including Mask R-CNN, SOLO [55], PolarMask [56], CentreMask [57], YOLACT [51] and BlendMask were compared. The six algorithms were tested on two data sets (with and without data enhancement). To examine the recognition effect of the model in a complex eld environment, the images in the test were those without any data enhancement. The test results are shown in Figure 10. According to Figure (b), data enhancement can lead to higher average precision values than those corresponding to the dataset without data enhancement. In the case of data enhancement, the AP50 value of the BlendMask model is 0.7200, which is the highest among the six instance segmentation models. The AP50 value of the SOLO model is 0.7131, and the corresponding instance segmentation performance is similar to that of the BlendMask model. Moreover, the AP50 value of the PolarMask model is 0.6542, corresponding to the lowest segmentation performance. In the case of data enhancement, the AP50 of the ve models ranges from 62% to 72%, and these models can thus satisfy the requirements for weed instance segmentation. According to Figure (c Table 3 lists the prediction durations of the model for a single picture, and it can be noted that when ResNet50 is used as the backbone network, BlendMask has the smallest prediction duration, which is 13.4 ms lower than that of SOLO. Figure 10 indicated that the segmentation performance of BlendMask was comparable to that of SOLO. However, according to the comparison in Table 3, under both the backbone networks, the prediction time of BlendMask for a single image is lower than that for SOLO. Therefore, considering both the segmentation performance and prediction time, it can be considered that BlendMask exhibits a satisfactory segmentation performance; the model is feasible and can realize prompt and accurate weed segmentation.

Comparison of different hyperparameters of BlendMask
We have changed the following four hyper-parameters of BlendMask: The Table 4 is a comparison of different resolutions when the K=4 and the bottom module is C3 and C5. We set the resolution R of the bottom-level RoI to 28 and 56, with R/M ratio from 14 to 4. It can be seen from Table 4 that increasing the resolution R of bottom-level RoI will lead to longer operation time of the model. Compared to R is 28, the performance of the model is generally higher when the R is 56. When R is set to 56, the M is set to 14, the AP50 value is 0.005 higher than when it is set to 7, and AP70 is 0.003 higher. But the prediction time is 2.9ms longer. Comprehensive consideration of prediction speed and accuracy, we set R to 56 and M to 7 in the next ablation experiment.  Table 5 lists the comparison of different bases when R is set to 56 and M is set to 7. We set the number of bases from one to eight, looking for the best performance of the model. From Table 5, we can see that four bases achieve the best performance. In the next ablation experiments, we set the number of bases to 4. Note: the resolution of top-level prediction is set to , the resolution of bottom-level RoI is set to 7 and the number of bases is set to 4. Table 6 lists the feature extraction performance comparison of different bottom modules when R is set to 56, M is set to seven and K is set to four. From Table 6, we can see that using FPN features as the input of the bottom module is effective for the performance of the model. In the next experiment, we use backbone combined with FPN to extract the features of weeds.

Segmentation results of weeds with different shooting angles and leaf ages
We compared the segmentation results of BlendMask with those of two different backbone networks (ResNet50 and ResNet101) combined with FPN under different leaf ages and shooting angles. The test set, which included 600 images, was used to verify the generalization ability of the model; therefore, 600 images without data enhancement were selected for testing. The total test set included 200 images each for the front view, side view, and top view. As shown in Figure 11, the labels a, b, c, a_leaf, b_leaf, c_leaf, and centre were not recognized, and we considered that these labels were identi ed as the background. Figure 11 shows the confusion matrices of the detection results of the model in the case of data enhancement. This matrix calculates the statistics pertaining to the number of classi ed images by comparing the actual labels in the validation set data with the predicted types and indicates whether the model can differentiate among different classes. As shown in Figure 11, both ResNet50 and ResNet101 exhibit intuitive common features: The prediction for the leaves of Solanum nigrum is highly accurate; however, the prediction for Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus is not satisfactory. The centre prediction for the total test set for ResNet101 is second only to that for the leaves of Solanum nigrum; however, the prediction accuracy of Barnyard grass is the lowest. We can adjust the number of Barnyard grass data images in the model appropriately. Moreover, a part of the leaves of Solanum nigrum is predicted as that of the leaves of Abutilon theophrasti Medicus, likely because of the similar shape of these two leaves. ResNet101 exhibited a slightly higher predictive accuracy than that of ResNet50 in all the test sets. Thus, the performance of ResNet101 is higher than that of ResNet50. Figure 12 shows the detection results of the BlendMask model under two backbone networks, three angles, and seven types of labels in the case of data enhancement. The precision rate, recall, and F 1 value of the ResNet101 network for the total test set are 0.9710, 0.9271, and 0.9479, respectively.
According to Figure 12, under the condition of data enhancement, the precision, recall, and F 1 values of  Figure 11 and Figure 12 only indicate the classi cation performance of the model, the recognition accuracy of the model cannot be determined, and the actual environment in the eld is complex, which is expected to in uence the weed identi cation. Therefore, the model recognition accuracy is critical to evaluate the model performance. Table 7 presents the detection results for the weeds under different networks and angles with data enhancement. The mAP is a commonly used index in target detection. Table 7 indicates that the mAP of ResNet101 is higher than that of ResNet50, indicating that the ResNet101 exhibits a high target detection performance. For the total test set, when ResNet101 is used as the backbone network, the AP50 and AP70 values are 0.720 and 0.592, respectively. Thus, when the threshold is equal to or greater than 0.5, ResNet101 exhibits a high detection performance. When ResNet101 is used as the backbone network, the AP50 values for the top, front, and side views and total test set are 0.784, 0.732, 0.645, and 0.720, respectively. The top view corresponds to a high detection performance.
The mIOU a valuable index to evaluate the segmentation results [31] and is commonly used to evaluate the segmentation performance of the BlendMask model. As indicated in Table 7, according to the test images of 600 weeds, for the total test set, the mIOU of ResNet50 and ResNet101 is 0.502 and 0.607, respectively. Thus, the ResNet101 model exhibits a higher network performance and can be applied to the segmentation of small target objects. In other words, this model can satisfy the needs of weed instance segmentation. When ResNet101 is used as the backbone network, the mIOU of the top view is 0.642, which is higher than those of the other datasets. Therefore, this con guration can achieve satisfactory segmentation results. Therefore, we choose ResNet101 combined with FPN to extract the features of weeds, as shown in Figure 13 is the recognition accuracy of different leaf ages in the case of data enhancement and ResNet101 combined with FPN.
In order to obtain the recognition accuracy of different leaves, we used three test sets, namely Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus datasets. Each dataset has 300 weed images, a total of 900 images, all images without data enhancement. In the Solanum nigrum data set, there are 100 weeds with 2 leaves, 3 leaves and 4 leaves respectively, and the remaining two data sets were also set in the same way. The accuracy of leaves identi cation is determined by comparing the leaves value calculated by the computer with the leaves value on the label when the data was collected. From Figure   13, we can see that the accuracy of leaves identi cation of the three weeds was higher than 80%. The recognition accuracy of 2 leaves and 3 leaves of Solanum nigrum is generally higher than that of the other two weeds. The recognition accuracy of 3 leaves was 0.957, which was the highest among all leaves. The recognition accuracy of Barnyard grass at 4 leaves was 0.017 and 0.022 lower than that at 2 leaves and 3 leaves, respectively. The recognition accuracy of 2 leaves of Barnyard grass was 0.005 lower than that of 3 leaves, which is the closest. The recognition accuracy of 2 leaves of Abutilon theophrasti Medicus is the lowest among all categories, with a value of 0.887.

Discussion
BlendMask combines top-level and bottom-level information. The top-level corresponds to a broader receptive eld, such as the posture of the whole weed plant. Because the top-level is a rough prediction, the resolution of the top-level ROI is relatively small, and the general maximum is 14. Compared to Mask RCNN xed output of , BlendMask output resolution can be higher, because its backbone is not limited by FPN. The bottom-level corresponds to more detail information, such as the position and centre of the weed, which can retain better position information. In our study, we need to accurately segment leaves and plant centre, which belong to some detail information, so we set the resolution of bottom-level to , and increasing the resolution did not cause the prediction speed to be too slow. There is a good balance between prediction speed and precision.
In the case of data enhancement, in terms of the F 1 value, the recognition accuracy for the Solanum nigrum leaf was the highest. According to the confusion matrix, most of the Solanum nigrum leaves are classi ed as Solanum nigrum leaves, and only a small part are considered as Abutilon theophrasti Medicus leaves, Barnyard grass leaves and background, and thus, the corresponding recall rate is high.
Furthermore, only a small part of Abutilon theophrasti Medicus is considered as Solanum nigrum, leading to a high precision rate. In summary, the F 1 value is the highest. Solanum nigrum is an annual herb with oval leaves. Moreover, this weed has a large number of leaves, which allows the model to learn more features. Because the model can extract su cient features from the Solanum nigrum leaves, Solanum nigrum exhibits a high recognition accuracy. In ResNet50, the recognition accuracy of Barnyard grass leaves is higher than that of Abutilon The front, side, and top view images were obtained to comprehensively clarify the information of the weeds in the data set to obtain the leaf age and plant centre of the weeds. The method of acquiring three views is a common practice in the study of plant phenotype. The eld plant perspective in the images contains front, top, and side views, and the information of the oblique angle is fused by the information of the orthogonal angles. Among the three orthogonal angles, more comprehensive weed phenotype information can be obtained from the perspective of the top view; speci cally, the plant centre of the weeds can be identi ed more clarify. In contrast, the side and front views cannot clearly observe the plant centre. Consequently, the detection accuracy for the top view angle is higher than that for the other angles. Nevertheless, when intelligent agricultural equipment is employed in the eld, the camera is usually xed at an angle, although the position and shape of the weeds in the eld are complex and changeable. When the machine is moving, the imaging angle of the weeds changes, and the imaging angle differs owing to the different positions of the weeds. The information of the side and front views is exposed at certain angles; therefore, obtaining the images of the side and front views can help the model in accurately segmenting the weeds. Constructing datasets from different perspectives can enable the model to adapt to the job requirements of different scenarios. The presented ndings can provide reference for future vision acquisition systems of intelligent agricultural robots.
In this study, we identi ed the individual plant images of three weeds in the eld; however, weeds are visual objects with complex structures and rich texture features, and even the same species may be considerably different in terms of the morphology and colour. Solanum nigrum has a higher recognition accuracy of 3 leaves. Because the recognition accuracy of Solanum nigrum leaves is higher, the calculated leaves value is also higher. And the Solanum nigrum leaves mostly grow from the same centre, and there are fewer leaves shaded below. The recognition accuracy of 4 leaves of Barnyard grass is lower than that of 2 leaves and 3 leaves. Because when Barnyard Grass was 4 leaves, the bottom of the main stem would mostly have some small leaves, which were small and easily concealed by the upper leaves, bringing great di culties to the counting of leaves. The leaves of Abutilon theophrasti Medicus are elliptical, which are similar to the leaves of Solanum nigrum, causing some leaves to be misidenti ed.
The recognition accuracy of Abutilon theophrasti Medicus is higher than the other two weeds at 4 leaves, because there are more leaves at 4 leaves, and the petiole of Abutilon theophrasti Medicus is longer and the lower leaves are not easily blocked.
At present, the treatment of weeds in the eld mostly involves weed classi cation and detection. However, weed classi cation can determine only the species of weeds, and the speci c position coordinates of the weeds cannot be obtained; thus, the exact target cannot be sprayed. Weed detection can facilitate the drawing of the bounding box of the weeds. However, weeds exhibit an irregular shape and size, which may cause the machine to be inaccurate with respect to the target, resulting in certain herbicide falling to the ground and not being absorbed by the weeds; this aspect may lead to environmental pollution and wastage of the herbicide. As a kind of deep learning model, instance segmentation can detect the target pixel by pixel, thereby solving the problems of blade adhesion and occlusion. Moreover, the leaf age of weeds and the position of the pant centre can be obtained accurately. In the Northeast Plain of China, the main economic crops are maize, soybeans, and wheat, which are susceptible to annual and perennial weeds. Controlling the annual and perennial weeds can increase the crop yields and reduce the likelihood of damage caused by the weeds in the second year [58]. Moreover, studying the interaction between the plant phenotype and vision through effective phenotypic analysis can help provide information regarding the plant growth and morphological changes.
The limitation of this study is that the recognition speed of the proposed method is low, but tensorRT can be considered for acceleration. The e ciency needs to be further improved before the approach can be applied in engineering practice. The employed DCNN model was used to segment only three kinds of weeds. If we consider more kinds of weeds, collect and segment the images of eld crops, and increase the number of datasets, the model can achieve a higher segmentation accuracy. Moreover, the obtained leaf age of the economic crops can provide a basis for crop fertilization. For certain plants, the plant centre is the pollination area of owers, and segmentation of this part can provide valuable guidance for subsequent studies. Future research will focus on evaluating the image datasets covering a wider range of weeds and crop varieties. In addition, most of the images used for model testing contained only singleplant weeds, and only a few images contained multiple weeds. The BlendMask framework failed to segment weeds near the edges in a few test images containing multiple weeds. In this case, continuous video input can help eliminate the edge effects when applied in the eld.
The results show that the combination of weed phenotypic information and computer vision can effectively address the complex eld conditions such as light changes, leaf occlusion, and mixed leaf age. The proposed models and methods can be applied to the study of different types of plants. The data for this study were obtained from a complex eld environment, whereas in the previous studies on plant phenotypes, the data were obtained in an indoor environment, in which the image background is often pure and the illumination is uniform. Studying the eld environment can help make the model more suitable for practical applications, and the shooting angle of the dataset determines the amount of target image information obtained; therefore, it is meaningful to study the segmentation results under different shooting angles. Only a few of the existing studies on plant phenotypes are speci c to weed phenotypes. However, the weeds of different leaf ages require different doses of herbicides; therefore, it is of signi cance to obtain the information of weed leaf ages to reduce the amount of herbicides. To facilitate practical application, in future work, we can deploy the trained model on the mobile platform of the spray system used for weeding, which can promote the development of precision agriculture and intelligent agriculture.

Conclusions
This paper proposes a weed phenotype segmentation method based on the BlendMask to determine the weed species, leaf age and plant centre of weeds. In the eld of plant phenotype research, the determination of weed phenotypes under complex eld environments is a substantial challenge. According to the research status at home and abroad, the leaf age and plant centre are key phenotypic information of weeds. In this study, we identi ed the leaf age and plant centre, which are of signi cance to realize targeted weeding. The model performance could be enhanced through data enhancement. In addition, the weed image obtained from the top view angle corresponded to an enhanced model performance. Weed datasets were constructed through data collection from three angles and data enhancement, and the datasets contained the weed information, corresponding to the different growth stages, angles, and types. The dataset and research results can function as valuable resources for future plant phenotype research.
Because the DCNN can extract features from complex environments, it can effectively address the problems of complex images in the eld. The experimental results show that despite the interference of the straw and crop leaves in the background of the weeds in the eld, the BlendMask model using ResNet101 as the backbone network can realize accurate segmentation of the weeds with a satisfactory segmentation performance. Future research will be focused on evaluating image datasets that cover a wider range of weeds and crop varieties. Moreover, the identi cation e ciency of the proposed approach is low; thus, the model e ciency needs to be enhanced, and the trained model must be applied to the mobile platform of the spray system used for weeding. The proposed study combines arti cial intelligence technology with agronomic research concepts, and the ndings can facilitate the development of intelligent agriculture. Given that the data used in this study were self-collected, the dataset is being further improved. Thus, the dataset is unavailable at present.

Figure 1
Process ow of using BlendMask to segment the leaf age and plant centre.  Data enhancement. Note: The size of the original images is 4032×3024. After cropping the images to a size of 3024×3024, the images are resized to 1024×1024. The collected images include positive and negative samples. The acquired images are brightened and darkened by 10% and subjected to increased noise and random rotation. Finally, the dataset is randomly divided into training and veri cation sets with a ratio of 8:2.

Figure 6
Model structure of Mask R-CNN.

Figure 7
Model structure of BlendMask   Detection results for different instance segmentation models Note: The backbone network of the six models(YOLACT, PolarMask, BlendMask, CentreMask, SOLO and Mask R-CNN) is ResNet101. "Without data enhancement" and "Data enhancement" refer to the models trained using 4000 unenhanced datasets and 6000 enhanced datasets, respectively. When the IOU threshold is greater than or equal to 0.5 and 0.7, the mAP is de ned as AP50 and AP70, respectively.

Figure 11
Confusion matrix of the detection results of ResNet50 and ResNet101 in the case of data enhancement Note: The BlendMask with ResNet50 and ResNet101 frameworks are expressed as ResNet50 and ResNet101, respectively. a_leaf, b_leaf, and c_leaf represent the leaves of Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus, respectively. Moreover, a, b, and c represent Solanum nigrum, Barnyard grass, and Abutilon theophrasti Medicus, respectively, and centre represents the plant centre of each weed.

Figure 12
Detection results of the BlendMask with pretrained networks in the case of data enhancement. Note: The BlendMask with ResNet50 and ResNet101 frameworks are expressed as ResNet50 and ResNet101, respectively. Centre represents the plant centre of each weed.