We explored an end-to-end semantic segmentation method to label each pixel as panicle, leaf or background automatically under natural field conditions, and then generated the leaf to panicle ratio (LPR) by division of the number of pixels assigned for each class in each field image. Figure 1 shows the overall work-flow of this method, including two parts. Part 1 is the offline training workflow, which builds a deep learning network called FPN-Mask to segment panicle and leaf from field RGB images. Part 2 is the procedure of GvCrop to develop a software system for calculating LPR.
Experimental setup
In 2018, plots of ongoing field experiments at Danyang (31°54′31″N, 119°28′21″E), Jiangsu Province, China were selected to take pictures for the training dataset. Of note, these experiments were not specially designed for a phenotyping study. In brief, the plant materials of these experiments were highly diverse in genotypic variation, containing seven main japonica cultivars of Jiangsu and 195 mutants with contrasting agronomical traits as reported by Abacar et al [25]. Further, the seven cultivars had two sowing dates, resulting in obviously different phenotypes for a certain genotype. Thus the diversity in plant architecture and canopy structure of the tested materials can provide as many kinds of phenotypes as possible for image analysis.
In 2019, three experiments were conducted to test and apply the proposed FPN-Mask model. (1) Genotypic variations in LPR. A total of 192 mutants were investigated. The plot area was 2.4 m × 1.4 m with a row spacing of 30 cm and plant spacing 20 cm. Nitrogen, phosphate (P2O5) and potassium (K2O) fertilizers were applied at a rate of 240 kg ha-1, 120 kg ha -1 and 192 kg ha-1, respectively, and were equally separated into basic fertilizers (before transplanting) and topdressing (at 4th leaf age in reverse order). (2) N fertilization effects on LPR. A japonica rice cultivar, Wuyunjing 30, was selected for field experiments with a randomized complete-block design. It had three replications and a plot area of 2.4 m ×1.4 m. Total N fertilizer was 240 kg ha-1 N, and two N fertilization modes with different base/topdressing ratios were applied: (1) N5-5: base/topdressing, 5/5; (2) N10-0: base/topdressing, 10/0. (3) Regulation of plant growth regulators on LPR. Solutions of 100 mM gibberellin, 100 mM uniconazole, 25 mM 2, 4-epibrassinolide, 25 mM brassinazole as well as the control, water, were made up in distilled water with 0.5 % TWEEN-20. One cultivar, Ningjing 8, from the N treatment was used as material. Spraying was conducted at the rate of 500 mL m-2 after sunset, with three times starting at booting stage on August 22 and with a 2-day interval.
In addition, a dynamic canopy light interception simulating device (DCLISD) was designed for capturing images from the sun’s position installed on a supporting track (Fig. 2). The bottom part consists of four pillars with wheels and the upper part is comprised of two arches consolidated by two steel pipes, and a moveable rail for mounting the RGB camera. The sun’s trajectory is simulated by two angles, the elevation angle and the azimuth angle, which is calculated according to the latitude, longitude, as well as the growth periods at the experimental site.
Image acquisition
Images of the training dataset were captured in the field experiments in 2018, reflecting the large variations in camera shooting angle, the elevation angle and the azimuth angle of the sun, rice genotype, and plant phenological stages (Fig. 3). Images for validation and application of the proposed model were acquired in 2019. For the three treatments of genotypes, N fertilization, and spraying, an angle of 40° was selected for the tripod. The height of the camera (Canon EOS 750D, 24.2 megapixels) was 167.1 cm, the average height of a Chinese adult, and the distance between the central point of the target area and vertical projection of the camera on the ground was 90 cm. The camera settings were as follows: focal length, 18 mm; aperture, automatic; ISO, automatic; and exposure time, automatic. In the experiment with DCLISD, the camera model was SONY DSC-QX100, with settings were as follows: focal length, 10 mm; aperture, automatic; ISO, automatic; and exposure time, automatic.
Dataset preparation
Training dataset: Taking into consideration camera angle, solar angle, panicle type and growth stage (Fig. 3), we prepared a training dataset with 360 representative images from the 2018 dataset (Table S1). The GG (green panicle with green leaf) growth stage, YG (yellow panicle with green leaf) growth stage and YY (yellow panicle with yellow leaf) growth stage were represented by 113, 104, and 143 images, respectively. Fig. 1(1)-(3) shows the preparation of the training data. Considering that the original size of these field images is as large as 4864×3648 pixels, they were cropped to a size between [150,150] and [600,600] using the Paint.Net software. After obtaining these patches, we labeled pixels of each patch as panicle, leaf and background manually using the Fluid Mask software. Finally, a total of 1896 representative patches were selected as the final training sample set. Among them, 1210 samples were added continuously during the late daily tests of the model. Further, to increase the diversity of the training dataset and avoid overfitting, we performed basic data enhancements to the training set, including random horizontal/vertical flips, rotations by 90 degrees, and histogram equalizations. To reduce illumination effects, we performed random brightness enhancements on the image. All the input images were resized to 256×256 pixels. And for a faster and more stable training model, all the input images were normalized to [0, 1] [27,28].
Testing dataset:We divided all 2018 collected images into three groups based on the rice growth stage during the image acquisition. From each group, we randomly selected 30 testing images and finally selected 90 images as the testing dataset (Table S2). Many of the field images in the testing dataset included extraneous objects, such as tracks, chains, neighbor plots, color-charts and sky, which were not required for our approach. Therefore, only a significant region from the plot was selected as the region of interest (ROI) and all selected testing images were cropped manually to exclude the area outside the ROI.
Network structure
In this study, we proposed a deep learning-based method for rice panicle segmentation, called FPN-Mask. The method consists of a backbone network and a task-specific subnetwork. The Feature Pyramid Network (FPN) [29] was selected as the backbone network for extracting features over the entire input data. Originally designed for object detection, it has the advantages of extracting a multi-level feature pyramid from an input image with a single scale. The subnetwork is referenced from the Unified Perceptual Parsing Network [30], which performs semantic segmentation based on the output of the backbone network (Fig. 4).
Backbone network for feature extraction: The FPN [29] is a standard feature extractor with a top-down architecture and lateral connections. The top-down architecture is based on Residual Networks (ResNet) [31], which consists of four stages. Each stage is denoted as C2, C3, C4 and C5, respectively. For a detailed description of the FPN structure, please refer to reference [29]. A detailed description of the ResNet structure can be found in [31]. We denoted the last feature map of each stage in ResNet as {C2, C3, C4, C5}. In our backbone network, we removed the global max pooling layer before C2, because it will drop out semantic information. Therefore, the rates of each stage {C2, C3, C4, C5} were down-sampled from {4,8,16,32} to {1,2,4,8}. The down-sampling rates of the feature maps derived by FPN {P2, P3, P4, P5} are {1, 2, 4, 8}, respectively; this means that the size of P2 is the same as the original image size of 256×256, the size of P3 is 128×128; the size of P4 is 64×64 and the size of P5 is 32×32. The number of feature maps output for each stage in ResNet is equal to 32.
Subnetwork for semantic segmentation: the subnetwork is based on the multi-level features extracted from the backbone network introduced above. Each level of the features will be fused together as an input feature map for semantic segmentation, which has been proved to outperform semantic segmentation compared to using only the highest resolution feature map [30, 32]. To up-sample the low-level feature maps {p3, p4, p5} to get the same size feature as the original image, we directly adopted the bilinear interpolation layer instead of the time-consuming deconvolution layer, and attached a convolution layer followed by each interpolation layer to refine the interpolation result. After up-sampling, different levels of the features were concatenated as the final semantic feature. The concatenated multi-level features were convoluted by a convolution layer to refine the result and a convolution layer to reduce the channel dimensions. The convolution layer was attached to a batch normal layer and a relu layer. Finally, we obtained a 3-channel semantic segmentation result, representing background, leaf and panicle, respectively.
Loss function for semantic segmentation
The cross-entropy loss function is a standard classification method [33]. In practical applications, due to the uneven number of pixels in different categories, the loss calculated by the cross-entropy loss function is not realistic [34]. For this reason, our paper used the focal loss, which is specifically designed to solve the imbalance problem [34] and focuses on the more difficult classification locations by changing the weight of different categories. For specific descriptions refer to [34].
Training
We experimented with ResNet-18 as the FPN backbone. All convolutional layers were initialized as in He et al [35]. Batch layers were simply initialized with bias and weight . The mini-batch size was 24, optimization was according to the Adam method, and training lasted for 7 days with the base learning rate of 0.001. All the experiments in this article were conducted on a high-performance computer with Intel 3.50 GHz processor and 128 GB of memory. Two NVIDIA 1080 GeForce graphics processing unit (GPU) has a 12 GB memory used to accelerate the training of our model.
During the training, we tested the model performance with all the collected images and selected supplementary training samples for the images that did not perform well to make sure that the training samples covered all the cases of the 6 GB images obtained in 2018 (except the 90 testing images). There were 60 field images which generated 302 patches which were added as supplementary training samples, about 40 samples per day. The performance standards (good or bad) were determined through observation. The training period continued until the testing performance of all images visually met the accuracy requirements and the loss function curve was smooth without fluctuations.
PostProcess
Although a deep network is well suited for processing semantic segmentation problems, it is impossible to achieve 100% accuracy based only on auto segmentation methods. Therefore, developing a tool for manually modifying the segmentation results is a necessity. To solve that problem, we developed a software called GvCrop, which not only integrates the pixel-wise segmentation method (Fig. 1(6)), but also integrates the ability to modify the segmentation results by human interaction (Fig. 1(7)). Because pixel-level labelling of the wrong location is time consuming, processing the image regions with homogeneous characteristics instead of single pixels can help us accelerate the manual labelling speed (Fig. 1(7)). According to the image color space and boundary cues, we used the gSLICr [36] algorithm to group pixels into perceptually homogeneous regions. gSLICr is the Simple Linear Iterative Clustering (SLIC) [37] implemented on GPU using the NVIDIA CUDA framework, 83 × faster than the SLIC CPU implementation. The gSLICr has three parameters: S, C and N. S stands for super pixel size, C is the compact coefficient degree, N is the number of iterations. In our paper, S was set to 15, C was set to 0.2, and N was set to 50. After super pixel segmentation, users can modify the auto-segmentation results based on super pixels.
Accuracy assessment
To quantify the performance of our method, we evaluated our semantic segmentation calculating Pixel Accuracy (P.A.) (1) and mean IoU (mIoU) (2). These are standard metrics used to quantify semantic segmentation tasks [30]. P.A. indicates the proportion of correctly classified pixels to the total number of pixels and mIoU indicates the intersection-over-union (IoU) between the ground truth and the predicted pixels, averaged over all classes:
where n is the number of classes, pij is the number of pixels of class i predicted to belong to class j, pii is the true positive, pij is the false negative, pji is the false positive, pjj is the true negative.
Calculation of leaf-panicle ratio (LPR)
The software GvCrop was developed to calculate LPR based on the quantity of pixels contained in the leaf (L) and panicle (P) regions in an image and is calculated as: LPR = L / P.