This section focuses on the work and methods needed to identify the stomata of the leaves of living plants. This section is divided into three parts: image acquisition and pre-processing, target detection network and evaluation, and the general block diagram is shown in Fig. 1. Image acquisition and pre-processing were mainly performed using a digital microscope of Keyence VHX-2000 to acquire images of the subepidermal stomata of living black poplar leaves. Target recognition network is based on YOLO-X, and the parameters and structure are optimized for the characteristics of leaf stomata. Evaluation mainly uses the evaluation indicators commonly used in target detection, namely accuracy, recall, and mean average precision.
2.1. Image acquisition and preprocessing
One-year-old black poplar trees were selected for live leaf subepidermal stomatal image acquisition using a Keyence VHX-2000 digital microscope. Using the real-time depth synthesis technology of the VHX-2000, a full-field clear image of the stomata of the living poplar leaves was acquired. The image resolution was 1600×1200, and stomatal images were acquired at this resolution with a 1000X lens. Stomatal images were also acquired from each location, including the upper, middle and lower leaves of different branches, as well as the upper, middle and base of a single leaf. An example of the collected images is shown in Fig. 2, and the collected stomatal images are labeled[9].
Finally, we obtained a total of 3160 data of living plant stomata. Based on the findings obtained by Sharada Prasanna Mohanty, we divided the training set, validation set and test set in the ratio of 7:1:2[10], as shown in Table 1.
Table 1
Classes | Number of Training Data | Number of Testing Data |
Open Stoma | 1241 | 84 |
Close Stoma | 1919 | 574 |
Total | 3160 | 658 |
We manually annotated the dataset using Labelimage software. To obtain a larger dataset, we performed data enhancement on the obtained dataset at a scale of 1:10, using image processing methods such as upside down, mirror flip, Gaussian blur, brighten, panning and scaling, as shown in Fig. 3.
2.2. Target detection network
YOLO-X network as a new target recognition network, we optimized and improved YOLO-X on its basic structure, and the network structure is shown in Fig. 4. The improved YOLO-X network is divided into four parts, which are CSPDarknet, FPN, Yolo Head and EX-NMS.
CSPDarknet can be called the backbone feature extraction network of the YOLO-X network. The input images are first feature extracted in CSPDarknet, and the extracted features are used as feature layers. We acquire the features of the last three feature layers using FPN for feature enhancement to combine feature information at different scales. In the FPN, the same structure of Panet used in YOLO-V4 is used. The network up-samples the features to achieve feature fusion and down-samples the features afterwards to achieve feature fusion, and finally three feature-enhanced feature layers are obtained. Yolo Head is the classifier and regressor of YOLO-X network, and the structure is shown in Fig. 5. It determines the feature points in three aspects: determining the prediction frame, judging the presence of defects in the prediction frame, and identifying the types of defects in the prediction frame. Compared with the previous version of YOLO, which used a decoupling head where classification and regression were implemented in a 1X1 convolution, the Yolo Head was divided into two parts to implement classification and regression separately, and the results were integrated together for final prediction.
Although YOLO-X performs Non-Maximum Suppression(NMS) in the process of filtering and decoding the predicted results, it still appears that two states are identified out of one stomata in the output results, as shown in Fig. 6.
Therefore, we make a two-part adjustment to YOLO-X. On the one hand, in the decoding part of the prediction result, we use the CIou with better effect instead of Iou when calculating the IOU loss, see Eq. where \({\rho }^{2}\left(b,{b}^{gt}\right)\) is the Euclidean distance between the center point of the predicted frame and the real frame, \(c\) is the diagonal distance that contains both the smallest rectangle of the predicted frame and the real frame, \(\alpha\) is the weight function, and \(v\) is used to measure the similarity of the aspect ratio, see Eq. On the other hand, since the stomata are all kept at a certain distance from each other and the prediction frames do not cross, we perform a separate NMS of the YOLO-X output to ensure that each stomata corresponds to only one prediction frame.
$$CIOU=IOU-\frac{{\rho }^{2}\left(b,{b}^{gt}\right)}{{c}^{2}}-\alpha v$$
$$\alpha =\frac{v}{1-\text{I}\text{O}\text{U}+v}$$
$$v=\frac{4}{{\pi }^{2}}{\left(\text{a}\text{r}\text{c}\text{t}\text{a}\text{n}\frac{{w}^{gt}}{{h}^{gt}}-\text{a}\text{r}\text{c}\text{t}\text{a}\text{n}\frac{w}{h}\right)}^{2}$$
The hyperparameters we set for the YOLO-X network are shown in Table 2.
Table 2
Hyper-parameters of the experiments.
Hyper-Parameters | Value |
Optimization algorithm | SGD |
learning rate | 1.0×10 − 3 |
Epochs | 300 |
Batch size | 4 |
2.3. Evaluation
In this section, we used three evaluation metrics, including stomatal number recognition accuracy, stomatal opening and closing recognition accuracy, and mean average precision (MAP). These metrics are calculated from the base metrics, which are True Positive(TP), False Positive(FP), True Negative(TN), and False Negative(FN). TP is predicting positive classes as positive classes, FP is predicting negative classes as positive classes, TN is predicting negative classes as negative classes, and FN is predicts positive class as negative class.
Accuracy: The probability that the prediction is correct in the total sample, that is, the accuracy of stomatal opening and closing recognition.
$$accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
Recall: It is the ratio of the number of relevant documents retrieved to the number of all relevant documents in the document library. Since this experiment only identifies stomata, the accuracy rate of stomata identification is the recall rate without distinguishing stomata status.
$$recall=\frac{TP}{TP+FN}$$
Precision: describes the percentage of correct predictions when the prediction result is in the positive category.
$$precision=\frac{TP}{TP+FP}$$
Mean Average Precision (MAP): a performance metric for this class of algorithms that predict target locations as well as categories.