Segmentation of Tomato Growing Truss a Depth Image Conversion Model Based on CycleGAN


 Background: The truss on tomato plants is a group or cluster of smaller stems where flowers and fruit develop, while a growing truss is the most extended part of the stem. Because the state of the growing truss reacts sensitively to the surrounding environment, it is essential to control the growth in the early stages. With the recent development of IT and artificial intelligence technology in agriculture, a previous study developed a real-time acquisition and evaluation method for images using robots. Further, we used image processing to locate the growing truss and flowering rooms to extract growth information such as the height of the flower room and hard crab. Among the different vision algorithms, the CycleGAN algorithm was used to generate and transform unpaired images using generatively learning images. In this study, we developed a robot-based system for simultaneously acquiring RGB and depth images of the tomato growing truss and flower room groups.Results: The segmentation performance for approximately 35 samples was compared through the false negative (FN) and false positive (FP) indicators. For the depth camera image, we obtained FN as 17.55±3.01% and FP as 17.76±3.55%. Similarly, for CycleGAN, we obtained FN as approximately 19.24±1.45% and FP as 18.24±1.54%. As a result of image processing through depth image, IoU was 63.56 ± 8.44%, and when segmentation was performed through CycelGAN, IoU was 69.25 ± 4.42%, indicating that CycleGAN is advantageous in extracting the desired growing truss. Conclusions: The scannability was confirmed when the image scanning robot drove in a straight line through the plantation in the tomato greenhouse, which confirmed the on-site possibility of the image extraction technique using CycleGAN. In the future, the proposed approach is expected to be used in vision technology to scan the tomato growth indicators in greenhouses using an unmanned robot platform.


Introduction
In crops, the growing tip and the roots where cell division occurs are sensitive to the surrounding environment. In particular, the hypertrophy of early reproductive growth of crops can be determined from the state of the growing truss [1], which can also affect the quality of owers and fruits. Although experts can determine hypertrophy with the naked eye, it makes the collection of accurate numerical data di cult, and realizes various disadvantages in setting the crop management standards. While studies are being actively conducted on analyzing crop diseases using digital imaging studies conducted on tomato crops, not many have measured the indicators related to tomato growth. the case of the growing truss, it is di cult to collect numerical information from the information obtained from the image to determine a label value considering the lack of video images for reference.
The development of a future-oriented agricultural robot platform is expected to reduce the challenges in acquiring image data comprising growth information. Singh et al. (2020) developed a mechanical robot arm with a high degree of freedom and an intelligent control unit that moves the arm by judging the captured image. However, research on recognizing objects in a diversi ed view based on images by placing the robot arm in a more advantageous position is currently underway [2,3]. Chang et al., 2015 reported the use of image processing techniques such as color space transformations, morphological operations, and 3D localization to identify objects and grippers in captured images and estimate their relative positions using the computer vision area as the novel algorithm for extracting the object before determining the movement of the robot arm. In agriculture, measuring the growth using computer vision has been in progress for a relatively long time [5,6]. In particular, robots are used in harvesting, and various image processing techniques have been applied to extract fruit and determine the ripeness [7,8]. Zhuang et al. (2019) proposed a computer-vision-based method for locating acceptable picking points for litchi clusters, and the image processing algorithm was used to track the location of the fruit while considering the agronomic characteristics of the picking point.
Although it is necessary to apply image processing techniques by identifying the characteristics of the crop, no image segmentation method has been developed to determine the growing truss in tomato plants. However, the segmentation of tomato stem and leaves were not able to distinguish the overlap of other surrounding objects in the RGB images. Because a tomato cultivation environment inside a greenhouse is dense, classifying stems or leaves using images is di cult [10,11]. As a related study, Xiang, (2018) performed crop segmentation through a simpli ed pulse coupled neural network by measuring 385 tomato images at night time. The best results obtained through the segmentation technique con rmed that the best and false rates are 59.22% and 13.77%, respectively. However, there was a limitation in that it could be performed through a speci c light at night for light correction, which would require more mechanical devices and technical improvements to measure the growing truss of tomatoes. Zhang and Xu, (2018) reported method for improving the accuracy of image segmentation in the middle stage and late stage of the fruit growth by using unsupervised method. However, the segmentation of tomato stem and leaf did not distinguish the overlap of other surrounding objects, so it did not show the possibility in RGB images. Many studies have been conducted on the fruit of the main target for identi cation of tomatoes using RGB images, and there are many reported results about the possibility in tomato cultivation, but segmentation studies on tomato stems at growing points have yet to be successfully reported.
In order to solve this problem, there is a potential possibility to use a 3D camera capable of segmentation according to distance or image processing techniques that are affected by solar light in greenhouse. 3D depth cameras are widely used in image acquisition platforms for recognizing objects in various industries, including agriculture [14][15][16]. It has been reported that a technology that combines depth and color image information through a stereo camera, one of the 3D camera technologies, can be presented, and segmentation of objects can be performed on real images recorded with a stereo camera [17,18]. Unlike conventional 2D cameras, 3D depth cameras can be distributed to the eld and used to calculate the depth value of each pixel in an image, whereas research on growth measurement is underway to determining growth measurement using 3D cameras.
Deep learning image processing technology has advanced in recent years. For instance, in image recognition and classi cation, studies using convolutional neural networks (CNN) are effectively applied to various industrial elds [19][20][21][22]. The use of Mask-RCNN, which recognizes objects at high speed and is specialized for segmentation, is expected. As a related study, Afonso et al., (2020) used Mask-RCNN for tomato fruit recognition and con rmed its potential inside a greenhouse. The structure of such a CNN has the form of general supervised learning, and annotation of the region of interest (ROI) is required in all image data, and the accuracy of the model is contributed to some extent by the quantity and quality of the data obtained. Therefore, it is important not only to develop a robot platform to extract accurate images in an automated greenhouse, but also to apply an algorithm that can self-learning with an appropriate number of images.
Generative advertising networks (GANs) have particularly gained wide attention [24,25]. The basic GAN con guration comprises a deep learning technology that learns the delimiter and generator model simultaneously to obtain the target image from the generator, showing endless possibilities in unsupervised learning. The recently devised CycleGAN has been trained to avoid switching between images by learning two unpaired images by circulating the two generators and identi ers [26,27]. A representative application example of the CycleGAN is a study wherein the zebra pattern was converted to that of an ordinary horse. Researches have reported (28) that this technology is capable of switching the patterns of two images, that is, a photo with depth information and a general image with RGB information. Furthermore, unlike other CNN algorithms, CycleGAN is a learning process while generating images by self-learning, and the number of labeled image data required is relatively small. This is expected to enable e cient algorithm application using relatively little data in environments where data acquisition is di cult, such as in a greenhouse environment.
Considering these points, we can say that the current research lacks detection technology for determining the tomato growing point, and it is necessary to systematically secure the related study. For image acquisition using an unmanned robot, extraction of the tomato growing truss must be performed on-site, which requires segmentation using depth image information. Therefore, the speci c objectives of this study are : 1) To identify the tomato growing truss image, a monitoring system that can measure the height of the growing truss was built based on a robot, and based on this, the RGB image and the depth image of the growing truss were secured.
2) Using the obtained image, a conversion model between RGB and Depth is created through CycleGan, and this is combined with image processing techniques to segment the growing truss of tomatoes growing in the greenhouse without overlapping.

Greenhouse environment and image acquisition device
The experiment was conducted in a greenhouse facility where tomatoes are grown. A 2000 m2 interlocking Venlo greenhouse was utilized, wherein the insides comprised sensors and control systems to manage the level of carbon dioxide at constant temperature and humidity. The location of the greenhouse is at latitude 37.7986 and longitude 128.8575. We used Dafnis tomatoes for the experiment, and images of the harvested tomatoes were collected approximately 180 days after plantation. Tomatoes are grown in a greenhouse drip irrigation-based hydroponics system, and the nutrient solution is supplied through solar proportional irrigation control. The roots of tomatoes are established in the rock-wool substrate, and the substrate and the gutter supporting it are located at a height of about 1.3 m from the ground. The growing truss of tomatoes is located 1.6~2.5m from the gutter using the inducer lines, and this location is determined by the line works of the farmer.
To acquire the images, we used a vehicle placed on a robot platform capable of driving automatically in a greenhouse, and a 5-joint UR5 (Universal Robots, Odense, Denmark) was used as a menu plater to x the photographing unit at the position of the tomato growing truss. The menu plate operation was manually adjusted in the eld, and the position of the photographing camera was kept constant at the center of the line. The image acquisition unit comprising a Realsense 435i camera (Intel, Santa Clara, CA, USA) acquired RGB and depth images. The maximum resolution of the two camera is 1600 by 800. The measured images were collected on a mini-Windows PC (NUC, Intel, Santa Clara, CA, USA) and saved using Python programming. Figure 1 shows a photograph of the robot platform and the measurement module used.

The CycleGAN structure
The GAN is said to be successful when an adversarial loss makes the generated image indistinguishable from the actual photo. This loss is particularly powerful for image-creation tasks considering most computer graphics aim at achieving optimization [27]. The objective of the CycleGAN model is to learn the mapping functions between two domains X and Y using the given training samples where y j ∈ Y, which can be expressed as data distribution as xp data (x) and yp data (y). Zhu et al. introduced two cycle consistency losses [ Figure.3(a)], indicating that the starting position of x must be reached when converting from one domain to another and vice versa. The forward cycle consistency loss is given as:

{ } { }
As seen in Figure 3, the RGB and depth images were obtained from the robot platform and the acquisition unit. As seen in Figure 4 (left), a normal RGB image is similar to an image obtained from a normal camera. Figure 4 (right) shows a picture with the depth technology applied, and the location information between the camera and the object in the video is displayed in a color table.
Using CycleGAN learning, we constructed a model that converts RGB images to depth images and vice versa, as seen in Figure 4. The model was con gured using approximately 356 samples images of the tomato growing truss pictures at the fruit growing stage acquired from the image acquisition device. Of the 356 sample images, 276 were used to train the CycleGAN model and 80 samples were used for testing.
Each CycleGAN generator comprises three sections: The encoder, transformation, and decoder. Figure 5 shows the components of each generator section. The 1600 × 900 pixels image used in this study was obtained as a raw value and resized to 512 × 512 pixels. First, the resized input image is fed directly to the encoder comprising three convolutional layers to increase the number of channels and decrease the representation size. The activated result is passed to the transformation, which is a series of eight resnet blocks that e ciently transfer information in the CNN structure. Therefore, it can be used as an optimal algorithm for the transformer layer. The transformation result is then expanded by the decoder comprising two transpose convolutions that enlarges the representation size and one output layer that produces the nal RGB image. Although each layer is followed by an instance normalization and ReLU layer, it has been omitted from this study for simplicity. Furthermore, we built a discriminator that captures images and predicts whether it is real or fake. The image corresponding to real is an actual RGB or depth image, and fake means an image generated by CycleGAN. The generator can be visualized in the following image:

Image processing and evaluation methods for the extraction of growth points
The obtained depth image was subjected to pre-processing to extract the parts corresponding to the crop. First, a general RGB-based picture of the crop showing the tomato growing points were converted into a depth image. In the depth image, it exists as an image that can be divided through color change by recognizing the distance as a boundary line. Here, the growing truss of the closest part of the image we want has a relatively red color, therefore, we extracted the area through HSV. Although the extraction performance was better than the RGB-based method, the process was optimized with a trial-and-error method. In addition, morphological operations were performed for the lling of the remaining small fragments and the extracted area.
We used the model developed as a CycleGAN in this study. The image was pre-processed by applying the HSV and Otsu thresholds. As seen in Figure 6  The contour of the crop was determined using the morphology EX algorithm, which can perform advanced morphological transformations using basic erosion and dilation operations that can be performed in place. In multichannel images, each channel is processed independently. The edge was detected from the contour obtained, and erosion was performed in one iteration using a 3 × 3 kernel. Finally, erosion was performed to remove small objects that are independent and correspond to noise. Although this process can be applied universally in tomato greenhouses, it is di cult to use in general outdoor areas and places where the distance of the plantation from the camera keeps changing. The results of the entire image processing are shown in Figure 7.
We compared the accuracy between the obtained image from image processing and from which the growth point was manually extracted using 80 test samples. For the manual image extraction, a method of creating polygons and leaving ROI areas was intuitively determined by a person.
The image extracted by hand was assumed to be the actual region of interest. The extracted growing truss from image processing and the actual region of interest of the same size were overlapped, and the extracted image value at the same coordinate as the position of the actual growing truss was eliminated. The error rate was then calculated based on the number of pixels in the remaining images. Two indicators were calculated for the error rate: The residual ratio of image after removing the predicted pixels from the actual image was designated as false negative (FN), and after removing the actual image pixel from the predicted image pixel was designated as false positive (FP). In addition, as a standard segmentation method, intersection over union (IoU) was calculated for evaluating an image segmentation method. Figure 8 shows the speci c calculation method for FN and IoU using the resulting image.

Continuous measurement of images of robots for eld applicability of CycleGAN
We conducted a eld applicability test to examine the possibility of measuring the desired tomato stem section in the greenhouse crop bed driving. The vehicle was driven between the planting spaces in the greenhouse in a straight line and continuously scanned pictures of a particular location. We only collected the RGB images from the RealSense camera, which were then converted using the previously developed CycleGAN. Further, an image processing technique was applied to extract the area of interest from the image. The RGB images were captured continuously at intervals of 1 min by advancing approximately 5 m for every 2 s by xing the forward speed of the robot to 0.5 m/s. We simultaneously performed the image conversion and extraction of the region of interest on the stem. Figure 9 shows the performance of the growth measurement experiment inside an actual greenhouse.

Training results of the CycleGAN
The growing truss of tomato was collected through the camera attached to the vehicle-based robot arm proposed above. A total of 350 pairs of images were collected, and CycleGAN learning was performed through this. This data can be con rmed through supplementary data. Figure 10 shows the collected data, the shape of the tomato growing point, and the greenhouse cultivation environment.
The CycleGAN was trained for approximately 9600 iterations in ve batches using approximately 276 training samples. At this time, the changes in the loss of the generator and discriminators X and Y can be con rmed, as seen in Figure 11. First, the generator loss was observed to have converged in the half at a certain level, although there was some loss in the beginning. The discriminator gradually converged to 0.5 for Dx, but further converged to approximately 0.55 for Dy. For depth to RGB generation, an error with the actual sample was con rmed. However, the learning performance, which was mainly used in this study, seemed to have been secured in the RGB to depth image to an extent. Figure 12 (a) shows the RGB-to-depth learning process. It was con rmed that the generator results obtained at 8800 iterations clearly depicted the appearance of crops as compared to initial iterations in the initial learning period. In addition, the RGB color differed based on the size and shape of the crop, and a similar pattern was observed in the depth images. Conversely, for the depth to RGB image, a low-quality crop image was obtained considering the input image could not generate high-quality images, as seen in Figure 12(b). Although the appearance, characteristics, and color of the crops were simulated like real RGB images, it was di cult to grasp speci c features with the naked eye.

Accuracy of image extraction
The conversion from an RGB image to a depth image was mainly for the segmentation of target crops, and we veri ed the accuracy of FN and FP as an evaluation method. From the previously developed CycleGAN models, the results were inferred using 8,800 iterations, whereas the image pre-processing and growing truss extraction image processing methods were the same. We obtained results as seen in Figure   13 by comparing the results based on 80 images that were not used for model training. When the FN value was calculated using the image obtained from the depth camera, we obtained an approximate value of 17.55% ± 3.01%, and FP was 17.76±3.55%. Similarly, on converting the image using CycleGAN, FN was approximately 19.24% ± 1.45% and FP was 18.24% ± 1.54%. In terms of error probability, although CycleGAN and depth images were compared with the actual extracted region and crossed segmentation values through IoU as shown in Figure 14. Among the total test samples, when using depth image, the IoU was 63.56 ± 8.44%, and when segmentation was performed through CycelGAN, the IoU was 69.25 ± 4.42%. As additional data, analysis result samples for each algorithm were additionally presented through the attached le IoU samples.

Field application results for continuous detection
Because this study aims to extract the growing truss of tomato crops in greenhouse using CycleGAN and image processing technology, the possibility was con rmed by eld application experiments. Figure 15(a) shows the result of continuously acquiring and matching images with a height of approximately 3.5 m, wherein the growing truss can be con rmed while the image acquisition vehicle advances inside the greenhouse. After advancing for 5 m, and it was con rmed that approximately six crops were unevenly distributed. Figure 15(b) shows the result of converting the image into a depth image using the developed CycleGAN model. Similar to the actual depth image, the image showed the object to be segmented. The depth in the image was indicated in red for the closer crops, and in blue for the farther crops. Finally, the result of extracting the growing truss, that is, the stem and leaves of the tomato plant using the image processing technique, can be con rmed from Figure 15(c).

Discussion
While past researches have focused on applying the vision of crops in fruit-oriented research, in case of stem plants, such as tomatoes, the state of the growth point, which grows continuously, can be used as an important indicator to determine the future yield. Therefore, we conducted research on image processing techniques to identify the growing truss and used deep learning to acquire highly e cient results. We rst devised an image processing technique for segmentation of the part that could be a growing truss through CycleGAN using a depth image and a simple RGB image converted into a depth image. Given that the CycleGAN is useful in image conversion, it was advantageous in recognizing objects that existed in two images, which was the growing truss to be extracted. Furthermore, it was possible to convert the color of the growing truss to the color of the depth. The color of the depth of the prepared training set was red, and the main learning factor. Owing to the CycleGAN method, both transformations can be advantageously applied, which has already been proven in a previous study that demonstrated the horse and zebra transformation attention [24,25]. If we compare the purpose and approach of the existing segmentation studies on tomato images Zhang and Xu, 2018), there is a difference in the direction that many studies have focused on the analysis of tomato fruit. It is very di cult to classify the stems or leaves of the tomato we want because the growing environment is very dense. On the other hand, the possibility of an approach using depth imaging was con rmed in this study.
Although the identi cation error rate was lower on using the depth image, as seen in Figure 13, the average error rate was less than 20% in the two techniques, which indicated that the segmented object was not another region of interest. This was a result of the assumption that the error rate could not be reduced considering the ground truth was determined manually. However, in standard deviation, CycleGAN con rmed the result with a minor deviation value, which could have been due to the depth camera image being applied to the eld considering the tendency of the camera to lose focus at proximity with a 10% probability, as seen in Figure 15 (c). However, because this was related to the applicability eld of the camera, it was not considered in this study. Objects that remain unrecognized by the depth camera are termed as a failure case, as seen in Figure 16, which can cause problems will be in eld applications in the future. However, this problem did not occur in the depth image converted using CycleGAN considering it was already being used in preparing the training set stage.
Additionally, it is often necessary to prepare annotated image samples as training data for arti cial intelligence algorithms that recognize objects by judging through human intellectual contribution, and a large amount of data samples is required to verify the accurate performance. On comparing these points, the results of CycleGAN and the image processing method proposed in this study have con rmed that preparing the annotated image samples is not required.
In the future, robots would be required in agriculture to automatically measure plantation growth. However, to choose the desired growing truss, the robot must accurately recognize the growing truss to establish a menu plating strategy. In this study, we adopted CycleGAN, an arti cial intelligence image conversion technique, as the rst step for the robot to recognize the growing truss. As a result, the robot was able to effectively extract the growing truss using the matched image even in eld applications. In the eld applicability veri cation experiment, the moving robot matched several images and nally converted the image using CycleGAN. The result was veri ed by only extracting the growing truss from the image. However, an irregular connection of images was observed during the registration, and the CycleGAN structure used when converting to depth applies to only 512 × 512 images, making a grid shape inside the images. As this applies to all images using deep learning, it is necessary to solve the problem using an algorithm for the exible application of the input layer structure. Nevertheless, the result indicated that the application of unmanned robots in agriculture in the future has been well considered.
In future research, we will consider using a method of acquiring optimal images by menu plating the robot arm once the growing truss is recognized. Automated robots and systems are likely to be delivered to systems. In addition, the result of converting the depth image to an RGB image, although not addressed in this paper, is worthy of discussion as a future study (Figure 12). Virtual reality has a high potential for human contribution [30], and creative results can be achieved when fused with arti cial intelligence.

Conclusions
In this study, we developed a technique for extracting the growing truss in tomato plantation in a greenhouse using image processing techniques based on the image information obtained by a robot platform and images of the growing truss captured by a depth camera. Furthermore, a study was conducted to convert the characteristics of two images, that is, converting RGB images into depth images, using the CycleGAN algorithm. Discriminators X and Y used in the loss of learning process converged to 0.43 and 0.65, respectively. The image information converted using CycleGAN was further used to compare the performance of the extraction of growing truss. The false negative value based on the images from the depth camera was approximately 17.55% ± 3.01%, and the false positive value was 17.76 ± 3.55%. Similarly, using CycleGAN, the false negative was approximately 19.24% ± 1.45% and the false positive was 18.24% ± 1.54%. When using depth image, the IoU was 63.56 ± 8.44%, and when segmentation was performed through CycleGAN, the IoU was 69.25 ± 4.42%. In terms of error probability, CycleGAN exhibited a higher value. Finally, we performed eld application tests to determine the growing truss of tomatoes, wherein the continuously scanned image information was converted into depth images using CycleGAN. In the future, the proposed approach is expected to be used in vision technology to scan the tomato growth indicators in greenhouses using an unmanned robot platform.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
All authors consent on the submission of this paper for publication in Plant Methods.

Availability of data and materials
The raw RGB and Depth images of tomato growing truss used in this study, CycleGAN code, executable .py le, and the model generator H5 le were shared at the supplementary le.

Competing interests
The authors declare that the research was conducted in the absence of any commercial or nancial relationships that could be construed as a potential con ict of interest. Authors' contributions DHJ is expected to have made substantial contributions to the conception, design of the work, the acquisition, analysis, interpretation of data, and the creation of new software used in the work. CYK and TSL designed image acquisition devices and analyzed the data. SHP have substantively revised the manuscript. All authors have agreed to the submission of the manuscript for publication. All authors read and approved the nal manuscript.    Relationship between the images generated from the X and Y generators and the image data to be extracted.

Figure 5
The CycleGAN generator architecture  The entire image processing after CycleGAN conversion.     Results of RGB to depth image conversions (a) and realizing the depth in RGB (b) through CycleGAN's 8800 iteration learning.

Figure 13
Comparison between depth image and CycleGAN image with ROI speci ed by hand through image processing technique, and the FP, FN, and IoU values in pixel units.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. PMsupplementarydata.zip