Structure from Motion

The SfM technique was utilized to obtain the projection matrix for each camera and the sparse point cloud from a multi-view image. Regarding SfM, it is a photogrammetric method for simultaneously estimating the camera parameters and the depth of corresponding points (i.e., sparse 3D point clouds) from multi-view images. In this study, we used Metashape (Agisoft, St. Petersburg, Russia), which is commercial photogrammetry software that includes SfM. The projection matrices, including the optical center, focal length, orientation, and position of the cameras, were exported as Extensible Markup Language (XML) files. Markers were placed on the image to optimize image placement and thereby make it easier to obtain the corresponding points.

Instance segmentation of leaves using Mask R-CNN

To extract the 2D edges of leaves, mask images of multi-view images were obtained using Mask R-CNN, a deep neural network model for segmentation [42]. We used Detectoron2 [56], a library for detection and segmentation tasks, to utilize the Mask R-CNN model with the backbone ImageNet and the model weights pretrained on the COCO dataset. The model was trained on a *training* dataset that comprised 80% of the dataset consisting of multi-view images of three individual plants, and the remaining 20% of images were used for validation (*validation* dataset). The model performance was evaluated using a *test* dataset consisting of another individual plant. To calculate the accuracy of instance segmentation using the Mask R-CNN, we performed four-fold cross-validation, which assigned a different individual as the *test* dataset.

Leaf edges were extracted from the predicted mask images using OpenCV library [31]. The extracted edges were divided according to the condition [37] under which the contours had random lengths. The 2D edges were divided into fragments with an overlap ≥ *τ*overlap and with a certain length *l*fragment. In this study, we used \({\tau }_{\text{o}\text{v}\text{e}\text{r}\text{l}\text{a}\text{p}}=15\) and \({40<l}_{\text{f}\text{r}\text{a}\text{g}\text{m}\text{e}\text{n}\text{t}}<100\) for the simulation data and \({\tau }_{\text{o}\text{v}\text{e}\text{r}\text{l}\text{a}\text{p}}=30\) and \({80<l}_{\text{f}\text{r}\text{a}\text{g}\text{m}\text{e}\text{n}\text{t}}<200\) for the real data.

Leaf correspondence identification across multi-view images

To individually process and reconstruct the leaves, we determined the correspondence of the leaves between the images. First, the point cloud was clustered, and the points in the backside was removed and projected onto the mask image. Then, the point cloud was associated with the mask on which most of the points had been located. If this was performed for all the mask images, the correspondence between the leaves of the images could be obtained via a point cloud. In this study, density-based spatial clustering of applications with noise (DBSCAN) [57], a density-dependent clustering method, was used for clustering on simulated data; and region-growing segmentation implemented in the Point Cloud Library [58], where the angle between normals determines whether a region is grown or not was used on real data. Hidden points removal [39], where the angle between the normals determines whether a region is grown, was used on the real data. Hidden point removal [39], which determines the visible points in a point cloud from a given viewpoint using a sphere and a spherical inversion operator, was used for background removal. Figure 6a shows an overview of the method, and Fig. 6b shows the practical application of the method.

Curve-based 3D reconstruction of leaf edges

A 3D curve sketch [37], which is a curve-based 3D reconstruction, was obtained using the projection matrix obtained by SfM, the multi-view mask image obtained by Mask R-CNN, and the correspondence of leaves among the images. All the subsequent processes were applied to each leaf. Obtaining a 3D curve sketch involves the following steps: (1) camera pair definition; (2) pair hypothesis generation; and (3) 3D curve fragment reconstruction and filtering by reprojection.

**(1) Camera pair definition**

To perform curve-based 3D reconstruction, camera pairs were defined based on the relative positions of the cameras in the scene. Angle *b**ij*, which is the angle between cameras *i* and *j* from the average positions of all the cameras (), was calculated for all the camera combinations. The camera pairs were defined as the combinations that satisfied. Since the cameras had been assumed to be equally spaced to cover the plants, *b* corresponded to the baseline in [37]. In this study, we used angles of 30, 40, and 60° on the simulated data of 128, 64, and 32 multi-view images, respectively. For the real data, *b*max was set to 30°, regardless of the number of images.

**(2) Pair hypothesis generation**

Let be the *p*-th 2D curve fragment in the *i*-th image. Pair hypothesis, a potentially corresponding pair of 2D curve fragments, is defined as a pair of 2D curve fragments. In epipolar geometry, a fundamental matrix **F***ij* computed from the projection matrices corresponding to images ( and ) maps a point in the *i*-th image to a line in the *j*-th image. The line mapped by the fundamental matrix is called the epipolar line (or epiline), and any existing corresponding points along the line are found. By extending this concept to a 2D curve fragment, **F***ij* maps a 2D curve fragment in the *i*-th image to a band (a set of epipolar lines) in the *j*-th image. Pair hypotheses were generated based on the 2D curve fragments overlapping the bands (Fig. 3a). For a robust reconstruction, 2D curved fragments tangential to the epipolar line were excluded from the process (see [37] for details). The number of pairs of hypotheses per band was set to a maximum of only 10 to account for the limited the computational resources.

**(3) 3D curve fragment reconstruction and filtered by reprojection**

Then, 3D curve fragments (), which correspond to the pair hypotheses, were reconstructed using projection matrixes in 3D Euclidean space. Each reconstructed 3D curve fragment was reprojected onto multi-view images, excluding the i- and j-th images, to evaluate how closely the reconstructed curve fragments generated the true projection. The reconstructed curve fragments were *supported* by reprojections if the reprojected curve fragments had been located close to the edges of the target object (i.e., leaf) on the image; i.e., a reprojected curve fragment was supported if at least *τ**v* (%) of the curve fragment was located within *τ**d* pixels of the edges in *τ**t* images. Only curves supported with a sufficient number of images (i.e., greater than the support threshold *τ**t*) were reconstructed. A *τ**v* of 80% was used for all cases, and *τ**d* was 11 and 39 pixels for the simulated leaves and actual soybean specimens, respectively.

B-spline closed curve fitting

The 3D curve fragments were integrated into a closed 3D curve by using B-spline fitting. In B-spline fitting, a continuous periodic function is approximated by a piecewise polynomial function, which is a linear combination of the order *j* B-spline basis over *i*-th interval \({b}_{i,j}\left(l\right)\) as follows:

$$f\left(l\right)=\mathbf{w} \mathbf{b}\left(l\right)=\left(\begin{array}{ccc}{w}_{1}& \cdots & {w}_{n-1}\end{array}\right)\left(\begin{array}{c}{b}_{1,j}\left(l\right)\\ ⋮\\ {b}_{n-k-1,j}\left(l\right)\\ {b}_{n-k,j}\left(l\right)+{b}_{1-k,j}\left(l\right)\\ ⋮\\ {b}_{n-1,j}\left(l\right)+{b}_{0,j}\left(l\right)\end{array}\right)$$

2

,

where *w**i* denotes the coefficient of the *i*-th B-spline. Based on the coordinate values of the reconstructed 3D curve fragments, the B-spline coefficients were estimated for *x*-, *y*-, and *z*-coordinate values using the "curve_fit" function in SciPy [59].

Calculation of optimal support thresholds

For *τ**t*, the PR curves in the ground-truth mesh and the reconstructed curve fragment were calculated in the simulation data, which the optimal support threshold is the highest *τ**t*, with the highest recall when the precision had exceeded 0.99. The precision is the percentage of ground truths for which the reconstructed curved fragments are within 30 mm, and the recall is the percentage of curved fragments for which the ground truths are within 30 mm. We also excluded points supported by more than *τ**p* on a well-supported curve based on 3D drawings [60], which are modified versions of 3D curve sketches.

The simulation data were comprehensively tested for different precision and recall values with respect to the support threshold, which is defined as the number of images whose ratio to the total number of images (128, 64, and 32 images) falls into 24 equal parts (i.e., between 1 and 0.125). The precision and recall values were recorded; if the precision did not reach one, the minimum value was used as the optimal support threshold.

Virtual plant models

Virtual plant models (single, multiple, and single leaves with holes) were created using Blender (Blender Foundation, Amsterdam, Netherlands). Multi-view images of the virtual plants were generated using Unity (Unity Technologies, San Francisco, CA, US) from 128, 64, and 32 cylindrically arranged views. Three individuals were created based on the multiple-leaf model; each leaf was translated randomly—horizontally from − 33.33 to 33.33% and vertically from − 14.28–14.29% of the bounding box dimensions—and rotated randomly from − 10 to 10°.

Multi-view images of soybean plants

Multi-view images were obtained from four individual soybeans (*Glycine max*), including four cultivars (Enrei, Zairai 51 − 2, Aoakimame, and Saga zairai), to train the Mask R-CNN model and evaluate its performance. These individuals were captured at different growth stages: 1) Enrei: 34 days after sowing (DAS); 2) Zairai 51 − 2: 56 DAS, 3) Aoakimame: 24 DAS; and 4) Saga zairai: 48 DAS. To demonstrate the applicability of the proposed method, multi-view images of another cultivar, Fukuyutaka, at different growth stages of 21, 28, and 42 DAS, were obtained. Each set of multi-view images included 264 images, and approximately 130 images were subsampled. We used a simple fixed photogrammetry system consisting of digital cameras (EOS Kiss X7; Canon, Tokyo, Japan), a turntable (MT320RL40; ComXim, Shenzhen, China), and a camera control application (CaptureGRID4; Kuvacode, Kerava, Finland) (Additional file 1, Fig. S5) to obtain multi-view images.