A total of 507 buckwheat (Fagopyrum esculentum) germplasm accessions were provided by the Rural Development Administration (RDA) GenBank, South Korea. The collection of plant material complied with relevant institutional (Jeju National University), national (Republic of Korea), and international guidelines and legislation.
Camera system setting
A complementary metal-oxide-semiconductor (CMOS) image sensor color camera (Nikon D7500, Nikon Imaging Japan Inc., Tokyo, Japan) with a resolution of 23.5 × 15.7 mm and lens (af-s dx Nikkon 16-80 mm f/2.8-4 e ed vr, Nikon Imaging Japan Inc., Tokyo, Japan) was used to acquire images. A studio box (M80 Studio, China) with a size of 800 × 800 × 800 mm was set up. A light-emitting diode (LED) board (ArtLight, Unclepen Co., Bucheon, Korea) with an area of 670 × 470 × 20 mm was set to reduce image error caused by shadows during data production. The shadows of buckwheat seeds were removed using a backlight. Additionally, two light boards sized 600 × 10 mm, with 5500 K ± 200 temperature and two LED lighting stands (N-T96 LED, Prodean co., Seoul, Korea) with 5600 K temperature were installed to remove the shadows of buckwheat seeds during the process of taking images in the studio box. Buckwheat seeds were manually spread on the area of blue polypropylene (PP) (color clear PP “L” Holder, Hyunpoong Inc. co. Pochen. Korea), which was 255 × 310 mm. The blue PP had a chroma-key effect, making it easy to separate buckwheat seeds from the background. Each image taken by this system contains 95 seeds per germplasm accession on average.
We acquired vertical red, green, and blue (RGB) images of buckwheat seeds taken 25 cm above the ground with a camera (Nikon, Japan). To calculate the data compared to the actual size, a 16 mm tag was used as a scale bar. To minimize the error of the color value depending on the light condition, a standard color was selected, and a color tag was added to the blue PP. Pictures of the camera setting and seed image, as an example, can be found in Supplementary Figs. 1 and 2.
Image analysis processing
The buckwheat seed images were processed by the program ImageJ (ImageJ, National Institutes of Health, USA, rsd.info.nih-gov/ij). The program has allowed us to edit, calibrate, measure, analyze, and process image data7. It could be extended as a tool such as macro, which was used with ImageJ in the experiment. While editing images, we converted the size to millimeters in the scale setting. The standard tag was selected and set to the size of 16 mm for the pixel.
To separate into RGB channels, ImageJ, which was used to split seeds from the background, was used. The separation of RGB channels made it easier to separate the seeds from the background because the color of the seeds was simplified. After the separation of RGB channels, binary images were created by using threshold values of pixel values to complete the separation of seeds and background. The noise particles were processed at a pixel value 100 times smaller than the size of the seeds to avoid the measurement of noise particles other than buckwheat seeds in the image. The area of buckwheat seed was separated into each of the parts as an independent area without connecting the objects8. Fig. 1 outlines the end-to-end pipeline of the image-based phenotyping.
Five seed shape characteristics were imaged: seed area, width, height, circularity, and roundness. The characteristics were extracted from the images of individual buckwheat seeds (Table 1). All data analysis were performed using Python 3.8.59. The data consist of 48,047 samples with 507 IT lines. The average, maximum, and minimum numbers of samples per line are 94.7, 238, and 41, respectively. We removed four samples with zero values and one with a height of 99.163, which is too large as the average height was 5.67. The data distribution of each feature per line was tested using the Shapiro-Wilk normality test10. A total of 507 lines failed because at least one of the features in the line had a p-value less than 0.05. To further check the data distribution, each feature in selected buckwheat lines was visualized with kernel density estimate (KDE) plots in the Seaborn package (see Fig. 1). X-axis indicates ranges of data in a single feature. The Y-axis indicates the probability of density that can be viewed as smoothing histograms. Even if the normality test failed, we can see that the data approximately follow the normal distribution, and no outlier exists.
The median value of each feature per line was calculated for clustering. K-means clustering was selected because it showed robust clustering results in various data sets11. For a given k, k-means clustering partitions samples into k clusters in which each sample belongs to the cluster with the nearest distance. The k-means clustering tries to minimize the distortion, which is the sum of the squared distances between each sample and its centroid. Supplementary Fig. 3 shows the calculated distortion value according to the number of clusters.
The distortion value decreases dramatically until k is smaller than 6. For a larger k, the distortion value does not decrease dramatically. The optimal choice of k would be 6. Supplementary Fig. 3 visualizes the result of K-means clustering when k is 6 for pairs of two features located on the X and Y axes. The plots on the diagonal show the density distribution of a corresponding feature for each cluster.
The kernel density estimate (KDE) plot is widely utilized to visualize various data types and easily visualize the peak of data in the intervals12. The density plot of each feature in the selected four accessions was visualized with Plotly (Fig. 3). The x-axis indicates the ranges of data in a single feature. The y-axis indicates the probability of density that corresponded to the x-axis, and it could be more significant than one13. In addition, the density plot of each feature per accession can be found in supplementary Fig. 4.
We calculated the correlations between features using Spearman’s method14 (see Fig. 2). The method was selected because the data did not fully follow the normality distribution. Based on the correlation coefficient and the p-value (Fig. 3), we can confirm that there are no correlations between features.