Development of an SNP marker set for marker-assisted backcrossing using genotyping-by-sequencing in tetraploid perilla

High-quality molecular markers are essential for marker-assisted selection to accelerate breeding progress. Compared with diploid species, recently diverged polyploid crop species tend to have highly similar homeologous subgenomes, which is expected to limit the development of broadly applicable locus-specific single-nucleotide polymorphism (SNP) assays. Furthermore, it is particularly challenging to make genome-wide marker sets for species that lack a reference genome. Here, we report the development of a genome-wide set of kompetitive allele specific PCR (KASP) markers for marker-assisted recurrent selection (MARS) in the tetraploid minor crop perilla. To find locus-specific SNP markers across the perilla genome, we used genotyping-by-sequencing (GBS) to construct linkage maps of two F2 populations. The two resulting high-resolution linkage maps comprised 2326 and 2454 SNP markers that spanned a total genetic distance of 2133 cM across 16 linkage groups and 2169 cM across 21 linkage groups, respectively. We then obtained a final genetic map consisting of 22 linkage groups with 1123 common markers from the two genetic maps. We selected 96 genome-wide markers for MARS and confirmed the accuracy of markers in the two F2 populations using a high-throughput Fluidigm system. We confirmed that 91.8% of the SNP genotyping results from the Fluidigm assay were the same as the results obtained through GBS. These results provide a foundation for marker-assisted backcrossing and the development of new varieties of perilla.


Introduction
Molecular markers enable efficient crop breeding through a variety of marker-assisted selection (MAS) approaches, including marker-assisted backcrossing (MABC), markerassisted gene pyramiding, and marker-assisted recurrent selection (MARS) (Rossetto and Henry 2014).Linkage mapping studies of family-based populations, genome-wide association studies using natural diversity panels, and joint linkage-association mapping of both biparental and natural populations are the most widely used strategies to identify the underlying genetics of crop species.Genotyping-bysequencing (GBS) can be used to create genetic maps and identify trait-associated markers, which, once validated, can be applied in MAS to combine genomic regions of interest (Rasheed et al. 2017).Marker-assisted pyramiding of quantitative trait loci (QTLs) that individually explain a small portion of the variance in a complex trait has been largely ineffective; however, phenotypic recurrent selection can be used to improve complex traits, although the long selection cycles and difficulty in identifying unique genes or alleles restrict its practicability (Ye and Smith 2008).
MARS is a strategy that allows targeted recombination of gene effects through recurrent selection.When MARS is used to introduce a desired target trait from a donor, it is important to accurately check the recovery rate of the background genome as well as the introduction of the target to shorten the time and increase the accuracy of inbred line development.When introducing a target trait from a wild species, it is necessary to minimize the unintentional introduction of additional wild traits by ensuring that only the locus related to the target trait is affected.Furthermore, in the case of polyploidy, it is necessary to confirm the degree of recovery by the subgenome and individual chromosomes after the new genetic material is introduced.Accordingly, a molecular marker set that can check the entire genome can be used to secure an inbred line with a high recovery rate in a relatively short time.
In order to develop MARS markers, genome sequence information must be assembled in a chromosome-level genome.Linkage maps can then be generated for segregated groups using GBS based on the assembled reference genome.These linkage maps can then be applied to select markers capable of identifying the highly similar structures of homeologous subgenomes.Polyploidy generally leads to high sequence conservation between homeologous genes, increased gene and genome dosage, and large genome size, which cause different layers of complexity for gene discovery and ultimately affect the efficient use of markers during breeding (Rasheed et al. 2017).Therefore, when developing markers in polyploid crops, it is essential to consider the redundancy present in the genome.
MARS markers are available in major crops including rice (Ahmadikhah et al. 2015), pepper (Kim et al. 2017), and tomatoes (Park et al. 2018), but not in the tetraploid minor crop perilla.Furthermore, there has been no report on the development of markers for MABC in tetraploid perilla.Perilla is an attractive crop used for oil or as a leafy vegetable because of its unique aroma and composition (Ahmed 2018;Asif 2011;Oh et al. 2018).Perilla has become increasingly popular in recent years as an economical source of omega-3 fatty acids, flavor ingredients, and flavonoids containing anthocyanins (Dhyani et al. 2019;Oh et al. 2018).There is increasing demand for molecular breeding to efficiently enhance these desirable traits in perilla varieties; however, as perilla is cultivated for food only in northeastern Asia, research on its genome and genetics is limited (Asif 2011;Sa et al. 2018).In addition, cultivars of perilla are reported to be polyploid, so it is expected that there will be difficulties in genomic research and molecular marker development (Lee et al. 2018;Nitta et al. 2005).GBS is considered an effective and widely applicable method for genotyping various crops (Hussain et al. 2017;Li et al. 2017;Bielenberg et al. 2015;Verma et al. 2015;Poland et al. 2012).Recently, a large number of single-nucleotide polymorphisms (SNPs) were successfully identified by GBS of interspecies cross-breeding groups of the diploid wild species Perilla citriodora (Kang et al. 2019).
In this report, we identified genome-wide SNPs by GBS of two F2 populations of tetraploid perilla.We then constructed a universal map composed of 22 linkage groups and selected SNP markers for each individual chromosome.Next, we selected 96 markers for MARS, converted them to allelic-specific kompetitive allele specific PCR (KASP) markers, and verified their accuracy in the two F 2 populations using the Fluidigm BioMark HD genotyping system.
We randomly collected 96 fresh young leaves from each F 2 line and extracted genomic DNA using a modified cetyltrimethylammonium bromide-based method (Allen et al. 2006).We then used the extracted genomic DNA to prepare libraries for GBS and whole-genome sequencing.The DNA was quantified using a Thermo Scientific Nanodrop 8000 spectrophotometer (Fisher Scientific; Hampton, NH, USA).To prepare the GBS libraries, we digested the DNA using the restriction enzymes PstI (CTG CAG ) and MspI (CCGG) and ligated the DNA from each F 2 line to a set of 96 barcoded adapters using a modified version of a previously published protocol (Elshire et al. 2011).We then pooled the libraries from each F 2 line and sequenced them in 151-bp reads using TrueSeq Version 3.0 paired-end sequencing on the Illumina platform HiSeq X Ten.Each lane of sequencing contained DNA from all 96 samples of a single F 2 line.

Sequencing and SNP filtration
After sequencing, we removed the reads containing only the common adapter using the Cutadapt software (Martin 2011).We then de-multiplexed the raw reads using a Python script, which filtered out reads containing ambiguous bases in the barcodes and split the raw Illumina FASTQ file into 96 separate FASTQ files based on the barcode sequences (Eun et al. 2016).Next, we trimmed the de-multiplexed reads using the Solexa QA package v.1.13(Cox et al. 2010).Poor-quality reads and reads shorter than 25 bases were discarded.We applied the Burrows-Wheeler Aligner program (BWA; 0.6.1-r104)(Li and Durbin 2009) to align the clean reads to the draft genome sequence of Perilla frutescens (unpublished data) for SNP calling.The high mapping quality ensured reliable mapping of the reads, which is important for variant calling.Next, we extracted the mapped reads from the resulting BAM file for further analyses using SAMtools v.0.1.16(Li et al. 2009).Using the SAM tools varFilter command, SNPs were called only for variable positions with a minimal mapping quality (i.e., -Q) of 30.The minimum and maximum read depths were set to 3 and 100, respectively.We used an in-house script that considered biallelic loci to select the significant sites among the called SNP positions (Kang et al. 2019;Kim et al. 2014Kim et al. , 2018)).A raw SNP matrix was constructed by collecting the raw SNP positions obtained from comparison of each sample with the draft genome.Then, the blank region (non-SNP loci) was filled from the consensus sequence of the sample, and a final SNP matrix was created by filtering out mis-called SNP loci.Based on the ratio of the mapped reads to each SNP locus, we classified the SNPs into three types: Homozygous SNPs, read depth ≥ 90%; Heterozygous SNPs, 40% ≤ read depth ≤ 60%; and Other SNPs, which did not meet the criteria for Homozygous or Heterozygous.We then selected SNPs that had minor allele frequency (MAF) > 5% and missing data in less than 30% of the samples.

Linkage map construction
We converted the filtered SNPs in the F2 populations to a genotype matrix consisting of four characters [maternal type: a; paternal type: b; heterozygous type: h; unknown: n].We then constructed a genetic linkage map of P. frutescens var.frutescens separately using Joinmap 4 ( Van Ooijen 2006).Also, we filtered out genetically identical markers and individuals using the genetic mapping program.The SNP markers were grouped into linkage groups (LGs) with logarithm of the odds threshold value ≥ 4. The genetic distances of the SNP markers, which were based on the recombination rates, were then converted using Kosambi's mapping function (Kosambi 2016).The final genetic linkage map was drawn using MapChart ver2.3 (Voorrips 2002).

SNP selection and validation by fluidigm genotyping assay
SNP markers were selected from the markers constituting the two genetic maps.To design Fluidigm SNP genotyping assays, 60-150 bp sequences flanking the selected SNPs on either side were founded in the perilla draft genome sequence and checked for specificity using BLAST.Finally, the selected SNPs and flanking sequences were uploaded to the D3 Assay Design (https:// d3.fluid igm.com/) website.After confirming the results, the designed assays were ordered.One Fluidigm SNP assay contains allele-specific primer 1 (ASP1), ASP2, locus-specific primer (LSP), and specific target amplification (STA) primer.
The SNP genotyping assays were performed in series using the BioMark™ HD system (Fluidigm, San Francisco, CA, USA) at SNP genetics (Seoul, Korea) according to the manufacturer's instructions using the following thermal cycling conditions: 95 °C for 15 s, 64 °C for 45 s, and 72 °C for 15 s, with a touchdown of − 1 °C per cycle from 64 °C to 61 °C, followed by 34 cycles of 95 °C for 15 s, 60 °C for 45 s, and 72 °C for 15 s.The genotyping results were acquired using the Fluidigm SNP Genotyping Analysis software.All base calls were manually verified, and any errors in homozygous or heterozygous clusters were curated.

Identification of genome-wide SNPs using GBS
The summary statistics of the sequencing and GBS analysis are shown in Table S1.Paired-end sequencing produced approximately 105.4 Gbp (698,076,320 reads) from #973,109 and 109.8Gbp (727,703,068 reads) from #973,116.Following removal of the barcodes and PstI and MspI overhang sequences, 684,566,618 and 715,551,854 short reads from #973,109 and #973,116, respectively, were de-multiplexed.After elimination of adapters and ambiguous nucleotides, 629,718,268 and 660,395,486 high-quality reads with Phred quality score ≥ 20 were obtained for #973,109 and #973,116, respectively.A total of 587,492,927 reads and 622,599,718 reads from #973,109 and #973,116, respectively, were mapped to the P. frutescens var.frutescens draft genome, which accounted for about 93.2% of the trimmed reads from #973,109 and 94.2% of the trimmed reads from #973,116.The average depth of the mapped reads in the GBS data sets was 108 × and 103 × for #973,109 and #973,116, respectively.

SNP filtering by linkage mapping
To secure biparental polymorphic SNPs in each population, we aligned the whole-genome sequencing data of Chajogi #107-1, Pureunchajogi #115-2, and Namcheon to the P. frutescens var.frutescens draft genome and identified genome-wide SNPs.In total, we identified 3,531,724 SNPs between Chajogi #107-1 and Namcheon and 3,659,790 SNPs between Pureunchajogi #115-2 and Namcheon.(Table 1).Subsequent analysis using our in-house GBS analysis pipeline considering read depth, mapping quality, and biallelic loci (Eun et al. 2016;Kim et al. 2014;Lee et al. 2018b) identified a total of 38,156 and 44,219 raw SNPs in #973,109 and #973,116, respectively (Table 1).These SNPs were filtered to identify putative markers using the criteria of 30% missing values across the genotyped individual and MAF ≥ 0.05, which yielded a total of 5,052 and 5,462 SNPs for #973,109 and #973,116, respectively.Next, considering the biparental polymorphic SNPs in each population, we identified 2,840 markers in #973,109 and 3,147 markers in #973,116 (Table 1).After additional filtering using the mapping program with criteria such as chi-square-test and removal of duplicated markers and individuals, 2783 and 3072 markers were converted separately to genotyping matrixes for mapping of each GBS dataset.
The genetic map of #973,109 formed 16 LGs using 2326 out of the 2783 selected SNP markers (Table S2).The number of SNP markers constituting each group ranged from 25 (LG15) to 323 (LG16).The genetic distance (cM) of the LGs ranged from 47.9 cM for LG15 to 220.5 cM for LG08.The total genetic distance was 2,133.9cM, and the average genetic distance between markers was 0.9 cM.The genetic map of #973,116 formed 21 LGs using 2,454 out of the 3,072 selected markers (Table S3).The number of markers constituting each group ranged from 30 (LG10) to 244 (LG16).The genetic distance (cM) of the LGs ranged from 40.8 cM for LG10 to 167.0 cM for LG19.The total genetic distance was 2,169.3cM, and the average genetic distance between markers was 1.0 cM.When we integrated the two genetic maps with the commonly used markers, nine paired LGs were matched one-to-one between the two genetic maps; however, for the remaining 19 LGs, different numbers of LGs were matched between the two maps (Table 2; Fig. 1).Several of the 19 LGs may be the result of the inability to distinguish between sub-genomes.However, by comparing the two genetic maps, the sub-genomes can be more clearly distinguished.

Marker selection for MARS
To select markers for MARS, we compared the 2,454 SNPs in #973,116 to the 2,326 SNPs in #973,109 and identified the SNPs that were common to both linkage maps.A total of 1,123 SNPs were represented in both F 2 populations (Table 2; Fig. 1).In addition, we identified 556 SNP markers by comparing the genetic maps of #973,116 and #973,109 with a genetic map of the recombinant inbred line developed from a cross between Daesil and Ipdeulkkae 01 (unpublished data).We selected target loci from among the candidate SNPs by choosing those with no other SNPs or indels within 300 bp of the target region.In addition, we ensured that the selected SNP markers occurred at regular intervals along the genome so that the whole region of each LG was covered.The largest and smallest LGs had nine and two SNP markers, respectively.Thus, 96 SNP markers were appropriately assigned to individual LGs corresponding to the number of chromosomes.

Discussion
MAS makes crop breeding programs faster and more accurate, especially if the life cycle of the crop is long, or there are not enough research personnel for the target crop.When multiple genes are introduced into different chromosomes, difficulties may arise due to unintentional dragging of nearby genes.Therefore, markers are needed that cover all the chromosomes evenly in order to investigate the recovery of the genetic background after new genes are introduced.This can be done either with a chromosome-level reference genome or with a complete linkage map of the genome.Although new reference genomes are being reported at a rapid rate, there are still many crops, especially minor crops, for which chromosome-level reference genomes are not available.It is always desirable to have a reference genome when designing MARS markers; however, it can be time consuming and costly to create a new high-quality reference genome.As an alternative approach, complete linkage maps can be used to develop MARS marker sets for crops lacking a suitable reference genome.
To create a set of MARS markers for perilla, which is a ploidy orphan crop with sparse background research and no available chromosome-level reference genome, we made a linkage map using F 2 populations.It is particularly challenging to select allelic-specific SNP markers that work well in polyploid crops, because it is not possible to determine how much similarity exists between subgenomes, especially in the absence of a high-quality reference genome.An important and difficult attempt to effectively distinguish each subgenome in polyploids such as bread wheat was made previously (Makhoul et al. 2020).Even with these difficulties, we were able to select appropriate candidate markers through creation of a linkage map.Although the 20 LGs, which equaled the number of perilla chromosomes, were not fully identified, a high success rate was confirmed in the experimental verification process while covering the entire chromosome.
The number of LGs differed depending on the population used to create the genetic map.Sixteen LGs were formed in #973,109, whereas 21 LGs were formed in #973,116 (Table S1 and Table S2).When the corresponding LGs were compared on the genetic maps of the two populations, LG01 in #973,109 was divided into LG05 and LG20 in #973,116.Conversely, LG06 in #973,116 was divided into LG09, LG10, and LG11 in #973,109 (Table 3).Combining two LGs into one LG or dividing one LG into two LGs is expected to fail to accurately separate the LG because of chromosomes     Chromosome number: Cited Tamura K et al., (2023) with high similarity between subgenomes.This problem can be solved through genome assembly using long-read sequencing and comparison with genetic maps.
The chromosome-level genome sequence was recently reported (Tamura et al. 2023).Thus the 22 linkage groups were compared to the recently published chromosome-level genome sequence.As a result of the comparison, in most case, one chromosome was matched to one linkage group.However each of the three chromosomes corresponded by dividing into two linkage groups respectively.That is chromosome 2 corresponded to LG02 and LG03, chromosome 5 to LG05 and LG06, and chromosome 12 to LG17 and LG18.On the other hand, there were cases where two chromosomes 14 and 18 corresponded to one linkage group, LG20 (Table S4).The genome sequence and the markers in the linkage group showed high synteny.
The application of PCR-based genotyping is essential for the rapid utilization of molecular markers in practical MAS breeding programs.We confirmed that the marker conversion from GBS markers to PCR-based KASP markers had a high success rate of over 91%.Thus, by applying the genetic map to perilla without a chromosome-level reference genome, we secured a molecular marker set that covered the entire genome.Our results show how a genetic map can be used to classify subgenomes in tetraploid crops.When we confirmed the converted SNP markers in the F 2 lines, 8.1% of the genotyping results for the markers were different between the GBS and the Fluidigm assay.If we look at the two sets of genotyping results in more detail, we find that certain markers were more likely than others to differ between the two methods.Hence, this type of comparison may be a good way to check the accuracy of markers in the marker selection process.construction, and data analysis; JK constructed the GBS library; MH developed the mapping population; TH provided the draft genome; KL participated in discussions about the experiments; SH and JH wrote and reviewed the manuscript.

Fig. 1 A
Fig. 1 A Perilla genetic linkage map with 1123 markers.A high-density genetic map was constructed by merging the genetic maps of the two F2 lines created from Perilla #107-1 × Namcheon (#973109) and TTT GAG TCG ACG GGA GCC TTG ATT TGA GTC GACA GGT GAC ATC CCG CTC TCT GT CAA CTG CAG AGT CTT TAC TGGG CCT TTT GTA GGCCA CAA TCG ATT CCT TTT GTA GGCCG GCT GCT GTC AAA AAG GAG CCT CGC TTG GCA TTT CTA ATT TGGC GTG GAT CAA GATC TGC TAG CCT TCA AAT CCG CCA AGA AAT GCA CTG TTT CGT TCCA AGA AGA AAA CAT ATC CAC TTG GGA CAA CAA TTG TGC CGG AGA GTG TGA CCA GAT TCT TAG ACT TCT TTA ATT TTC TTT TCT CTG GCT GCA GTT GAA GGA AAG GCA CAG AAT GAT GGC CGT AAG AGT ATG TGC TCCT TCC GTT GAA GTA TGT GCT CCA TGC AGG TGT ACA TTT TAA GTG AGT