Genome-wide Development of Insertion-deletion (InDel) Markers Database for Cannabis and its uses for Genetic Structure Analysis of Chinese Germplasm and Identification of Sex-linked Marker


 Background: Cannabis sativa L., a dioecious plant, derived from China, demonstrates important medicinal properties and economic value worldwide. Cannabis properties were usually harnessed depending on the sex of the plant. To analyze the genetic structure of Chinese cannabis and identify sex-linked makers, the genome-wide insertion-deletion (InDel) markers were designed and used. Results: In this study, a genome-wide analysis of insertion–deletion (InDel) polymorphisms was performed based on the recent genome sequences. In total, 47558 InDels were detected between the two varieties, and the length of InDels ranged from 4 bp to 87 bp. The most common InDels were tetranucleotides, followed by pentanucleotides. Chromosome 5 had the highest number of InDels among the cannabis chromosomes, while chromosome 10 had the lowest number. Additionally, a total of 47558 InDel markers were designed, and 84 primers evenly distributed in the cannabis genome were chosen for polymorphism analysis. A total of 38 primers exhibited polymorphisms among three accessions, and of the polymorphism primers, 14 biallelic primers were further used to analyse the genetic structure. A total of 39 fragments were detected, and the PIC value ranged from 0.1209 to 0.6351. According to the Indel markers as well as the flowering time, the 115 Chinese germplasms were divided in two subgroups, which were mainly composed of cultivars from the most north and south regions, respectively. Additional, the marker “ I1-10” was found to amplify two bands (398bp and 251bp) in the male plants, while a 389bp bands in female plants. Using this marker, the feminized and dioecious varieties can also be distinguished.Conclusion: This study will facilitate the genetic improvement and germplasm conservation of cannabis in China, and the sex-linked InDel markers will provide accurate sex identification strategies for cannabis breeding and production.

Introduction cannabis sativa L., a member of the Cannabinaceae, is a diploid (2n=20) monocotyledon, and one of the oldest cultivated plants. Although it originated in Central Asia including China, it soon started being cultivated or grown world-wide for folk medicine, textile bre, oil, and recreational use [1]. cannabis is a botanical genus of owering plants that is divided into two distinct species, Hemp and marijuana based on its tetrahydrocannabinol (THC) content [2]. Although cannabis cultivation is being restricted in many countries due to its wide-spread usage as a recreational drug, there has been a resurgence of interest for its agronomic potential and especially its medical value.
Cannabis is a dioecious species, which include male and female owers separated on different plants. The sex of plant commonly affect the economically relevant traits like ber quality and cannabinoid (CBD) content. In general, male plants have a higher ber content and better ber quality, while the content of CBD in female plants is higher than those in male plants. Therefore, an ideal ratio of male-to-female individuals must be maintained with different production purposes to improve economic e ciency. However, it is di cult to identify the sex of plant via morphological traits before owering, and DNA molecular marker technology had been considered as an accurate and reliable method for the sex identi cation of dioecious plants, which are unaffected by development time [3].

Page 3/14
Conventional breeding is the main method for developing new varieties in cannabis breeding programs; however, this process is very challenging, and often takes several years [4]. Previous studies have indicated that advancement in molecular technologies offers several molecular breeding strategies, such as the use of molecular markers to overcome the limitations of conventional breeding [4,5]. Some types of molecular markers had been identi ed and applied to the molecular analysis of cannabis. For example, restriction fragment length polymorphism (RFLP), random ampli ed polymorphic DNA (RAPD), ampli cation fragment length polymorphism (AFLP), and simple sequence repeats (SSRs) were used to analyse the genetic diversity of the cannabis germplasm [6][7][8][9][10], and some RAPD and SSR markers were successfully utilized for identifying sex-linked markers [11][12][13][14]. Although different types of cannabis molecular markers have been identi ed and utilized, research is still lagging behind compared with that in other crops like rice, wheat, and maize. This is owing mainly to two limitations: the rst is that currently only few of these markers are known, and the second is that the detailed information about the physical locations of these markers is not available due to a lack of high-quality chromosome-scale reference genome sequences. As a result, the current extent of research on the molecular markers of cannabis is insu cient for the molecular analysis of cannabis toward agronomic applications.
Insertion-deletions (InDels) are major sources of genetic structural variations found widely distributed across the plant genomes. InDels like SSRs are also a type of length polymorphisms, originating from a single mutation event, which is generally bi-allelic and single-locus in nature. Meanwhile, InDels have a myriad of the desirable inherent genetic characteristics of both SNP and SSR markers, such as co-dominance, abundance, and random distribution across the genome [15]. In addition, the InDel markers are easily detectable on a genome-wide scale in silico requiring low cost, labour, and time compared to the freely accessible genomic or transcriptomic sequence resources. Nowadays, InDels represent an ideal marker system, extensively used in population genetics, taxonomic diagnosis, genetic map construction, and association mapping in different crop plants [4,[16][17][18].
Although InDel markers are widely identi ed barley [18], oil rapes [19,20], and other plants [21][22][23], to our knowledge, no research on InDels in cannabis has been reported so far. This knowledge gap limits the comprehensive molecular analysis of cannabis.
China has been considered as one of the putative centre of origin for cannabis, and where cannabis has been planted for more than 2000 years for bre, oil, and other purposes [24]. However, at present, the bre yield, bre quality, and CBD content are vital factors limiting the development of the cannabis industry in China, making it necessary to undertake the genetic improvement of the cannabis crop cultivated in China. Previous studies have shown that the genetic structure analysis of the germplasm can facilitate genetic improvement in other crops [25,26]. Nevertheless, such research has not yet been carried out su ciently on the germplasm of plant varieties grown in China.
Until now, genetic diversity and population structure of cannabis were analysed using SSR and ISSR markers [9,27]. Due to the non-speci c ampli cation using SSR or ISSR, genotyping results are confusing. Thus, dimorphic molecular markers like InDels have been considered an ideal choice for genetic diversity analysis. Recently, a high-quality chromosome-scale reference genome of a drug-type strain "Purple Kush" and the hemp variety "Finola" were obtained, which enable the genome-wide capture of InDels in the cannabis genome [28]. In this context, we aimed to develop genome-wide InDel markers. On the one hand, the genetic structure of 115 cannabis accessions from China were evaluated using the newly developed InDel markers and a phenotypic markers, owering time. On the other hand, a new InDel marker was found to be useful for identifying the sex of cannabis plants. Our study results will provide a useful tool for the molecular analysis of cannabis in the future, and the information on the genetic structure of the cannabis germplasm and sex-linked marker will aid in the genetic improvement and molecular breeding of cannabis.

Distribution of InDel markers
Whole genomes for "Purple Kush" and"Finola"were downloaded from ftp://ftpmips.helmholtzmuenchen.de/plants/barley/public_data/. On a genome-wide basis, 47,558 InDels were identi ed between PK and FN in the genomic DNA sequence database (Table S1). InDel sites varied from 4 bp to 87 bp, and the number of the InDel sites decreased sharply with an increase in the InDel length. Four InDel sites were found to be the most common InDel sites (11286), accounting for 23.7% of the total InDels (Fig. 1). Meanwhile, the distribution of the InDels on each chromosome of the FN genome was different. As shown in Fig

Development of InDel markers on whole cannabis genome and polymorphism analysis
In total, 47558 InDel markers between FN and PK were successfully developed, with a density of 47.1/Mb in the FN genome. The lengths of all primers were between 18 bp and 24 bp, and the product sizes ranged from 80 bp to 400 bp. To evaluate the polymorphisms of all primer pairs, 84 InDels distributed along the chromosomes with intervals of about 10 Mb were chosen for polymorphism analysis (Fig. S1). The results showed that 80 primers were ampli ed successfully, and 38 primers exhibited polymorphisms among three varieties ("Yunma 6", "Neimengudali", "Qingdama 1"). Of all the polymorphism primers, the 14 primers which had two alleles among the above three varieties were used for further study.
Genetic diversity analysis and population structure The 14 InDel primers were used to analyse the genetic relationships of 115 accessions, and a total of 39 polymorphic bands were ampli ed. The PIC ranged from 0.1209 to 0.6351, with an average of 0.4109, and the gene diversity varied from 0.1243 to 0.6865, with an average of 0.4664. The average MAF was 0.6484 and ranged from 0.4478 to 0.9348 (Table 2). Afterward, cluster analysis was conducted based on the unweighted pair-group method with arithmetic means (UPGMA) using NTSYS-pc2.11 software. As showed in Fig. 4, at a genetic distance of 0.74, the 115 accessions were divided into two groups. Group included 84 accessions, which mainly consisted of the varieties cultivated in north China (up to 90%). Group included 31 accessions, and most of them were from south China (90.3%).
Based on the 39 alleles ampli ed using 14 InDels, the population structure of the 115 individuals was further estimated under the Hardy-Weinberg Equilibrium using STRUCTURE V2.3.3 software. Delta K was plotted against K values and the best number of clusters was obtained via the Structure Harvester platform (http://taylor0.biology.ucla.edu/structureHarvester/). As shown in Fig. 5, Delta K reached a maximum value at K = 2, which indicated that the 115 individuals were clearly divided into two groups (Fig. 5).
As showed in Table 1, the owering time of 115 cannabis genotype varied from 23 days to 125 days. Afterward, cluster analysis was conducted using IBM SPSS Statistic 19.0 with the longest distance method and the Euclidean distance square. As showed in Fig. 6, at a inter-class distance of 25, the 115 genotypes were divided into 2 groups, the groups 1 included 34 cultivars, which mainly origined from the South China (30), and groups 2 contained 81 cultivars, most of which are from North China (74), such as Northwest China (15) and Northeast China (37).
Screen of sex-linked InDel markers and PCR veri cation of known-sex plants As the latest report indicated that chromosome pair 1 is the sex chromosome pair in cannabis [29], so 11 pairs primers evenly distributed on chromosome 1 was designed and used to amplify 12 samples (6 females and 6 males) from F 2 population crossed by "Yunma 6" and "H4" (Table S2). As showed in Fig. 7A, only one primer pair (I1-10) ampli ed two bands in male plants (251bp and 398bp in size), while one band (398bp) in female plants.
To further verify the versatility and accuracy of I1-10 primer pair, 24 known-sex plants from the dioecious variety, "H4", and 10 known-sex plants from the feminized variety,"ZY1"was used to ampli ed PCR, respectively. The results showed that 12 female plants can amplify 398bp fragment, while 12 male plants ampli ed two bands (398bp and 251bp in size) (Fig. 7B). Consistent with the ampli cation fragment in female plants of H4, all plants from"ZY1"were ampli ed 398bp using the I1-10 primer pair (Fig. 7C). Moreover, In order to understand the potential function of I1-10 sex-linkage markers, we directly searched against the NCBI nucleotide database, using Blastn, and I1-10 markers were mapped onto two regions. However, there was no gene founded to locate close to the I1-10 markers (Table. S3).  [9], which makes it di cult to meet the demand of genetic map construction and QTL mapping. In addition, a genome-wide survey of InDels has not yet been carried out for cannabis. In this study, 47558 InDels were identi ed in the cannabis genome, and the average density across the FN genome was 0.053 InDels/kb, which was much less compared to that found in other species such as human, rice, and oilseed rape [20,[30][31]. To analyse the population structure of the 115 cannabis germplasms from the varieties cultivated in China, 84 InDels distributed along the cannabis chromosomes with intervals of approximately 10 Mb were chosen for the polymorphism analysis, and 38 InDels were found to exhibit polymorphism among 3 accessions. The polymorphism rate was 45.2%, similar to the extent in chickpea (46.6%) [36], lower than found in jute (58%) [23], and higher than that in maize (18.68%) [22], which indicated that the polymorphism rate may relate to the plant species. Additionally, of the 36 InDels, 14 InDels amplifying only two fragments were selected for the genotyping of the 115 accessions. The PIC values ranged from 0.1209 to 0.6351, with an average of 0.4109, indicating that most of the InDels have a moderate range of genetic diversity, which was lower than that of SSR markers in cannabis [10]. The possible reason was that most of the InDels used in this study are biallelic (Fig. S2), while, in general, SSRs are multiallelic.
The information regarding the genetic structure of different genotypes can guide breeding programs for developing varieties with a broad genetic background. The genetic diversity of the cannabis germplasm has been analysed using two types of markers: SSR and ISSR [9,27]. In the present study, 39 fragments were ampli ed using the 14 InDels, and when Delta K was at a maximum value of 2, the 115 accessions were divided into two subgroups. Evidently, most of the cultivars from North China belonged to Group , while most cultivars from the south belonged to Group (Fig. 5). Similar to results of the population structure analysis, the 115 accessions were clearly clustered into two major groups using UPGMA clustering (Fig. 4). As cannabis is an annual and photoperiod-sensitive crop, and the day length may determine the oral transition and owering times, we suppose that the climate, in uenced by the latitude and day length, is an important factor affecting the cannabis germplasm diversity. The accessions from the North China and South China were individually classi ed into groups I and , (Fig. 4 and Fig. 5), perhaps because of the higher latitude (i.e. longer day length) and low latitude plateaus, respectively, which is in agreement with the analysis of Gao et al (2014) and Zhang et al (2018) [9,37] . In addition, both group and group included the cultivars from central China like the HeNan provinces, implying that the breeders in these areas frequently exchange cannabis germplasm resources with the breeders from the north or south regions.
Cannabis is a short-day crop, which is sensitive to photo-period. Flowering time is an important agronomic trait, which affects the content of cannabidiol (CBD) and ber yield. Except for InDel markers, 115 cannabis genotypes were also clustered into two groups according to their owering time, the cultivars of the groups 1 mainly origined from the South China, and most of varieties in groups 2 are from North China (Fig. 6), consistent with the results of the population structure analysis and UPGMA clustering ( Fig. 4 and Fig. 5). In general, when the cannabis cultivars, originating from the north regions, were introduce to the south regions, the plants will encounter early owering. In this study, though the cultivars '22' and '214' originated from the north regions of China, LiaoNing and HeiLongJiang provinces, respectively, the plants didn't encountered early owering when cultivated in south regions of China (HuNan province), which indicated these cultivars may be of insensitivity to photo-period. Thus, these two cultivars would be an ideal germplasm for developing wide adaptable cannabis varieties to day length.
Due to the different economical values between female and male plants, a suitable ratio of females to males individuals is important for enhancing economic e ciency. To overcome the di culties of the accurate identi cation of sex through morphological methods before owering, three types of sex-linked molecular markers including SSR, AFLP, RAPD have been detected in previous studies [11][12][13][14]. For the rst time, sex-linked Indel markers have been identi ed in cannabis in this study. Interestingly, similar to the sex-linked SSR markers CS308 [14], the same fragments in size was presented in both female as well as male plants using I1-10, indicating these markers were not speci c to the Y chromosome, which is different from the markers MADC1 to MADC3 on Y chromosome [11][12][13].
The sex-linked markers can provided an entry point to identifying sex-linkage sequences, which help me to nd the genes involved in sex-determination and differentiation [38]. Unfortunately, there was no hits found for the I1-10 sex-linked markers that were mapped onto the sex chromosome through the blast search against the NCBI nucleotide database (Table S3). A possible reason could be that, the Y chromosome-speci c fragment ampli ed using I1-10 was only 251bp length, which give us limited sequence information. Thus, genome working technique will be used for getting more unknown DNA regions on either side of chromosomal regions of I1-10 marker in the further studies.

Conclusions
In summary, to our knowledge, our study is the rst to report a large-scale identi cation, development of InDel markers in the cannabis genome, analyse the genetic structure of the Chinese cannabis germplasm using InDel markers, and identify a sex-linked Indel marker in cannabis. These genome-wide InDels and data about the genetic relationships of the Chinese cannabis germplasm would serve useful in the further molecular analysis of cannabis, and marker I1-10 provides accurate sex identi cation strategies between female and male individuals, thus improving the economic e ciency of cannabis resources.

Materials And Methods
Plant materials and DNA extraction A total of 115 cannabis accessions were collected from different regions in China and preserved in our institute.
Detailed information on these cultivars is summarised in Table 1. Flowering time is the time from sowing to owering. When more than 50% of the plants of each cultivar bloom, the owering time was scored, and listed in Table 1. In additon, six female and six male individuals, selected from a F 2 population derived from a cross betwween a female "H4" plant and a male "Yunma 6"(Y6) plant were used for the screen of sex-linked marker.
Furthermore, 24 samples (12 females and 12 males individuals) from "H4" variety and 10 samples from the feminized cannabis variety, "ZY1", were used for further validation of the sex-linked marker.

DNA extraction
The young leaves of each sample at the owering stage were collected for DNA extraction. A Plant Genomic DNA Kit (Tiangen Biotech, Beijing, China) was used for DNA extraction. DNA quality and quantity were checked using an Eppendorf BioSpectrometer (Eppendorf, Hamburg, Germany) and the DNA was further diluted to a 10 ng/L working solution.
One pair of primers with the highest scoring was selected in the design results for the experiments, which were listed in Table S1.

InDel genotyping
The 84 primer pairs evenly distributed in the FN genome were selected for polymorphism analysis. Polymerase chain reactions (PCRs) were performed using 10μL aliquots of the reaction mixture, including 7μL of PCR mix solution (Qingke, Nanjing, China), 1μL of the forward primer (10 nmol/L), 1μL of the reverse primer (10 nmol/L), and 1μL of the DNA template. PCR was conducted as follows: an initial step at 95℃ for 5 min, followed by 32 cycles of 30 s at 94℃, 30 s at 55℃, 40 s at 72℃, and a nal extension of 10 min at 72℃. Primers used for genotyping were listed in Table 2 and Table S2. Genetic diversity assay and population structure Identical band types on the electropherograms of 115 cannabis cultivars were ampli ed using the same InDel markers. Each polymorphic band detected by the same given primer represented an allelic mutation. In order to generate molecular data matrices, clear bands for each fragment were scored in every accession for each primer pair and recorded as 1 (presence of a fragment), 0 (absence of a fragment), and 9 (complete absence of band).

Consent for publication
Not applicable