Whole-Genome Sequencing of a Year-Round Fruiting Jackfruit Variety Reveals Very High Single Nucleotide Polymorphisms in Inter-Genic Regions


 Background

Jackfruit (Artocarpus heterophyllus Lam.) is a tropical and sub-tropical fruit tree distributed in Asia, Africa, and South America. It is the national fruit of Bangladesh and produces fruit in the summer season only. However, a year-round jackfruit variety, BARI Kanthal-3 developed by Bangladesh Agricultural Research Institute (BARI) provides fruits from September to June. This study aimed to evaluate the agronomic performance of BARI Kanthal-3 and to generate a draft whole genome sequence to obtain molecular insights of this important unique variety.
Results

Number of fruits, average each fruit weight, fruit yield per plant, edible portion in fruit and ß carotene content of BARI Kanthal-3 (n = 5) were 422/plant/year, 5.60 kg, 236.32 kg/year, 53.5% and 3614 mg/100g, respectively. During de novo assembly, 817.7 Mb of the BARI Kanthal-3 genome was scaffolded. However, in the reference-guided genome assembly, almost 843 Mb of the BARI Kanthal-3 genome was scaffolded. Through BUSCO assessment, 97.2% of the core genes were represented in the assembly with 1.3% and 1.5% either fragmented or missing, respectively. By comparing the single copy orthologues (SCOs) in three closely and one distantly related species of BARI Kanthal-3, 706 SCOs were found to be shared across the genomes of the five species. The phylogenetic analysis of the shared SCOs showed that A. heterophyllus is the closest species to BARI Kantal-3. The estimated genome size of BARI Kanthal-3 was 1.04 giga base pairs (Gbp) with a heterozygosity rate of 1.62%. The estimated GC content was 34.10%. Variant analysis revealed that BARI Kanthal-3 includes 5.7 M (35%) and 10.4 M (65%) simple and heterozygous single nucleotide polymorphisms (SNPs), and about 90% of all these polymorphisms are located in inter-genic regions.
Conclusion

The whole-genome sequence of A. heterophyllus cv. BARI Kanthal-3 reveals extremely high single nucleotide polymorphisms in inter-genic regions. The findings of this study will help better understanding the evolution, domestication, phylogenetic relationships, year-round fruiting and the markers development for molecular breeding of this highly nutritious fruit crop.


Background
The genus Artocarpus (family, Moraceae) comprises approximately 70 species of food-producing plants that are extended throughout the tropical and subtropical regions of the world [1,2]. Among them, jackfruit, A. heterophyllus Lam. is a tropical evergreen tree, which produces the largest edible single fruit in the world (up to 50 kg/fruit) [3,4]. The place of origin of this fruit tree is still unclear, but it is predicted to be indigenous to the rainforests of the Western Ghats of India. It is widely cultivated throughout the tropical lowlands in South and Southeast Asia, parts of central and eastern Africa, and Brazil [5][6][7]. Bangladesh is one of the highest jackfruit producing countries in the world. The jackfruit popularly known as Kanthal is the national fruit of Bangladesh. Its demand is increasing gradually due to its low price, high nutritious value, diversi ed uses and potentials for commercial cultivation and processing industry [7]. The jackfruit is commonly referred to as "poor man's food" due to its lower market price as well as high abundance in the summer season [5,8]. It is grown all over the Bangladesh, especially profusely in the central terrace ecosystem [9].
Bangladesh is one of the largest producers of jackfruits and accounts for about 21% of total fruit production of the country, which is only next to Mango as the principal fruit crop. During 2019-20, Bangladesh produced 1.1 million tons of jackfruit covering 16,592 hectares area [10]. Despite the numerous advantages, jackfruit tree is not commercially grown as a crop because of an extremely high variation in fruit quality, which is due to its cross-pollinated nature, seed-mediated propagation, short seasonal fruiting and susceptibility to abiotic stresses [7]. Therefore, the potential of this unique nutritious fruit crop has not yet been utilized in Bangladesh for ensuring food and nutritional security through commercial cultivation and industrial processing. Genetic improvement of the existing germplasms to overcome these problems will accelerate jackfruit to becoming a commercial crop in Bangladesh [11]. Although several studies have described the high diversity of jackfruit in Bangladesh but none of them are systematic and comprehensive [12,13]. The underlying molecular mechanisms of the trait diversity in jackfruit is largely unknown. The harvesting period of jackfruit is short (June-August) resulting in a large wastage of this fruit amounting 20-30% of the crop or even more during the season.
Bangladesh Agricultural Research Institute (BARI) developed a year-round jackfruit variety namely, BARI Kanthal-3 in 2014. Number of fruits per plant per year ranges between 219-245, fruits are medium in size (averaging 5.43 kg each) and yield is 1165-1504.2 kg fruit/plant ( Figure 1). The ripened edible potion contains 35.06 mg/g ß-carotene and 23.6% of total soluble sugar (TSS) [1].
Whole genome sequencing provides complete coverage of the coding and noncoding regions of the genome [14], which allows a more comprehensive assessment of the genome of any organism including those of plants [15]. It provides the genetic foundation that enables a greater e ciency to manipulate genetic diversity at key genes and enhance, reduce or add certain features to a plant phenotype [15].
Since the rst whole genome sequencing of the model plant, Arabidopsis thaliana in 2000, a large number of plants from diverse taxonomic groups have been sequenced, and genes responsible for various plant traits have been characterized and cloned [15][16][17]. Recently, whole genome sequencing of economically important plants such as jute, Corchorus spp.
[18], hilsa, Tenualosa ilisha [19], and goat, Capra hircus [20] have created huge public interest in Bangladesh. Recent advances in genomics analyses have revealed large numbers of single nucleotide polymorphisms (SNPs) as the most common form of DNA sequence variation between alleles in several plant species [21]. Because of their high abundance, signi cant information content, when associated with genes, SNPs have gained the center stage as the principal markers of choice for molecular genetics studies. This includes their application in shortening the time of breeding new varieties in many crops through using marker assisted selection [21,22]. SNPs have also been applied for several years to assess diversity in speci c genes or genomic regions, revealing the phylogenetic relationships between species. However, the emergence of high throughput sequencing technologies allows the SNP-based genetic diversity studies to be carried out at scale and can be useful in conserving diversity in domesticated populations. Plant phylogenetic and evolutionary studies are conventionally based on variation that exist at genes, and hence the knowledge of SNPs in these regions is essential for this analysis [23]. It is also important to know the location of SNPs in the whole genome, because if a SNP is present in the coding or regulatory region of a gene, it can greatly affect the functional activity of the resulting protein, such as an enzyme in a biosynthetic pathway [24] by affecting gene expression and transcriptional and translational promoter activities. Therefore, SNPs can often be responsible for phenotypic variations that exits between individuals and be utilized as selectable genetic markers for improving agronomic traits. However, large scale genomic studies for the identi cation of SNPs in the jackfruit genome are still not available.
Jackfruit is a highly priority fruit plant in Bangladesh as the National Agricultural Research System (NARS) has recently placed a focus on improving fruiting characteristics of this species and supporting its commercial development. Until now, only a limited amount of genomic information has been made available for the genus of A. heterophyllus [5,25]. Although the development of a year-round fruiting variety BARI Kanthal-3 offers an opportunity for commercial cultivation and processing of the jackfruit, nothing is known about the underlying molecular mechanism of its year-round fruiting characteristics and other bene cial traits. Therefore, the whole genome sequencing of A. heterophyllus cv. BARI Kanthal-3 and decoding its genome could offer a potentially signi cant improvement of jackfruit for desired traits through molecular breeding. Molecular understanding of the extremely high phenotypic variabilities in jackfruit would facilitate the future development of high yielding, year-round fruiting, biotic and abiotic stress (e.g., ood, saline, drought and pest) tolerant jackfruit varieties through molecular breeding, which is essential for establishing jackfruit-based processing industry in Bangladesh and elsewhere. Therefore, we report here a draft annotated whole genome assembly of the year-round fruiting A. heterophyllus cv. BARI Kanthal-3 for rst time. The results of the promising phenotypic characteristics of the BARI Kanthal-3 variety, together with both the de novo and reference-guided assemblies, and the identi ed SNPs, sheds light on the genetic variation that exists within the A. heterophyllus genome.

Results And Discussion
Source and phenotypic features of the BARI Kanthal-3 variety To develop the variety of BARI Kanthal-3, germplasm was collected from all over the country including Chittagong Hiltracts such as Ramgarh of Khagrachari of Bangladesh. In 2014, an accession was certi ed for cultivation in Bangladesh with a varietal name of BARI Kanthal-3, representing a new and unique variety of jackfruit in Bangladesh that bear fruits for almost all of the calendar year (September to June). It is a high yielding variety with an average of 232 (range 219-245) fruits per plant yielding 1165 -1504.2 kg fruit/plant/year (Fig. 1). The fruits of BARI Kanthal-3 were medium (average 5.43 kg each) and average yield was 133.2 t/ha/year (Fig. 1). This variety was not affected by any sort of infectious pathogens or pests (data not shown). The plant is erect and medium bushy. The pulp of the fruit is medium soft, slightly yellow, medium juicy, highly sweet and aromatic. The amounts of ß-carotene and total soluble solid (TSS) in fruits were 35.06 mg/g and 23.6%, respectively. The edible portion of the fruit was 52.5% (Table 1) [1]. A large body of literature revealed that jackfruit is a rich source of carbohydrates, minerals, carboxylic acids, dietary ber, vitamins and minerals and bioactive compounds [1,26]. Considering the yield, year-round fruiting and nutritional quality of the fruits, the BARI Kanthal-3 could be utilized as a genetic resource for breeding of jackfruit for commercial cultivation in Bangladesh and elsewhere. To understand the underlying molecular mechanisms of unique traits of this new variety, genome sequencing and analysis of genomic data are needed. Gb with a heterozygosity rate of 1.62% based on K-mer analysis of the short read data (Fig. 2). The estimated size is similar to the recently reported 1.01Gb genome size of A. heterophyllus [5], and is consistent with the c-value of 1.20 pg [27]. The BARI Kanthal-3 has a higher heterozygosity rate compared to the only one available reference genome of a seasonal jackfruit from India recently published by Sahu et al. [5].
After quality ltering of the short reads using Trimmomatic, 218,562 clean reads were obtained ( Table 2).
The high-quality reads assembled into different contigs using SOAPdenovo2, which ultimately yielded a base assembly of 1.36 M scaffolds, totaling 843 Mb. The N50s of scaffolds were 1.8 Kb (Table 2). In a recent study from India, the genome of the A. heterophyllus with high-quality reads was assembled into 108,267 scaffolds, totaling 982 Mb [5]. The BARI-Kanthal-3 contigs were then scaffolded together using the reference guided approach using the existing published draft reference genome of jackfruit [5] and the software RAGTAG, and nally gaps in the scaffolds were lled using the same pair-end Illumina data and Gapcloser software. In this case, SOAPdenovo2 + RAGTAG + GapCloser produced a base assembly of N50 size = 425 Kb in 218,562 scaffolds ( Table 2). The GC content of BARI Kanthal-3 was 34.10% which is comparable to the GC content of a seasonal A. heterophyllus from Indian origin that was recorded as 32.9 % [5]. Genome assembly validation Scaffolds from the BARI Kanthal-3 assembly were compared against the reference genome using the nucmer software and showed a high degree of conservation; output for a comparison of the BARI Kanthal-3 scaffolds against the largest scaffold in the reference genome is represented in Fig. 3. This is just one portion of the genome (approx. 3.5Mb) that is represented but we found the same degree of high similarity across the rest of the genome.
To assess the representation of a complete conserved core gene set in the BARI Kanthal-3, assembly an analysis was carried out to assess the quality and completeness of the draft genome using the Benchmarking Universal Single-Copy Orthologs (BUSCO) datasets and an orthologue data base (Fig. 4).
We also identi ed SCOs from the genomes of other four species (A. heterophyllus, A. altilis, M. notabilis, A. thaliana), and compared those with BARI-Kanthal-3 genome. In this study, 706 SCOs that were found to be shared across the genomes of ve species (Fig. 5A). Remarkably, none of the SCOs were found to be unique for BARI Kanthal-3 genome. Phylogenetic analysis was performed using the 706 shared SCOs following the neighbor-joining method with 100 bootstraps (Fig. 5B). The phylogenetic analysis showed that two genomes of A. heterophylus (BARI Kanthal-3 and A. heterophyllus) clustered more closely related to other three genomes. Therefore, A. heterophyllus is the closest to the BARI Kantal-3 while A. thaliana branched more distantly from BARI Kantal-3 in the tree (Fig. 5B).

Variant analysis
The processed WGS reads were aligned against the A. heterophyllus draft assembly. Out of a total of 439 M reads, 417 M (95.1%) were found to be aligned in exact pairs. A total of 16 million single-nucleotide polymorphisms (SNPs) were called from the dataset including 5.7 M (35.0%) simple and 10.4 M (65.0%) heterozygous SNPs (Table 3 and Fig. S1). Approximately, 90% of all polymorphisms are located in intergenic regions. In this study, 144,787 (2.5%) and 426,997 (7.5%) of the simple SNPs, and 250,715 (2.4%) and 739,288 (7.1%) of the heterozygous SNPs, were found in the exons and introns, respectively (Table 3). We further predicted the effects of variants on genes. As expected large fraction of the variants were in the intergenic (64.5%), intronic (5%) and up/down-stream regions (29%) of the genes (Fig. 6A). There are 232,587 missense mutations and 4,750 gained stop codons suggesting an altered protein function in BARI Kanthal-3 (Fig. 6B). One of the important ndings from this study is the high level of heterozygosity in the year-round fruiting jackfruit genome. The high level of heterozygosity in A. heterophyllus genome raises the question of which allele, for each heterozygous locus, is represented by the reference genome (Table 4). Therefore, the inherent differences between individual plants should always be considered when utilizing the reference genome to detect SNP variants [28]. were "complete duplicated", 15 (1%) were "fragmented", and 56 (4%) were "missing" 1569 (97.2%) were complete BUSCOs in which 1094 (67. 8%) BUSCO genes were "complete single-copy", 475 (29.4%) were "complete duplicated", 21 (1.3%) were "fragmented", and 24 (1.5%) were "missing"

Genome size
The estimated genome size was 1.01 Gbp The estimated genome size was 1.04 Gbp SNPs have been indicated as the major factors in the creation of phenotypic variation and their effect on functional changes of genes is used as a tool in functional genomics of organisms [29]. The discovery and identi cation of genomic variants such as SNPs, together with the determination of their location in the genome, can provide valuable information for breeding programs. In plants, many traits of interest have been linked with SNPs [22,30,31]. SNPs have been reported to play a role in metabolism, cellular processes and signaling, that could be addressed in detail at the breeding level. This study for the rst time identi ed the SNPs in the genome of a year-round fruiting jackfruit cultivar, which promises the development of genetic markers associated with the important traits of this economically important plants including the genes regulating owering and fruit development. The availability of SNPs within the coding and regulatory sequences also offers the prospect of identifying the causative variations in uencing these processes [32].
One of the hallmark ndings of this study is that majority of the SNPs (47.29%) of BARI Kantha-3 were localized in the intergenic regions. The SNPs occur more frequently in the proximity of genes. Approximately, 25% of the intergenic SNPs were detected within the region spanning 10 kb upstream of the gene start site and 10 kb downstream of the gene end site [33], implying the possibility that some of these SNPs affect the expression of the nearest neighboring genes. The SNP markers have become extremely popular in plant molecular genetics due to their genome-wide abundance and amenability for high to ultra-high-throughput detection platforms. For example, SNPs are reported to be regulating various Quantitative Trait Loci (QTL) responsible for cold and disease resistance such as such as blight, bacterial canker and gray mold [34,35]. Novel SNPs associated with owering in Raphanus sativus inbred lines for marker-assisted backcross breeding has recently been discovered by transcriptome sequencing and computational analysis [36]. Therefore, a further transcriptomics study is needed for the identi cation of genes associated with owering and year-round fruiting of BARI Kanthal-3. In this study, most of the SNPs were identi ed in inter-genic regions (including intergenic regions, 5′ UTR, 3′ UTR, and introns) rather than in the coding regions. It has been reported that high frequency of genetic variants in the noncoding regions likely results from less selection pressure from natural selection and/or domestication [39]. However, DNA polymorphisms in these regions have been reported to play important roles during evolution and domestication. For example, a mutation in the 5′ regulatory region of the qSH1 gene, an ortholog of the Arabidopsis homeobox gene REPLUMLESS (RPL) results in the absence of abscission zone formation and thus loss of seed shattering in a subset of temperate japonica cultivar of rice [40]. Similarly, a considerable number of mutations in introns in pre-harvest sprouting (PHS) genes lead to PHS in rice plant (Reference). Among the 12 PHS mutants (phs), mutations in genes encoding major enzymes of the carotenoid biosynthesis pathway, causes photo-oxidation and ABA-de ciency phenotypes, of which the latter is a major factor controlling the PHS trait in rice (Reference). Interestingly, in jackfruit, MADS-box genes and carotenoid biosynthesis genes, were the primary targets for domestication [25]. However, the role of inter-genic SNPs in A. heterophylus in domestication of this horticultural plant needs to be explored further.
This study for the rst-time reports about the distribution of SNPs in a year-round fruiting cultivar of A. heterophyllus cv. BARI Kanthal-3. The identi ed SNPs in this study can be used to identify new functional genes and their regulatory activities speci c to BARI Kanthal-3. The identi ed SNPs can also be used as markers to characterize cultivars and wild relatives of A. heterophyllus. Furthermore, the whole genome sequence data and the identi ed SNPs of a new year-round fruiting jackfruit cultivar of Bangladesh could facilitate further genomics and post-genomics studies for detecting other trait-speci c genes that are essential for molecular breeding of jackfruit.

Conclusions
This study analyzed the phenotypic properties and whole genome sequence data of a new year-round fruiting variety of jackfruit BARI Kanthal-3. The fruit quality, yield and year-round fruiting properties of the BARI Kanthal-3 indicate it as a unique genetic material for the improvement of jackfruit for commercial cultivation and development of jackfruit-based processing industry. The de novo genome assembly through SOAPdenovo2 produced a base assembly of N50 size = 1.8Kb in 1.36M scaffolds totaling 817.7 Mb. During reference-guided assembly, SOAPdenovo2 + RAGTAG + Gapcloser produced a base assembly of N50 size = 425 Kb in 218,562 scaffolds. A total of 843 Mb of the BARI Kanthal-3 genome was scaffolded. The comparison of scaffolds from the BARI Kanthal-3 assembly against the reference genome using the nucmer software showed a high degree of conservation. The phylogenetic analysis of the shared SCOs showed that A. heterophyllus is the closest species of the BARI Kanthal instruction. Whole genome sequencing (WGS) library preparation was performed using Nextera XT DNA library preparation kit (Illumina Inc., San Diego, CA, USA) according to the manufacturer's protocol. Brie y, after normalization, DNA samples were fragmented and tagged by tagmentation in a single-tube reaction [41]. The tagmented DNA was ampli ed through a limited-cycle PCR program using a unique combination of barcode primers, the Index 1 (i7), Index 2 (i5) and full adapter sequences required for cluster generation. Ampli cation was followed by a cleanup step that puri ed the library DNA, and removed small library fragments by using Agencourt AMPure XP beads (Beckman Coulter, Inc. De novo genome assembly The generated WGS data were ltered through Trimmomatic v0.38 [42] with option "LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:50" parameters to remove Illumina adapter, known Illumina artifacts, phiX, and low-quality regions. The processed reads were assembled by SOAPdenovo2 v2.04 [43] with k-mer=39 and subsequently scaffolded using a reference guided approach by RAGTAG [44] software with default parameters. GapCloser v1.12 [43] with default parameters ("-l 150t 32 -p 31" ) was utilized for gap closing using the pair-end data.

Genome assembly validation
The scaffolded sequences were compared against the reference genome by nucmer v4.0.0rc1 [45]

Authors' contributions
All authors contributed intellectually to this study. TI conceived the study, designed the experiment, coordinated the project, provided reagents and laboratory support, interpreted the results, wrote and revised the manuscript. NA collected the plant samples, extracted DNA, and prepared the draft manuscript; CK assembled and annotated the sequenced data, interpreted results and wrote the manuscript; MNH, interpreted the results, wrote and revised the manuscript; MJR, provided plant samples, collected and interpreted phenotypic and biochemical data and wrote manuscript; NUM and DRG, conducted experiments and prepared library for the DNA sequencing using Illumina Nextseq 550; AAN and RI, analyzed sequenced data, prepared phylogenetic tree and wrote the manuscript; PKB, wrote and revised the manuscript; AGS, coordinated, wrote and revised the manuscript; All authors read, revised, edited and approved the nal manuscript.
50. Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.  Genome size prediction of Artocarpus heterophyllus cv. BARI Kanthal-3. The X-axis represents the coverage of the genome while the Y-axis represents the frequency levels.