Whole-genome resequencing of ‘Sunset’
The ‘Sunset’ genome was sequenced and assembled using a reference guided assembly approach using Illumina sequencing technology. The sequencing quality of these raw reads was generally high (90% with Phred quality score >27). After filtering, a total of 74 million high quality, 124 bp paired-end (PE) reads were generated. The total read length was 9.197 Gb, representing around 24.72× genome equivalents (Table 1). The sequencing depths were evenly dispersed along the papaya chromosomes. We first mapped the PE reads back to the ‘SunUp’ reference genome by BWA’s short read aligner [21]. After removing multiple mapping reads and PCR duplicates, 48 million clean reads were retained for the following study. Of these ‘Sunset’ reads, as high as 99.97% matched unique ‘SunUp’ genomic locations, showing substantial consistency over most genome regions between ‘SunUp’ and ‘Sunset’. The remaining 15,822 reads (0.03%) were unmapped, and likely correspond to the organelle genomes, ‘Sunset’-specific region or highly repetitive regions that were unassembled in the reference ‘SunUp’ genome. Approximately 46 million (95.78%) clean reads mapped to reference genome in a properly paired orientation.
Detection and characterization of SNPs, small InDels and large SVs in ‘Sunset’
Polymorphisms between ‘Sunset’ and ‘SunUp’ were identified using SAMtools software suite [22] with strict parameters. Polymorphisms with coverage <10 or >100 and quality <50 were discarded to eliminate false positives in low coverage and highly repetitive regions respectively. Polymorphism sites with only one ALT were retained given the diploid nature of papaya. In total, 310,364 SNPs and 34,071 small InDels were found between ‘Sunset’ and the ‘SunUp’ reference genome (Table 2), with an average mutation rate of 0.084% for SNPs vs. 0.009% for InDels. The number of heterozygous SNPs was nearly 7 times higher than that of homozygous SNPs (269,493 vs. 40,871). A more even distribution was observed in the numbers of homozygous and heterozygous InDels, with 19,135 and 14,936, respectively. The genome wide average for polymorphisms across the ‘Sunset’ genome was 84 SNPs per 100 kb and 9 InDels per 100 kb (Table 3 and Fig. S1). SNPs were substantially more prevalent at the genome-wide level than InDels. SNPs had an uneven distribution across the 9 chromosomes of papaya ranging from 24 SNPs per 100 kb in chromosome 2 to 165 SNPs per 100 kb in chromosome 6. InDels were more evenly dispersed across the ‘Sunset’ genome ranging from an average of 7 InDels per 100 kb in chromosome 2/9 to 13 InDels per 100 kb in chromosome 6.
All types of base changes were obtained and subdivided into transitions (Ts) and transversions (Tv) (Table 4, Fig. S2). The total amount of Ts and Tv detected in all SNPs was 205,333 and 105,031 respectively, with a Ts/Tv ratio of 1.95. The average ratios of Ts to Tv for homozygous and heterozygous SNPs were 1.03 and 2.18, respectively. The amount of all four types of Ts were observed to have between 3.4- to 5.8-fold more than that of any types of Tv. The SNPs consisted of 104,312 G/C to A/T transitions (33.6%), 101,021 A/T to G/C transitions (32.6%), followed by 29,222 G/C to T/A transversions (9.4%), 28,910 A/T to C/G transversions (9.3%), 28,835 A/T to T/A (9.3%) and 18,064 G/C to C/G transversions (5.8%). Changes from G/C to A/T (Ts) were observed with the highest frequency whereas G/C to C/G (Tv) were the least frequent changes.
The length of small InDels ranged in size from 1 to 6 bp throughout the entire genome (Fig. 1), of which 1 bp-sized InDels were the most abundant, followed by 2 bp-sized InDels. In general, the amount of InDels decreased sharply as their size increased, especially for the shortest ones (1- to 2-bp) which showed the most dramatic drop in number. An exception was that the number of 3 bp-sized and 5 bp-sized InDels were slightly less than that of 4 bp-sized and 6 bp-sized InDels respectively.
The BLAST result indicated that no additional plasmid derived inserts were found in the available ‘SunUp’ genome with the exception of three previously detected plasmid-derived inserts. In addition to SNPs and small InDels, the prevalence of some other types of larger structural variations (>50 bp) such as larger insertions (INS) and deletions (DEL), inversions (INV), intra-chromosomal translocations (ITX) and inter-chromosomal translocations (CTX) were also assessed using BreakDancer under stringent criteria. A total of 1,200 structural variants were identified in ‘Sunset’ (Table S1). We further validated these SVs by manual inspection of read alignments and found that all of SVs were unreliably predicted or false positives. Although each detected SV was supported by several reads, these regions were also covered by paired-end reads that supported the papaya reference genome arrangement. All false positives were found to be located in the gap regions or regions with high levels of coverage (>100).
Classification of SNPs and small InDels by potential impact on protein function
We predicted the variant effects of SNPs and small InDels according to their potential impact on protein function using SNPEff program [23] and self-built papaya data sets (Fig. 2 and Table 5). All variants that may have an effect on protein function could be categorized into 35 effect types, which were further grouped into the following four larger predefined impact categories on the basis of the assumed severity: HIGH, MODERATE, LOW, and MODIFIER (Table 5). The vast majority of variants (571,039, 97.4%) belonged to the MODIFIER category, which is usually comprised of intronic and intergenic variants and assumed to have only a weak or no impact on the protein. The LOW category is thought to be mostly harmless or unlikely to change protein behavior, such as synonymous mutations. A non-disruptive variant that might change protein effectiveness is defined as MODERATE, including in-frame deletions and missense mutations. In all 7,533 (1.28%) and 6,114 (1.04%) variants had possible MODERATE and LOW impacts on gene function. Only 1,591 variants with HIGH impacts were found, representing 0.27% of the total variants, which are assumed to have disruptive impacts on the protein, probably causing protein truncations, loss of function or triggering nonsense mediated decay. The most common types of mutations were frameshift variants in the HIGH category.
In terms of genomic distribution, approximately 48.5% of SNPs were identified in intergenic regions, while merely 8.4% were present in genic regions. Upstream promoter regions and downstream regulatory regions contained high proportions of SNPs, accounting for about 21% (Fig. 2A). Within the genic region, 5.9% and 2.5% of SNPs were present in the introns and coding sequence (CDS) regions, respectively (Fig. 2B). Overall, a similar distribution pattern of SNPs and InDels was observed across the entire genome. Likewise, a substantial number of InDels (~39%) were identified in intergenic regions (Fig. 2A), whereas only 9.9%were located in genic regions, consisting of 8.1% of intronic InDels and 1.8% of exonic InDels (Fig. 2C). The presence of InDels in the upstream and downstream regulatory regions of genes was also shown with a relatively high percentage (~25%) (Fig. 2A). In order to investigate the effect of SNPs on the amino acid alteration of a protein, the likelihood of non-synonymous and synonymous coding SNPs was estimated. Among all SNPs, 7,589 non-synonymous and 5,272 synonymous type modifications were detected in ‘Sunset’ (Fig. 2B). The ratio of non-synonymous to synonymous SNPs (NS/Syn ratio) was about 1.439. The predominant InDels within the coding regions were frameshift mutations (1,137, 95.7%), i.e. an indel size of which is not multiple of 3 (the length of a codon), whereas a significantly lower amount of codon insertions (31, 2.6%) and deletions (20, 1.7%) was observed (Fig. 2C).
With respect to gene function, all high-impact SNPs were predicted to affect 1,454 genes. For the global functional analysis of HIGH category genes, Gene Ontology (GO) terms were assigned to corresponding genes using BLAST2GO software [24]. Of 1,454 high-impact genes, 751 genes were associated with at least one GO term. We further applied GO category enrichment analysis to elucidate the functional enrichment of potentially high-impact genes, using Fisher’s exact test with an FDR cutoff ≤0.05. A total of 31 GO terms were significantly enriched in biological processes and molecular functions (See Table S2 and Fig. S3). The biological process GO term “ATP catabolic process” was the most significantly and specifically overrepresented term, followed by “ribonucleotide catabolic process”, and “purine nucleotide catabolic process”. Enrichment in the biological process category was also reflected by high numbers of molecular function GO terms “nucleoside-triphosphatase activity”, “hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides” and “ATPase activity”, etc.
Shared and specific nuclear organelle integration sites between ‘SunUp’ and ‘Sunset’
With the aim of conducting genome-wide comparative analysis of the integration of nuclear organelle fragments between ‘SunUp’ and ‘Sunset’, two in-house software pipelines written in a mixture of python scripts (available upon request) were developed for automatic processing and identification of shared and variety-specific norgDNA integration sites between these two varieties. Schematic diagrams of pipelines are shown in Fig. 3 and Fig. 4.
A total of 3,430 NUPT and 2,764 NUMT junction sites were obtained by searching against organelle genomes with the ‘SunUp’ reference genome as the query (Table 6). Out of all 3,430 NUPT junction sites, a large fraction of junction sites (3,327, 97%) were shared by ‘SunUp’ and ‘Sunset’. With BLASTN we identified that shared NUPTs matched the papaya chloroplast (pt) genome with an average identity of 91.92%. The remaining 3% (103) were specific in ‘SunUp’, with a higher average identity of 94.03% to the pt genome (further details of the 103 junction sites are provided in Table S3). Similar to the trend observed for the distribution of NUPTs, out of 2,764 NUMT junction sites, junction sites shared between ‘SunUp’ and ‘Sunset’ numbered 2,642 and account for the major share 95.6% whereas ‘SunUp’-specific junction sites only accounted for 4.4% (122) (further details of the122 junction sites are provided in Table S4). The average similarity in identity between ‘SunUp’-specific NUMTs and papaya mitochondria (mt) genome was 93.77%, which is slightly less than the identity between ‘SunUp’-specific NUPTs and the pt genome (94.03%) but a bit higher than the identity between shared NUMTs and the mt genome (92.97%). In general, higher similarities in identities were apparent between ‘SunUp’-specific norgDNAs and corresponding organelle genomes than between shared norgDNAs and corresponding organelle genomes. We next evaluated the performance of our pipeline through manual inspection of read alignments surrounding those identified as ‘SunUp’-specific norgDNA junction sites in the Integrative Genomics Viewer (IGV) software [25]. The visual display exhibited that no ‘Sunset’ reads aligned to or spanned any ‘SunUp’-specific junction site in the ‘SunUp’ reference genome as we had expected, thus those ‘SunUp’-specific integration events predicted by our pipeline were bona fide. In the ‘SunUp’-specific norgDNA regions, no reads mapped or having a read depth greater than 100× were observed, suggesting that those reads likely correspond to the organellar DNA. The results demonstrate the superior sensitivity and accuracy of our pipeline.
Overall, ‘SunUp’-specific norgDNA integration junction sites were distributed non-randomly across nine chromosomes of papaya, with distinct regions of high and low variation (Table 7). The most distinct region was in Chr2 which had the highest frequency of NUPT junction sites with 11.65% compared to other chromosomes of the genome, followed by Chr6 and Chr8, with 8.74% each. Only a low proportion of NUPT junction sites were found in Chr3 (1.94%) and Chr2 (2.91%). Compared with NUPT junction sites, a smaller range of variation across chromosomes was found at NUMT junction sites. Similarly, NUMT junction sites were highly enriched in Chr6 (10.66%), Chr2 (9.84%) and Chr8 (9.02%), while less prevalent in Chr5 (4.92%) and Chr1 (5.74%).
Using a strict pipeline (Fig. 4), the ‘Sunset’ genome was also scanned for norgDNA integrations by searching the papaya chloroplast and mitochondria genomes. The total amount of either NUPT or NUMT integration junction sites in the ‘Sunset’ genome were slightly fewer than in the ‘SunUp’ genome, with 3,430 NUPT and 2,764 NUMT junction sites, respectively (Table 6). In contrast to ‘SunUp’-specific NUPT integrations (103), the amount of ‘Sunset’-specific NUPT integration junction sites sharply reduced to only 19, with an average sequence identity of 95.64% matching to the papaya pt genome; ‘Sunset’-specific NUMT integration junction sites decreased to 103, having an average identity of 96.95% to the mt genome.
The origin of organelle-like borders of transgenic inserts in ‘SunUp’
BLASTN search analysis of transgenic inserts’ flanking sequences was conducted to investigate the possible identity of sequences around the insertion sites. All six genomic DNA segments flanking the three previously identified transgenic insertions were surprisingly found to share near sequence identity to the papaya organelle sequences (Fig. 5A). Both sides of the single, contiguous 9,789 bp functional transgene insertion encoding intact PRSV cp, uidA and nptII genes were identified to be NUPT sequences, consisting of a 4,000 bp and a 1, 790 bp plasmid-derived segments, which were highly homologous with trn, rps genes of the plasmid genome and part of the ycf3 gene. The genomic DNA flanking both borders of the nonfunctional nptII fragment insert (290 bp) also exhibited homology with papaya plastid genome genes ndhG and atpB, E, with size of 363 bp and 827 bp, respectively. The contiguous 1,533 bp nonfunctional tetA fragment insert, in particular, had one border of NUPT sequence homologous to the plastid gene ycf2, reaching up to 6,199 bp. The other border of the tetA fragment was comprised of non-plastid DNA-like sequence and showed identity to a papaya mitochondria genome segment, totaling 1,708 bp. Sequences of three flanking pairs of transgenic inserts showed significant homology to papaya organelle genome segments, with a range of 98.18~100% identity. Especially two flanking pairs of the functional transgene insert and the nonfunctional nptII fragment insert, having identities approaching 100%. By contrast, organelle-like sequences at both borders of the tetA fragment insert experienced further rearrangements and showed lower similarities of 98.6% and 98.18% with the pt genome and the mt genome, respectively. We estimated the homology between our previously assembled ‘Sunset’ norgDNAs and six flanking organelle-like sequences of inserts in ‘SunUp’. Through a rigid BLAST screening, there were respectively 12, 6, 1, 5, 43 and 2 best BLAST hits detected between ‘Sunset’ norgDNAs and six flanking norgDNAs, with combined lengths ranging from 49 to 4,180 bp (Table 8). Only one (best) hit was found between ‘Sunset’ norgDNAs and border A of the nonfunctional nptII fragment insert, with a size of 49 bp; whereas there was at least 1,231 bp of combined length for the remaining five borders. The sequence identity between ‘Sunset’ norgDNAs and six flanking norgDNAs of insertions in ‘SunUp’ varied from 93.68% to 99.09%. Of which all NUPT borders matched ‘Sunset’ norgDNAs with relatively lower identities in comparison to their matching with the papaya chloroplast. Meanwhile, the sole NUMT border had 99.09% similarity to ‘Sunset’ norgDNAs, which is higher than its similarity to the mt genome (98.18%).
We developed a strategy based on massive paired-end mapping (Fig. 5B) to further investigate whether these organelle-like sequences at both borders of three insertions were present in the genome prior to bombardment or not. If a deletion in ‘Sunset’ relative to the ‘SunUp’ insertion-with-the border region was found (Fig. 5B), we were able to deduce that those organelle-derived fragments flanking transgenic insertions were originally present in ‘Sunset’ prior to bombardment. Deletions were identified using paired ends spanning the specified genomic region in ‘SunUp’ that were longer than the transgenic insert size (cutoff).
As a result, a total of 217,890 ‘Sunset’ short reads could be aligned to the region of the functional transgene insertion with organelle-like borders, of which 183,488 reads were mapped to the reference region in properly paired orientations. According to statistical calculations, the inner distances between PE reads were far less than the cp functional transgenic insert size (9,789 bp), which ranged in size from 0 to 246 bp. Of these, the inner distances of 0 bp were significantly enriched, with 1,320 pairs. Meanwhile, there were 42,273 ‘Sunset’ short reads aligned to the region of the nonfunctional nptII transgene insert with organelle-like borders, among which 22,518 were mapped as a pair, with pairwise distances ranging from 0 to 169 bp in length. All pairwise distances were smaller than the size of the nonfunctional nptII fragment insert at 290 bp. A major fraction of PE reads (223 pairs) were found to have no distance between each other. Regarding the nonfunctional tetA transgene insert with both flanks, the total number of mapped reads were 418,697, including 150,110 optimally mapped PE reads. Of the latter, the sizes of inner distances were in the range of 0 to 969 bp, which is under the cutoff 1,533 bp. There was a significant enrichment for the 0 bp inner distance as well, containing 2,661 pairs. The distribution of mapped paired-end spans in regions of three inserts with flanks is shown in histograms (Fig. 5C). Except for the 0 bp distance, three histograms of paired-end inner spans were normally distributed and showed primary peaks at 59 bp (1,296 pairs), 57 bp (187 pairs) and 56 bp (1,993 pairs). In summary, in all cases of three transgenic insertion events, the inner distance of any pair of mapped PE reads was shorter than a transgenic insert size (cutoff), indicating that the distance between any pair of ‘Sunset’-derived PE reads was not elongated by an insertion. This serves as a strong hint that these flanking norgDNAs were not originally contiguous in GM-free Sunset genome.