Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.
Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements.
Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6
This is a list of supplementary files associated with this preprint. Click to download.
Table S1. Performance of 3-class or binary prediction models Table S2. Importance values and Kruskal test of features in Model 34 Table S3. Correlation among densities of HC/LC/BG regions and genomic features Table S4. Importance values of features in Model 35 Table S5. Importance values of features in Model 34B Table S6. Importance values of features in Model 35B Table S7. Number of annotated sequences in HC regions and 10,000 randomly sampled BG regions (based on the number and length distribution of HC regions) Table S8. Gene set enrichment analysis for Pfam domains for HC regions Table S9. Gene set enrichment analysis for GO terms for HC regions Table S10. Gene set enrichment analysis for metabolic pathways for HC regions Table S11. Gene set enrichment analysis for Pfam domains for HC regions vs BG regions Table S12. Gene set enrichment analysis for GO terms for HC regions vs BG regions Table S13. Gene set enrichment analysis for metabolic pathways for HC regions vs BG regions Table S14. Aligned regions between Short-read assembly and Long-read assembly Table S15. Categories of regions in Short-read assembly Table S16. Corresponding variable coverage regions between two assemblies Table S17. HC regions in Short-read assembly and corresponding variable coverage regions in Long-read assembly Table S18. Importance values of features in Model 37 Table S19. Importance values of features in Model 38 Table S20. Choice of bin size
Figure S1. Properties of HC/LC/BG regions with high confidence. a Proportion of Ns in HC/LC/BG regions. b Length distribution of HC/LC/BG regions. c Length distribution of overlapped regions between HC/LC/BG and M2/M3 regions.
Figure S2. Genomic feature distributions in flanking regions of HC/LC/BG regions of different length (0.5~32Kb). Violin plots showing distributions of GC content, and densities of genes, tandemly duplicated genes, pseudogenes, transposable element and SSRs in HC, BG, and LC regions. Red line indicates median value. Statistics analysis was conducted using Wilcoxon rank sum test, *** indicates p<1e-3.
Figure S3. RD distribution of 100 bp BG bins overlapped with aligned region categories. For each plot, left Y-axis: number of BG bins in correctly assembled category (C1, C2 or C3); right Y-axis: number of BG bins in mis-assembled category (M1, M2 or M3). The categories are defined in Figure 5a.
Figure S4. Example mis-assembled HC regions. a Normalized RD in a region containing an SM gene cluster on Chromosome 8. Normalized RD was calculated for each 100bp bin. Two black brackets indicate identified HC regions, with RD around 2. Valley regions with RD=0 indicate gaps in genome assembly filled up with Ns. b Syntenic alignments (colored regions between black lines) and gene annotation (colored boxes) of the same region in (a) between short-read and long-read assemblies of Heinz and the PCR-validated sequence in M82. Corresponding positions in (a) and (b) were delineated with grey dashed lines. Colored regions: BLASTn matches between Heinz and M82, with E-value < 1e-20 and alignment length > 50bp. AOX: alcohol oxidase; TPS: terpene synthases; P450: cytochrome P450; CPT: cis-prenyl transferase; ψ: pseudogene. c Dotplot of the region in (a) between Short- and Long-read assemblies. Five cases were indicated using arrow heads, and all these five short regions were identified to be in BG regions in long-read based assembly (→ BG).
Figure S5. Important features in model distinguishing HC_M2/M3 and HC_C2/C3. Violin plots show distributions of each features or GC contents within regions or in flanking regions. Lines within violin plots indicate median values.
Figure S6. Distributions of important features in model distinguishing HC_M2/M3 and BG_M2/M3. Violin plots show distributions of each features or GC contents within regions or in flanking regions. Lines within violin plots indicate median values.
Figure S7. Sensitivity and accuracy of CNVnator in RD value calculation, and impact of genome coverages on RD values. a RD distribution of new CNVnator runs by mapping re-sampled reads from the tomato genome based on simulated input RD, where the only possible RD values were 0 (LC), 1 (BG), or 2 (HC). b-d Correlation between known, simulated input RDs and new RD values from new CNVnator run using the resampled reads. In (b), the simulated RD values were generated as in (a). In (c), the analysis RD values (those generated with CNVator by mapping dataset2 reads on to the tomato genome, see Methods) were first discretized (rounded) to their closest integers, then the rounded RD values were used for resampling reads for determining new RD values. In (d), the analysis RD values were directly used for resampling reads for determining new RD values. e-i Correlation between RD values using all reads (46X coverage) and RD values using subsets of reads at variable coverage: (e) 30X, (f) 20X, (g) 10X, (h) 5X, (i) 46X.
Figure S8. F1 of HC/LC/BG region calling. a-d CNVnator runs were conducted using resampled reads based on different starting RD values as in Figure S3 using read dataset 2. As in Figure S3, the simulated RD values include: (a) possible RD values of only 0 (LC), 1 (BG), or 2 (HC); (b) analysis RD values discretized/rounded to integers; (c) analysis RD without discretization; (d) analysis RD without discretization but down-sampled to 30X (note the y-axis range is much smaller compared to (a-c)). Each dot indicates an F1 value (y-axis) at a given q-value threshold (x-axis), where F1 was calculated using: (1) numbers of nucleotides in overlapping regions between the HC/LC region designations based on the analysis RD and new HC/LC regions determined using resampled reads (True Positive), (2) numbers of nucleotides in true HC/LC regions but determined as BG regions in new run (False Negative), and (3) numbers of nucleotides in true BG regions but determined as HC/LC regions in new run (False Positive, see Supplementary Text). Orange line: LOESS fitted curve. e-h Same as (a-d) except that the F1 was determined based on numbers of regions as opposed to numbers of nucleotides.
Loading...
Posted 14 Jan, 2021
On 18 Jan, 2021
Received 12 Jan, 2021
On 04 Jan, 2021
Invitations sent on 04 Jan, 2021
On 04 Jan, 2021
On 04 Jan, 2021
On 04 Jan, 2021
Received 31 Dec, 2020
Received 27 Dec, 2020
On 23 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
Invitations sent on 21 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
On 20 Nov, 2020
Received 17 Nov, 2020
On 04 Nov, 2020
Received 23 Aug, 2020
On 01 Jul, 2020
Invitations sent on 29 Jun, 2020
On 29 May, 2020
On 28 May, 2020
On 28 May, 2020
On 27 May, 2020
Posted 14 Jan, 2021
On 18 Jan, 2021
Received 12 Jan, 2021
On 04 Jan, 2021
Invitations sent on 04 Jan, 2021
On 04 Jan, 2021
On 04 Jan, 2021
On 04 Jan, 2021
Received 31 Dec, 2020
Received 27 Dec, 2020
On 23 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
Invitations sent on 21 Dec, 2020
On 21 Dec, 2020
On 21 Dec, 2020
On 20 Nov, 2020
Received 17 Nov, 2020
On 04 Nov, 2020
Received 23 Aug, 2020
On 01 Jul, 2020
Invitations sent on 29 Jun, 2020
On 29 May, 2020
On 28 May, 2020
On 28 May, 2020
On 27 May, 2020
Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.
Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements.
Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6
This is a list of supplementary files associated with this preprint. Click to download.
Table S1. Performance of 3-class or binary prediction models Table S2. Importance values and Kruskal test of features in Model 34 Table S3. Correlation among densities of HC/LC/BG regions and genomic features Table S4. Importance values of features in Model 35 Table S5. Importance values of features in Model 34B Table S6. Importance values of features in Model 35B Table S7. Number of annotated sequences in HC regions and 10,000 randomly sampled BG regions (based on the number and length distribution of HC regions) Table S8. Gene set enrichment analysis for Pfam domains for HC regions Table S9. Gene set enrichment analysis for GO terms for HC regions Table S10. Gene set enrichment analysis for metabolic pathways for HC regions Table S11. Gene set enrichment analysis for Pfam domains for HC regions vs BG regions Table S12. Gene set enrichment analysis for GO terms for HC regions vs BG regions Table S13. Gene set enrichment analysis for metabolic pathways for HC regions vs BG regions Table S14. Aligned regions between Short-read assembly and Long-read assembly Table S15. Categories of regions in Short-read assembly Table S16. Corresponding variable coverage regions between two assemblies Table S17. HC regions in Short-read assembly and corresponding variable coverage regions in Long-read assembly Table S18. Importance values of features in Model 37 Table S19. Importance values of features in Model 38 Table S20. Choice of bin size
Figure S1. Properties of HC/LC/BG regions with high confidence. a Proportion of Ns in HC/LC/BG regions. b Length distribution of HC/LC/BG regions. c Length distribution of overlapped regions between HC/LC/BG and M2/M3 regions.
Figure S2. Genomic feature distributions in flanking regions of HC/LC/BG regions of different length (0.5~32Kb). Violin plots showing distributions of GC content, and densities of genes, tandemly duplicated genes, pseudogenes, transposable element and SSRs in HC, BG, and LC regions. Red line indicates median value. Statistics analysis was conducted using Wilcoxon rank sum test, *** indicates p<1e-3.
Figure S3. RD distribution of 100 bp BG bins overlapped with aligned region categories. For each plot, left Y-axis: number of BG bins in correctly assembled category (C1, C2 or C3); right Y-axis: number of BG bins in mis-assembled category (M1, M2 or M3). The categories are defined in Figure 5a.
Figure S4. Example mis-assembled HC regions. a Normalized RD in a region containing an SM gene cluster on Chromosome 8. Normalized RD was calculated for each 100bp bin. Two black brackets indicate identified HC regions, with RD around 2. Valley regions with RD=0 indicate gaps in genome assembly filled up with Ns. b Syntenic alignments (colored regions between black lines) and gene annotation (colored boxes) of the same region in (a) between short-read and long-read assemblies of Heinz and the PCR-validated sequence in M82. Corresponding positions in (a) and (b) were delineated with grey dashed lines. Colored regions: BLASTn matches between Heinz and M82, with E-value < 1e-20 and alignment length > 50bp. AOX: alcohol oxidase; TPS: terpene synthases; P450: cytochrome P450; CPT: cis-prenyl transferase; ψ: pseudogene. c Dotplot of the region in (a) between Short- and Long-read assemblies. Five cases were indicated using arrow heads, and all these five short regions were identified to be in BG regions in long-read based assembly (→ BG).
Figure S5. Important features in model distinguishing HC_M2/M3 and HC_C2/C3. Violin plots show distributions of each features or GC contents within regions or in flanking regions. Lines within violin plots indicate median values.
Figure S6. Distributions of important features in model distinguishing HC_M2/M3 and BG_M2/M3. Violin plots show distributions of each features or GC contents within regions or in flanking regions. Lines within violin plots indicate median values.
Figure S7. Sensitivity and accuracy of CNVnator in RD value calculation, and impact of genome coverages on RD values. a RD distribution of new CNVnator runs by mapping re-sampled reads from the tomato genome based on simulated input RD, where the only possible RD values were 0 (LC), 1 (BG), or 2 (HC). b-d Correlation between known, simulated input RDs and new RD values from new CNVnator run using the resampled reads. In (b), the simulated RD values were generated as in (a). In (c), the analysis RD values (those generated with CNVator by mapping dataset2 reads on to the tomato genome, see Methods) were first discretized (rounded) to their closest integers, then the rounded RD values were used for resampling reads for determining new RD values. In (d), the analysis RD values were directly used for resampling reads for determining new RD values. e-i Correlation between RD values using all reads (46X coverage) and RD values using subsets of reads at variable coverage: (e) 30X, (f) 20X, (g) 10X, (h) 5X, (i) 46X.
Figure S8. F1 of HC/LC/BG region calling. a-d CNVnator runs were conducted using resampled reads based on different starting RD values as in Figure S3 using read dataset 2. As in Figure S3, the simulated RD values include: (a) possible RD values of only 0 (LC), 1 (BG), or 2 (HC); (b) analysis RD values discretized/rounded to integers; (c) analysis RD without discretization; (d) analysis RD without discretization but down-sampled to 30X (note the y-axis range is much smaller compared to (a-c)). Each dot indicates an F1 value (y-axis) at a given q-value threshold (x-axis), where F1 was calculated using: (1) numbers of nucleotides in overlapping regions between the HC/LC region designations based on the analysis RD and new HC/LC regions determined using resampled reads (True Positive), (2) numbers of nucleotides in true HC/LC regions but determined as BG regions in new run (False Negative), and (3) numbers of nucleotides in true BG regions but determined as HC/LC regions in new run (False Positive, see Supplementary Text). Orange line: LOESS fitted curve. e-h Same as (a-d) except that the F1 was determined based on numbers of regions as opposed to numbers of nucleotides.
Loading...