Abundance of tomato genomic regions with higher and lower than average coverage
Two datasets were used to determine how well different genomic regions were covered and to define variable coverage regions: genomic regions with significantly higher coverage (HC) or lower coverage (LC) than average (referred to as background, BG). The first, dataset1, was generated with Illumina Genome Analyzer IIx (GAIIx) sequencer with 90-bp paired-end and 54-bp mate pair reads (~28x coverage) used in the original genome assembly , and the second, dataset2, was generated with Illumina HiSeq 2000 sequencer with 101-bp paired-end reads (~46-fold coverage). To assess the qualities of these two datasets and to see if both datasets should be analyzed, the tomato genome assembly was split into 100bp bins and, for each bin, the read depth (RD) was determined using either dataset1 or 2 (see Methods). After correcting the RD values for GC content bias , the median RDs for dataset1 and dataset2 are 1.04 and 1.05, respectively (an example region on chromosome 1, Fig. 1a). The RD values from these two datasets are significantly correlated (Spearman’s ρ = 0.40, p <2.2e-16), and consistently revealed bins with substantial deviations from the median value in both directions indicating the presence of HC and LC regions (e.g., grey and black arrows in Fig. 1a). However, RDs of dataset1 was significantly more variable across genome (variance = 0.25 using 0~99 percentile values) compared to those of dataset2 (variance = 0.15, F-test, p < 2.2e-16, Fig. 1a). This is not simply due to the higher genome coverage of dataset2 (~46.3x) compared to dataset1 (~28.6x), because a subset of randomly sampled reads from dataset2 to ~30x genome coverage has much more similar RD estimates as dataset2 (Spearman’s ρ = 0.87, p <2.2e-16, Fig. 1a).
The comparably higher RD variance in dataset1 may be due to lower sequencing quality and shorter read lengths that contribute to erroneous read mapping and may lead to overestimates of HC and LC regions. Thus, only results based on dataset2 were discussed further. Based on the RD values, HC, LC, and BG regions were identified with CNVnator (see Methods). However, we found that RD distributions of HC, LC and BG regions called by CNVnator were overlapping (Fig. 1b). Given our goal is to identify HC and LC regions with high confidence, two threshold RD values were chosen: 0.72 and 1.76 that minimize the overlap between LC and BG regions (Fig. 1c), and that between BG and HC regions (Fig. 1d), respectively. Thus, HC regions were defined as regions with RD > 1.76, and LC regions were those with RD < 0.72. BG regions had RD values between 0.72 and 1.76. This resulted in 1,156 HC, 19,451 BG and 15,034 LC regions. The HC and LC regions account for 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of the genome, respectively. The regions that are not classified as HC, LC, or BG due to this dual thresholding scheme are referred to as “other” regions. As expected, 95.5% of LC regions contain gaps filled with Ns, whereas only 18.2% of HC and 42% of BG contain Ns (Fig. S1a). In addition, the median lengths for LC, BG, and HC regions are 1.70, 16.30, and 2.80Kb, respectively (Fig. S1b).
Prediction of HC, LC, and BG regions with a multi-class model using seven genomic features
After the HC/LC/BG regions were defined, we established machine learning model predicting which regions would be HC, LC, or BG regions. By predicting HC/LC/BG regions, we would have a better understanding of what the contributing genomic characteristics were, especially for HC regions where misassembly mostly likely have occurred. Starting out, we used seven features (referred to as base features, yellow box, Fig. 2a), including GC content, density values: all genes, tandem genes, non-tandem genes, pseudogenes, transposable elements (TEs), and simple sequence repeats (SSRs) of genomic regions to build a 3-class (HC, LC, or BG) model (referred to as Model 1, Fig. 2a, Table S1) using the Random Forest algorithm . We should emphasize that an independent, test set (10%) of HC, LC, and BG regions were set aside that was not used for model building. Thus, the test set was ideal for validating our models. Using Model 1, 61.9%, 83.9%, and 74.8% of HC, LC, and BG regions, respectively, were correctly predicted (Fig. 2b). Here the percentage true cases predicted correctly is defined as the recall value. Importantly, the testing set not used for model training were predicted with a similar recall (Fig. 2b). To jointly consider both recall and precision (% predictions that are correct), we determined the F1-score that is the harmonic mean of precision and recall. In our machine learning pipeline, we started with equal numbers of training and testing HC, LC, and BG regions (33% each). Thus, random guess would lead to an accuracy of 33% and F1=0.33. On the other hand, a perfect model would have an accuracy and an F1 of 1. Model 1’s F1=0.73 (Fig. 2a), while it was much better than random guess, the F1 was far from perfect.
Defining HC, LC, and BG regions with additional features
To improve upon Model 1, we included additional features from two sources. The first was the same seven base features but with values from flanking regions. The rationale was that the regions right next to HC, LC, and BG regions may have similar properties which can contributed to a better model. To assess this, we first build prediction models using only sequences flanking HC, LC, and BG regions by 0.5, 1, 2, 4, 8, 16, and 32kbs to build seven models (Model 2-8) and found that the performance of these models was worse than that of Model 1 (accuracy=46~58%, F1=0.46~0.58, Fig. 2a and Table S1). In addition, as the sizes of the flanking regions increased, the prediction performance decreased (Fig. 2a) This is likely because flanking regions can be of different types, i.e., a region flanking an HC region may be LC and/or BG regions. However, this is not because these regions are not important. When the features used for building Model 1 were combined with those for Model 2-8, the resulting model (Model 9) had a substantially improved F1=0.82 (Fig. 2a) compared to Model 1 (F1=0.74). This finding suggests that sequences flanking the HC/LC/BG regions, by themselves insufficient, have information that are useful for the prediction task.
In addition to flanking region, we focused on dissecting if HC, LC, and BG regions have different sequence composition—instead of compositions of much longer sequences (genes and transposons) or single nucleotides (GC content), we investigated whether specific SSRs (repeats with 2-64 bp units, 156,444 features) and/or k-mer (1-6bp, 5,460 features) may be prevalent in HC, LC, or BG regions. Because the number of these SRR and k-mer features was large, we first identified a subset of SSR and k-mer features with p-value < 0.05 (Kruskal-Wallis H test) among HC/LC/BG regions. Top 100 SSR and k-mers were further selected with a feature selection algorithm (see Methods). By incorporating these top SSR and k-mer features with seven base features to predict HC/LC/BG regions (Model 10), the performance of Model 10 (F1=0.84, Fig. 2a) was even better than the Model 9 (F1=0.82) that did not consider SSRs and k-mers but flanking regions. Thus, there exist substantial differences in the short sequence compositions among HC/LC/BG regions. We also included the top 500 and the top 1000 k-mers/SSRs to create Model 18 and Model 26 that improved performance further with F1=0.85 and 0.86, respectively (Fig. 2a). Although additional k-mers/SSRs may further improve predictions, they likely have diminishing contribution judging from the small F1 differences between Model 10, Model 18 and Model 26 (blue bars, Fig. 2a). Next, we combined the features used in Model 9 and those from used in Model 10 to establish an all-inclusive Model 34 with 256 features that had F1=0.87 (Fig. 2a). Importantly, >84% BG, >82% HC and >96% LC regions are correctly predicted in both training and test (not used in model training) datasets (Fig. 2c).
Features important for the prediction of HC/LC/BG regions
Model 34 has the highest F1=0.87 using 256 features (56 base features from 8 regions, 100 top k-mers, and 100 top SSRs, Fig. 2a). We next evaluated which features were among the most informative in distinguishing HC/LC/BG regions (highest feature importance values, see Methods). Table S2 lists the importance values of all 256 features. We found that three types of features stand out: k-mers (median importance rank=57), GC contents (median rank=66), and density of TEs (median rank=70). TEs have long been implicated in their contributions to misassembly due to their lengths and high degree of similarities . Interestingly, the HC regions tend to have a significantly lower TE density compared to BG regions (Fig. 3a), likely reflecting genomic regions differing in recent transposition events. In contrast, LC regions have the highest TE density, although distribution of LC regions across the genome is not correlated with TE distribution (Spearman’s rho = 0.09, Fig. 3b). Furthermore, when we randomly reshuffled HC/LC/BG regions designations 1,000 times and determined the correlation distribution between prevalence of TEs and random genomic regions, the observed correlation value of LC regions was significantly lower than that of the random expectation (z-score=-3.0, Fig. 3c). One potential reason is that the assembler may be confused by the repetitive nature of TE and short length of sequencing reads to assemble sequences correctly, resulting in gaps filled with Ns in LC regions (Fig. S1a) with TE sequences at the breakpoints in one or both ends, which in turn led to higher TE density in LC regions.
As for GC content, it is well documented that, specifically for short read sequencing with the Illumina platform, GC-rich and GC-poor regions tend not to be sequenced and thus contribute to regions with low coverage or breakpoints in assemblies [4, 27]. Consistent with this, LC regions have significantly higher GC content compared to HC and BG regions (Fig. 3a). We also found that a major reason that the top 100 k-mers were important was because of their high AT content (88% with AT content ≥ 80% and 100% with AT content ≥ 60%, Table S2). In addition, the top two SSRs with highest importance in the prediction were “AT” (ranked 152) and its reverse complement “TA” (ranked 153). Contrary to densities of individual SSRs which generally had very low ranks (i.e., less important), densities of all SSRs ranked in the middle (108), indicating that it is more informative in distinguishing HC/LC/BG regions to consider SSRs as a whole. Consistent with this, when only the top 128 features were used, where the density of all SSRs was considered but no individual SSR feature, the model’s performance didn’t decline (Model 34_4 in Table S1).
Pseudogenes (median rank=131), all protein coding genes (139), non-tandem duplicates (141), and tandem duplicates (154) also ranked in the middle (Table S2). Densities of these genomic features in flanking regions ranked similarly or even higher than densities in HC/LC/BG regions (Table S2), suggesting the differences in genomic environment around HC/LC/BG regions. By determining the correlations between the prevalence of HC/LC/BG regions and the prevalence of genomic features in corresponding regions, we found that genomic regions with high densities of not only HC, but also BG regions tend to have higher gene density, regardless if tandem and non-tandem genes were separated or not (all ρ>0.12, p-values <2.0e-6, Fig. 3b and Table S3). Because LC regions tend to contain Ns, regions with higher LC density are expected to have lower gene density (all ρ< -0.17, p-values < 2.2e-11, Fig. 3b and Table S3). Given that density of protein coding genes is informative in distinguishing among HC/LC/BG regions, we next asked whether the types of genes, in terms of functional aspects (e.g., Pfam domains, biological processes and metabolic pathways) also impact RD. To test this, top 100 functional characteristics (domains, biological processes and pathways) of genes within HC/LC/BG regions (9869 features in total) were also combined with Model 34 to build Model 35 (Fig. 2a, Table S1). However, there was no apparent improvement compared to model 34 (model 35’s overall accuracy = 87%, F1 = 0.87).
Features important in binary classification model distinguishing HC and BG regions
Although the importance analysis allows us to pinpoint the features crucial for Model 34’s performance, Model 34 is a 3-class model and thus it is not straightforward to tell if a feature is important because it allows us to distinguish HC from BG and LC regions or other scenarios. Because we are mostly interested in assessing why HC regions exist, we next establish a binary classification model to distinguish HC from background regions. Using the same features as in Model 1, we established a new Model 1B (B=binary) with HC and BG regions as classes and found that it has an accuracy=84% and F1=0.76 (Fig. 2a). As expected, Model 34B that used the same feature set as Model 34 for binary classification had even better accuracy=92% and F1=0.84 (Fig. 2a). Like Model 35 which didn’t lead to improved classification among HC/LC/BG regions by including functional features (Table S1), the comparable binary Model 35B had the same performance as Model 34B (Fig. 2a, Table S1). However, models with only functional features had accuracy=58% and F1=0.69, which is much better than random guess (Table S1), suggesting that functional features are still informative in distinguishing HC from BG regions.
As expected, the important features for binary classification of HC and BG regions differ from those for the 3-class model. For example, densities of k-mer in Model 34B (median rank=73) and in Model 35B (median rank=72) were no longer the most important feature categories as in Model 34 (median rank=57) and 35 (median rank=55, Table S2,4,5,6). In contrast, GC content, density of TE and SSRs had the highest median ranks (12, 13, and 57 in Model 35B, respectively). For HC regions, one hypothesis for their presence is due to the presence of multiple copies of highly similar sequences arranged in tandem that are misassembled. If this is true, one would expect that SSRs and tandem genes would tend to be co-localized with HC regions compared to BG regions. Consistent with the above hypothesis, although the density of SSRs in HC regions was slightly lower than BG regions (Fig. 3a), it was significantly higher than randomly expected (z-score=3.81, Fig. 3c). In contrast, density of SSRs in BG regions was slightly lower than random expectation (z-score=-0.69 Fig. 3c). In addition, the density of SSRs across genome is positively correlated with HC, not BG, regions (Fig. 3b), and the flanking regions of HC also have higher density of SSRs than those of BG regions (Fig. S2). These results suggest the potential contribution of SSRs to misassembly in HC regions, which resulted in underestimation of SSRs density in HC regions. The situation is similar for tandem genes, although it is not as important as SSRs (median rank=155). The observed correlation value (ρ) for HC regions was significantly higher than random expectation (z-score=6.0) compared to that for BG regions (z-score=4.1, Fig. 3c). Note that, although both have positive z-scores due to consideration of LC regions also, the higher z-score for HC regions indicates that tandem gene density is more prevalent in HC than in BG regions. Conversely, compared to BG regions, HC regions tend to have fewer non-tandem genes (Fig. 3c). Thus, the presence of tandem genes also contributes to misassembly.
Properties of genes located in HC regions
In earlier section, functional characteristics (domains, biological processes and pathways) of genes within HC/LC/BG regions were also combined with the seven base features to build Model 35 and 35B (Table S1) that resulted in no apparent improvement compared to model 34 and 34B that did not incorporate functional characteristics (Fig. 2a). This may be because properties contributing to the enriched presence of genes with certain functional characteristics were already considered, it is also possible that, due to the large number of features considered and the fact that functional characteristics tend to be lower ranked, the contribution of functional characteristics was not apparent in Model 35 and 35B because other features dominated. To assess the extent to which functional characteristics could be used to predict whether a genomic region would be BG, HC, or LC, we established three-class models using only functional features and found that it had accuracy=41% and F1=0.31, very close to random guess, no matter how many features were selected (Model 36, Fig. 2a). However, binary model for classifying HC and BG regions using only functional features had accuracy=57.9% and F1=0.69, indicating that they were informative (Model 36B, Fig. 2a, Table S1).
To assess what types of genes tend to be located in BG and HC regions, we first determined if the numbers of different types of genes (Table S7) were over or under-represented in HC compared to BG regions. By generating 10,000 datasets with randomized HC locations, we established the randomly expected numbers of different gene types and the resulting null distributions were used to assess the statistical significance of observed numbers of different gene types (Fig. 4a). In this analysis, two types of genes stand out, specialized metabolism (SM) protein coding genes and RNA genes. SM genes has a z-score=2.1, indicating that SM genes tend to be found in HC regions and thus misassembled. This is consistent with the findings that SM genes tend to belong to large gene families, located in tandem clusters, and be recently duplicated [20, 21]. However, genes in larger families are not necessarily in HC regions (black arrow, Fig. 4a) and number of SM genes that are tandemly duplicated is not significantly higher than random expectations (green arrow, Fig. 4a). Thus, it is likely that the over-representation of SM genes in HC regions is due to their higher duplicate rate, but not always through tandem duplication, resulting in closely related copies that were misassembled. It also can be because tandem duplicated SM genes were misassembled together, which makes the number of tandem duplicated SM genes underestimated. In addition to SM genes, surprisingly, non-coding RNAs (ncRNAs) tend to be enriched within HC regions (z-score=2.5, purple arrow, Fig. 4a). We speculate that tomato ncRNA regions may have a higher-than-average rate of recent duplications, which would indicate there are more ncRNA regions than annotated and ncRNA expression levels may be overestimated because multiple ncRNA regions are assembled together.
Our finding that SM genes tend to be over-represented in HC regions suggests that genes with other functions may have similar behaviors. To address this, we asked if there was enrichment of any Pfam domain family, Gene Ontology (GO) biological process category, or TomatoCyc pathways. Given the number of domain families (Table S8), categories (Table S9) and pathways (Table S10) were large, multiple testing correction was applied and resulted in only one statistically significantly enriched entry (salicylic acid catabolic process). To assess if there are general patterns we may have missed due to the stringency of the multiple testing corrections, we examined the Log Likelihood Ratio (LLRs, see Methods) between the numbers of genes with or without a protein domain X and the numbers of genes within or out of HC regions (inserted table, Fig. 4b). Similarly, we examined the LLRs for biological processes (Fig. 4c) and pathways (Fig. 4d). Here the HC regions were compared with the whole genome. We have also conducted the same analysis but between HC and BG regions that produced similar results (Table S11-13).
There are three general patterns that emerge. The first is the prevalence of nuclear encoded proteins responsible for mitochondrial and plastid functions among the Pfam domains and the GO categories with the highest LLRs—including ATP-synt_A: ATP synthase A chain, NADHdh: NADH dehydrogenase, Photo_RC: photosynthesis reaction center, and RbcS and RuBisCO_small: Ribulose-1,5-bisphosphate carboxylase small subunit (Fig. 4b), as well as mitochondrial proton transport and RuBisCo biogenesis (Fig. 4c). The second general pattern is the occurrence of domain/process related to transcription and translations—including various translational elongation factor G (EFG) domains, translational elongation-related functions, ribosomal proteins, RNA splicing (Fig. 4b,c). One outstanding property of genes that fit these two general patterns is their extremely high level of expression. Such high level of expression is known to lead to the generation of retrogenes and retro-pseudogenes with highly similar sequences that littered around various parts of the genomes (Wang et al., 2006; Zou et al., 2009) thus higher coverage within genomes. Consistent with this hypothesis, the average number of introns in genes found in HC regions was significantly lower than that in BG regions (2.13 vs. 4.38 in average, Kolmogorov-Smirov test, p=4.3e-09). The third general pattern is revealed from the few metabolic pathways with LLR value > 1 where five out of nine were SM pathways (Fig. 4d), as expected.
Evaluation of HC region misassembly by comparing Short-read and Long-read assemblies
To assess the extent to which HC regions tend to be misassembled, Short-read assembly (query) was aligned to Long-read assembly (subject) using MUMmer  (see Methods), and the aligned regions were shown in Table S14-16. Aligned regions were classified into six categories (see Methods, Fig. 5a): 1) non-duplicated, Correctly assembled (C1, 681.0 Mb); 2) non-duplicated, misassembled (M1, 0.6 Mb); 3) locally duplicated (i.e., on the same chromosome of the Long-read assembly), correctly assembled (C2, 11.5 Mb), 4) locally duplicated, misassembled (M2, 8.2 Mb); 5) non-locally (on different chromosome) duplicated, correctly assembled (C3, 29.5 Mb); and 6) non-locally duplicated, misassembled (M3, 4.9 Mb). We found that 86.3Mb of Short-read assembly regions, mostly consisted of Ns, that could not be aligned to the Long-read assembly. Since LC regions tend to consist of Ns, it is not surprising that 93% of the total length of LC regions had no match in Long-read assembly (Fig. 5b). Thus, LC regions were not further examined. Of the 5.1 Mb HC regions, 0.03 Mb (0.53%), 0.97 Mb (19.1%) and 0.44 Mb (8.7%) were in misassembled categories M1, M2 and M3, respectively (Fig. 5b). Compared to HC regions, the proportion of misassembled regions in BG were significantly lower (0.08 %, 0.9% and 0.5% for M1, M2, and M3, respectively; Fisher’s Exact tests, all p<2.8e-09). Among the three categories of misassembled regions, only 1.9% HC regions (M1) did not have duplicated subject regions in the Long-read assembly. The majority of misassembly (67.4% and 30.8%) was due to duplications (M2 and M3, respectively), suggesting that HC regions were much more likely to be misassembled due to duplications, especially when duplications occurred on the same chromosome.
Thus far, HC regions were defined by mapping short reads to the Short-read assembly. To evaluate whether these Short-read assembly-based HC regions were still classified as HC in the Long-read assembly, the short reads were also mapped to the Long-read assembly to determine variable coverage regions. This resulted in 297 HC, 4,479 BG and 1,971 LC regions based on the Long-read assembly. Importantly, among 1,156 Short-read assembly-based HC regions, once we map the reads to the Long-read assembly, only 88 (7.6%) overlapped with the Long-read assembly-based HC regions. In addition, among misassembled HC regions (coverage defined using the Short-read assembly), 96.5% and 91.8% of M2 and M3 were identified as BG based on the Long-read assembly (Fig. 5c, Table S17). These findings further suggest that higher than usual read coverage is a good indicator of misassembly.
HC regions tend to be misassembled compared to BG or LC regions (Fig. 5b). If we broke down the six aligned region categories (Fig. 5a), it was clear that HC regions have higher proportion of M2 (11.8%) and M3 (9.1%) compared to other categories (0.3-4.4%, Fig. 5d). Nonetheless, 74.6% of M2 and 75.7% of M3 were identified in BG regions (Fig. 5d). One potential reason is that some true HC regions were identified as BG in our analysis. If that was the case, we would expect misassembled BG regions (which presumably were HC) would have significantly higher read coverage compared to correctly assembled BG regions. Consistent with this expectation, the median RDs of 100bp BG region bins that were misassembled (1.13 and 1.28 for M2 and M3, respectively) were higher than the median RDs of correctly assembled BG bins (1.03 and 1.06 for C2 and C3, respectively; Wilcoxon signed-rank tests, both p<2.2e-16, Fig. S3). In addition to read coverage differences, we found that misassembled BG regions tend to be much shorter (median lengths=698bp for M2/M3 combined) than misassembled HC regions (2,328bp; Fig. S1c; Wilcoxon signed-rank test, p < 2.2e-16). This is likely because CNVnator  merges adjacent bins based on read depth similarity and in doing so, shorter regions with variable coverage may not be identified. In any case, the read coverage difference is small. Thus, if we relaxed the HC detection threshold, it would significantly increase the false HC calls by calling true BG regions as HC.
An example HC containing region, which is from 500Kb to 590Kb on chromosome 8, had further supported our assumption above (Fig. S4). In this region, there are five tandemly duplicated genes for terpene synthases (TPS) and four cis-prenyl transferase (CPT) genes. By comparing the genomic sequences of the short-read and long-read assemblies, and the Polymerase Chain Reaction (PCR)-validated sequence in the tomato variety M82  (Fig. S4B), the two HC regions (Fig. S4A) were identified as to be mis-assembled in the short-read assembly (case 3 and 4 in Fig. S4C). In addition, two BG regions (case 1 and 5, Fig. S4C), where the read coverages were ~2 times higher than the background, were also identified as misassembly, supporting the idea that the number of HC regions were underestimated using the current criteria. These results also suggest that for a HC region, if the long-read assembly is not available, one can validate the sequence in the region using the genomic PCR approach, as stated in Matsuba et al., (2013) .
Genome features distinguishing correctly and mistakenly assembled regions
To understand why some HC regions were not identified as misassembled, using the same genome features as in Model 35 for classifying HC, BG, and LC regions, a binary classification model (Model 37) was built to distinguish HC regions consisted of mainly M2/M3 (>50%, referred to as the HC_M2/M3 class) or mainly C2/C3 (>50%, referred to as the HC_C2/C3 class). Model 37 resulted in an F1 = 0.79 (balanced positive and negative classes, thus the background F1 was 0.5), indicating that mis- and correctly assembled HC regions were significantly distinct from each other in certain genome features. Among the top 20 most important features (Fig. 6a and Table S18, detailed distribution of feature values was shown in Fig. S5), interestingly the most informative ones were those of regions flanking the misassembled regions. The flanking sequences of HC_M2/M3 regions tend to have higher densities of SSR, pseudogenes, TE, and non-tandem genes. At first it was surprising that it was the features in the flanking regions that were informative. In hindsight, if a region was misassembled, the distinguishing signature would likely be buried with it. From the flanking regions, one can better defined whether the sequence in between is problematic, i.e., in our case, misassembled. Within HC_M2/M3 regions, there tend to be higher densities of four types of k-mers including TATTTC, TGTAA, ATACTT, and GATTTT. However, it is not clear why these k-mers are informative.
In the above modeling exercise, we were able to distinguish HC regions that were likely misassembled from those that were not. Recall that not only HC regions contain misassembled sequences, in fact BG regions have higher proportion of M2/M3 regions (Fig. 5d). To further dissect their differences and to understand why some misassembled regions were not detected as HC, another model (Model 38) was built to distinguish HC_M2/M3 and BG_M2/M3 (>50% of a HC or BG region overlapped with M2/M3), using same features as in Model 35. The resulting F1 was 0.77. Among the top 20 most important features (Table S19), BG_M2/M3 regions tend to have higher GC content (41.3%) and TE density (0.92) compared to HC_M2/M3 regions (GC=33.6%, TE density=0.61, Fig. 6b, feature value distributions shown in Fig. S6). This trend is also true when comparing BG_M2/M3 and HC_M2/M3 flanking regions (Fig. 6b, Fig. S6). Interestingly, the comparatively lower GC content in HC_M2/M3 regions (33.6%) is more similar to the 36.6% overall GC content in the tomato genome. In addition, the GC content in genic region is at 42.4%, suggesting BG_M2/M3 and HC_M2/M3 may be located in relatively gene-rich and poor regions, respectively. Contrary to this expectation, however, HC_M2/M3 regions tend to have significantly higher gene density (average=0.16, rank=142) compared to BG_M2/M3 regions (average=0.04, Wilcoxon signed-rank test, p=1.3e-15). This is also true when comparing flanking regions (Fig. 6b and Fig. S6). With regard to TE, we have already shown that HC regions tend to have a significantly lower TE density compared to BG regions regardless whether they are misassembled or not (Fig. 3a). Taken together, these predictive models perform well for distinguishing mis- from correctly assembled HC regions and for predicting whether a misassembled region lies in BG or HC regions. Using model interpretation strategies, we are able to identify salient genome features underlying the models’ ability to make good quality predictions.