Annotation of SVs across 4 cichlid species
We mapped all previously generated paired-end libraries[22] to the high quality O. niloticus assembly[26] to annotate five different classes of rearrangements (deletion, tandem duplication, inversion, insertion and translocation) in the four available species of the East African radiation (Supplementary Fig. 1). We used a combination of three different tools: Breakdancer [42], Delly [43] and Pindel [44] and identified 6694 deletions, 1550 duplications, 1471 inversions, 34875 insertions and 1354 translocations (Table 1, Supplementary File 1).
Our initial predictions showed a bias towards small (<1kb) deletions (240229). This number might be inflated as a result of our SV detection pipeline, where deletions are identified using read pairs mapped in a concordant way (as opposed to duplications and inversions). This represents an issue particularly when considering small events. Therefore, we decided to retain only deletions with a minimum size of 1kb (Table 1). In the resulting dataset, 5483 deletions fall in the 1-10kb size range, while 1207 represent larger, >10kb events (Fig. 1). We investigated whether the size of a SV correlates with the age of the event. While the size distributions of deletions did not seem to be affected by the number of species sharing the SV, we noticed a tendency for duplications and inversions towards larger sizes as the number of species increased. Species specific events are significantly smaller than those common to 2 species (MW test, p=0.005), which in turn are smaller than the events found in 3 species (MW test; duplications: p=0.008; inversions: p<0.0001; see Supplementary File 2). Moreover, conserved inversions are significantly larger than both conserved deletions (MW test, p<0.0001) and duplications (MW test, p<0.0001).
We investigated the patterns of gain and loss evolution for each SV class, using a Dollo Parsimony approach (see Methods). We identified a high proportion of events predicted to be lineage specific (Fig.2). Additionally, comparison across species allowed us to identify the events common to a single lineage or to all species involved in the African radiation. We will refer to the latter as “conserved SVs”. However, a “conserved SV” could also represent a structural change that occurred in the O. niloticus lineage, and this ambiguity cannot be resolved without the addition of an outgroup species.
We noticed a surprisingly high loss rate of deletions in the M. zebra lineage (Fig.2A). In order to evaluate the reliability of our approach and the accuracy of our annotations, we
compared our results to those obtained through the pairwise, whole genome alignments between the latest M. zebra and O. niloticus assemblies, using Satsuma2 (https://github.com/bioinfologics/satsuma2, see Methods). Out of 2263 deletions annotated in M. zebra, only 54 (2%) were discordant with Satsuma2 alignments. Thus, we show that our annotation in M. zebra has a very high concordance with the high quality genome assemblies of M. zebra and O. niloticus.
With the exception of M. zebra deletions, we observed high proportions of lineage specific events, consistent across all SV classes (Fig.2). However, in the case of deletions we also observed a high number (1711) of events which are ancestral to the radiation. Overall, these results point at a reduction in genome size associated to the African radiation. While this is in concordance with the observation that the M. zebra assembly is 48Mb shorter than the O. niloticus reference [26, 27] , conserved SVs might also reflect a rearrangement event specific to O. niloticus, as stated previously.
We investigated the extent of interval overlap between our predicted SVs and different genomic features. We considered different subsets of our SV annotations, categorising our predictions based on size range (<1kb, 1kb-10kb,>10kb) and number of species sharing the SV event (whole dataset vs conserved SVs). We observed a strong association between >10kb conserved deletions and immunoglobulin chain regions. The association is highly significant for both the constant (14.8 fold change, p=0.01) and variable (10.8 fold change, p=5e-03) gene segment annotation, which suggests a possible involvement of copy number variants in immune response mechanisms. It must be pointed out, nevertheless, that these loci are present in multiple, tandemly repeated copies, and the observed association could possibly reflect assembly issues in repetitive regions.
We also hypothesised that repeats throughout the genome facilitate the evolution of structural changes. In order to test this hypothesis, we looked at the genomic association (interval overlap analysis, see Methods) between our SV dataset and African cichlid specific repetitive elements. These analyses highlighted a significant overlap between 1kb-10kb long conserved inversions and SINE2 elements (10.2 fold change, p=8.7e-03). The association with SINE2 is not significant, however, when we consider all conserved inversions, irrespective of their size (1 fold change, p=0.3).
Conserved duplications are significantly under-represented with African cichlids (AFC) SINE2-1 (0.64 fold change, p=2.7e-03) and REX1-2 AFC elements (0.3 fold change, p=3.14e-02). Conversely, they appear to be enriched for several simple repeats, including (AAGTCTC)n (54.7 fold change, p=1e-04).
Large deletions appear to be negatively associated with AFC RTE-2 elements (0.57 fold change, p=3.7e-02) but positively associated with AFC L1-1 elements (2.17 fold change, p=8.7e-03), as well as several simple repeats. When we considered all conserved deletions, irrespective of their size, we observed a significant association with AFC SINE2-1 (1.42 fold change, p=1e-04) and SINE3 (2.98 fold change, p=1e-04) elements. Similar conclusions were reached in previous studies on the pig genome [45, 46]. Taken together, these results suggest a correlation between repetitive elements and structural evolution in African cichlids.
We next asked the question whether we observe differences in the repeat landscape inside and outside SV regions. In order to identify homologous regions between the reference and each of the remaining species, we converted Satsuma2 whole genome alignments to chain format, and performed a liftover of all SV coordinates from the reference to each of the other species. This allowed us to compare the repeat landscape in a pairwise fashion, considering different SV class separately.
When heterozygous, an inversion can favor the accumulation of mutations and novel transposable elements, as a result of reduced excisions rates [47, 48]. We tested this possibility by comparing the repeat content inside and outside inverted regions. We focused our analysis on the latest O. niloticus and M. zebra genome references, representing the highest quality assemblies among our five species. Previous studies highlighted very high proportions of DNA transposable elements in African cichlids [22], an observation which was confirmed by our data (Figg.3-5). Overlap analyses based on the O. niloticus reference suggested a limited, but significant enrichment in DNA transposons inside inversions (size range: 500nt-5Mb; fold change=1.07, p=1e-04). While for other repeat classes the proportions are very similar inside and outside liftover inverted regions (Fig.3), the LTR representation is higher in the former across all divergence bins (Fig.4). This reflected in a significant enrichment in LTR elements inside inversions (size range: 500nt-5Mb; fold change=1.21, p=1e-04), as opposed to the LTR content outside inversions (fold change=0.92, p=1e-04).
Next, we repeated the analyses considering duplication events. We observed no difference in repeat content inside and outside duplicated regions (Fig.5). The association between duplications and LTR is weaker than expected (fold change=0.94, p=1e-04), while no significant deviation was found when considering DNA transposons (fold change=0.99, p=0.31). As for regions outside duplications, we observed a significant, although very limited, enrichment for both DNA (fold change=1.001, p=0.04) and LTR (fold change=1.05, p=8e-04) elements. It must be noted, however, that during the liftover conversion of the genomic coordinates, many inverted and duplicated regions were lost, limiting the sequence space considered in M. zebra.
SV regions are enriched for developmental and immune related genes
Structural variation can provide important evolutionary novelty for speciation and the evolution of adaptive traits [29-31, 38]. For instance, gene duplication can lead to dosage effects, neofunctionalisation or subfunctionalisation events (Lynch 2002; Katju and Lynch 2006), while inverted regions can experience drastically reduced recombination rates [38]. We took advantage of our SV dataset across 4 species, to investigate which genes are affected by duplication or inversion events. We first considered different subsets of inversions, separating species specific events from the ones annotated in multiple species. When looking at species specific events, we considered each species separately (Supplementary File 3). We identified 559 genes in H. burtoni, 109 in M. zebra, 580 in N. brichardi and 814 in P. nyererei. Results for H. burtoni highlighted GO:0006955 (“immune response”, significant: 13; expected: 5.01; padj =0.0015), GO:0007600 (“sensory perception”, significant: 10; expected: 5.04, padj =0.03).
Inverted genes in N. brichardi, are enriched for GO:0065007 (“biological regulation”, significant: 193; expected: 154.35, padj =0.0411) and GO:0009416 (“response to light stimulus”, significant: 6; expected: 2.45, padj =0.0351). Interestingly, we also found one gene (gja3, coding for an intercellular channel) annotated to GO:0048050, (“post-embryonic eye morphogenesis”). P. nyererei genes are enriched for GO:0007602 (“phototransduction”, significant: 6, expected: 1.44, padj =0.003). This set also includes 2 genes annotated to GO:0002089 (“lens morphogenesis in camera-type eye”, significant: 2, expected: 0.23, padj =0.018), fn1a and foxe3, as well as 3 genes annotated to GO:0061035 (“regulation of cartilage development”, significant: 3, expected: 0.54, padj =0.015): sox32, s1pr2 and pthlha. Members of the sox gene family encode for transcription factors, and play a crucial role in morphological and behavioural variation in teleosts [49]. pthlha is an oral jaw specific gene [50] coding for the parathyroid hormone. In the case, of M. zebra, we could only identify one accession represented by 5 or more genes: GO:0006468 (“protein phosphorylation”; significant: 6; expected: 2.52; ; padj =0.0374).
We also considered inversions which are common to at least 2 species, again not exceeding 5 Mb in size. A total of 854 GO annotated genes (Supplementary File 3) could be identified inside these SV regions. GO:term enrichment on this gene set highlighted accessions GO:0060041 (“retina development in camera-type eye”, significant: 14; expected: 5.7; padj =0.001), GO:0060042 (“retina morphogenesis in camera-type eye”, significant: 7; expected: 2.76, padj =0.0018), GO:0048706 (“embryonic skeletal system development”, significant 10; expected: 5.12; padj =0.03). Among the genes annotated to GO:0060041, we found: vax2 (Ventral Anterior Homeobox 2), a gene known to regulate cone opsin expression [51]; fgf8a (fibroblast growth factor 8a), part of a key pathway in animal evolution [52], and ift172 (intraflagellar transport 172).
We repeated the same procedure for inversions up to 10Mb in length, which increased the number of genes considered to 1,404 (Supplementary File 3). While accession GO:0060041 was still significantly over-represented, we observed additional, immune related processes: GO:0019882 (“antigen processing and presentation”, significant: 8; expected: 2.7; padj =0.0044), GO:0006955 (“immune response”, significant: 19; expected: 12; padj =0.03) and GO:0042445 (“hormone metabolic process”, significant: 6; expected: 2.14 ; padj =0.01). As part of GO:0006955, we found the gene nfil3 (Nuclear Factor, Interleukin 3 Regulated), coding for a transcriptional regulator.
Next, we compared the GO terms across different subtrees of our five-species phylogeny (Supplementary Fig.2, Supplementary File 3). We first identified inversions common to M. zebra and P. nyererei but absent in the other species, for which we could identify 210 inverted genes. We found enrichment for protein modification and processing, including GO:0006508 (“proteolysis”, significant: 14; expected: 7.3; padj=0.014). When considering the branch leading to these two species as well as H. burtoni (134 genes, events absent in N. brichardi), we identified genes involved developmental processes, including 4 annotated to GO:0048598 (“embryonic morphogenesis”) and 2 genes for accession GO:0033339 (“pectoral fin development”): cyp26c1 (cytochrome P450, family 26, subfamily C, polypeptide 1) and sall4 (Spalt Like Transcription Factor 4). The former lies in a sex associated region in H. burtoni [53]. Together with the fact that the inversion is lineage specific, it makes the gene particularly interesting. We can speculate that the inversion event might have helped the maintenance of specific haplotypes (including gene cyp26c1) through the suppression of recombination in the affected region, possibly contributing to the divergence of sex-associated traits.
The enrichment for developmental processes was also observed for genes in conserved inversions (up to 5Mb in size, n=90), among which GO:0060042 (“retina morphogenesis in camera-type eye”, <5 genes) and GO:0048706 (“neuron development”, significant 5; expected: 1.3; padj =0.04) are particularly interesting.
We also looked for genes contained inside tandem duplications, and filtered the resulting set based on evidence of tandem repeat of at least 3 consecutive exons in the target genome assembly (see Methods). When considering species-specific events (Supplementary File 3), we identified 204 genes in H. burtoni, 197 in M. zebra, 143 in N. brichardi and 224 in P. nyererei. For the H. burtoni gene set we identified, among others, GO:0006508 (“proteolysis”, significant: 19; expected: 7.3; padj=0.021) and GO:0060078 (“regulation of postsynaptic membrane potential”, significant: 6; expected: 0.87; padj=0.0002). Duplicated genes in P. nyererei are enriched for immune related processes, including GO:0006955 (“immune response”, significant: 10, expected: 1.83; padj=1.4e-5) and GO:0019882 (“antigen processing and presentation”, significant: 6; expected: 0.41, padj<0.0001). Additionally, GO:0055085 is represented by 18 genes (“transmembrane transport”, significant: 18; expected: 11.18, padj =0.028).
By requiring the duplication event to be shared by at least 2 species, we could identify 152 genes (Supplementary File 3). Results highlighted the presence of GO:0019882 (“antigen processing and presentation”, padj=1e-6), GO:0007229 (“integrin-mediated signalling pathway”, padj=1.6e-5), and GO:0006955 (“immune response”, significant: 8; expected:1.54; padj=1.5e-4). This dataset contains two genes encoding for an H-2 class II histocompatibility antigen chain: ENSONIG00000019943 and ENSONIG00000003904. Accession GO:0048854 (“brain morphogenesis”, significant:2; expected:0.15; padj=0.001) was also significant enriched, however it is represented by only 2 genes: atp1a1 (ENSONIG00000012456), encoding for ATPase Na+/K+ transporting subunit alpha 1a, and shank3 (SH3 and multiple ankyrin repeat domains 3). Similar to inversions, we looked at GO enrichment across the phylogenetic tree (Supplementary File 3). While only one gene (lyz) was found in conserved duplications (after filtering for evidence of tandem repeats and presence in the Ensembl annotation), we had 41 genes for the M. zebra-P. nyererei subtree and 58 for the lineage including H. burtoni as well (Supplementary Fig.3). However, in all of these cases the significantly enriched terms were represented by very low (<4) numbers of genes.
Altogether, our analyses provide the first insights into the possible contribution of SVs to the evolution of adaptive traits in African cichlids, including circadian rhythm, developmental processes and immune response mechanisms.
Validation of selected deletion and inversion events
In order to better understand the reliability of our computational analyses, we decided to validate selected deletions by PCR amplification of the rearranged genomic region (Fig.7, Supplementary Fig. 4, Methods). We focused on 10, medium sized (1-5kb) deletion events annotated in M. zebra (Table 2). For the validation, we compared experimental results obtained using tissue samples for M. zebra (liver and brain) and O. niloticus (testis and fin). In this comparison, O. niloticus represented the SV-free reference sequence, while M. zebra is predicted to carry the deletion event (and hence show a smaller amplification product). Fig. 7 provides an overview of the results of the second PCR run. We could confidently confirm the deletion event in 7 out of 10 cases. In the case of deletions 1 and 2, we were not able to detect the expected products. As for deletion 6, we had discordant results between run 1 (Supplementary Fig.4) supporting the SV, and run 2 showing the expected product in both M. zebra and O. niloticus. Even excluding deletion 6, we obtain a 70% concordance between our computational predictions and the PCR validation, providing evidence for the reliability of our bioinformatics pipeline.
Next, we adapted the primer design strategy for the validation of 9 selected inversions (Table 3), ranging from 1kb to 10Mb in size. We chose 7 events containing genes involved in either retina development (GO:0060041) or innate immune response (GO:0045087), plus 2 additional, smaller (<4kb) inversions. Different primer sets were designed to match sequences flanking either of the two breakpoints (Table 3, Methods). PCR results (Table 3, Fig. 8) confirmed the majority of these inversions, providing strong support for 6 events, partial support for 1 event (inv_4) and poor support for the remaining 2 (inv_5 and inv_7). Among the genes inside validated inversions (Supplementary File 3), we find many genes involved in retina development: sf1 (splicing factor 1) as part of inv_2; vax2 (Ventral Anterior Homeobox 2) as part of inv_3; genes id2 (Inhibitor of Differentiation 2), exoc5 (Exocyst Complex Component 5) and ppm1a (Protein Phosphatase 1A) located inside inv_8. No gene annotated to GO:0045087 (“innate immune response”), however, was found inside these validated inversions. Altogether, these results demonstrate the reliability of our bioinformatics analyses, and provide additional, experimental support to our inferences.