Beta satDNA sequences were identified in genomes of multiple eukaryotic taxa
To study the distribution of beta satDNAs in different species, we BLASTed human beta satDNA sequences online against the nucleotide collection of NCBI. In addition to primates that known to harbor beta satDNAs, significant hits were found in Spirometra erinaceieuropaei, Onchocerca flexuosa, Enterobius vermicularis, Bos mutus and Nicotiana tabacum. S. erinaceieuropaei, O. flexuosa and E. vermicularis are endoparasites of human and other mammals. The presence of beta satDNA in these species indicates HTs. To our great surprise, hit was also found in a plant, N. tabacum. This result suggests that the distribution of beta satDNAs could be much wider than what we have known. It is necessary to perform a global investigation for beta satDNA sequences in eukaryotes.
Currently, there are >8,000 genome assemblies of ~4,000 eukaryotic species in the NCBI Genome database. We downloaded 7,821 assemblies of 3,767 species and BLASTed them against the database containing human beta satDNA sequences. After filtering the BLAST outputs, we found 33,150 beta satDNA copies in 166 genome assemblies of 116 species (Fig. 1, Table 1, Supplementary Fig. S1, Supplementary Table S1, and Supplementary File 1). Since beta satDNA sequences are highly repeated and difficult to assemble into chromosome contigs, it is hard to assess the proportions of beta satDNAs in certain genomes, e.g. a human genome assembly, GCF_000002125.1_HuRef, has 7,036 beta satDNA copies, whereas, another human genome assembly, GCF_000306695.2_CHM1_1.1, has only 107 copies. To our great surprise, beta satDNAs were found in most of the major branches of eukaryotes, including 68 out of 1,394 animals, 16/415 plants, 22/1,656 fungi, 8/60 species of Apicomplexa (9/176 species of Harosa), and 1/20 species of Mycetozoa (1/43 species of Amoebozoa), yet no hit was found in Excavata (0/54) which actually is a polyphyletic group [17, 18].
Beta satDNA can be found in all the major clades of primate, as well as in tree shrew (Tupaia chinensis), presenting a vertical transfer scenario. In addition to primates, beta satDNAs were also found in the genome assemblies of four Bovinae species (Supplementary Table S1). HT in Bovinae had been reports in several previous studies. It seems that the genomes of Bevinae are especially prone to HT [19, 20]. Beta satDNAs are rare in the rest of mammals, as well as in other vertebrates and invertebrates (3%), though found in most major taxa. 1.3% fungal species have been found containing beta satDNAs. Given that the fungal genomes are usually small and easy to assemble, the existence of beta satDNA in fungi might be significantly rarer than in animals. Interestingly, the proportion of genomes containing beta satDNAs is relatively higher in plants (~3.9%), and even higher in Harosa (5.1%). Moreover, the number of sequences found in Harosa is significantly higher than those in non-primate animals, fungi and plants (Table 1 and Supplementary Fig. S1). Beta satDNAs typically exist as tandem repeats in human genome, and the similar pattern were also identified in the contigs/scaffolds of certain non-primate species (Supplementary Fig. S2), indicating that they may play similar roles as they do in primate genomes. Considering that most of the current genome assemblies are far from complete, the actual distribution of beta satDNAs should be substantially wider than the current view, especially in plants, for the extra difficulty in their genome assembling [21].
Analysis of beta satDNA sequences in raw WGS data
We then analyzed 102 WGS data of 73 species from the NCBI SRA database (Fig.2 and Supplementary Table S2), so that we can 1) identify more beta satDNA sequences that have not been assembled into genome data; 2) assess the proportion of beta satDNAs in certain genomes; 3) obtain a more elaborate distribution landscape of beta satDNA by looking at some representative nodes on the tree of eukaryotes. Similar to the genome BLAST results, beta satDNAs were found in all primate clades, as well as in Dermoptera and Scandentia. Previous study suggested that there are two bursts of beta satDNA through the evolution of Hominidae: one is after the separation between great apes and lesser apes, and the other is after the separation between African great apes and orangutan [16]. However, our result showed that the proportion of beta satDNAs in orangutans is similar to those in gibbons or old world monkeys. Thus, the ‘big bang’ of beta satDNAs was taken place in the African apes, which may suggest a distinct aspect of heterochromatin regulation in the Homininae. Beta satDNAs compose >0.1% of human genome, and should receive more attentions in the coming studies.
Besides Euarchonta, bovines, toothed whales and elephants were the three mammalian clades found having beta satDNAs. Beyond mammals, beta satDNAs were identified in many parasites including blood-feeding insects. Since it is always hard to avoid human DNA contamination in isolating parasites DNA samples [22], this observation need to be treated with special caution. The existence of beta satDNAs seems more common in plants than in animals, though the abundances are not high. We found significant hits in most major taxa of plants, including Angiosperm, Gymnosperm, moss, green algae and red algae. Additionally, the three species of Harosa, T. gondii, P. falciparum and Saccharina japonica (brown algae), all contain beta satDNA sequences. Generally speaking, the current WGS data of protists are not sufficient and the quality is not high. We expect a better prospect of beta satDNA distribution in protists in the future.
PCR amplification of beta satDNAs in 36 species
Since there have never been reports about beta satDNAs in non-primate species or HT of beta satDNAs, our discovery is shocking. Therefore, we need fairly strong evidences to support our observations. To further validate the searching results in genome assemblies and SRAs, we performed PCR assays using genomic DNAs (gDNAs) of 36 species as templates (Fig. 3 and Supplementary Fig. S3). The positive PCR signals of beta satDNAs are typical ladders with ~70-bp spacing for their tandem repeated pattern. Basically, the PCR results were consistent with the BLAST results. The signals of mouse, rat and pig showed negative as the negative controls, while a wide range of positive signals were observed in various animal and plant genomes.
It is not surprising that the positive ratio of beta satDNA in SRA, and especially in PCR are significantly higher than that in the genome assemblies. Since beta satDNA sequences are difficult to assemble for their high repetition, many assemblies don’t contain beta satDNA sequences, though the physical genomes do, e.g. we even failed to identify beta satDNA sequences from the current genome assemblies of Pongo pygmaeus (Bornean orangutan), who no doubt has beta satDNAs.
Of course, there are concerns of contamination in both WGS and PCR, and even the genome data have possibility of contamination and wrong assembling, hence we tried multiple approaches to rule out the risk of false conclusion introduced by possible contaminations (see Supplementary Discussion for detailed information).
Identification of beta satDNA sequences in prokaryotic genomes
To see whether beta satDNAs exist in genomes other than those of eukaryotes, we searched the 210,000 prokaryotic genome assemblies (12,398 species of 569 archaea and 11,829 bacteria) in the NCBI Genome database and found hits in 72 species, two archaea and 70 bacteria (Fig. 4). Obviously, beta satDNAs in prokaryotes are far rarer than in eukaryotes. Since the prokaryotic chromosomes are very different from the eukaryotic chromosomes and there are no centromeres or telomeres in prokaryotic chromosomes, beta satDNA may not function in prokaryotes. They could merely be results of HTs from eukaryotes. Most of the bacteria containing beta satDNAs are symbionts/pathogens of animals or plants (Supplementary Table S3) and they might acquire beta satDNAs from their hosts. In addition, there are >13,000 genome assemblies of organelles and >30,000 genome assemblies of viruses in the NCBI database, but none of them contains beta satDNA. Although viruses and symbiotic organelles are common media for HT [23, 24], they might not play roles in the HT of beta satDNAs.
Characterization of the diversity of beta satDNA sequences
To analysis the diversity and evolution of the beta satDNA sequences of different taxa, we examined the phylogenetic relationships of 1,384 beta satDNA sequences of 92 species (Fig. 5A). There is barely association between the phylogenetic tree of beta satDNAs and the tree of these species, indicating that multiple HTs have been taken place between species. However, unlike the case in coding genes or TEs, the beta satDNA sequences from different species couldn’t be separated in phylogenetic tree, which may be due to the intrachromosomal and interchromosomal exchanges [25]. Moreover, the consensus or centroid sequences of beta satDNAs from different sequences are very similar too. For this reason, the HT pathways of beta satDNAs cannot be determined like those of coding sequences based on the phylogenetic relationship of sequences from different species. Then we compared the sequence libraries of different species pairwisely (Fig.5B). Clearly, the diversity of sequences in primates is higher than the rest, indicating that the pool of beta satDNA is quite small until it bursted in primates. Similarly, the cluster analysis on individual beta satDNA sequences of non-primate species showed distinct pattern from that on the total sequences (Supplementary Fig. S4 and S5).