Comparative genomics of the human genome and six bat genomes using AI: Mb-level CpG and TFBS islands

doi:10.21203/rs.3.rs-1323531/v1

Download PDF

Research Article

Comparative genomics of the human genome and six bat genomes using AI: Mb-level CpG and TFBS islands

https://doi.org/10.21203/rs.3.rs-1323531/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Emerging infectious disease RNA viruses, such as the SARS-CoV-2 and Ebola viruses, are thought to rely on bats as their natural reservoir hosts. Since these zoonotic viruses pose a great threat to humans, it should be important to characterize the bat genome from multiple perspectives. Unsupervised artificial intelligence (AI) methods extracting novel information from big sequence data without prior knowledge or particular models are highly desirable for obtaining unexpected insights. We previously established a batch-learning self-organizing map (BLSOM) of the oligonucleotide composition that reveals novel genome characteristics from big sequence data.

Results

In this study, using the oligonucleotide BLSOM, we conducted a comparative genome study of humans and bats. BLSOM is an explainable-type AI that reveals the diagnostic oligonucleotides contributing to sequence clustering (self-organization). When the unsupervised AI reveals unexpected and/or characteristic features, these features can be studied in more detail via the much simpler and more direct standard distribution map method. Based on this combined strategy, we identified the Mb-level enrichment of CG (Mb-level CpG islands) around the termini of bat long-scaffold sequences. In addition, a class of CG-containing oligonucleotides were enriched also in the centromeric and pericentromeric regions of human chromosomes. Oligonucleotides longer than tetranucleotides often represent binding motifs for a wide variety of proteins, e.g., transcription factor binding sequences (TFBSs). By analyzing penta- and hexanucleotide compositions, we observed the evident enrichment of a wide range of hexanucleotide TFBSs in centromeric and pericentromeric heterochromatin regions on all human chromosomes.

Conclusion

TFBSs which are enriched in centromere and pericentromeric heterochromatin regions, are thought to play an important role in the formation of nuclear 3D structures. Our AI-based analysis should help us to understand differential features of nuclear 3D structures of the human and bat genomes.

artificial intelligence

oligonucleotide composition

Self-Organizing Map

centromeric heterochromatin

Covid-19

Emerging infectious disease RNA viruses, such as the SARS-CoV-2 and MERS coronaviruses and Ebola viruses, are considered to rely on bats as their natural reservoir hosts [1]. Several research groups, including ours, have shown that the mono- and short oligonucleotide composition of these virus genomes changes in a time-dependent, directional manner after invading the human population [2-5]. APOBEC enzymes appear to be involved in the directional time-dependent changes [6-8], but many other factors involved in efficient growth in human cells are also expected. To understand the molecular evolution of these infectious RNA viruses in detail, it will become important to characterize from various aspects the genome of bats, which are thought to be their reservoir host, in comparison with the human genome. Due to the societal and biological importance of bats, reference-quality genome sequences have been reported for six species [7], and these sequences were used in the present study.

The composition of short oligonucleotides has been called “genome signature [9]”; in microorganisms, it differs among species, even species with the same genome G+C%. In higher vertebrates such as mammals, however, there are clear intragenomic differences in both the mononucleotide composition (e.g., isochores) [10] and the oligonucleotide compositions [11-13]. Here, we conducted comparative genomic analyses of the human genome and six bat genomes by focusing on the oligonucleotide composition. Notably, oligonucleotides sequences such as those longer than tetra- and pentanucleotides often correspond to binding motifs for a wide variety of proteins, e.g., transcription factor binding sequences (TFBSs), and a large amount of accumulated knowledge about oligonucleotide functions is available for the human genome. Since the oligonucleotide sequences with important biological functions (e.g., TFBSs) tend to be conserved during evolution, we will obtain novel knowledge about the bat genome by utilizing the information accumulated for the human genome about oligonucleotide compositions [11-13].

Unsupervised AI methods that can extract unexpected insights from big data without prior knowledge or particular models are highly desirable for current genome research. We previously developed a batch-learning self-organizing map (BLSOM) for the oligonucleotide compositions that enables the separation (self-organization) of genomic fragments (e.g., 100 kb) by species and phylogeny [14-16]. Importantly, BLSOM is an explainable-type AI, which can identify the diagnostic oligonucleotides that contribute to the classification (self-organization) of genomic sequences. The basic strategy applied in the present study is to let the AI carry out the majority of initial knowledge discovery without presetting particular models or hypotheses. Then, by focusing on the unexpected and/or characteristic results obtained by the AI, we successively examine the AI’s findings via more direct methods.

BLSOM

The Kohonen self-organizing map (SOM), an unsupervised neural network algorithm, is a powerful tool for clustering and visualizing high-dimensional complex data in a two-dimensional map [17]. We modified the conventional SOM for genome informatics based on batch learning so that the learning process and the resulting map were independent of the order of data input [18]. The initial weight vectors were defined using principal component analysis (PCA) rather than by using random values. The weight vectors (wij) were arranged in a two-dimensional lattice denoted by i (= 0, 1,…, I-1) and j (= 0, 1,…, J-1) and were set and updated as described previously [14,18]; weights in the first dimension (I) were arranged into lattices corresponding to a width of five times the SD (5 × σ1) of the first principal component, and the second dimension (J) was defined by the nearest integer greater than σ2/σ1 × N. Here, N was the average number of sequence data per node; and σ1 and σ2 were the standard deviations of the first and second principal components, respectively. Therefore, the ratio of the two axes of the obtained BLSOMs differs depending on the data to be analyzed. A BLSOM program suitable for PC cluster systems is available on our website (http://bioinfo.ie.niigata-u.ac.jp/?BLSOM).

Genome sequences and TFBS data

The sequences of six bat genomes published by Jebb et al. (2020) [7] were obtained from https://bds.mpi-cbg.de/hillerlab/Bat1KPilotProject/, on 2021/02/02; bat species names and their abbreviations are given in the Fig. 1A legend. The genome sequence of Homo sapiens (GRCh38/hg38) was obtained from the UCSC ftp site (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/) on 2021/05/15. When the number of undetermined nucleotides (Ns) in a 100-kb or 1-Mb fragment sequence exceeded 20% of the fragment size, the sequence was omitted from the analysis. When the number of Ns was less than 20%, the oligonucleotide frequencies were normalized to the length without Ns and included in the analysis. Human TFBS sequences were obtained from the SwissRegulon Portal (http://swissregulon.unibas.ch/sr/downloads), which publishes the prediction of TFBSs of MotEvo: integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences [19]. Since complementary oligonucleotides are treated as the same oligonucleotide in this study, we used 181 degenerate sets of hexanucleotide TFBSs (DegeHexa TFBSs) obtained from the SwissRegulon Portal; for the TFBS sequences, see Supplemental Table S1.

DegeDinucleotide BLSOM for bat and human genomes

The genome sequence registered in public DNA databases (e.g., International Nucleotide Sequence Database Collaboration; https://www.insdc.org/about) is only one strand of complementary sequences, and the strand is chosen rather arbitrarily in the registration of fragment sequences, including scaffold sequences. In other words, the distinction between complementary sequences is usually not important for understanding the global characteristics of genome sequences. Here, two complementary nucleotides (e.g., AA and TT) are added together and referred to as degenerate dinucleotides (DegeDi) [15-16]. Figure 1A is a BLSOM of the DegeDi composition of 100-kb fragments from six bat genomes; the number of nodes is set so that an average of 20 sequences are attributed to each node (grid point). Importantly, no information other than the oligonucleotide composition is provided during machine learning: unsupervised AI. In the figure, nodes containing sequences of a single species are shown in the same color to indicate the species, and nodes containing sequences of multiple species are displayed in black. There are many black nodes, and the separation of each species is not good under these conditions, showing that the simultaneous use of six bat genomes complicates the results. Therefore, in constructing DegeDi-BLSOM for comparative genomics including the human genome, we selected the following three bat species: the megabat Rousettus aegyptiacus (Aeg), the microbat Rhinolophus ferrumequinum (Fer), which is closely related to the host species of the bat SARS strain that is thought to be the origin of SARS-CoV-2, and the microbat Myotis myotis (Myo); the bats that are not used in this BLSOM analysis will be used in later distribution analyses of oligonucleotide frequencies. Figure 1Bi shows the BLSOM results for the three bats and humans; in Fig. 1Bii, the sequences attributed to each node are colored according to the species with the largest number of sequences. Even under these analysis conditions, the separation between bats and humans is not good, as shown in Fig. 1Bi.

When comparing genomes, the short fragmentation of each genome is likely to complicate the BLSOM analysis for various reasons, e.g., some fragments primarily contain protein-coding sequences, but others do not. By extending the fragment size, the effect caused by fragmentation can be reduced, and the comparison between genomes will become easier. In Fig. 1Ci, the fragmentation window size is set to 1 Mb, and the window is moved in 100-kb steps to reduce the effects of the 1-Mb cutting positions. The number of sequences used for BLSOM is almost the same as in the analysis shown in Fig. 1Bi, and the effects only of increasing the fragmentation size can be easily detected; the number of black nodes is decreased, and the separation between species becomes clearer. Human and bat genomes show clear separation, but for the genomes of both species, there is broad horizontal distribution and subdivision of each species territory, which seems to be related to mosaic G+C% structures typically observed in higher vertebrates [10,20,21]. Since the separation pattern is simpler than that under 100-kb fragmentation (Fig. 1B), 1-Mb fragmentation is used in subsequent analyses.

The BLSOM is equipped with a tool known as the U-matrix [22], which displays the Euclidean distances between representative vectors of neighboring nodes as the degree of blackness; a larger distance is reflected by a higher degree of blackness (Fig. 1Ciii). A distinct black zone is observed in the upper right part of the human territory, indicating that the oligonucleotide compositions of the sequences located in that region not only differ distinctly from the majority of human sequences but also differ from each other. In our previous BLSOM analysis of the human genome, which mainly focused on tetra- to hexanucleotides, we found a similar conspicuous black zone of the U-matrix; we called this zone the “specific zone” (Sz), and we reported that the sequences derived from centromeric and pericentromeric heterochromatin regions were clustered there [5,11-13]. In the present study, the similar black region found by the U-matrix is also referred to as Sz. The BLSOM is an explainable-type AI, which provides us with the reason why sequences have been separated (self-organized), by using a red/blue heatmap (Fig. 1D) [14-16]; the contribution level of each oligonucleotide at each node can be visualized by color: red (high), white (moderate) and blue (low). Interestingly, GA+TC shows a distinctive red/blue distribution that stands out from that of other dinucleotides, with a major dark red area in Sz; in more detail, the occurrence frequency is low (blue in Fig. 1D) in most regions, including the bat territories, but evident enrichment (dark red) can be seen in the majority of Sz. Importantly, this is not a property derived from the mononucleotide composition because the distribution is very different from that of AG+CT, which has the same mononucleotide composition. Some enrichment (though not as conspicuous as that of GA+TC) of other oligonucleotides (e.g., AC+GT) is also observed in Sz.

DegeDinucleotide distribution on human chromosomes

To investigate the findings made with the BLSOM in detail, we next use a more direct analysis method, a standard distribution map. First, to test the local, evident enrichment of GA+TC in the human genome, which can be predicted from the dark red in Sz, we examined the distribution of the occurrence frequency (%) of the dinucleotide on each human chromosome, as well as AC+GT for comparison, in 1-Mb sliding windows with a 100-kb step. Figure 1E provides examples of two autosomal chromosomes (metacentric chr1 and acrocentric chr14) and two sex chromosomes (chrX and Y). The results for the other chromosomes are shown in Supplemental Fig. S1. Human centromeric and pericentromeric heterochromatin regions are composed of highly repetitive sequences and are incompletely sequenced even in the currently available GRCh38/hg38 version and the unsequenced region is visualized as an open space in Fig. 1E; the central position of the primary constriction of each chromosome shown on the UCSC Genome Brower (https://genome.ucsc.edu/) is indicated by the magenta vertical bar, and the genomic positions of centromeric and pericentromeric heterochromatin regions are presented in Supplemental Table S2. Notably, on chr1, there is a large constitutive heterochromatin region (1q12) adjacent to the centromeric region on the long arm (UCSC genome browser, https://genome.ucsc.edu/), and this region has not been sequenced; on acrocentric chr14, the large heterochromatin region on the centromeric side is also left unsequenced and thus blank.

Interestingly, GA+TC was clearly enriched in centromeric and pericentromeric regions of not only the chromosomes shown here but all chromosomes (Supplemental Fig. S1), showing very peculiar oligonucleotide compositions in these regions. These are consistent with our previous finding that Sz in the U-matrix is composed of sequences in centromeric and pericentromeric heterochromatin regions [11-13]. AC+GT was enriched in the centromeric and pericentromeric regions only of some chromosomes, and the enrichment, when present, was less evident than that of GA+TC. Furthermore, on many chromosomes, AC+GT tended to be rather avoided in the interested regions.

DegeDi-distribution on bat scaffolds

The genome G+C% of both humans and bats is close to 40%, and the number of sequences enriched in C and G is therefore small. The red zone in Fig. 1D for dinucleotides consisting of only C or G is restricted to the left side, and the CC+GG content is high (red) in a somewhat broadened area on the left side, but the GC content is high in the middle to lower part of the left side. However, the CG content is high only in the lower part on the left side and extends significantly into the Aeg territory; some other bat sequences are also located around of this red region. Considering CG suppression (i.e., CG deficiency) [23], which is typically observed in mammalian genomes and is thought to be related to the C-methylation of CG, the local enrichment in CG in bat territories, especially in the Aeg territory (Fig. 1D), is interesting. We then examined the distribution of the CG frequency in bat genomes to clarify the chromosomal locations of CG-enriched sequences. Figure 1F shows the distribution of the CG composition (%) in 1-Mb fragments with a 100-kb sliding step for the long scaffolds of three bats. The most distinct CG peaks were observed at both ends of the longest and second longest scaffold (Scaf1 and 2) of Aeg; since only long scaffolds were used in the distribution analysis, these are likely to correspond to chromosomes but are called scaffolds according to the original paper [7]. In the long scaffolds of other bat species, distinct peaks were also observed, at least at one end (Fig. 1F and Supplemental Fig. S2), but the degree of enrichment was less than that in Aeg, which is consistent with the heatmap result in Fig. 1D. Since the evident peak was composed of data of many 1-Mb sequences, it appears to be appropriate to call it an “Mb-level CpG island”. It should also be noted that in regions other than the scaffold end Fer shows distinct internal peaks, which will be mentioned later.

CpG islands,which play crucial roles in transcriptional regulation, are typically a few hundred bp in length and preferentially occur in gene regulatory regions [24]. By zooming out from the hundred-bp level to the Mb level, conventional CpG islands become inconspicuous, and Mb-level peaks become prominent in bat genomes. In view of the biological importance of CpG islands, we refer to these large-scale structures as “Mb-level CpG islands” [12]. Notably, in a wide range of vertebrate genomes, including the human genome, it has long been known that the G+C% tends to be higher toward the end of their chromosomes [20]. If we use the term CpG island, we should verify that the Mb-level CG enrichment is not a mere reflection of increase in the G+C%. By analyzing the distribution of the odds ratio (Obs/Exp) of CG, which is obtained by dividing the observed occurrence of CG by its expected value according to the mononucleotide composition, evident peaks are observed in the terminal region of the long scaffolds of bat genomes (Supplemental Fig. S3). It is, therefore, appropriate to call the peaks “Mb-level CpG islands”.

DegeTri-BLSOM for 3 bats and humans

In considering the biological significance of the Mb-level peaks obtained from the dinucleotide analysis, we next examined whether the Mb-level enrichment represents the properties of the dinucleotide itself or is a reflection of properties of longer oligonucleotides. Figure 2A shows the BLSOM with the degenerate trinucleotide (DegeTri) compositions of humans and the three bats. The upper panels in Fig. 2B show the heatmap of the DegeTri containing GA (and, thus, TC). The heatmaps of all DegeTri (Supplemental Fig. S4) also show that the addition of a different nucleotide to GA+TC produces different effects. This shows that the specific enrichment of GA+TC observed in Fig. 1D is a reflection of longer oligonucleotide rather than the dinucleotide itself. Lower panels in Fig. 2B shows the heatmap for the DegeTri containing CG; addition of A or T (but not of C or G) gives an increasing trend, mainly in Sz in the human territory: from blue to white in the heatmaps.

Figure 2C shows the distribution of GAA+TTC and GAC+GTC, which are examples of the addition of one different nucleotide to GA+TC. In this figure, we purposely chose different chromosomes from those in Fig. 1E to present the results of diverse chromosomes. As also shown in Supplemental Fig. S5, in which the results for all human chromosomes are presented, GAA+TTC is clearly enriched in centromeric and pericentromeric regions of all chromosomes, whereas GAC+GTC tends to scarce in these regions. This shows that the specific enrichment of GA+TC observed in Fig. 1D is a reflection of longer oligonucleotides rather than the dinucleotide itself.

Analyses of DegePenta for 3 bats and humans

By analyzing tetra- or pentanucleotides, relationships with biological functions (e.g., binding to proteins) can be examined more directly. Figure 3A shows the DegePenta-BLSOM of 1-Mb sequences with a 500-kb sliding step and the corresponding U-matrix. The pattern is simpler than that for DegeTri, and the number of black nodes is reduced, which may provide a more accurate understanding of the characteristics of each genome. However, since we are dealing with 512 variables, it is not easy to investigate the relationships with biological functions by analyzing the 512 heatmaps. In the following analyses, we therefore attempt to focus on oligonucleotides, which are thought to be important from the perspective of biological functions. Figure 3B shows the BLSOM with CG-containing DegePenta for the three bats and humans. While 122 variables are used here, the separation according to species becomes clearer and the number of black nodes is reduced relative to the DegePenta-BLSOM with 512 variables. For human territories, the region judged to represent Sz by the U-matrix is separated far from the human main territory by several bat territories, showing that the CG-containing DegePenta composition in Sz sequences should differ markedly from that of other human sequences. The heatmaps presented in Supplemental Fig. S6 show that a portion of the CG-containing oligonucleotides are enriched in a limited or wide area in Sz, and others are distributed rather evenly over a broad area in human and/or bat territories. Since these results are still complex, it appears to be convenient to first analyze the human and bat genomes separately and then compare them.

Analyses of CG-containing DegePenta for human

In Fig. 3C, only human genome sequences are analyzed. Here, nodes that contain a mixture of sequences from different chromosomes are shown in black, and nodes that contain only sequences from a single chromosome are shown in chromosome-specific colors. Interestingly, in the Sz area on this BLSOM, there are a large number of colored nodes showing chromosome-dependent separation, but there is practically no separation in the major territory (black nodes). In addition, there are blank (colorless) nodes in Sz that do not contain any sequences. When sequences with clearly distinct compositions from others exist, no sequences are attributed to the nodes adjacent to those for the distinct sequences after machine learning, resulting in blank nodes [11-13]. This shows that the sequences in Sz should distinctly differ in their oligonucleotide composition not only from a majority of other human sequences but also from each other. Our previous distribution analysis of CG-containing oligonucleotides on human chromosomes [12,13] showed that when the sequence surrounding CG is composed primarily of A or T, the oligonucleotides are enriched in centromeric and pericentromeric heterochromatin regions, but if the sequences surrounding CG are composed primarily of G or C, the oligonucleotides tend to be enriched also in subtelomeric regions. The same tendency was of course observed in the present analysis, but when analyzing all 122 CG-containing DegePenta in detail, stricter sequence dependency that could not be explained merely by the composition of the surrounding mononucleotides was observed. Notably, only one-third of the CG-containing oligonucleotides were specifically enriched in Sz, and the patterns often differed completely even among highly similar oligonucleotide sequences (see heatmaps of Supplemental Fig. S6). This finding may be related, at least in part, to the presence of many TFBSs that contain the CG sequence, as mentioned later.

To clarify the enrichment of CG-containing DegePenta in Sz via conventional distribution analysis, we analyzed the distribution on each human chromosome as follows. First, the occurrence frequency (%) of each oligonucleotide in 1-Mb sequences was calculated across each chromosome, and oligonucleotides with high (Max-Ave)/Ave values were selected, where Ave is the average occurrence value of each oligonucleotide on each chromosome, and Max is the highest value for the oligonucleotide among all 1-Mb sequences on the chromosome. Division by the average value allowed us to exclude the case in which the reason for the high peak is a mere reflection of a high basal level across the chromosome. After sorting the obtained ratio, we examined the distribution of the occurrence frequency (%) of the top ten oligonucleotides on each chromosome. Figure 3E shows examples of four chromosomes with moderate chromosome lengths, and Supplemental Fig. S7 shows all chromosomes. Interestingly, for all chromosomes, the top ten oligonucleotides were enriched in centromeric and pericentromeric regions, and slight increases were observed in the subtelomeric region. In Fig. 3E, common symbols are used for different chromosomes, and it is therefore clear that the set of oligonucleotides included in the highest peak differs among chromosomes. This shows that the combination of enriched oligonucleotides in centromeric and pericentromeric regions clearly depends on the chromosome, and this is the reason for the existence of many colored nodes in Fig. 3C.

Methylation at C in CG dinucleotides is a typical epigenetic modification, and the binding of methyl-CpG-binding domain proteins (MBDs), as well as several structurally unrelated methyl-CpG-binding zinc-finger proteins, to methylated C bases induces histone deacetylation, subsequent chromatin condensation and heterochromatinization [25]. The human methyl-CpG-binding protein MeCP2 requires an A/T-rich sequence surrounding the methylated C for its binding and is involved in the formation of chromatin loops and the nuclear organization [23,25]. The chromosome-dependent enrichment of CG-containing oligonucleotides in centromeric and pericentromeric regions may be related to the differential sequence specificity of methyl-CpG-binding proteins.

Analyses of CG-containing DegePenta in the 3 bats

The CG-containing DegePenta-BLSOM for the three bats is shown in Fig. 3D. The separation by species is clear, indicating that the genomic composition of CG-containing DegePenta clearly differs by species. Figure 3F shows the distribution of the top ten oligonucleotides according to the ratio, (Max-Ave)/Ave, on long scaffolds. As observed in the CG distribution in Fig. 1F, Aeg shows a pronounced Mb-level peak at both ends and low peaks in the interior region. For Fer, the terminal peak is reduced in height to 30% of the level in Aeg, and clear internal peaks are observed, which is consistent with the results in Fig. 1F. To illustrate that these internal peaks are not limited to a specific scaffold, Fig. 3F shows the pattern of the second longest scaffold (Scaf2) for Fer, in which distinct internal peaks are also observed. In Supplemental Fig. S8, we show examples of the three longest scaffolds of all three bats, and evident terminal peaks and internal peaks are observed, although their height and thickness vary among species. It should also be noted that we have previously observed the Mb-level CpG islands near the ends of frog chromosomes [27], and therefore, the evolutionary processes that formed the Mb-level CpG islands are of interest.

DegeHexa-BLSOM and analyses of human DegeHexa TFBSs

The DegeHexa-BLSOM of 1-Mb sequences with a 500-kb sliding step for the three bats and humans is shown in Fig. 4A; the species-dependent separation pattern is clearer than that of DegePenta-BLSOM in Fig. 3A. The human territory is largely divided into the main territory and the Sz territory defined by the U-matrix, and these two territories are separated by several bat territories, showing that the TFBS composition in Sz clearly differs from that in the main human territory. Since DegeHexa consists of 2080 variables, knowledge discovery from 2080 heatmaps is quite difficult, and we adopted the following strategy to investigate the relationships with biological functions. In the case of hexanucleotides, many TFBSs have been assigned to the human genome by the SwissRegulon Portal [26]. Transcription factors (TFs) are evolutionarily well preserved due to their functional importance, and a large number of human TFBSs are thought to be used also in bat genomes. Then, we performed initially a BLSOM analysis of 181 TFBSs only for the human genome (Fig. 4B), which have been assigned to the human genome by the SwissRegulon Portal; for the TFBS sequences, see Supplemental Table S1. The nodes are colored according to the chromosome, and in Sz, there are many colored and, thus, chromosomally separated nodes, showing that the TFBS composition in Sz sequences differs by chromosome.

Next, we analyzed the TFBS distribution on each chromosome. When searching for TFBSs with a high (Max-Ave)/Ave ratio, we observed that CG-containing TFBSs often present a high ratio, making it difficult to distinguish the effect of TFBS from that of the simple presence of CG. Therefore, in Fig. 4D, the top ten TFBSs that do not contain CG (w/o CG) are shown for four autosomes and two sex chromosomes; in Supplemental Fig. S9, distribution maps of the top 10 TFBSs are presented regardless of the presence or absence of CG for all chromosomes. Importantly, on all chromosomes, distinct peaks are observed in centromeric and pericentromeric regions (Fig. 4D and Supplemental Fig. S9): “Mb-level TFBS islands”. As shown in Fig. 4D, chrY, on which a specific AATGGA+TCCATT sequence is evidently enriched, clearly differs from the other chromosomes. By analyzing both hexa- and heptanucleotide TFBSs [13], we previously showed that the “Mb-level TFBS islands” are located in centromeric and pericentromeric constitutive heterochromatin regions, which are reported by Strachan and Read (2004) [28] and the UCSC genome browser (https://genome.ucsc.edu/). The positions of Sz sequences obtained from the BLSOM shown in Fig. 4B were found to be almost the same as those of centromeric and pericentromeric constitutive heterochromatin regions (Supplemental Table S2). Since all Sz sequences are derived from centromeric and pericentromeric heterochromatin regions, we could estimate the number of DegeHexa TFBSs enriched in these genomic regions simply by examining the red/blue heatmap level in Sz in the BLSOM presented in Fig. 4B, and we found that approximately 100 out of 181 TFBSs showed a distinct red color in Sz (Supplemental Fig. S11). This showed Mb-level enrichment in these regions for more than half of the TFBSs; some TFBSs were red in a large area in Sz (enrichment for many chromosomes), and others were locally red (enrichment for a limited number of chromosomes).

Analyses of TFBS-containing DegeHexa in the 3 bats

In Fig. 4C, we show the BLSOM of the 3 bats for 181 DegeHexa TFBSs assigned to the human genome. Although the separation of the microbat Myo from other bats is clear, there is a significant overlap between the microbat Fer and the megabat Aeg, which are closely related according to molecular phylogeny; most of the black nodes were found to be a mixture of the latter two sequences (data not shown). We then analyzed the distribution of the DegeHexa-TFBS frequency (per Mb) across long scaffolds to examine whether and where TFBS peaks exist in bat genomes (Fig. 4E). TFBSs were again sorted according to the (Max-Ave)/Ave ratio, and the top 10 TFBSs with and without CG are presented in the upper and lower panels, respectively. In the longest scaffold (Scaf1) of Fer, distinct peaks are observed for TFBSs containing CG (upper panel of Fig. 4E), and the internal peaks are observed at almost the same positions for TFBSs without CG (lower panel of Fig. 4E), although peak heights differ from each other. Similar results are observed for Fer scaffold 2 (Scaf2). For Aeg, the internal peaks of CG-containing TFBSs (Supplemental Fig. S8) are rather inconspicuous because of the evidently high occurrence of peaks at both ends. When the peak positions were examined with different vertical scales, the positions of these internal peaks were almost the same as those of TFBSs without CG (lower panel). In addition, we present an example from Mol, which shows relatively low terminal peaks and easily visible internal peaks. The results for the longest 3 scaffolds of all six bats are presented in Supplemental Fig. S10. The finding that various TFBSs with or without CG form peaks at almost the same positions in long bat scaffolds shows that Mb-level structures enriched in diverse TFBSs exist at internal chromosomal sites not only in the human genome but also in bat genomes, although the enrichment level in the bat genomes is clearly lower than that in the human genome.

Biological significance of Mb-level TFBS and CpG islands

In this comparative genome study, without setting up models or hypotheses, a main part of knowledge discovery is left to AI, especially at the beginning of the analysis. The most characteristic result obtained by this data-driven study is remarkable enrichments of TFBSs in human centromeric and pericentromeric regions, which raises various questions for us. Although such issues appear to be outside the scope of the comparative genome study, we will briefly discuss the possible function of Mb-level TFBS islands. By analyzing Hi-C data for Mb-level interchromosomal interactions, we found that chromatin segments supporting the interchromosomal interactions were primarily located in the Mb-level TFBS islands [13], and we therefore proposed that TFs and thus TFBSs are important in the formation of the 3D architecture of genomic DNAs in interphase nuclei. Microscopy-based analyses, such as fluorescence in situ hybridization (FISH), have shown that the centromeric and pericentromeric regions of both homologous and nonhomologous chromosomes are associated at a chromocenter in interphase nuclei, and its size and number differ among tissues of the same organism; notably, a set of chromosomes is involved in chromocenter formation (i.e., centromere clustering) differs between cell types [29-32]. The chromosome-dependent enrichment of a combination of TFBSs, as well as CG-containing oligonucleotides, in centromeric and pericentromeric regions is thought to be involved in supporting cell type-dependent centromere clustering because the cellular contents of individual TFs, as well as the levels of CG methylation enzymes and methyl-CpG-binding proteins, are regulated in a cell type-dependent manner. The Mb-level structures found in the bat genome may also be involved in the formation of nuclear 3D structures.

To consider the molecular mechanisms for the function of the Mb-level structures, we must know the detailed sequence characteristics of the Mb-level islands. Our preliminary sequence-level analyses of Mb-level TFBS islands [13] indicated that one reason for the observed chromosome-dependent enrichment is related to the chromosome-dependent difference in alpha-satellite monomer sequences [33-35]. Notably, the hexanucleotide composition of the consensus sequence of alpha-satellite sequences has been analyzed by several groups, and it has been reported that the most evident sequence is GAAACA, and the others are AGAAAC, GAGCAG, AAACAC and AGAGAA [36, 37]. None of the five sequences found in the consensus sequence, nor their complementary sequences, are included in the hexanucleotide TFBSs, which were reported by the SwissRegulon Portal and thus used in the present study. Collectively, the enrichment of TFBSs found at and near the centromeres is chromosome-dependent and not a feature of the consensus sequence of alpha-satellite sequences.

SARS-CoV-2 and TFBSs

In the Introduction section, we mentioned that one of the goals of comparative genomic analysis of the bat and human genomes is to help in research on the molecular evolution of SARS-CoV-2 after invading the human population. The present study has not yet linked to the viral evolutionary study but is limited to a comparative study of the host genomes themselves. Recently, an informatics search for TFBSs that are present in the SARS-CoV-2 genome but absent in the bat coronavirus genome has reported 22 TFBSs that are presumed to facilitate the viral replication [38]. All of these TFBSs are longer than hexanucleotides, and therefore, their relationship to the TFBSs in the present analysis is unclear. When extending our analysis to TFBSs with longer sequences, we can characterize in detail the TFBSs that are absent on the bat-derived coronaviruses but are present on the human-derived viruses, as well as that are increasing their occurrence in the SARS-CoV-2 genome during human-to-human transmission, and we are planning to conduct such research as a separate project.

Combining the oligonucleotide BLSOM and the distribution map method, we conducted a comparative genome study of humans and bats, the latter of which are considered to be natural hosts of emerging infectious RNA viruses, such as the SARS-CoV-2 and Ebola viruses, and are thus of high biological and societal interest. We found the Mb-level enrichment of CG (Mb-level CpG islands) around the termini of bat long-scaffold sequences and, thus, most likely in subtelomeric regions. We also found the chromosome-dependent Mb-level enrichment of a set of TFBSs (Mb-level TFBS islands) in centromeric and pericentromeric heterochromatin regions of human chromosomes. The Mb-level CpG and TFBS islands are thought to play an important role in the formation of nuclear 3D structures of the genome DNAs.

AI: Unsupervised artificial intelligence

SOM: self-organizing map

BLSOM: batch-learning self-organizing map

TFBS: transcription factor binding sequence

DegeDi: degenerate dinucleotide

DegeTri: degenerate trinucleotide

DegePenta: degenerate pentanucleotide

DegeHexa: degenerate hexanucleotide

Sz: specific zone

Aeg: Rousettus aegyptiacus

Fer: Rhinolophus ferrumequinum

Myo: Myotis myotis

Dis: Phyllostomus discolor

Kuh: Pipistrellus kuhlii

Mol: Molossus molossus

Scaf1: the longest scaffold

Scaf2: the second longest scaffold

Ethics approval and consent to participate

We do not use any data pertaining to individuals.

Consent for publication

Not applicable

Availability of Data and Materials

All genomics sequences used in this study are available at UCSC ftp site (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/) and https://bds.mpi-cbg.de/hillerlab/Bat1KPilotProject/.

Competing interests

The authors declared that there are no conflicts of interests.

Funding

This work was supported by JSPS KAKENHI Grant Number 18K07151 and CREST Grant Number JPMJCR20H1.

Author contributions

T.I. and T.A. designed this project and Y.I. conducted genome analysis. K.W., Y.W. and Y.I. developed computer programs. T.I. wrote the manuscript. All authors participate in the discussion through the project.

Acknowledgements

We gratefully acknowledge the valuable comments of Dr. Masami Hasegawa, Professor Emeritus of The Institute of Statistical Mathematics (Tokyo).

1 Letko M, Seifert SN, Olival KJ, Plowright RK, Munster VJ. Bat-borne virus diversity, spillover and emergence, Nat Rev Microbiol. 2020;18:461-471.

2 Mercatelli D, Giorgi FM. Geographic and genomic distribution of SARS-CoV-2 mutations. Front Microbiol. 2020; doi: 10.3389/fmicb.2020.01800.

3 Wada K, Wada Y, Ikemura T. Time-series analyses of directional sequence changes in SARS-CoV-2 genomes and an efficient search method for candidates for advantageous mutations for growth in human cells. Gene X. 2020; doi: 10.1016/j.gene.2020.100038.

4 Iwasaki Y, Abe T, Ikemura T. Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes, BMC Microbiol. 2021; doi: 10.1186/s12866-021-02158-6.

5 Ikemura T, Wada K, Wada Y, Iwasaki Y, Abe T. AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome, Genes Genet. Syst. 2021;96:1-12.

6 Simmonds P. Rampant C→U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses: causes and consequences for their short- and long-term evolutionary trajectories, mSphere. 2021; doi: 10.1128/mSphere.00408-20.

7 Jebb D, Huang Z, Pippel M, Hughes GM., et al. Six reference-quality genomes reveal evolution of bat adaptations. Nature. 2020;583:578-584.

8 Ratcliff J, Simmonds P. Potential APOBEC-mediated RNA editing of the genomes of SARS-CoV-2 and other coronaviruses and its impact on their longer term evolution. Virology. 2021;556:62-72.

9 Karlin S, Campbell AM, Mrázek J. Comparative DNA analysis across diverse genomes., Annu Rev Genet. 1998;32:185-225.

10 Bernardi G, Olofsson B, Filipski J, et al. The mosaic genome of warm-blooded vertebrates. Science, 1985;228:953-958.

11 Iwasaki Y, Wada K, Wada Y, Abe T, Ikemura T. Notable clustering of transcription-factor-binding motifs in human pericentric regions and its biological significance, Chromosome Res. 2013;21:461-474.

12 Wada Y, Iwasaki Y, Abe T, Wada K, Tooyama I, Ikemura T. CG-containing oligonucleotides and transcription factor-binding motifs are enriched in human pericentric regions, Genes Genet Syst. 2015;90:43-53.

13 Wada K, Wada Y, Ikemura T. Mb-level CpG and TFBS islands visualized by AI and their roles in the nuclear organization of the human genome. Genes Genet Syst. 2020;95:29-41.

14 Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T. Informatics for unveiling hidden genome signatures, Genome Res. 2003;13:693-702.

15 Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples, DNA Res. 2005;12:281-290.

16 Abe, T., Sugawara, H., Kanaya, S., Kinouchi, M., and Ikemura, T. Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes. Gene. 2006;365:27-34.

17 Kohonen, T. The self-organizing map, Proceedings of the IEEE. 1990;78:1464-1480.

18 Kanaya S, Kinouchi M, Abe T, et al. Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene. 2001;276:89-99.

19 Arnold P, Erb I, Pachkov M, Molina N, van Nimwegen E. MotEvo: integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences, Bioinformatics. 2012;28:487-494.

20 Bernardi G. Structural and evolutionary genomics: natural selection in genome evolution. Elsevier, Amsterdam; New York; 2004.

21 Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol. 2, 13-34.

22 Ultsch A. Self organized feature maps for monitoring and knowledge acquisition of a chemical process. In S. Gielen and B. Kappen, editors, Proc. ICANN’93, Int. Conf. on Artificial Neural Networks, 1993. p.864–867.

23 Klose RJ, Sarraf SA, Schmiedeberg L, McDermott SM, Stancheva I, Bird AP. DNA binding selectivity of MeCP2 due to a requirement for A/T sequences adjacent to methyl-CpG, Mol Cell. 2005;19:667-678.

24 Deaton AM, Bird A. CpG islands and the regulation of transcription, Genes Dev. 2011;25:1010–1022.

25 Bogdanović O, Veenstra GJ. DNA methylation and methyl-CpG binding proteins: developmental requirements and function, Chromosoma. 2009;118:549-565.

26 Pachkov M, Balwierz PJ, Arnold P, Ozonov E, van Nimwegen E. SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates, Nucl. Acids Res. 2013;41:D214-220.

27 Katsura Y, Ikemura T, Kajitani R, et al. 2021, Comparative genomics of Glandirana rugosa using unsupervised AI reveals a high CG frequency, Life Sci Alliance, 4, e202000905. Doi: 10.26508/lsa.202000905.

28 Strachan T, Read A. Human Molecular Genetics, 3rd ed. Garland Publishing. NY; 2004.

29 Maison C, Almouzni G. HP1 and the dynamics of heterochromatin maintenance. Nat Rev Mol Cell Biol. 2004;5:296-304.

30 Probst AV, Okamoto I, Casanova ME, Marjou FE, Le Baccon P, Almouzni G. A strand-specific burst in transcription of pericentric satellites is required for chromocenter formation and early mouse development. Dev Cell. 2010;19:625-638.

31 Probst AV, Almouzni G. Heterochromatin establishment in the context of genome-wide epigenetic reprogramming. Trends Genet., 2011;27:177-185.

32 Saksouk N, Simboeck E, Déjardin J., Constitutive heterochromatin formation and transcription in mammals, Epigenetics Chromatin. 2015; doi: 10.1186/1756-8935-8-3.

33 Hayden KE, Strome ED, Merrett SL, Lee HR, Rudd MK, Willard HF. Sequences associated with centromere competency in the human genome, Mol Cell Biol. 2013;33:763–772.

34 Aldrup-MacDonald ME, Kuo ME, Sullivan LL, Chew K, Sullivan BA. Genomic variation within alpha satellite DNA influences centromere location on human chromosomes with metastable epialleles, Genome Res. 2016;26:1301–1311.

35 Sullivan LL, Chew K, Sullivan BA. Α satellite DNA variation and function of the human centromere, Nucleus 2017;8:331-339.

36 Choo KH, Vissel B, Nagy A, Earle E, Kalitsis P. A survey of the genomic distribution of alpha satellite on all the human chromosomes, and derivation of a new consensus sequence. Nucl. Acids Res. 1991;19:1179–1182.

37 Paar V, Pavin N, Rosandić M. et al. ColorHOR—novel graphical algorithm for fast scan of alpha satellite higher-order repeats and HOR annotation for GenBank sequence of human genome, Bioinformatics. 2005;21:846–852.

38 di Bari I, Franzin R, Picerno A, et al., Severe acute respiratory syndrome coronavirus 2 may exploit human transcription factors involved in retinoic acid and interferon-mediated response: a hypothesis supported by an in silico analysis, New Microbes New Infect. 2021; doi: 10.1016/j.nmni.2021.100853.

No competing interests reported.

Download PDF

Editorial decision: Major revision
04 Apr, 2022
Reviews received at journal
14 Mar, 2022
Reviewers agreed at journal
03 Mar, 2022
Reviewers invited by journal
03 Mar, 2022
Editor assigned by journal
03 Mar, 2022
Editor invited by journal
14 Feb, 2022
Submission checks completed at journal
14 Feb, 2022
First submitted to journal
03 Feb, 2022

You are reading this latest preprint version

Comparative genomics of the human genome and six bat genomes using AI: Mb-level CpG and TFBS islands

Status:

Version 1

Abstract

Figures

Background

Materials and methods

Results and discussion

Conclusion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1