DegeDinucleotide BLSOM for bat and human genomes
The genome sequence registered in public DNA databases (e.g., International Nucleotide Sequence Database Collaboration; https://www.insdc.org/about) is only one strand of complementary sequences, and the strand is chosen rather arbitrarily in the registration of fragment sequences, including scaffold sequences. In other words, the distinction between complementary sequences is usually not important for understanding the global characteristics of genome sequences. Here, two complementary nucleotides (e.g., AA and TT) are added together and referred to as degenerate dinucleotides (DegeDi) [15-16]. Figure 1A is a BLSOM of the DegeDi composition of 100-kb fragments from six bat genomes; the number of nodes is set so that an average of 20 sequences are attributed to each node (grid point). Importantly, no information other than the oligonucleotide composition is provided during machine learning: unsupervised AI. In the figure, nodes containing sequences of a single species are shown in the same color to indicate the species, and nodes containing sequences of multiple species are displayed in black. There are many black nodes, and the separation of each species is not good under these conditions, showing that the simultaneous use of six bat genomes complicates the results. Therefore, in constructing DegeDi-BLSOM for comparative genomics including the human genome, we selected the following three bat species: the megabat Rousettus aegyptiacus (Aeg), the microbat Rhinolophus ferrumequinum (Fer), which is closely related to the host species of the bat SARS strain that is thought to be the origin of SARS-CoV-2, and the microbat Myotis myotis (Myo); the bats that are not used in this BLSOM analysis will be used in later distribution analyses of oligonucleotide frequencies. Figure 1Bi shows the BLSOM results for the three bats and humans; in Fig. 1Bii, the sequences attributed to each node are colored according to the species with the largest number of sequences. Even under these analysis conditions, the separation between bats and humans is not good, as shown in Fig. 1Bi.
When comparing genomes, the short fragmentation of each genome is likely to complicate the BLSOM analysis for various reasons, e.g., some fragments primarily contain protein-coding sequences, but others do not. By extending the fragment size, the effect caused by fragmentation can be reduced, and the comparison between genomes will become easier. In Fig. 1Ci, the fragmentation window size is set to 1 Mb, and the window is moved in 100-kb steps to reduce the effects of the 1-Mb cutting positions. The number of sequences used for BLSOM is almost the same as in the analysis shown in Fig. 1Bi, and the effects only of increasing the fragmentation size can be easily detected; the number of black nodes is decreased, and the separation between species becomes clearer. Human and bat genomes show clear separation, but for the genomes of both species, there is broad horizontal distribution and subdivision of each species territory, which seems to be related to mosaic G+C% structures typically observed in higher vertebrates [10,20,21]. Since the separation pattern is simpler than that under 100-kb fragmentation (Fig. 1B), 1-Mb fragmentation is used in subsequent analyses.
The BLSOM is equipped with a tool known as the U-matrix , which displays the Euclidean distances between representative vectors of neighboring nodes as the degree of blackness; a larger distance is reflected by a higher degree of blackness (Fig. 1Ciii). A distinct black zone is observed in the upper right part of the human territory, indicating that the oligonucleotide compositions of the sequences located in that region not only differ distinctly from the majority of human sequences but also differ from each other. In our previous BLSOM analysis of the human genome, which mainly focused on tetra- to hexanucleotides, we found a similar conspicuous black zone of the U-matrix; we called this zone the “specific zone” (Sz), and we reported that the sequences derived from centromeric and pericentromeric heterochromatin regions were clustered there [5,11-13]. In the present study, the similar black region found by the U-matrix is also referred to as Sz. The BLSOM is an explainable-type AI, which provides us with the reason why sequences have been separated (self-organized), by using a red/blue heatmap (Fig. 1D) [14-16]; the contribution level of each oligonucleotide at each node can be visualized by color: red (high), white (moderate) and blue (low). Interestingly, GA+TC shows a distinctive red/blue distribution that stands out from that of other dinucleotides, with a major dark red area in Sz; in more detail, the occurrence frequency is low (blue in Fig. 1D) in most regions, including the bat territories, but evident enrichment (dark red) can be seen in the majority of Sz. Importantly, this is not a property derived from the mononucleotide composition because the distribution is very different from that of AG+CT, which has the same mononucleotide composition. Some enrichment (though not as conspicuous as that of GA+TC) of other oligonucleotides (e.g., AC+GT) is also observed in Sz.
DegeDinucleotide distribution on human chromosomes
To investigate the findings made with the BLSOM in detail, we next use a more direct analysis method, a standard distribution map. First, to test the local, evident enrichment of GA+TC in the human genome, which can be predicted from the dark red in Sz, we examined the distribution of the occurrence frequency (%) of the dinucleotide on each human chromosome, as well as AC+GT for comparison, in 1-Mb sliding windows with a 100-kb step. Figure 1E provides examples of two autosomal chromosomes (metacentric chr1 and acrocentric chr14) and two sex chromosomes (chrX and Y). The results for the other chromosomes are shown in Supplemental Fig. S1. Human centromeric and pericentromeric heterochromatin regions are composed of highly repetitive sequences and are incompletely sequenced even in the currently available GRCh38/hg38 version and the unsequenced region is visualized as an open space in Fig. 1E; the central position of the primary constriction of each chromosome shown on the UCSC Genome Brower (https://genome.ucsc.edu/) is indicated by the magenta vertical bar, and the genomic positions of centromeric and pericentromeric heterochromatin regions are presented in Supplemental Table S2. Notably, on chr1, there is a large constitutive heterochromatin region (1q12) adjacent to the centromeric region on the long arm (UCSC genome browser, https://genome.ucsc.edu/), and this region has not been sequenced; on acrocentric chr14, the large heterochromatin region on the centromeric side is also left unsequenced and thus blank.
Interestingly, GA+TC was clearly enriched in centromeric and pericentromeric regions of not only the chromosomes shown here but all chromosomes (Supplemental Fig. S1), showing very peculiar oligonucleotide compositions in these regions. These are consistent with our previous finding that Sz in the U-matrix is composed of sequences in centromeric and pericentromeric heterochromatin regions [11-13]. AC+GT was enriched in the centromeric and pericentromeric regions only of some chromosomes, and the enrichment, when present, was less evident than that of GA+TC. Furthermore, on many chromosomes, AC+GT tended to be rather avoided in the interested regions.
DegeDi-distribution on bat scaffolds
The genome G+C% of both humans and bats is close to 40%, and the number of sequences enriched in C and G is therefore small. The red zone in Fig. 1D for dinucleotides consisting of only C or G is restricted to the left side, and the CC+GG content is high (red) in a somewhat broadened area on the left side, but the GC content is high in the middle to lower part of the left side. However, the CG content is high only in the lower part on the left side and extends significantly into the Aeg territory; some other bat sequences are also located around of this red region. Considering CG suppression (i.e., CG deficiency) , which is typically observed in mammalian genomes and is thought to be related to the C-methylation of CG, the local enrichment in CG in bat territories, especially in the Aeg territory (Fig. 1D), is interesting. We then examined the distribution of the CG frequency in bat genomes to clarify the chromosomal locations of CG-enriched sequences. Figure 1F shows the distribution of the CG composition (%) in 1-Mb fragments with a 100-kb sliding step for the long scaffolds of three bats. The most distinct CG peaks were observed at both ends of the longest and second longest scaffold (Scaf1 and 2) of Aeg; since only long scaffolds were used in the distribution analysis, these are likely to correspond to chromosomes but are called scaffolds according to the original paper . In the long scaffolds of other bat species, distinct peaks were also observed, at least at one end (Fig. 1F and Supplemental Fig. S2), but the degree of enrichment was less than that in Aeg, which is consistent with the heatmap result in Fig. 1D. Since the evident peak was composed of data of many 1-Mb sequences, it appears to be appropriate to call it an “Mb-level CpG island”. It should also be noted that in regions other than the scaffold end Fer shows distinct internal peaks, which will be mentioned later.
CpG islands, which play crucial roles in transcriptional regulation, are typically a few hundred bp in length and preferentially occur in gene regulatory regions . By zooming out from the hundred-bp level to the Mb level, conventional CpG islands become inconspicuous, and Mb-level peaks become prominent in bat genomes. In view of the biological importance of CpG islands, we refer to these large-scale structures as “Mb-level CpG islands” . Notably, in a wide range of vertebrate genomes, including the human genome, it has long been known that the G+C% tends to be higher toward the end of their chromosomes . If we use the term CpG island, we should verify that the Mb-level CG enrichment is not a mere reflection of increase in the G+C%. By analyzing the distribution of the odds ratio (Obs/Exp) of CG, which is obtained by dividing the observed occurrence of CG by its expected value according to the mononucleotide composition, evident peaks are observed in the terminal region of the long scaffolds of bat genomes (Supplemental Fig. S3). It is, therefore, appropriate to call the peaks “Mb-level CpG islands”.
DegeTri-BLSOM for 3 bats and humans
In considering the biological significance of the Mb-level peaks obtained from the dinucleotide analysis, we next examined whether the Mb-level enrichment represents the properties of the dinucleotide itself or is a reflection of properties of longer oligonucleotides. Figure 2A shows the BLSOM with the degenerate trinucleotide (DegeTri) compositions of humans and the three bats. The upper panels in Fig. 2B show the heatmap of the DegeTri containing GA (and, thus, TC). The heatmaps of all DegeTri (Supplemental Fig. S4) also show that the addition of a different nucleotide to GA+TC produces different effects. This shows that the specific enrichment of GA+TC observed in Fig. 1D is a reflection of longer oligonucleotide rather than the dinucleotide itself. Lower panels in Fig. 2B shows the heatmap for the DegeTri containing CG; addition of A or T (but not of C or G) gives an increasing trend, mainly in Sz in the human territory: from blue to white in the heatmaps.
Figure 2C shows the distribution of GAA+TTC and GAC+GTC, which are examples of the addition of one different nucleotide to GA+TC. In this figure, we purposely chose different chromosomes from those in Fig. 1E to present the results of diverse chromosomes. As also shown in Supplemental Fig. S5, in which the results for all human chromosomes are presented, GAA+TTC is clearly enriched in centromeric and pericentromeric regions of all chromosomes, whereas GAC+GTC tends to scarce in these regions. This shows that the specific enrichment of GA+TC observed in Fig. 1D is a reflection of longer oligonucleotides rather than the dinucleotide itself.
Analyses of DegePenta for 3 bats and humans
By analyzing tetra- or pentanucleotides, relationships with biological functions (e.g., binding to proteins) can be examined more directly. Figure 3A shows the DegePenta-BLSOM of 1-Mb sequences with a 500-kb sliding step and the corresponding U-matrix. The pattern is simpler than that for DegeTri, and the number of black nodes is reduced, which may provide a more accurate understanding of the characteristics of each genome. However, since we are dealing with 512 variables, it is not easy to investigate the relationships with biological functions by analyzing the 512 heatmaps. In the following analyses, we therefore attempt to focus on oligonucleotides, which are thought to be important from the perspective of biological functions. Figure 3B shows the BLSOM with CG-containing DegePenta for the three bats and humans. While 122 variables are used here, the separation according to species becomes clearer and the number of black nodes is reduced relative to the DegePenta-BLSOM with 512 variables. For human territories, the region judged to represent Sz by the U-matrix is separated far from the human main territory by several bat territories, showing that the CG-containing DegePenta composition in Sz sequences should differ markedly from that of other human sequences. The heatmaps presented in Supplemental Fig. S6 show that a portion of the CG-containing oligonucleotides are enriched in a limited or wide area in Sz, and others are distributed rather evenly over a broad area in human and/or bat territories. Since these results are still complex, it appears to be convenient to first analyze the human and bat genomes separately and then compare them.
Analyses of CG-containing DegePenta for human
In Fig. 3C, only human genome sequences are analyzed. Here, nodes that contain a mixture of sequences from different chromosomes are shown in black, and nodes that contain only sequences from a single chromosome are shown in chromosome-specific colors. Interestingly, in the Sz area on this BLSOM, there are a large number of colored nodes showing chromosome-dependent separation, but there is practically no separation in the major territory (black nodes). In addition, there are blank (colorless) nodes in Sz that do not contain any sequences. When sequences with clearly distinct compositions from others exist, no sequences are attributed to the nodes adjacent to those for the distinct sequences after machine learning, resulting in blank nodes [11-13]. This shows that the sequences in Sz should distinctly differ in their oligonucleotide composition not only from a majority of other human sequences but also from each other. Our previous distribution analysis of CG-containing oligonucleotides on human chromosomes [12,13] showed that when the sequence surrounding CG is composed primarily of A or T, the oligonucleotides are enriched in centromeric and pericentromeric heterochromatin regions, but if the sequences surrounding CG are composed primarily of G or C, the oligonucleotides tend to be enriched also in subtelomeric regions. The same tendency was of course observed in the present analysis, but when analyzing all 122 CG-containing DegePenta in detail, stricter sequence dependency that could not be explained merely by the composition of the surrounding mononucleotides was observed. Notably, only one-third of the CG-containing oligonucleotides were specifically enriched in Sz, and the patterns often differed completely even among highly similar oligonucleotide sequences (see heatmaps of Supplemental Fig. S6). This finding may be related, at least in part, to the presence of many TFBSs that contain the CG sequence, as mentioned later.
To clarify the enrichment of CG-containing DegePenta in Sz via conventional distribution analysis, we analyzed the distribution on each human chromosome as follows. First, the occurrence frequency (%) of each oligonucleotide in 1-Mb sequences was calculated across each chromosome, and oligonucleotides with high (Max-Ave)/Ave values were selected, where Ave is the average occurrence value of each oligonucleotide on each chromosome, and Max is the highest value for the oligonucleotide among all 1-Mb sequences on the chromosome. Division by the average value allowed us to exclude the case in which the reason for the high peak is a mere reflection of a high basal level across the chromosome. After sorting the obtained ratio, we examined the distribution of the occurrence frequency (%) of the top ten oligonucleotides on each chromosome. Figure 3E shows examples of four chromosomes with moderate chromosome lengths, and Supplemental Fig. S7 shows all chromosomes. Interestingly, for all chromosomes, the top ten oligonucleotides were enriched in centromeric and pericentromeric regions, and slight increases were observed in the subtelomeric region. In Fig. 3E, common symbols are used for different chromosomes, and it is therefore clear that the set of oligonucleotides included in the highest peak differs among chromosomes. This shows that the combination of enriched oligonucleotides in centromeric and pericentromeric regions clearly depends on the chromosome, and this is the reason for the existence of many colored nodes in Fig. 3C.
Methylation at C in CG dinucleotides is a typical epigenetic modification, and the binding of methyl-CpG-binding domain proteins (MBDs), as well as several structurally unrelated methyl-CpG-binding zinc-finger proteins, to methylated C bases induces histone deacetylation, subsequent chromatin condensation and heterochromatinization . The human methyl-CpG-binding protein MeCP2 requires an A/T-rich sequence surrounding the methylated C for its binding and is involved in the formation of chromatin loops and the nuclear organization [23,25]. The chromosome-dependent enrichment of CG-containing oligonucleotides in centromeric and pericentromeric regions may be related to the differential sequence specificity of methyl-CpG-binding proteins.
Analyses of CG-containing DegePenta in the 3 bats
The CG-containing DegePenta-BLSOM for the three bats is shown in Fig. 3D. The separation by species is clear, indicating that the genomic composition of CG-containing DegePenta clearly differs by species. Figure 3F shows the distribution of the top ten oligonucleotides according to the ratio, (Max-Ave)/Ave, on long scaffolds. As observed in the CG distribution in Fig. 1F, Aeg shows a pronounced Mb-level peak at both ends and low peaks in the interior region. For Fer, the terminal peak is reduced in height to 30% of the level in Aeg, and clear internal peaks are observed, which is consistent with the results in Fig. 1F. To illustrate that these internal peaks are not limited to a specific scaffold, Fig. 3F shows the pattern of the second longest scaffold (Scaf2) for Fer, in which distinct internal peaks are also observed. In Supplemental Fig. S8, we show examples of the three longest scaffolds of all three bats, and evident terminal peaks and internal peaks are observed, although their height and thickness vary among species. It should also be noted that we have previously observed the Mb-level CpG islands near the ends of frog chromosomes , and therefore, the evolutionary processes that formed the Mb-level CpG islands are of interest.
DegeHexa-BLSOM and analyses of human DegeHexa TFBSs
The DegeHexa-BLSOM of 1-Mb sequences with a 500-kb sliding step for the three bats and humans is shown in Fig. 4A; the species-dependent separation pattern is clearer than that of DegePenta-BLSOM in Fig. 3A. The human territory is largely divided into the main territory and the Sz territory defined by the U-matrix, and these two territories are separated by several bat territories, showing that the TFBS composition in Sz clearly differs from that in the main human territory. Since DegeHexa consists of 2080 variables, knowledge discovery from 2080 heatmaps is quite difficult, and we adopted the following strategy to investigate the relationships with biological functions. In the case of hexanucleotides, many TFBSs have been assigned to the human genome by the SwissRegulon Portal . Transcription factors (TFs) are evolutionarily well preserved due to their functional importance, and a large number of human TFBSs are thought to be used also in bat genomes. Then, we performed initially a BLSOM analysis of 181 TFBSs only for the human genome (Fig. 4B), which have been assigned to the human genome by the SwissRegulon Portal; for the TFBS sequences, see Supplemental Table S1. The nodes are colored according to the chromosome, and in Sz, there are many colored and, thus, chromosomally separated nodes, showing that the TFBS composition in Sz sequences differs by chromosome.
Next, we analyzed the TFBS distribution on each chromosome. When searching for TFBSs with a high (Max-Ave)/Ave ratio, we observed that CG-containing TFBSs often present a high ratio, making it difficult to distinguish the effect of TFBS from that of the simple presence of CG. Therefore, in Fig. 4D, the top ten TFBSs that do not contain CG (w/o CG) are shown for four autosomes and two sex chromosomes; in Supplemental Fig. S9, distribution maps of the top 10 TFBSs are presented regardless of the presence or absence of CG for all chromosomes. Importantly, on all chromosomes, distinct peaks are observed in centromeric and pericentromeric regions (Fig. 4D and Supplemental Fig. S9): “Mb-level TFBS islands”. As shown in Fig. 4D, chrY, on which a specific AATGGA+TCCATT sequence is evidently enriched, clearly differs from the other chromosomes. By analyzing both hexa- and heptanucleotide TFBSs , we previously showed that the “Mb-level TFBS islands” are located in centromeric and pericentromeric constitutive heterochromatin regions, which are reported by Strachan and Read (2004)  and the UCSC genome browser (https://genome.ucsc.edu/). The positions of Sz sequences obtained from the BLSOM shown in Fig. 4B were found to be almost the same as those of centromeric and pericentromeric constitutive heterochromatin regions (Supplemental Table S2). Since all Sz sequences are derived from centromeric and pericentromeric heterochromatin regions, we could estimate the number of DegeHexa TFBSs enriched in these genomic regions simply by examining the red/blue heatmap level in Sz in the BLSOM presented in Fig. 4B, and we found that approximately 100 out of 181 TFBSs showed a distinct red color in Sz (Supplemental Fig. S11). This showed Mb-level enrichment in these regions for more than half of the TFBSs; some TFBSs were red in a large area in Sz (enrichment for many chromosomes), and others were locally red (enrichment for a limited number of chromosomes).
Analyses of TFBS-containing DegeHexa in the 3 bats
In Fig. 4C, we show the BLSOM of the 3 bats for 181 DegeHexa TFBSs assigned to the human genome. Although the separation of the microbat Myo from other bats is clear, there is a significant overlap between the microbat Fer and the megabat Aeg, which are closely related according to molecular phylogeny; most of the black nodes were found to be a mixture of the latter two sequences (data not shown). We then analyzed the distribution of the DegeHexa-TFBS frequency (per Mb) across long scaffolds to examine whether and where TFBS peaks exist in bat genomes (Fig. 4E). TFBSs were again sorted according to the (Max-Ave)/Ave ratio, and the top 10 TFBSs with and without CG are presented in the upper and lower panels, respectively. In the longest scaffold (Scaf1) of Fer, distinct peaks are observed for TFBSs containing CG (upper panel of Fig. 4E), and the internal peaks are observed at almost the same positions for TFBSs without CG (lower panel of Fig. 4E), although peak heights differ from each other. Similar results are observed for Fer scaffold 2 (Scaf2). For Aeg, the internal peaks of CG-containing TFBSs (Supplemental Fig. S8) are rather inconspicuous because of the evidently high occurrence of peaks at both ends. When the peak positions were examined with different vertical scales, the positions of these internal peaks were almost the same as those of TFBSs without CG (lower panel). In addition, we present an example from Mol, which shows relatively low terminal peaks and easily visible internal peaks. The results for the longest 3 scaffolds of all six bats are presented in Supplemental Fig. S10. The finding that various TFBSs with or without CG form peaks at almost the same positions in long bat scaffolds shows that Mb-level structures enriched in diverse TFBSs exist at internal chromosomal sites not only in the human genome but also in bat genomes, although the enrichment level in the bat genomes is clearly lower than that in the human genome.
Biological significance of Mb-level TFBS and CpG islands
In this comparative genome study, without setting up models or hypotheses, a main part of knowledge discovery is left to AI, especially at the beginning of the analysis. The most characteristic result obtained by this data-driven study is remarkable enrichments of TFBSs in human centromeric and pericentromeric regions, which raises various questions for us. Although such issues appear to be outside the scope of the comparative genome study, we will briefly discuss the possible function of Mb-level TFBS islands. By analyzing Hi-C data for Mb-level interchromosomal interactions, we found that chromatin segments supporting the interchromosomal interactions were primarily located in the Mb-level TFBS islands , and we therefore proposed that TFs and thus TFBSs are important in the formation of the 3D architecture of genomic DNAs in interphase nuclei. Microscopy-based analyses, such as fluorescence in situ hybridization (FISH), have shown that the centromeric and pericentromeric regions of both homologous and nonhomologous chromosomes are associated at a chromocenter in interphase nuclei, and its size and number differ among tissues of the same organism; notably, a set of chromosomes is involved in chromocenter formation (i.e., centromere clustering) differs between cell types [29-32]. The chromosome-dependent enrichment of a combination of TFBSs, as well as CG-containing oligonucleotides, in centromeric and pericentromeric regions is thought to be involved in supporting cell type-dependent centromere clustering because the cellular contents of individual TFs, as well as the levels of CG methylation enzymes and methyl-CpG-binding proteins, are regulated in a cell type-dependent manner. The Mb-level structures found in the bat genome may also be involved in the formation of nuclear 3D structures.
To consider the molecular mechanisms for the function of the Mb-level structures, we must know the detailed sequence characteristics of the Mb-level islands. Our preliminary sequence-level analyses of Mb-level TFBS islands  indicated that one reason for the observed chromosome-dependent enrichment is related to the chromosome-dependent difference in alpha-satellite monomer sequences [33-35]. Notably, the hexanucleotide composition of the consensus sequence of alpha-satellite sequences has been analyzed by several groups, and it has been reported that the most evident sequence is GAAACA, and the others are AGAAAC, GAGCAG, AAACAC and AGAGAA [36, 37]. None of the five sequences found in the consensus sequence, nor their complementary sequences, are included in the hexanucleotide TFBSs, which were reported by the SwissRegulon Portal and thus used in the present study. Collectively, the enrichment of TFBSs found at and near the centromeres is chromosome-dependent and not a feature of the consensus sequence of alpha-satellite sequences.
SARS-CoV-2 and TFBSs
In the Introduction section, we mentioned that one of the goals of comparative genomic analysis of the bat and human genomes is to help in research on the molecular evolution of SARS-CoV-2 after invading the human population. The present study has not yet linked to the viral evolutionary study but is limited to a comparative study of the host genomes themselves. Recently, an informatics search for TFBSs that are present in the SARS-CoV-2 genome but absent in the bat coronavirus genome has reported 22 TFBSs that are presumed to facilitate the viral replication . All of these TFBSs are longer than hexanucleotides, and therefore, their relationship to the TFBSs in the present analysis is unclear. When extending our analysis to TFBSs with longer sequences, we can characterize in detail the TFBSs that are absent on the bat-derived coronaviruses but are present on the human-derived viruses, as well as that are increasing their occurrence in the SARS-CoV-2 genome during human-to-human transmission, and we are planning to conduct such research as a separate project.