Spread of SARS-CoV-2 Genomes on Genomic Index Maps of Hierarchy

Using visual technologies, COVID-19 patients worldwide are conveniently described by position information to collect samples, and modern GIS maps are useful to show inﬂuenced ﬂows and numbers of patients on various regions of a pendamic. From an analysis viewpoint, it is more interesting to organize genomic information into a phylogenic tree with multiple branches and leaves in representations. Clusters of genomes are organized as phylogenic trees to represent intrinsic information of genomes. However, there are structural difﬁculties in projecting phylogenetic information into 2D distributions as GIS maps naturally. In this paper, a novel projection is proposed to arrange SARS-CoV-2 genomes by genomic indexes to make a structural organization as 2D GIS maps. For any genome, there is a unique invariant under certain conditions to provide an absolute position on a speciﬁc region. In this hierarchical framework, it is possible to use a visual tool to represent any selected region for clustering genomes on reﬁned effects. Com-plementary visual effects are provided with phylogenetic tree technology. Sample regions and various projections show spread effects of ﬁve thousand SARS-CoV-2 genomes in 72 countries, and four special countries are selected on genomic index maps.


Introduction
The outbreak of SARS-CoV-2 caused COVID-19 to start in Dec. 2019 and is now pendamic. To the date of 25 May 2020, there are more than 5.33 million confirmed cases and 0.34 million deaths worldwide. An understanding of the prevalence and contagiousness of the disease and of whether the strategies used to contain it to date have been successful is important for understanding future containment strategies.
One excellent strategy for containment of SARS-CoV-2 is to collect sample genomes globally into the GISAID genetic database [1] for infected viruses. Based on this effective activity, Nextstrain provides Phylogenetic tree [2] to organize sample datasets from different places to categorize them as clusters under the maximal likelihood relationship to view intrinsic variations among SARS-CoV-2 genomes. Based on phylogenetic information, a dynamic simulation system provides flexible illustrations on selected branches [3] to support medical doctors, virological experts, biomedical specialists and psychologic doctors for detailed treatments on COVID-19 patients.

Weakness of Phylogenetic Tree
The phylogenetic tree of Nextstrain is based on the maximal likelihood relationship to organize genomic datasets as hierarchical clusters under differential information. After a sample genome of SARS-CoV-2 compared with root node and following branch nodes recursively, it is possible to push it into the most likelihood node that contains the most similar genomes to be a target group. Since a genome contains a long sequence, there are multiple relationships among various clusters in the phylogenetic tree shown in Fig. 1. Using GIS maps, it is useful to see various genomes distributed worldwide. Further arrangement may not be a direct approach. Regular Zoom operators in GIS could be simulated along deeper or upper movement along branching nodes in a phylogenetic tree. Since phylogenetic trees correspond to neither 1D nor 2D structures, it is difficult to rearrange various subtrees [4] as visual objects.
In other words, effective projections for a subset of phylogenetic trees provide a natural projection, and other forms of visual representations could not be directly generated.

Combination, Matrix and Thermodynamics
In modern mathematics and physics, there are many theoretical constructions to handle invariant and variation problems for entropy issues [5]- [35] such as combinatorial mathematics, combinatorial theory, combinatorics, multiple variable complex theory, statistical physics, thermodynamics, thermostatistics, statistical mechanics et al.
The genomic index provides unique identification for each genome to be an invariant under given conditions. Based on these types of global quantitative characteristics, it is convenient for large numbers of genomes to be located in a certain geometric region to be collected as clusters.
Different entropy quantities were discussed in separate papers: Visualization of SARS-CoV-2 Genomes on Genomic Index Maps [45], Visualizations of Topological Entropy on SARS-CoV-2 Genomes in Multiple Regions [46], Cluster Analysis of Visual Differences on Pairs of SARS-CoV-2 Genomes [47], Visualizations of Combinatorial Entropy Index on Whole SARS-CoV-2 Genomes [48].
Considering this is an extremely important research direction, it is necessary to handle this topic from a foundation level to provide additional information to explore hidden structures among this type of multiple levels of hierarchical constructions from a visual representation viewpoint.

Input on Four Meta Symbols
For genomes, each element of input sequences is composed of four meta symbols: {A,C,G,T}.

The First Order of Combinations
From a combinatorial viewpoint, the first order of combinations from the four symbols is composed of sixteen states as a lattice of hierarchy, as shown in Fig. 2.

Multiple Probability Measures
When a genome contains m elements, the numbers of four Meta symbols can be counted. Let m s , s ∈ SS be a number of symbols s and p s be a probability measure. We have the following equations for multiple probability measures.
Under multiple probability conditions, there are sixteen distinct probability mea-

Two Workflows from Input to Output
Two workflows (1) and (2) can be identified by the type of output.

Genomic Index Projection and Genomic Index Map
Three workflows are described in three parts as input, output and process. In Step (1), one index of 16 Combinatorial Entropies can be generated. In Step (2), a genomic index map can be generated from multiple sets of sixteen indexes.
Input : ∀(x, y) ∈ Multiple sets of sixteen indexes, x, y ∈ [0, log 2 (m + 1)] M be the j-th probability measurement, and a relevant information entropy eZ can be determined and restricted in a [0, log 2 (m + 1)] region.
For sixteen combinations of the first order, sixteen entropy measurements of eZ

2D Combinatorial Entropies
Extending this construction to higher orders, the second order of combinations are composed of 2D 16 × 16 pairs of states or a 2D square with 256 positions.

Multiple Genomes
For multiple genomes {Z t }, 1 ≤ t ≤ T on maximal T members of each (i, j) projection, a total number of T positions can be collected on 2D square of ∀(eZ t i , eZ t j ), 1 ≤ t ≤ T . This provides a special distribution for whole genomes of T members on (i, j) projection based on combinatorial entropy measurements.

Genomic Index Maps
Different from a genome, it has a relative position in a phylogenic tree on the maximal likelihood relationship. A genomic index is an absolute invariant to correspond a genome into a quantitative measurement under information entropy based on variant construction. Visual representations of multiple projections are illustrated.

Datasets
From a collection of more than 30K genomes from the GISAID genetic database, more than 5K genomes were selected without any uncertain element of 'N' in whole sequences. Approximately 25K genomes contain at least one 'N'. There are 72 countries involved that contain more than one genome.

Visual Tool -Plotly
Plotly is a visual tool [36] of open-source visualization libraries for R, Python and JavaScript. In this project, we use this visual tool to illustrate hierarchical distributions for multiple genomes on selected regions of EZ i, j maps.

Clustering on Genomic Index Maps
Since all genomic indexes are associated with absolute invariants, this makes it possible to apply 1D or 2D distributions to represent complicated clusters for multiple genomes in hierarchical structures.
Two distinct schemes are shown in Fig 3 for both the phylogenetic tree of Nextstrain and a global genomic index map on five thousand genomes in 72 countries. Different colors are applied to distinguish relevant countries. Various clusters For an enlarged region, regional genomic index map and various projection maps for four countries: Australia, China, the Netherlands, and the USA, were selected to show the relevant projections of the results in Fig. 6 (a)-(b) and Fig. 7(a)-(f) respectively.
Since there are more functions included in the package, it is interesting to control this program to show specific country's distribution double (single) click the entry extracted (removed) from the 72 countries or to draw a box on the region to investigate a special genome on its clustering neighborhood. When the console is pointed to a genome index position, there is a display box to show regular No. of the genome and the values of two entropy indexes.
Under this package, many complicated explorations could be carried out.

Discussion
Since there is an autoscale function in the Plotly package, visual regions for selected datasets may not be a fixed one with slight differences for each selected region.

Global Projections for Four Countries
In Fig. 4 and Fig. 5, six genomic index maps are represented for all genomes of 72 countries and selected four countries: Australia, China, the Netherlands and the USA. In Fig. 4(b) and Fig. 5(b), all genomic indexes of four countries: Australia, China, the Netherlands, and the USA are selected in a region of and a similar number of clusters could be identified as same as in Fig. 4(a). The centers of two larger clusters are located on (2.963,2.669) and (2.973,2.643).
In Fig. 5(c), all genomic indexes of the USA in blue are selected from a region of [2.93, 3.01] × [2.62, 2.70] with the similar number of clusters identified in Fig. 4(a). The centers of two larger clusters are located on (2.963,2.669) and (2.973,2.643).
In Fig. 5(e), all genomic indexes of Australia in light blue are selected from a region of [2.94, 3.01] × [2.64, 2.69] with much lighter clusters identified in Fig. 5(a) or Fig. 5(b). The centers of two larger clusters are located on (2.963,2.668) and (2.972,2.664).

Enlarged Region of Global Projections for Four Countries
In Fig. 6 and Fig. 7, six genomic index maps are represented on enlarged regions for all genomes of 72 countries and selected four countries: Australia, China, the Netherlands and the USA.
In principle, the enlarged operations can be repeatedly applied to selected regions to refine detailed locations in the finest value. In the most conditions, two genomic indexes could be separated when a larger fold magnification was applied.This may provide conveniently classified effects for medical doctors to treat COVID-19 patients with similar genomic indexes as one group of genomes.

Conclusion
Using combinatorial entropy as 2D genomic index maps, there are 256 projections to support multiple genomes in representations. Applying five thousand genomes of SARS-CoV-2 on 72 countries and special selections on four countries based on Plotly libraries, two specific sets of six genomic index maps are shown in significant different distributions on each country to illustrate complicated contagiousness patterns among various regions.
Using genomic index maps, further refined classifications and categories of genomes can be visually explored and this visual tool will be useful in further medical treatments for COVID-19 patients worldwide in the coming future.

Conflict Interest
No conflict of interest has been claimed.