A Visual Framework of Meta Genomic Analysis on Variations of Whole SARS-CoV-2 Sequences

The worldwide outbreak of the COVID-19 has become a global pandemic resulting in millions of conﬁrmed cases and hundreds of thousands of deadths. To face such a global crisis, bioinformatics has played a key role in the diagnosis, follow-up, prognosis and treatment of COVID-19-infected patients. A novel bioinformatic tool for metagenomic analysis of whole genomes is proposed in this paper that is composed of three projections: global, clustering and genomic index. For each projection, key modules are described. Global projection provides various combinatorial distributions for a whole genome of N length, and the m -mer scheme partitions this sequence as M segments on 1D, 2D and 3D density matrices for multiple projections. Clustering projections based on distributions from global projections make special ﬁlters extract speciﬁc parts as probability eigenvalues. Genomic index projection provides comprehensive technologies under the theory of information entropy, and a list of measuring entropies are included, such as combinatorial entropy CE, integrated entropy IE, mean entropy ME and topological entropy TE. Three projections provide uniﬁed information to describe complicated functions, internal structures and reﬁned variations for multiple groups of SARS-CoV-2 on variations and other genomes in comparisons. The outputs of three projections are illustrated on variant maps to support category, clustering, classiﬁcation and establishing root activities for reﬁned quantitative operations from bottom to top strategy.


Introduction
Initially from November 2019 or earlier, the worldwide spread outbreak of the COVID-19 has turned into a global pandemic of over 200 countries resulting in more than 3 million confirmed cases and hundreds of thousands of deadths. Fighting this unexpected global crisis, bioinformatics has played a key role in the diagnosis, follow-up, prognosis and treatment of COVID-19-infected patients. It is important to attract more researchers and practitioners to this important field to stimulate worldwide collaboration and coordination in the direction.

Metagenomic Analysis on COVID-19
Metagenomics for COVID-19 Metagenomics [1]- [14] represents a new approach in genomic analysis. This method accesses the potential reservoirs of novel genes in wider applications. To explore this reservoir, RNA genomes from SARS-CoV-2 samples are collected from thousands of COVID-19 patients worldwide. In addition to Koch's Postulate [15,16] for pathogenicity of SARS-CoV-2, it is necessary to have an initial investigation of existing metagenomic technologies.
Using advanced metagenomic analysis methodologies, many statistical and computational tools and databases for metagenomics have been developed [20], such as functional and sequence-based analysis of the collective microbial genomes "full shotgun metagenomics" [21] and polymerase chain reaction (PCR) amplification of certain genes of interest "marker gene amplification metagenomics" (i.e., "full shotgun metagenomics" [21], 16S ribosomal RNA gene) or "meta-genomics" [22]. Two typical operations linked between next generation sequence data and data management storage and sharing are summarized in  Considering this advanced research direction on specific driving applications, it is worth further examining refined facts to understand specific advantages and weaknesses in existing metagenomic analysis tools to handle RNA virus genomes of SARS-CoV-2 in general.

Analysis Tools for Metagenomics
From a computational viewpoint, many analysis tools have been developed. Tools, such as EULER [23], Velvet [24], SOAP [26] and Abyss [27] were among the first to perform de novo assembly and are still widely used today.
The next generation of assembly tools, such as MetaVelvet/MetaVelvet-SL [25] and Meta-IDBA [28] create more accurate assemblies from datasets containing a mixture of multiple genomes. They use k-mer frequencies to detect kinks in the de-Bruijin graph and then use these k-mer thresholds to decompose the graph into subgraphs.
The IDBA-UD algorithm [29] addresses metagenomic sequencing technologies with uneven sequencing depths involving complicated processes for various k-mers in both low-depth and high-depth regions in multiple levels of hierarchy.
Binning is the process of grouping (binning) reads or contigs into individual genomes and assigning the groups to specific species, subspecies or genera. Binning methods are characterized in two different ways: 1) individual genomes have a unique distribution of k-mer sequences (genomic signature); 2) similarity-or homology-based binning refers to BLAST or profile hidden Markov model pHMMs to obtain similarity information in available databases (NCBI or PFAM).

Hybrid Approaches
Certain tools use a hybrid approach to employ both composition and similaritybased information to group sequences, such as PhymmBL [31] and MetaCluster [32] that cross a series of metagenomic samples, facilitating the assembly of micro genomes without the need for reference sequences.
Further Binning tools are characterized with category operations on 1) ab initio unsupervised classifiers and 2) supervised/training-based classifiers. For example, ClaMS is a classifier for metagenomic sequences [38].
Unsupervised binning refers to the process of pre-existing bins to classify a given dataset without user supervision. In contrast, supervised binning allows user interface and supervision in the training process per se. In general, Support Vector Machines SVM (PhylopythiaS), hidden Markov Models hMD (PhymmBL, TETRA) and Self Organizing Maps SOM (ESOMs) provide unsupervised classifiers, and PhylopythiaS and TETRA allow little user intervention, while ClaMS and ESOM provide a more supervised training approach to allow optimal classification for the specific dataset under consideration.
Optimal binning results are expected to combine both composition-and similaritybased approaches by hybrid tools, such as PhymmBL and MetaCluster.

Parallel Schemes
Advanced systems are Parallel-META: efficient metagenomic data analysis based on high-performance computation [33], MEGAN analysis of metagenomic data [34,35], and Galaxy: a web-based genome analysis tool for experimentalists [36].
In relation to the identification of genes within the reads/assembled contig, genes are labeled as coding DNA sequence CDSs and noncoding RNA genes, and certain annotation piplines (e.g., IMG/MER) predict regulatory elements, such as clustered regularly interspaced short palindromic repeats (CRISPRs).
CDSs are identified by MetaGeneMark, Metagene, · · ·, and FragGeneScan. CRISPR elements are identified by CRT, OILER-CR and IMG/MER to retain the longest element prediction in case of overlap. Noncoding RNAs are predicted by tRNAscan.
Ribosomal RNA rRNA genes (5s, 16s and 23s) are predicted by rRNA models for IMG/MER and MG-RAST using similarity to compare three databases (SILVA, Greengene and the Ribosomal Database Project-RDP) to predict rRNA genes.
To predict protein coding genes, due to the large size of metagenomic datasets, very expensive computationally and highly automated operations are performed. BLAST or other sequence-similarity-based algorithms run on high-performance computer clusters under multithreading or other parallel computational approaches to divide jobs in multiple central/graphic processing units (CPUs/GPUs) to reduce the running time complexity and speed up execution time. Data repositories are metagenomic databases, such as KEGG, SEED, eggNOG, and COG/KOG. PFAM and TIGRFAM are protein domain databases.
BLAT (BLAST-like alignment tool) identifies the best homologs of those genes in the isolated genomes. It misses similarity below 70% identity, and many strong hits to other genes are missed.
The EBI metagenomics service is a web-based portal to use metadata structures and formats with the genomic standards consortium GSC guidelines. EBI uses FragGeneScan to obtain protein coding sequences. CAMERA is an online cloud computing service for the analysis of metagenomic data. MEGAN 5 is another tool to perform analysis of metagenomic data and offers a wide range of visualization tools for metagenomic annotation results to support multiple visualization schemes: functional or taxonomic dendrograms, tag clouds, bar charts and Krona taxonomic plots that allow hierarchical data to be explored in a zoomable pie chart.
Taxonomic analysis for prokaryotes (i.e., bacteria and archaca) is regularly performed using 16S data derived from sequencing technologies. In this area, due to the vast availability of algorithms and software for the analysis of 16S metagenomics datasets, QIIME seems to be established as the "gold standard".

Statistical Analysis and Visualization of Results
Different tools provide comprehensive representation of a taxonomic tree to be visualized in applications, such as FigTree and a file in Biological Observation Matrix BIOM format representing OTU tables. Numerous tools and software packages are performing statistical analysis. The Primer-E package allows multiple multivariate statistical analyses, such as Multi-Dimensional Scaling MDS, analysis of similarities ANOSIM, and hypothesis testing. Using R statistical programming language, packages, such as Vegen, Phyloseq and Bioconductor provide multiple inbuilt functions and libraries to support a wider range of statistical analysis required for metagenomic datasets to thoroughly analyze visualization tools for genomic datasets.

Storage, Sharing and Minimum Information
Tools,such as IMG/MER, CAMERA, MG-RAST and EBI metagenomics provide an integrated environment for the analysis, management, storage and sharing of metagenome projects. For refined applications, minimum information about a metagenome sequence MIMS and minimum information about a MARKer sequence MIMARKS is devised to provide a scheme of standard languages for metadata annotation.

Difficulties in Virus Analysis, Big Data Visualization and Others
Analysis usually requires a reference database to find the closest match to an operational taxonomic unit (OTU) from a taxonomic lineage. Existing databases are less suitable for certain groups of organisms, such as protists and viruses which are extremely diverse and for which considerably less sequence information is available compared to bectoria.
Considering the special importance of Koch's postulate in the period of genomics [15,16], it is necessary to find proper techniques to resolve this type of difficulty.
Facing a sea of biological data everywhere, Science magazine in 2005 asked the top 125 scientific problems [40] in Problem 17: How will big pictures emerge from a sea of biological data?
In the current situation, it is truly a top challenge to generate meaningful pictures that emerge from a large number of biological datasets especially from genomes.
Variation results can occur due to inconsistencies in a number of factors, such as DNA extraction, primer pair and amplification region, sequencing platform and software used. Various variations make several difficulties compare and obtain trustworthy results. Through benchmarks, simulations and testing, an initiative would eliminate at least minimization, and biases can be generated by analyzing data using multiple methodologies.
In comparative metagenomic analyses, one can use tools to compare samples from ecological niches and extract information in common and/or unique to a specific environment.

GISAID -International Sharing Influenza Virus Sequences
The GISAID [18] initially promotes the international sharing of all influenza virus sequences, related clinical and epidemiological data associated with human viruses and geographical as well as species-specific data associated with avian and other animal viruses to help research understand how the viruses evolve, spread and potentially become pandemics.
Up to April 25, 2020, over 12K viral genome sequences of hCoV-19 (SARS-CoV-2) shared unprecedented speed via GISAID. One application is shown in   The GISAID TreeTool uses the nextflu pipline to construct and visualize the tree. The pipline consists of an initial approximate maximum likelihood phylogeny reconstruction using FastTree, followed by refinement with RaXML (GISAID Tree-Tool in v2.0).
If the initial sequence is the origin sequence, there is no problem to make it the root and consequent new samples are gradually added in the tree in branches and intermediate nodes in a proper cluster based on the maximum likelihood relationship. However, if the first sequence is not a truly origin sequence and there would be potentially stigma [19] for COVID-19, then it is necessary for a fair whole phylogeny to change the root for the proper phylogeny.
From a comparison of machine-learning mechanisms, advanced TreeTool needs to be enhanced with both a semi-or full-supervised learning scheme and unsupervised schemes to organize datasets into multiple levels of hierarchy first and then based on well balanced clustering distributions to establish suitable phylogeny construction for further explorations. It is essential for the system to allow relocations of the root node consistently [99,102].

The Newest Fighting Fields on COVID-19
Facing the complicated spread of COVID-19 worldwide, scientists and researchers in many countries are corporate to fight this type of invisible virus, and a set of research papers were published in wider areas, such as from SARS to MERS [41], possible origins of SARS-CoV-2 from {Bats, Pangolin ...} [42,43,44,46,49,55,66], human ACE2 receptors [45,77], and the most important questions [47].
Applying similarity comparison, complex tree structures were investigated in similarity and evolutionary relationships [85]. The main steps in the processes are shown in Fig 4.   Fig. 4 Main steps of multiple coronaviruses on similarity networks [85] The new 19 variations of SARS-CoV-2 genomes were identified in patientdriving mutations [86] with significant variation in cytopathic effects and viral load, up to 270-fold differences observed. The conclusion of such variations will strongly influence further medical practices in the treatment of COVID-19 patients. The identification processes are briefly shown in Fig 5 to obtain initial information on genomic variations from Nextstrain's phylogeny, selected genomic samples through metagenomic analysis, 3D structural analysis, extractions, complicated biological quantitative measurements and comparisons for the results.

Variant Construction
Based on vector logic, modern matrix theory, geometric measure theory, combinatorial algebra and discrete mathematics [87]- [94], variant construction starts from n 0-1 variables to form 2 n states and 2 2 n functions via vector permutation and complement operations on state space to establish a variant logic framework to contain 2 n ! × 2 2 n configurations as a variation space. Variant measurement acts as a core of quantitative measurement, starting from m 0-1 variables to explore relevant clustering conditions on 2 m states. Many sample applications have been developed for 40 years using variant construction [95]-[102], such as content-based image retrieval, medical image processing, bat echo identifications, DNA maps, hierarchical organization, phase space classification, feature extraction, filtering, combinations, projections, and conjugate transformations.
This novel theoretical construction provides a solid foundation of multiple hierarchies on multiple probability distributions and invariants to support the metagenomic analysis system from concept levels, design and engineering implementation. Since all relevant transformations and variations could be represented as maps, we emphasize our attention in this paper to show the base structure of this visual framework for SARS-CoV-2 genomes in a series of visual maps.

Aim of The Study
Different from specific targets in existing metagenomics, the metagenomic analysis system MAS focuses on exploring general information from collections of whole genomic sequences intrinsically included in virus RNA genomes on SARS-CoV-2 samples. Multiple distributions, curves and surfaces are illustrated and relevant quantitative measurements -invariants are represented as genomic indices to be mapped into a restricted geometric measurement region to support category, clustering, classification activities to view whole collections of genomes on variant maps. This powerful mapping mechanism can be further explored to resolve any types of big data collections for categories and content-based indexing to provide supersymmetric properties to manage giant data collection over the world to be a unified thermodynamic scheme under relevant entropy schemes. Further explorations are required.
In this paper, the hierarchical architecture of the MAS framework is briefly described without workflows and core equations. For readers interested in more techni-cal information, please check the third paper of this special issue for detailed workfolows and main equations described.

Architecture of Metagenomic Analysis System MAS
The architecture of the metagenomic analysis system (MAS) is composed of three projections: A) global projection; B) clustering projection; C) genomic index projection shown in Fig. 6, and three projections are shown in Fig. 7 (a)-(c).

A: Global Projection
Global projection is composed of nine modules: A := {A 1 , · · · , A 9 } as follows.

Fifteen Modules in MAS
As a list of fifteen functional modules in the MAS is described, their function, description, visual, and mode are shown in Table 1.
Because a series of visual maps are involved, it is convenient for readers to use relevant names for visual illustrations for specific purposes in science, technology, medicine, social activities and daily life.   Table 2.

Results and Discussion
It is convenient to select one sample map for each module in the MAS framework.

Results of Global Projection Groups
Nine modules {A1,..., A9} are selected to be represented as a set of visual maps as follows.
A1 Both Nextstrain phylogeny maps and genomic index maps provide invaluable categorical information. Only relative differences among clusters are contained in the Nextstrain phylogeny with the nearest likelihood relationships in limited discrete evaluating generations.
However, genomic index maps provide invariant position information for all metagenomic sequences on a flexible-scaling region to support infinite evolutions of different variations from foundation levels in general.

Future Explorations
From genomic index maps to general medical practices.
Extracting precise categorical information from hierarchical genomic index maps provides invaluable information on the classification of genomic sequences for advanced practices of genomics, transcriptomics, proteomics, metabolomics and phonemics using metamodel organisms representing viruses, plants and animals in addition to medical, pharmacological, pharmaceutical, pathogenic, neurologic and other applications for COVID-19 patients.

Discussion
A series of visual results for three groups of functions A-C are presented. Among groups A & B, it is feasible to transform one or two genomic sequences as distinct probability distributions via relevant schemes. Multiple distributions from 1D to 3D visual maps are illustrated. From one input sequence, corresponding distributions can be generated.
However, the C group provides further integrations to make each sequence as only one point located in a certain position of a restricted geometric region, and further hierarchical scaling into infinite small is possible. Under this superinvariant framework of the information entropy family, global variations of SARS-CoV-2 samples over the world can be clearly identified in one unified map. This provides superpowerful capacities to consistently compare with all SARS-CoV-2 genomes.
There are potentially infinite variations for this type of dynamic system. Further detailed investigations are required.
From relevant results in comparisons, similarity properties can be performed between two genomes. A list of modules provides different transformations to make various distributions and specific genomic indices. The most important invariants of the MAS are four entropy parameters to provide global invariant parameters to describe any types of variations among multiple species of genomes globally. This is the most important contribution for the MAS to support genomics in general. Further explorations are required.

Conclusion
It is important to have an integrated framework to analyze RNA viruses to overcome intrinsically stronger variations extremely associated with hierarchical time, locations, from micro to macro environments and other complicated conditions to spread, carry, transmit, prevent and detect activities involved.
Based on variant construction of hierarchical organizations from meta levels of analysis to apply quantum thermodynamics and information entropy facilities, it is feasible to transfer various virus genomic sequences as unique sets of genomic indices to organize all relevant information mapped into a restricted geometric region. The foundation of thermodynamic variations and global invariant properties for quantitative characteristics support a universal usefulness of this supersymmetric framework in future exploration.
In the second paper "Input-Output Types of Fifteen Modules on Discrete and Real Measurements for COVID-19" in this special issue, further discussions of relevant input-output types are discussed, and the main equations are described. It is a complementary documentation of MAS for COVID-19.
Since only brief contents are described in this paper, please find refined illustrations with further detailed information on each module in other supporting papers of this special issue. We look forward to obtaining real metagenomic analysis applications for MAS in the near future.