2D Visual Analysis of SARS-CoV-2

COVID-19 is outbreaking in worldwide. It caused millions of infections, killing hundreds of thousands of people and making all countries loss immeasurable trade. For ﬁnding the secret of SARS-CoV-2, researchers need to analyze various variation information such as multiple coronaviruses in different times over distinct countries. In this paper, the metagenetic analysis system MAS is used to analyze SARS-CoV-2 genomes collected from different countries as input datasets, and spe-cial genomic indices are provided to be a global characteristic quantity based on the A1 and C1 modules of the MAS for visualizations. In this method, one RNA sequence is split into M segments and counting the number of genetic probability measures for 16 combinations of four genomic symbols. After statistical probability processes, each probability distribution can be transferred into an entropy quantity on both 2D and 1D histograms to show these results for all collected genomes. Under this approach, a pair of combinatorial entropies determine a 2D genomic index map to generate a heatmap for more massive clusters of genomes with simi-larity contents to provide basic quantitative in variants to organize further collected genomes as a construction of a phylogenetic tree. Further explorations are required.


Introduction
Today all the world is facing the most significant challenge after 1918. COVID-19 is sweeping the world in three months. There are about 85,000 patients in China and 3 million cases in the world. Mainly, the United States is the worst affected country, which has nearly 1 million patients of SARS-CoV-2. Moreover, this disease has a higher fatality rate, which can achieve about 10%. Therefore, to find the treatment and antidotes, we need to analyze the coronavirus.
As we know, the coronavirus is composed of RNA. Due to the instability of the RNA construct, the RNA virus is more accessible to evolution and mutation. According to [1], there are five subtypes of SARS-CoV-2 in the world. Some countries have one subtype, and some countries have two or more subtypes. So comparing these RNA sequences is very important. We must find the same or different points to distinguish different subtypes of the coronavirus.
In this paper, we utilize the metagenetic analysis system MAS especially on the C1 and A1 modules to process the RNA sequence of coronavirus in different countries. Besides, we analyze these results by graphs.

Method
In this section, we will introduce our method and its principle.

Pre-process
According to [2], we need to pre-process the RNA sequence of coronavirus. The whole sequence is divided into several parts. The specific steps are as follows. We get the data of the coronavirus RNA sequence from the GISAID database at the beginning. Next, we segment the sequence N into several subsequences whose length is m. So there are M = N/m subsequences [2]. We will analyze theses subsequences to get the information about the type of coronavirus.

Variant Construction for SARS-CoV-2
First, we should get the subsequence of the coronavirus's sequence, use variant construction to analyze each subsequence. Based on the [3], we account for the number of bases and their combinations in each subsequence. We can use Equation (1) to get the result.
where m b is the number of bases and their combinations in the subsequence [4]. So  there are 15 untrivial combinations like A, T, C, G, A+T, A+C, A+G, T+C, T+G,  C+G, A+T+C, A+T+G, A+C+G, T+C+G, A+T+C+G.

Entropy
Entropy is an efficient method to get the targets' information. We calculate the entropy of coronavirus' RNA sequence. Based on the previous section, we measure the M subsequences' entropy to achieve information on distributions on vectors or matrices. According to Section 2.2. we can know that there are m + 1 vectors or (m + 1) 2 matrices. The entropy E(X) can be measured by Equation (2).
where E(X) ∈ R and also is restricted in [0, log 2 T ], p j represents the j th probability which can be get by Z j m , 0 ≤ Z j ≤ m, and Z donates that a matrix or vector which has m + 1 or (m + 1) 2 . So T is m + 1 or (m + 1) 2 .

Visualization
After getting each subsequence's entropies, we count the same value of these entropies and use 2D coordinate to represent the results. X−axis coordinate means that the value of the entropy. Y −axis coordinate is the total number. Finally, we can get a histogram like Figure 1. Besides, we also choose two combinations to form the 2D scattergram's Xand Y -axis coordinates based on different coronavirus sequences. The example is shown in Figure 2.

Sequence Phylogenetic
Based on [5], we can know that due to RNA characteristic, the coronavirus is evolving continually. So For better understanding SARS-CoV-2, we also estimate the evolution time. According to [2], we can know that each sequence has a unique entropy. So we calculate the difference value of each entropy like Equation (3) and combine the time in sequence's data to analyze the sequence phylogenetically.
In Eq.3, S means the difference value, E a and E b represent the entropy value of two different sequences respectively [6].

Results
In this section, we will show the visual result of our method for the SARS-CoV-2.

Conclusion
In this paper, we analyze the types of SARS-CoV-2 in different countries by the metagenetic analysis system and their entropies. In order to show these results   clearly and efficiently, we use a histogram to project these results. Also, getting the evolution time of different coronavirus' types, we use entropy to get the result.

Conflict Interest
No conflict of interest has been claimed.