Visualizations of Multiple Probability Measures for SARS-CoV-2 Genomes

SARS-CoV-2 genomes are collected from various open source genomic banks. A set of SARS-CoV-2 genomes are selected for visualization under both the A3 and C1 modules of the metagenomic analysis system MAS. Multiple probability measures are mapped as relevant 1D histograms, and it is convenient to observe distinct differences among various distributions to organize similar patterns into relevant groups. Sample genomes were processed, and their visual results were illustrated.


Introduction
Everyone is concerned about the outbreak caused by SARS-CoV-2 [6] and is making their own efforts to overcome this outbreak.In particular, the majority of medical personnel not only race against time but also fight against diseases.From the frontline of hospital treatment to the frontiers of scientific research, there is constant good news, which brings us hope.
Samples collected in various places are sequenced to obtain viral gene sequences [1].When the gene sequences in various places are brought together, the number of viral gene sequences becomes very large.Today, there are many tools or online tools [2] to help us analyze the similarity or other characteristics of gene sequences, but when we want to quickly learn some basic information, such as the distribution of a gene in all sequences, it becomes very difficult, and it often takes some time.There are 15 modules in the metagenomic analysis system MAS, which can provide unique functions to support a wider range of applications.This paper is mainly related to the A3 and C1 modules.

Aim of The Study
This study focuses on the difficulty of quickly obtaining the specified features in a large number of sequences and hopes to visualize the results for easy understanding.First, variant measurement [9] is used to project viral gene sequences onto 1D histograms and find suitable parameters through continuous exploration.This powerful mapping mechanism can be further explored to resolve any types of big data collections for categories and content-based indexing to provide supersymmetric properties to manage giant data collection over the world.
In this paper, a specific method is introduced.Multiple probability measures are mapped as relevant 1D histograms.

Materials and Methods
The materials used in this study are the SARS-CoV-2 gene sequences published in the GISAID (Global Influenza Data Sharing Initiative) database [5].The length of each gene sequence used was greater than 29000.The figures shown in this paper were obtained using the gene sequences in Table 1.
The main method is divided into five steps: preprocessing [8], segmentation, statistics, normalization and projection.The first step is preprocessing, which refers to the preparation of the downloaded viral gene sequences before entering the program, mainly to check whether there are abnormal characteristics in the viral gene sequences and, if so, to deal with them accordingly.The second step is segmentation, which refers to dividing a single viral gene sequence into short sequences of fixed length m.In the variant maps shown in this paper, m is 32.The third step is statistics, which means counting the number of specific characters in the short sequences obtained in the second step.In the maps shown in this paper, the statistics include not only the individual numbers of A, C, G, and T but also the sum of the numbers of C and G in each short sequence.The fourth step is normalization [3,4], which refers to normalizing multiple values obtained in a viral gene sequence.The last step is projection.The graph used in this paper is the 1D histogram.From the 1D histogram, you can see the frequency of each group and the relationship between the frequencies of each group.

Results and Discussion
This chapter will show and analyze the 1D histograms generated by the materials and methods introduced in the previous chapter.Figure 1 shows all the histograms, from which you can observe their basic characteristics, such as they are approximately normal distribution.These histograms have some common features, such as high in the middle, gradually decreasing on both sides and approximately symmetrical.Most of the small frequencies in these histograms have two or more groups.
We can analyze the similarities and differences between these histograms to determine the distribution characteristics of a specific gene or gene combination in these sequences.It can be observed from Figure 1 that all histograms of sequence g are similar to the corresponding histograms of sequence h.Observe that the histograms of sequence g and sequence f, g a and f a are similar, g c and f c are similar, and only two corresponding histograms are similar.Therefore, it is inferred that sequence g is similar to sequences f and h, but sequence g has the highest similarity to sequence h.Some histograms with better symmetry are selected from Figure 1 and shown in Figure 2. The highest frequency number in the histogram is not only one set, so a histogram b A with two highest frequency numbers is selected for display.The other three histograms in Figure 2 have only one set of highest frequencies.It can be seen that they have good symmetry, with the highest group or two groups as the center, and gradually decrease toward both sides.
The two histograms with the center of gravity left and the center of gravity right selected from Figure 1 are shown in Figure 3.The center of gravity of c A and e C is to the left, and the gap between each group is small.c CG and d CG are just the opposite.Their center of gravity is to the right, and the gap between each group is slightly larger.Therefore, it is inferred that the number of A in a small segment in sequence c is abnormally large, and the number of C and G in a small segment is abnormally small.The number of a certain small segment C in sequence e is abnormally large.The number of C and G in sequence d is abnormally small.

Conclusion
Multiple probability [7] measures are mapped as relevant 1D histograms.The frequency of each group and the relationship between the frequencies of each group can be observed.By projecting a large number of SARS-CoV-2 genomes, it is possible to analyze the regularity and anomaly of the number of certain symbols or symbol combinations in the sequences.The histograms obtained from most sequence projections have similarities, and they exhibit an approximately normal distribution.Variant maps reflecting more diverse information will continue to be explored in subsequent studies.
e f a

Figure 2 Some histograms with symmetry Figure 3
Figure 2