2D Similarity Map of Multiple Coronavirus Gene Sequences

The outbreak of a novel coronavirus (SARS-CoV-2) in many countries in the world from late 2019 to 2020 resulted in millions of infected people, and caused serious damage to the social environments with signiﬁcant changes in human power and material resources in the world. The novel coronavirus is an RNA virus. RNA mutation is common in nature. This makes it extremely difﬁcult to develop a virus vaccine in a short period. The evolution of the virus has been in a mutation state, in which a certain sequence changes associated with time and environments in similar distributions. A larger number of genomes were collected in various open source databases for scientists in further explorations. In this paper, a 2D similarity comparison scheme on the A2 module of the MAS is proposed for extracting internal information among a genome undertaken M segment partitions to provide visual results based on probability measures and quantitative statistics. First, a genome is segmented into corresponding numerical transformations, and then four numbers of meta symbols in each segment are counted. Corresponding probability measures are calculated. Second, the probability is transformed into polar coordinates, and the polar coordinates are mapped into a M × M matrix. Then, a 1D genome can be processed into 2D measures with similarity properties in sequence. Through this correlation matrix, relevant similarity results are analyzed.


Introduction
To date, novel coronavirus pneumonia (SARS-CoV-2) has been diagnosed in more than approximately three million cases according to the latest data. SARS-CoV-2 was identified as the seventh human coronavirus, and the other six were HCoV-229e, HCoV-NL63, HCoV-oc43, HCoV-hku1, severe acute respiratory syndrome (SARS-CoV) and respiratory syndrome coronavirus in the Middle East (MERS-CoV).
Among them, HCoV-229e, HCoV-NL63, HCoV-oc43 and HCoV-hku1 are prevalent in the global population, accounting for approximately one third of the common cold infections in the blood-brain barrier. Because it is difficult for most coronaviruses to judge the difference of viruses through the surface, they are judged by genome characteristics.
The traditional methods of sequence analysis include sequence alignment algorithms, for example, the algorithms based on dynamic programming to calculate DNA sequence similarity. With the rapid development of sequencing technology, a large number of data sets make the sequence analysis method based on alignment appear bottleneck in the calculation, so the following is the incubation of some nonalignment biological sequence analysis methods. The non-sequence alignment algorithm can be divided into statistical model-based methods, geometry-based methods and complexity-based sequence comparison methods. The commonly used methods to measure the relationship between DNA sequences are, the Markov model and Kullback Leibler difference (KLD), Mahalanobis distance, chaos theory, Euclidean distance, cosine distance, etc. In addition, Chunting Zhang uses the knowledge of geometry to analyze DNA sequences and proposes the Z-curve theory, which is an equivalent representation form of genome sequences. All the information of genome sequence information can be expressed on the Z-curve.
Novel coronavirus has been a pressing matter of the moment in novel coronavirus. It is imperative to determine the source of the outbreak and take measures to analyze the virus. The comparative analysis of the sequence of the virus in each country helps to analyze the coronavirus in all directions. The development of sequencing technology makes it easier to obtain a new coronavirus sequence. The international open source database GISAID has been keeping up with the new databases. A large number of coronavirus sequences submitted by various countries are not easy to observe directly, and the introduction of visualization technology can more intuitively analyze the sequences. For many years, there have been many visualization methods in the field of bioinformatics, including sequence comparative analysis, phylogenetic tree construction, database search, structure comparative analysis, and gene recognition.
The novel coronavirus base nucleic acid sequence is added every day in the database. It is unlikely that the structure of each sequence will be compared and analyzed one by one. Therefore, we must analyze and select the data accumulated by the computer in order to achieve the result that should be judged by experiment. At present, gene sequence visualization technology has been used in the fields of gene sequence similarity analysis, protein structure analysis and family tree construction. Therefore, it is necessary to study the method of gene sequence visualization in the field of bioinformatics. In this paper, a 2D similarity comparison scheme on the A2 module of the MAS is proposed, which shows that different kinds of coronaviruses have their own distinct characteristics.

Model Description
In this paper, the visualization method is combined with the probability statistics method. First, the gene sequence is segmented to count the probability of each segment, then the probability is normalized to map the matrix, and then the graph is drawn by way of a thermodynamic diagram. The specific methods are as follows.
A set of sequences with values between [0, 2] is obtained and visualized by using the Gracian angular field method. First, a time correlation matrix is created for each object (X i , X J ). Second, it rescales the time series in a range [a, b], and (−1 ≤ a ≤ b ≤ 1). It then calculates the polar coordinates of the scaled time series. Finally, the sine of the angle difference between the angle and cosine of the Gramian angle, and field (Gasf) and the angle difference of the Gramian angle difference field (Gadf) are calculated. The steps are as follows.
Finally, after the rescaled time series is transformed into a polar coordinate system, the time correlation in different time intervals can be identified by considering the triangle and/or difference between each point, and the angle perspective is used. GAFS are defined in equations (3)-(10).
The calculation formula of Gasf is (7): cos(θ 1 + θ 2 ) = cos(arccos(x) + arccos(y)) = cos(arccos(x)) · cos(arccos(y)) − sin(arccos(x)) · sin(arccos(y)) (9) The mapping matrix used in this paper is similar to the Gram matrix, v 1 , ..., v n in linear algebra. The Gram matrix of a family of vectors of inner product space is the symmetry of the inner product, and its element is G i, j = (v j |v i ). Gram matrix theorem: let X be an inner product space, u 1 , u 2 , ..., u n ∈ X matrix: Data processing with gram-like corner field: A. Get data s B. Compress its data range to [−1, 1] C. Selection method sin or cos processing D. Then use the piecewise aggregation approximation (PAA) to reduce the size of each time series E. The image is flattened to one dimension

Visual Results
Typical human coronaviruses are shown in Table 1.
In the visualization model, the change of parameter value will have a great influence on the result of graphics. Whether the parameter value is too large or too small, some distribution features of the image will be hidden. The main two parameters in this paper are segment value m and sliding window value X. Six groups of novel coronavirus sequences were selected for comparative analysis. The m values were the same, and the X values showed different figures as shown in Figure 1. The X values were the same, and the m values were different in Figures 2(a1)-(a3) and (b1)-(b3) were selected as the 2019-nCoV MN908947 and the human coronavirus NL63 MN908947 respectively.   In the study of visualization of 15 seed set sequences of sequences, it is found that the features displayed by the subsets of the T-sequences are effective after many comparisons. Therefore, the 9 coronavirus sequences they are all treated as visualization diagrams of T-subsets, and Figure 4 is obtained. Compared with other sequence characteristics, both human coronavirus NL63 and 2019-nCoV MN908947 on the T projections are shown in Figure 5.  The standard difference calculation is introduced to prove that the similar distribution observed by the naked eye is reflected in the difference. The standard difference is represented by a graph. Because the standard difference is 0 when the same sequence is compared, the part with the value of 0 is replaced by the mean value of the column in which it is located without affecting the change trend. Therefore, the scatter diagram between the standard differences of 9 sequences is shown in Figure  6. The information entropy of the sequence can reflect the degree of confusion between the sequences, and the larger the image information entropy is, the greater the amount of information. Figure 7 is a broken line diagram of sequence information entropy and graphical information entropy of 9 coronaviruses.

Result Analysis
The change in the parameter value has a great influence on the graph results. In this paper, the gene sequence is visualized by controlling the change in segment value, window value and coordinate value, and its distribution characteristics are displayed one by one in the graph. In the case of the same control parameters, comparing different coronavirus sequences, it was found that the more similar the gene sequence in the family tree of the genome, the more obvious the similarity of its graphical distribution characteristics.
Phylogenetic analysis of novel coronaviruses, the RNA-dependent RNA polymerase (RdRp) gene and the S gene sequence indicated that RaTG13 is a close cousin of the new coronavirus and has different lineages than other SARSr-CoV. The novel coronavirus is highly similar to RaTG13 in the whole genome. Its complete genome sequence is 96.2% homologous to that of the SARS-like coronavirus isolated from China, and the similarity of nucleotide similarity to bat-SL covzc45 (GenBank MG772933) is 89.1%. The distribution of the novel coronavirus bases is similar to that of the SARS family. These similarities are shown in the pictorial distribution of the new coronavirus.
In many published research results, it has also been shown that the sequence of RaTG13 coronavirus collected from bats is highly similar to the sequence of new coronary disease. From Figure 4, it is also observed that the distribution of base characteristics of the two virus sequences is indeed highly similar to that of other coronavirus sequences. From the information entropy of the graph, it can be seen that the entropy values of the sequence are similar. From the information entropy of the image, it can be observed that when the value is the T sequence, the entropy value of NL63 sequence is the largest, indicating that the amount of information obtained is also the largest.

Conclusion
The visual graph of this paper shows that the distribution of bases in different coronaviruses has their own characteristics, which is convenient for the analysis of virus sequences. In the research of this paper, the focus is to process several typical human coronavirus sequences. The nine selected coronavirus sequences are of great value in the field of coronavirus research. The data come from the NCBI database. The processing mode is to preprocess first and then use the variant visualization model to process the two-dimensional matrix. The first is to map the matrix for intuitive comparative analysis, and the second is to map the matrix into a graph that is available when measuring the similarity between sequences. Then the standard deviation is introduced to measure the base distribution of the sequence, and the matrix value subtraction can be used to calculate the standard deviation of the sequence. In this paper, the graphical results include a graphical comparison of the changes in the control parameters and a graphical comparison of the same parameters on different virus sequences.
In the visual results, the common distribution characteristics of virus sequences and the characteristics of their existence are observed, while the distribution of their base sequences among the same type of virus sequences has certain similarity. In the standard difference measurement, the distribution of the original sequences can be compared one by one. Correspondingly, the model in this paper is feasible.