Comparative Study of Pathogenic Viruses Carried on Pairs of Species

The new coronavirus was checked on December 12, 2019, and spread rapidly over time. It has become a public health event spreading around the world. Until this time, the source of the virus remains controversial. In this paper, a series of SARS-CoV-2 genomes were collected using the A 1 module of the MAS for visualization. Pairs of genomes are compared under similarity relationships between SARS-CoV-2 and other deadly viruses carried by different species. Through the pro-posed method of variant construction, it provides important information to under-stand similarity properties among genomes. The comparison mechanism provides an efﬁcient and fast similarity mode to compare with a whole genome at multiple levels of hierarchical measurements to provide variation information on internal correlation to a certain extent. Sample results are intuitively expressed through a list of 1D visual line charts for various distributions.


Introductioon
At the beginning of 2020, the global pandemic of the new coronavirus cast a shadow over this new year, and countries around the world are actively responding and trying to overcome difficulties. However, there is still a huge controversy about the origin of the new coronavirus. It is an effective method through the combination of metagenomic [1][2][3][4][5][6][7][8][9] and sequence alignment.
Fifteen modules of {A,B,C} three groups in the metagenomic analysis system (MAS) provide unique capacities to support wider applications. This article shows the specific performance of the A 1 function module of the MAS in practical applications, and discusses the relationship between the deadly viruses carried by different species.
There are many comparison methods at this stage, such as the Needleman-Wunsh algorithm [10] and Smith-Waterman algorithm [11] based on alignment mode, and other alignment algorithms based on misalignment using k-mers [12] as the core . Through these algorithms, much research has been conducted on the source of the new coronavirus and the intermediate host, and some results have been achieved. However, these comparison methods have high complexity.
Through sequence alignment, we can judge the similarity between sequences and judge whether they are homologous sequences according to the degree of similarity [13]. Sequences with high similarity usually have a higher chance of being homologous sequences, and at the same time, homologous sequences usually have a higher similarity relationship. Although these are not necessarily true, they have correspondence in most cases, so judging homology by similarity is a way to be affirmed.
There are many challenges in finding homologous sequences and even viruses that may come from the middle array. It is very important to effectively screen and judge the virus sequences. Considering the special importance of Koch's Postulate in the period of genomics [14,15], it is necessary to find proper techniques to resolve this type of difficulty. Facing a sea of biological data every where [16], it is really a top challenge to generate meaningful pictures emerged from those types of meaningless datasets.
In this paper, based on variant theory, combined with the random characteristics of gene sequences, the similarity relationship between deadly viruses carried by different species is analyzed. The differences in the direction of variation are compared to provide an efficient and fast similarity model, which can be compared with the entire genome at multiple levels of measurements, thereby providing internally related variation information to a certain extent new coronavirus research provides a new perspective.
Variant theory is based on classical logic [17], and variant mapping is performed by the variant logic function. Variant conversion, variant measurement, and variant projection form a complete set of variant measurement systems. In this paper, 1D visual line charts are used to carry out variant projection to show the correlation between sequences clearly and intuitively. Variant theory has achieved a series of achievements. In 2018, a monograph [18] was published on the basis of a phased arrangement to introduce the system in detail and elaborate its application in various aspects.

Aim of The Study
By analyzing the similarity relationship between different sequences, the comparison result of the similarity relationship between the deadly virus carried by different species and the new coronavirus is obtained. By comparing and analyzing the different mutation results of the same virus, we can obtain the different characteristics of the virus when it is mutated, and at the same time provide the mutation information of internal correlation to a certain extent. The mutation pattern of the new coronavirus was explored.

Materials and Methods
The material uses the viral gene sequence downloaded from the NCBI and GISAID, and selects deadly viruses from bats, pangolins, and pigs that can infect humans and new coronavirus for comparison.
The core method used is a probability statistical model based on variant construction. Through segmentation and statistics, the comparison results are mapped to the 1D plane, and 1D line charts are drawn.

Input and Segment Statistics
The main function of this part is to separately count and save the number of four different bases in each segment of the two sequences involved in the comparison, and to provide a measurement basis for the next module.
Suppose the lengths of the two sequences S 1 ，S 2 participating in the alignment are L1 and L2 respectively, the length of the segment is m, N is the number of segments, then there are S 1 =a 1 a 2 ...a N 1 , S 2 =b 1 b 2 ...b N 1 , count the number of bases A, T, C, G in each segment in sequence {MA, MT, MC, MG}, and record.

Data Measurement
The main function of this part is to use the statistical values obtained in the segment statistics module to calculate the ratio of the bases in the corresponding positions of the sequences S1 and S2, respectively, to establish a ratio set and to provide a data basis for the projection module.
Because the selected gene sequences are similar in length, but there are still personal differences, we choose N = min(N1, N2) to format the longer sequence and delete the "excessive segments" at the end of the longer sequence to ensure participation in the comparison data. The length is the same (it affects the integrity of the sequence to a certain extent, but for the global alignment of the sequence, its impact can be ignored).
The entire calculation process is as follows: Among them, M a 1 A represents the number of bases A in the a 1 segment, and R a represents the ratio of base A ratios formed by the ratio values of segments A and A in the two sequences corresponding to segments S 1 andS 2

Visualization and Output
The main function of this part is to use the ratio data set obtained by the data measurement module to visualize it in the form of a line chart and analyze it through the image after output. Among them, the similarity analysis between sequences is mainly based on qualitative analysis of the tightness of entanglement between curves. Figure 1 lists typical illustrations in different similar situations: (a) This means that the two sequences are homologous and identical, and the characteristic is that the four curves representing different bases coincide into a straight line parallel to the X axis and the ordinate is 1; (b) This means that the nonhomologous similarity between the two sequences has a large difference, and the characteristic is that the sparse and disordered overall fluctuation is large; (c) This means that the two sequences are homologous and extremely similar, and they are characterized by tight entanglement and little overall fluctuation; (d) This means that the two sequences have the possibility of homology but the similarity is relatively low. The characteristics of the two sequences are that they are tightly entangled, but at the same time have certain fluctuation characteristics..

Results and Discussions
Result Figure 2 and Figure 3 show the comparison results of viruses collected from different species and the comparison results of viruses collected from different individuals of the same species. Figure 2 shows that there are obviously different similarities between different viruses.

Dicussions
Analyzing Figure 2 (a)-(c), it can be observed that SARS-CoV-2 has a very high similarity with the virus carried by pangolin. This is because the sample here is a variant coronavirus collected from pangolin, but it only has a high similarity with the SARS carried in bat, but it shows a significant difference compared with PDcov.
Analysis of Figure 2 (d)-(e) shows that the SARS and PDcov carried by pangolin and bat have similar performance as SARS-CoV-2.
Analyzing Figure 2 (c), (e), and (f), it can be observed that PDcov has a low similarity with viruses carried by the other three species. Analysis of Figure 3 (a) (b) (c) shows that (a) SARS-CoV-2 in different countries has a variation at the overall level, but the variation distribution is more uniform; (b) although the overall level of SARS virus collected in bat, there are variations on the above, the high variation is mainly concentrated in a small part of the front and tail of the sequence; (c) The variation of PDcov from pigs is mainly concentrated and part of the site, which is shown as a unique small protrusion.

Conclusion
Through experimental comparison, it is found that SARS-CoV-2 has a high homology relationship with SARS virus, but does not have a significant homology relationship with PDcov.
There are obvious differences in the mutation modes of the three viruses. From this point of view, the distribution of mutation sites can be observed on the basis of sequences with high homology to further determine the relationship between them, and can be used to find more reliable intermediate hosts.