Visualizations of Topologic Entropy on SARS-CoV-2 Genomes in Multiple Regions

The outbreak of novel coronavirus (SARS-CoV-2) developed into a global pandemic in a few months. The latest study found that the virus belongs to the beta coronavirus family. SARS-CoV-2 is highly similar to Pangolin CoV and BatCoV RaTG. Advanced scientiﬁc studies help traceability and vaccine development. In addition to the subgenus classiﬁcation analysis of the virus, it is interesting for further exploration to focus attention on mutations and their transmissions in different regions. New mutations may be likely to affect the symptoms of the disease and the effectiveness of vaccination. This paper is focused on the study to make error bars and scatter graphs with the support of the metagenetic analysis system MAS. Using SARS-CoV-2 genomes in different countries and regions as input datasets, topological entropy values provide global characteristic quantities based on C 4 module for visualization. Sample results show that the method is powerful and useful for consis-tently integrating all genomes on one unique genomic index map. Various countries have conﬁrmed their speciﬁc positions and projections under topologic entropies. Further explorations are required.


Introduction
Novel severe acute respiratory syndrome coronavirus (2019-nCoV, SARS-COV-2) [1], has developed into a global pandemic [2]. Seven coronaviruses are known to cause human infection, of which four HCoV229E, HCoVNL63, HCoVHKU1, and HCoVOC43 usually cause cold symptoms in immune individuals. Other SARS-CoV (Severe Acute Respiratory Syndrome Coronavirus) and MERS-CoV (Middle East Respiratory Syndrome Coronavirus) originate from zoonosis and cause severe respiratory diseases and deaths [3]. It has been found that the new coronavirus SARS-CoV-2 belongs to the beta coronavirus family [4], which includes the previously mentioned severe acute respiratory syndrome (SARS) and the Middle East respiratory syndrome (MERS). Guo et al. proposed a virus-host prediction method based on deep learning [5] to detect viruses with a DNA sequence as input to predict their potential infection hosts. SARS-CoV-2 is more closely related to other human coronaviruses, especially SARS-CoV, Bat-SARS and MERS-CoV.
Through genomic analysis, SARS-CoV-2 comes from nature, because its genome is highly similar to bat coronavirus, and bats may be its natural host [6]. The ten 2019-nCoV genomic sequences obtained by Lu and Zhao from 9 patients are very similar, showing more than 99.98% sequence identity. It is worth noting that 2019-nCoV is closely related to two bat-derived (SARS)-like coronaviruses bat-SLCoVZC45 and bat-SL-CoVZXC21 collected in Zhoushan, eastern China in 2018 (identity is 88%), and the distance of the same virus family is SARS-CoV (approximately 79%) and MERS-CoV (approximately 50%) [7]. Because SARS-CoV-2 belongs to the same family as SARS-CoV and MERS-CoV, they have many similarities. The structure and pathogenicity of SARS-CoV-2 are very similar to that of SARS-CoV. Compared with SARS, SARS-CoV-2 spreads faster than SARS, but its lethality is relatively lower [8]. Therefore, Flynn protease inhibitors may be a potential drug therapy for SARS-CoV [9]. Other studies have found those novel coronavirus is closely related to severe acute respiratory syndrome-like coronavirus (bat-SL-CoVZC45 and bat-SL-CoVZXC21) derived from two species of bats [10].
Some studies suggest that the Pangolin-CoV, carried by pangolin, is likely to provide a natural gene pool for novel coronavirus. At the genome-wide level, the similarity between Pangolin-CoV and SARS-CoV-2 and BatCoVRaTG is 91.02% and 90.55%, respectively. The relationship between the S1 protein of pangolin-CoV and SARS-CoV-2 is much closer than that between the S1 protein and RaTG13 [11]. The results of a recent study [12] have shown that the critical atomic interaction between the spinous process protein receptor-binding domain (RBD) of SARS-CoV-2 and the host receptor angiotensin-converting enzyme 2 (ACE2) usually helps to regulate the spread of COVID-2019 across species and humans.

Aim of the study
As of April 26, 2020, the number of epidemic countries has reached 209, with more than 1.8 million confirmed cases and more than 200 thousand deaths. The epidemic has not been adequately controlled worldwide, and the number of confirmed cases is still increasing [13]. The study of virus sequence characteristics is significant for further virus classification, surveillance and vaccine development and treatment.
At present, most studies focus on the classification and analysis of the subgenus of the virus and find the closest virus cluster, which provides the direction for virus traceability and vaccine development and treatment. However, there is a lack of individual population analysis of the SARS-CoV-2 virus in different regions. Because RNA depends on RNA polymerase (RdRp) and has a high mutation rate [14], new mutations are likely to affect the symptoms of the disease and the efficacy of vaccination. Li found 19 lethal new mutations in novel coronavirus. She pointed out that novel coronavirus has modifications that can affect pathogenicity. Drug and vaccine research and development also need to consider these mutation factors [15]. It has been observed that novel coronaviruses are sensitive to the environment by mutating and producing different variants. SARS-CoV-2 will gradually adapt to the local climate, and mutations in the virus genome play a vital role in the process of transmission. The analysis of genome characteristics shows that there is a strong relationship between sample collection time, sample location and genetic diversity accumulation. The study found 116 mutations, some of which affected the severity and transmission of SARS-CoV-2.
In this study, the C 4 output module of the MAS are implemented, and the topological entropy is used as the feature of the sequence. We have made diagrams of the distribution of novel coronavirus RNA sequences in multiple regions [18]. The complete series of several SARS-Cov-2 genomes in the different areas were selected to study the relationship between geographical distribution and sequence characteristics of virus genomes. Finally, the developmental clusters of the strains are identified [19] so that the scientific and diagnostic communities associated with the coronavirus can benefit from it.

Data
All the sequence data used in this study were downloaded from complete genomic sequences with NCBI and GISAID, with data size requirements of approximately 30000bp. The quality of the genome is high, to reduce the existence of unknown nucleotides as much as possible to avoid affecting the accuracy of the experiment. The total number of all sequences is 1337. The statistics of different countries are shown in Table 1. The statistics of various cities in different regions involved in the investigation are shown in Table 2, Table 3 and Table 4.

Topological Entropy
Information entropy theory is an essential tool in bioinformatics, and it is widely used in genome sequence analysis. Topological entropy (TE) is a kind of informa- tion entropy. Kirillova [16] uses the exponential growth rate that produces topological entropy to describe topological entropy. Topological entropy can distinguish simulated gene sequences from natural gene sequences. Koslick [17] proposed using the approximate calculation of topological entropy to analyze gene sequences. He selects the fragment length of n and w by analyzing the maximum point of P w (n), to obtain an exponential growth rate, which is closest to the complexity of the sequence. Defined as lim n→∞ log 2 (P w 4 n +n−1 (n)) n In the formula, w represents the sequence length, and n is the sequence substring length, and the value of n in 4 ≤ n ≤ 16 is obvious. P w 4 n +n−1 (n) represents the number of types of n words appearing in the sequence, which is also the k-mers commonly used in molecular biology. In this study, the sequence length w we used is approximately 30000bp, and the calculated n value is 7 and the n value is 8 for the experiment.

Topological Entropy Generation
TE is the C 4 part of four modules {CE, IE, ME, T E}. The process of TE generation is to input N sequences [20], calculate the average length of each sequence, select the best K-mers length, calculate the non-weight quantity of K-mers as the intermediate quantity [21], and finally obtain the topological entropy of the sequence [22]. The specific process is as follows.
1. According to the virus sequence length of |w|, the word length n can be determined.
2. Therefore, we get all the n-mers of this sequence, the total number is |w| − n + 1.
3. Filter out the repeated n-mers and get the number of n-mers, P w (n).
4. The topological entropy H W (n) of the sequence is calculated by substituting the formula.
5. Finally, the topological entropy of a country is calculated, and the average and standard deviation of the topological entropy of all sequences under that country are taken to make an error bar graph.And according to the topological entropy value of n = 7 and n = 8 of each sequence, as the characteristic quantity of the sequence, respectively correspond to the horizontal and vertical coordinates of the scatter diagram. Make a scatter diagram.

Results and Discussions
Make data error bar diagrams and scatter diagrams of different countries, and select the data with city (state) information to make diagrams of different regions. Fig. 1(a)-(f) provide error bar diagrams of topological entropy for different countries. The topological entropy of the sequence of n = 7 is selected as the ordinate.
From the comparison between Fig 1(a) and the actual data, it is not difficult to find that the average topological entropy of different countries is approximately H(w) = 0.9560, and the data topological entropy of 0.9560 is likely to be an initial feature of SARS-CoV-2. In Fig. 1(b), we can observe that the entropy of the genome in most parts of the United States is higher than 0.9560, corresponding to Fig. 1(a); only a few areas are smaller, such as Washington, Minnesota, Wisconsin. The highest entropy value of New York is H(w) = 0.9561751051408733, and the lowest entropy value of H(w) = 0.9558286454522853 is the area with the most significant error change in the statistical field. There is an exciting graphic phenomenon in Fig. 1(c). In Wuhan, where coronavirus was first discovered in China, the entropy value of the mathematical sequence is the largest and the error range is also the largest. Fig. 1(d) and Fig. 1(e) Australia and Italy also show similar characteristics to Wuhan, and the error range of the maximum average entropy is also the largest. In Fig. 1(b), the United States shows the most significant margin of error in Illinois. However, in Fig. 1(d), the average entropy value of Italian city Abruzzo is as high as 0.9640, which is the maximum entropy value at present, and the entropy error range of this city is also the largest. Fig. 1(f) shows that the entropy of the virus sequence found in three French cities is 0.95599, the entropy value between different cities is stable, and the entropy error is 1/10000. Fig. 2(a)-(f) shows the scatter distribution of topological entropy in different countries or regions. Both n = 7 and n = 8 are selected as entropy calculation parameters, and the horizontal and vertical coordinates are entropy attributes.
In Fig. 2(a), some scattered points overlap and classify in different countries, such as the United States, Belgium, Australia, and China, their statistical aspects overlap and some mathematical points in the United States are separated. Next, specific to different cities in each country for observation. Comparing Fig. 2(b) and (a), the separation of statistical points in the United States is related to the sequence of New York. In Fig. 2(b), the mathematical entropy points of New York are more character- istic than other locations. In Fig. 2(c), the entropy points of Wuhan are widely distributed, which mostly coincide with those of other cities, while the entropy points of Hangzhou are separated. Fig. 2(d) shows the entropy scatter distribution in Australia. The sequence entropy values of Queensland and Victoria gather together, and the sequence entropy values of Sydney and Western Australia overlapped. Fig. 2(e) and (f) are more evenly distributed.

Conclusion
From the entropy value diagram of different countries and different regions, the method of using topological entropy to deal with the virus sequence is effective. First, the average entropy value of each country is stable, and second, it is observed that there are differences in the entropy value of different cities. The entropy richness and consistency of the viruses at the first discovery site is better. The aggregation and classification of entropy values in different regions were observed through the entropy scatter diagram.

Conflict Interest
No conflict of interest has been claimed.