Similarity Comparison of Multiple Coronavirus Sequences from 2D to 1D Linearizing Transformation

Many studies on COVID-19 have been carried out, and it is interesting to apply methods and models to process the whole sequence of RNA. Similarity comparison of SARS-CoV-2 genomes plays a key role in naturally tracing its origin in scientiﬁc exploration, and further explorations are required. In this paper, an innovative of transformation from a 2D density matrix to 1D measuring vector is proposed based on the A5 module of the MAS for visualization. The core transformation projects whole RNA sequences of multiple coronaviruses in 2D matrices and then forms 1D measuring vectors on variant maps. The relationships of SARS-CoV-2 genomes are compared by their similarity properties and genomic index of entropy quantities applied to classify relevant results into groups.


Introduction
Since the outbreak of COVID-19 in Wuhan, China, in December 2019, the epidemic has now been more than four months, and more than 100 countries in the world have been infected successively. According to the World Health Organization (WHO) situation report on April 22, 2020, the cumulative number of global diagnoses is 471136, while the death toll was 169006 because of this epidemic [1]. The study of SARS-CoV-2 from genomics has certain help for the origin and evolution, development and spread of diseases, clinical diagnosis and treatment, antiviral emergency drugs and antibody drugs [2]- [6]. Thousands of SARS-CoV-2 genomes from many countries can be found on the website GISAID. Homology modeling is mainly used to explore the possible receptor binding characteristics of viruses, and it is the main method for comparing gene sequence similarities. An existing study compared SARS-CoV-2 sequences from 6 patients in Wuhan with SARS and MERS sequences [7]. One study used 9 gene sequences and found that SARS-CoV-2 is similar to SARS [8]. Another study used only 5 sequences and found that SARS-CoV-2 is similar to SARS [9]. In addition, the large sequence data analysis tool I-MLCS and similar algorithms are used in one paper to compare similarities between sequences [10]. Existing research lacks the exploration and research of the entire RNA sequence of SARS-CoV-2; therefore, it is also a worth thinking question that uses the whole sequence similarity to compare between viruses [12].
This paper proposes using the PMLP-V based on a variant system [13] to process the entire gene sequence. This visualization method is an innovative of transformation from a 2D density matrix to a 1D measuring vector and is based on the A5 module of the MAS. As an emerging technology method, its main idea is to use the 4-ary symbol as a meta-structure to deal with random sequences from cryptographic, DNA / RNA to ECG signals [14] and observe the global statistical distribution of sequences from an overall perspective feature. For the PMLP part, the basic mode starts with sequence input and ends with 1D variant map output. From variant map, the relationships of SARS-CoV-2 genomes are compared by their similarity properties. Finally, information entropy is used to demonstrate the results of variant maps and to classify relevant results into groups.

PMLP-V
Using the variant system that includes three major theories: variant logic, variant measurement, and variant map [14], in the field of big data is an innovative method of thinking research, and this variant construction has a good expression in sequence processing.
The processing of RNA sequences based on variant logic consists of three main parts: sequence inputting, module processing (PMLP), 1D diagram outputting and verification. The basic framework is shown in Fig 1. PMLP: Processing，Measurement，Linearization，Projection. V: Verification. Processing: Enter any one virus sequence into the program. In the processing module, a fixed length k is used to divide the whole sequence into several segments. In the measurement module, one selected segment sequence is used as one unit to count the number of bases. The base combination AT, AC is chosen as the position coordinate with a value of 1 at this point. If the value of this coordinate is other values not 1, add 1 to the existing value. The 2D density matrix is output after traversing all the number of sequence segments. In the linearization module, a linearized matrix is obtained by transforming the 2D density matrix to a 1D measuring vector. Then, the measuring vector is projected to be a 1D variant map in the projection module. Finally, verify the results and try to classify relevant results into groups. The main job of the processing module is to segment the entire RNA sequence to ensure that the length of each processed subsequence is the same.
B. Measurement The measurement module mainly obtains a 2D density matrix by counting the number of bases of each subsequence. Use the number of bases as the horizontal and vertical coordinates to construct a density matrix. The main statistics in this article are the number of a pair of base combinations AT, AC, with (num AT , num AC ) as the row and column of the density matrix. If position is first occurrence, record as 1. If it appears multiple times, add 1 to the original value.
C. Linearization The main function of linearization is to transform from a 2D density matrix to a 1D measuring vector by retaining valid values and deleting all 0 elements. After performing this operation, output a one-dimensional matrix.
D. Projection RNA sequence visualization. In this module, the whole sequence is projected to a 1D variant map.

E. Verification
The main work in this module is to utilize information entropy to verify the map results and classify them.  To effectively distinguish similarity of the viral sequences, it is recommended to select a smaller k-value.

Entropy curves
Information entropy can be used as a measure of judging system complexity. The system is more complex, and the entropy is larger.
Each curve corresponds to a sequence. The ordinate represents the average information entropy, and the abscissa represents the value of fixed parameters,which are k=2, 2 2 , 2 3 , 2 4 , 2 5 , 2 6 , 2 7 , 2 8 . Each parameter corresponds to an average information entropy. That is, a complete viral RNA sequence corresponds to 8 average information entropies, and then fits them into a curve and outputs it.   ) have the highest coincidence, indicating that the gene distribution of the two groups is visually similar. However, there is also a slight difference, and it is speculated that SARS-CoV-2 exhibits gene recombination and mutation. Except for USA-COVID-19, the difference between Pangolin (purple line) and China-COVID-19 is the smallest, indicating that there is not much difference in the proportion of bases between them. The internal complexity of the systems are similar, as are the gene sequences.

Analyses
Final result: As the analyses aboove show, SARS-CoV-2 is similar to Pangolin and may belong to a homologous sequence.

Conclusion
This paper proposes using a variant logic system to process virus genomes, and transforms RNA data to a 1D variant map. Utilizing visual analysis and special transformation methods to compare genome similarity is the main idea. Finally, we demonstrate the comparison results by using information entropy curves. The analysis results show that SARS-CoV-2 genomes are highly similar to Pangolin virus, which is consistent with existing research results [10].
Variant logic has great advantages in processing big data. Its processing flow is simple, data loss is small and the output result is ideal. It provides a new idea for processing big data.