Protein Coding of Variations on SARS-CoV-2 Genomes in Various Regions


 In this paper, COVID-19 cases in different regions are used for comparison. The related genomes of SARS-CoV-2 are segmented and replaced with sequence operations under protein coding scheme on the A3 module of the MAS. Using protein coding schemes, genomes are transformed and projected as measuring sequences as a vector that can be visualized in maps from two different perspectives: the elements of the gene sequence and the position of the element sequence, so as to interpret the genome more comprehensively. Through a series of linear diagrams, it is convenient to compare and analyze the genomes of the samples collected in different regions more intuitively, which may be conducive to further data mining of genomic information and refined explorations of COVID-19 for patients.


Introduction
In December 2019, pneumonia caused by COVID-19 broke out in Wuhan, China. Its clinical symptoms are different from the SARS outbreak in 2003, so it is inferred that the virus may be a new variant of coronavirus [1]. Today, in just four months, new coronaviruses have swept the world. The number of countries or regions affected has already exceeded one hundred. The cumulative number of confirmed cases has exceeded two million, and the number of deaths has exceeded two hundred thousand.
Interpretation of the viral gene sequence is the key to defeating the epidemic. It can provide us with methods and ideas to fight the epidemic and provide a corresponding basis for its treatment.

Aim of the Study
Using the variable value theory system framework, the element statistics and element position statistics of COVID-19 gene sequences can be displayed through visualization methods to observe and analyze the distribution of data features.

Materials and Methods
In this paper, the gene sequence is used as the input, and the corresponding value map is used as the output. The architecture used in this article is mainly divided into the following modules: segmented operation module, replacement operation module, statistics module, and visualization module. Under this architecture, the data processing flow is as shown in Fig. 1.

Sequence Segment Operation Module
This module segments the entire sequence of gene sequence data according to the segment length (dlen) into a single sequence, sequence data set with a fixed length value, where a single sequence length is less than the segment length will be discarded.
For example, the sequence data string is:
The selection of the segment length value will be determined according to the actual length of the gene data sequence and the visualization effect.

Sequence Replacement Operation Module
The operation module mainly includes base replacement and position replacement, wherein the position replacement operation is based on the base replacement operation. According to the different properties of the four bases, any DNA sequence can be uniquely described as the distribution of three independent purines (R) and pyrimidines (Y), the distribution of amino groups (M) and carboxyl groups (K), and strong hydrogen bonds (S) and the weak hydrogen bond (W) distribution. In this paper, the three distributions are used as replacement relationships, as follows: According to the corresponding replacement relationship above, one gene sequence can be mapped into three different sequences. This is the base replacement operation of the sequence, which mainly targets the elements in the sequence. For example: the sequence GTCCACT GGCAT GGT can be replaced with three independent sequences: There are four positions in any sequence after base substitution operation, as follows: (1)  The position replacement in the sequence operation is to select a specific position relationship string as the judgment basis for replacement in all position relationships and replace the substrings equal to it in the entire data sequence with '1', otherwise replace Is '0', and finally replace the entire sequence with a sequence containing only '0' and '1'.
In summary, any data sequence that has undergone base replacement can be replaced with 4 '01' sequences in total according to different position replacement relationships, which can be used as input to the statistical calculation module.

Statistics Module
The statistics module mainly includes two aspects: (1) Statistics of the number of gene sequence elements; (2) Statistics of the position of gene sequence elements.
In the gene sequence element number statistics module, only the total number of single elements after the base replacement operation is counted, and then the measure is calculated, that is, the total number of statistics of the element is divided by the length of the sequence.
Similar to the statistical operation of the number of gene sequence elements, the gene sequence element position statistics module counts the total number of '1' characters in the '01' sequence after the position replacement operation, and then calculates its measure, divided by the total number of statistics of '1' Take the length of the sequence.
The same sequence will have different sets of statistical values due to different sequence replacement relationships.

Visualization Module
This module is the average linear visualization operation module. The sequence data set statistics are the input of the module, and a series of linear diagrams are the output of the module. The specific operation process is shown in Fig. 2.

Data Introduction
There are 18 regions in the data involved in the article, which are divided into four independent comparison combinations, as follows: All data come from the NCBI website of the American Biogene Database. At the same time, the data used are samples of data sequences submitted earlier in the region. The gene sequence ID number is as follows.

Results Display
The results of this article are mainly the content in the result visualization module.
The following shows the visualization results with the 'SW' substitution relationship, as follows: (1) Single element visualization of gene sequences in five regions of China, Australia, Brazil, Colombia, and France, as shown in Fig. 3.

Discussion
The sequence of new coronavirus genes in 18 regions, in the four comparison groups, the sequence elements and sequence element position statistics measured by the "SW" base substitution relationship have different distribution characteristics, and the virus genes in each region. The sequences show different similarities and differences. Different gene sequences have different comparison results in different sequence elements or sequence element position relationships. In the comprehensive analysis, compared with 18 regions, four regions with the most similar gene sequence to the sample data submitted by China are Australia. , South Africa, South Korea, Greece.
Although the gene sequences of each region is used as a unit in this article, only the "SW" base substitution relationship is selected for display of results, but in fact, there may be other graphs showing the comparison results of base substitution relationships.

Conflict Interest
No conflict of interest has been claimed.