The prevalent RNA modication on SARS-CoV-2 RNAs may confound the SNP prole and evolutionary patterns revealed by previous studies

The recent outbreak of SARS-CoV-2 has caused severe damage to the world. The concomitant papers on the evolutionary patterns of SARS-CoV-2 is continuously emerging. Studies has utilized the publically available RNA-seq data to nd out the so-called SNPs in the virus genome and analyzed their selection patterns. Methods


Background
The outbreak of SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) in the beginning of year 2020 has caused severe damage to China especially the Hubei province [1][2][3]. Recently, several other countries like America, France, England, Italy, Russia, and Spain are consecutively being hit by this virus. There is urgent need to understand the origin and evolution of SARS-CoV-2 and related coronaviruses [4].
Papers on the evolutionary patterns of SARS-CoV-2 have emerged as rapidly as the outbreak of virus.
Several studies downloaded the publically available RNA-seq data and performed "SNP calling". The SNP distribution or frequency spectrum is usually a super informative inference of selection patterns. The recent study has discussed topics on the origin and continuing evolution of SARS-CoV-2. Although questioned methodologically by other researchers, their attempt to utilize the SNP data called from RNAseq is unchallenged.
To better illustrate the pipeline of SNP calling, let us take the human genome for instance. The SNP calling process is accomplished by mapping the DNA-seq reads of a sample to the human reference genome, and the reliable mismatch sites found should be a potential SNP (Fig. 1A). If the RNA-seq of the same sample is available, one could nd the same nucleotide change from the reference genome to the RNA-seq reads (Fig. 1A), indicating that the mutation takes place at DNA level. However, one could not determine the direction of the mutations without an outgroup (Fig. 1A). In contrast, if a variation site is found in the RNA-seq reads but not the DNA-seq reads, then this is possibly an RNA modi cation site. For example, the vertebrate adenosine deaminase would change adenosine to guanosine [5], which is interpreted as guanosine in the sequencing data. Thus, the A-to-G variations between RNA and reference genome are presumably caused by the deamination enzyme (Fig. 1A). Unlike the unknown direction of DNA mutations, the direction of A-to-G deamination is very clear even without an outgroup because it takes place at the RNA level (Fig. 1A).
SARS-CoV-2 is a positive strand RNA virus. The so-called reference genome is actually the RNA sequence.
Without a DNA template, the mismatches found between the reference and the RNA-seq reads could either be a "SNP" or RNA modi cation site (Fig. 1B). It is futile to try any ltering criteria on these RNA-seq data because the SNPs and RNA modi cation sites are technically indistinguishable. Even when multiple outgroup species are available, the reference sequence (RNA) of the outgroup viruses may also undergo the same RNA modi cation process (by host cells), making it di cult to de ne the ancestral state and the direction of mutations (Fig. 1B).
In this study, using a well-established mutation nding pipeline (see Fig. 2 and Methods), we found prevalent A-to-G RNA modi cations in the RNA-seq data of SARS-CoV-2. If other non-A-to-G variations could not be explained by known RNA modi cation systems and are regarded as SNPs, then we found non-identical codon position preference between the A-to-G and non-A-to-G variations, suggesting that their missense and synonymous pro les are different. Therefore, mixing all the variation sites in the SARS-CoV-2 evolutionary analyses is inaccurate and seems to be a hybrid between traditional SNP analyses and RNA modi cation analyses.

Results
Variation sites identi ed by normal mapping pipeline A recent study (https://www.biorxiv.org/content/10.1101/2020.03.02.973255v2) dealt with similar issues using the same sets of data. We rst compared our results with theirs. Here we used ordinary mapping pipeline (not the transform strategy, see Methods). Theoretically, the transform strategy was designed to retrieve many clustered variation sites while the normal variant calling pipeline might nd fewer sites. Indeed, according to the relevant study, datasets SRR10903401, SRR10903402, SRR11059942, and SRR11059945 contained 25, 163, 208, and 82 variant sites. Our results showed good overlapping with the study (Fig. 3A) but we have identi ed slightly more variation sites, possibly due to the different software used or that study performed additional lters. However, for the shared variation sites found by us and others, the correlation of alternative allele frequency is nearly perfect (Fig. 3A). The percentages of different variation types also conform with the previous study (Fig. 3B).
As we have said, the mutation sites found by traditional variant calling pipeline only represent a minority of all possible RNA modi cation sites in the transcriptome, and may not produce a strikingly high percentage of A > G mutation. In the next sections, we would no longer discuss these sets of variation sites. We would use the transform strategy as introduced in the Methods (Fig. 2) and look at the prevalent base substitution events across the transcriptome.
The prevalent A-to-G variations across SARS-CoV-2 genes We downloaded the reference and a set of RNA-seq data of SARS-CoV-2. We mapped the RNA-seq reads to the reference sequence with a well-established pipeline (see Methods for details). The numbers of (non-unique) mismatch events pro led (Fig. 4A). There are 5310 (59.1%) A-to-G mismatches and 2015 (22.4%) G-to-A mismatches, and the ten other types of mismatches composed only 18.5% of the totally 8989 mismatches (Fig. 4A). The most prevalent A-to-G mismatches could be interpreted as the adenosine-to-inosine deamination conferred by the host cells. For the second-most prevalent G-to-A mismatches, it is possible that the reference sequence (RNA) of SARS-CoV-2 itself suffered from deamination by the host cells. This is also why we worry that the phylogenetic tree constructed by multiple virus sequences could also be skewed by the RNA modi cation from the hosts. The third-most abundant T-to-C and C-to-T mismatches might represent the cytosine-to-uridine deamination system in the host cells.
We treated the A-to-G as the A-to-G RNA modi cation sites in the virus sequences. We found that the density of A-to-G modi cation varied moderately across different genes (Fig. 4B). To technically validate that the mismatches sites are reliably detected rather than artifacts or errors produced from the pipeline, we manually extracted a 150 bp read and aligned it to the reference sequence (Fig. 5). The A-to-G alterations are clearly presented in the reference versus RNA-seq comparison. Pay attention that errors could include the mismatches resulted from mis-alignments or sequencing errors. The validation here is to check the accuracy of the mapping pipeline. The manually examined read told that the sequence alignment is reliable. The control for sequencing errors would be discussed in the following section.

Robustly observed A-to-G variations under different criteria
The nearly nine thousand (non-unique) variation events shown in the above section belong to 4604 unique variation sites. Most of the 4604 unique sites have less than 10 reads supporting the alternative allele (Fig. 6A). There are 2878 (62.5%) unique A-to-G variation sites and 998 (21.7%) unique G-to-A variation sites (Fig. 6A). The signal to noise ratio for A-to-G variations is 1.67. If combined with G-to-A variations, the signal to noise ratio would be as high as 5.32. Among the 2878 unique A-to-G sites, 797 and 2391 sites were found in the two SRR samples, respectively, and their overlapping is 310 sites.
Since these results came from the mapping and variant calling procedures without any ltering criteria, we think it is necessary to see whether the patterns are sensitive to some ltering parameters. We re-did the analysis by requiring mapping quality > 25 and base quality (controlling for sequencing errors) > 35. We found that the number of unique variation sites (3129) slightly declined but the majority (2421, 77.4%) of which are still A-to-G variations (Fig. 6B).
We also wish to prove the reliability of the putative A-to-G modi cation sites from another angle. The base context of the A-to-G variation sites showed an obvious depletion of G upstream to the putative A-to-G modi cation sites (Fig. 7A). In contrast, the G-to-A variation sites did not have such a key validation feature (Fig. 7B). This consolidated our assumption that these A-to-G variations are RNA modi cation sites.

Discussion
The SNPs and RNA modi cation sites could bear completely different mutation rates and position biases, and might undergo different selection patterns. A mixture of SNPs and RNA modi cation sites does not tell either of their evolutionary patterns. Unfortunately, these two mutation sources could not be separated from the RNA-seq data from RNA viruses.
For DNA organisms like human, the pair-ended DNA-seq could be either mapped to the positive or negative strand of reference genome. But for single-ended human transcriptome data mapped to the transcriptome sequence, all reads should be theoretically mapped to the positive strand of transcriptome (except when the library itself is on the opposite direction). The same goes for RNA viruses like SARS-CoV-2 for either single-ended or pair-ended RNA-seq library. The authors did not mention the detailed mapping status of the reads. Although it is technically di cult to distinguish the SNPs and RNA modi cation sites in the viral RNA, at least this unsolvable concern should be stated in the manuscript.
If it is con rmed that the SARS-CoV-2 has been transferred from patient No.1 to patient No.2, then one might consider the RNA-seq from patient No.1 should be the ancestral state. However, the viral RNAs in patient No.1 would also undergo the same A-to-G modi cation by the hosts. The deamination enzyme only modi es a fraction of the total viral RNAs so that in patient No.1 there is still a mixture of A-version and G-version RNA reads. Technically, one could not know whether this is a polymorphic site in the virus population or it is modi ed by the host's enzyme. This uncertainty makes it di cult to de ne the ancestral state.
There is a less important but unsolved question that we think the G-to-A variation sites could also be the A-to-G modi cation on RNA of the "reference genome". However, from the base context of the G-to-A sites, they did not seem to be authentic A-to-G modi cation sites. This strange pattern remains an open question.
In summary, the optimized algorithms (in numerous software) are only to improve the accuracy of the alignments. Even an alignment is manually veri ed, we still do not know any A-G mismatches in the alignment should be SNPs or RNA modi cation sites. We appeal that this issue should be seriously discussed in the studies involving RNA viruses like SARS-CoV-2.

Conclusions
The technically indistinguishable RNA modi cations and SNPs of SARS-CoV-2 have complicated the situation where the researchers intend to reveal the evolutionary patterns behind the mutation spectrum. This is not a problem for DNA organisms but should be seriously considered when we are investigating the RNA viruses.

Data collection
We downloaded the novel coronavirus SARS-CoV-2 genome from the NCBI website (https://www.ncbi.nlm.nih.gov/genome/). The coding sequences were extracted according to the genome annotation. The RNA-seq data were retrieved from NCBI under accession numbers SRR10903401 and SRR10903402. Other two relevant datasets SRR11059942 and SRR11059945 were also downloaded and analyzed.

RNA-seq analyses
We mapped the RNA-seq reads to the CDS of SARS-CoV-2 using BWA mem [6] with default parameters but with a little modi cation (Fig. 2). Reads with too many mismatches could not be aligned to the reference genome. However, the many clustered mismatches could be RNA modi cations. To retrieve more RNA modi cation sites in clusters like this, we transformed the reference sequence and the RNA-seq reads manually [7]. The transformation could be either of the twelve types of mismatches. The transformed reads mapped to the transformed genome were replaced with the original reads and the unmodi ed genome. The mismatch sites were extracted from the alignment. We made two versions here, one version is the variation sites without additional lters, another is the mutation sites under the criteria of mapping quality > 25 and base quality > 35. The "transformation followed by re-mapping" work ow is a well-acknowledged pipeline to detect the RNA modi cation events omitted by traditional mapping procedures [7,8]. We should emphasize that this pipeline only deals with the reads that could not be mapped by normal procedures. For the majority of reads that could be mapped by normal procedures (which might also contain many RNA modi cation events), we would discuss their variation pro le separately. Using the normal variant calling pipeline (or termed ordinary pipeline), we also required mapping quality > 25 and base quality > 35 to conform with the protocol of a very relevant study posted as preprint recently (https://www.biorxiv.org/content/10.1101/2020.03.02.973255v2). The difference is that we used samtools to nd the variants and that study used other software.
The corresponding author designed and supervised this research. All authors contributed to writing the manuscript. All authors read and approved the nal version.  The transform strategy of mapping the reads with multiple mismatches.   An example of an alignment between an RNA-seq read and the reference sequence of SARS-CoV-2. The ve A-to-G mismatch sites are colored in the gure.