The SARS-CoV-2 arginine dimers

Arginine is present, even as a dimer, in the viral polybasic furin cleavage sites, including that of SARS-CoV-2 in its protein S, whose acquisition is one of its characteristics that distinguishes it from the rest of the sarbecoviruses. The CGGCGG sequence encodes the SARS-CoV-2 furin site RR dimer. The aim of this work is to report the other SARS-CoV-2 arginine pairs, with particular emphasis in their codon usage. Here we show the presence of RR dimers in the orf1ab related non structural proteins nsp3, nsp4, nsp6, nsp13 and nsp14A2. Also, with a higher proportion in the structural protein nucleocapsid. All these RR dimers were strictly conserved in the sarbecovirus strains closest to SARS-CoV-2, and none of them was encoded by the CGGCGG sequence.


Introduction
Arginine (R) is a polar and non-hydrophobic amino acid, with a positive charged guanidine group, a physiological pH, linked a 3-hydrocarbon aliphatic chain. Arginine participates in the binding of negatively charged substrates and/or protein actives sites (1). Consistently, arginine is involved in viral polybasic proteolitic cleavage sites, even as a dimers, as substrate of the ubiquitously expressed serine protease furin (2,3).
In this context, a notable characteristic of the SARS-CoV-2 is the acquisition of a polybasic cleavage site (PRRAR) at the S1-S2 boundary of the S protein, which is recognized by the furin protease, that greatly mediates the fusion of human cell and viral membranes, and the rapid human-to-human virus transmission (4)(5)(6)(7). That acquisition was achieved through an insertion of four amino acids (PRRA) in the S protein. Neither sarbecoviruses nor bat sarbecovirus strains closest to SARS-CoV-2, have a polybasic cleavage site (4). However, this site is common in viral proteins, such as the hemagglutinin (H5) protein of the avian influenza viruses (2) or the S protein of some of the seventh coronavirus known to infect humans: HCoV-HKU1 (RRKRR-756, coordinate based on S protein), HCoV-OC43 (RRSRR-764) and MERS-CoV (MLKRR-700) (4).
Another notable characteristic of the SARS-CoV-2 is the CGG arginine codon used encoding the RR dimer of the polybasic furin site. The CGGCGG nucleotide sequence was not previously seen encoding RR pairs of viral furin sites (8). Arginine is encoded by the codons CGU, CGC, CGA, CGG, AGA, and AGG. In SARS-CoV-2 the arginine CGG codon is the less often used (frequency 0.03) (8) The aim of this work is to report the other RR dimers in the SARS-CoV-2. That is, how many are there? In what proteins? How are they coded? And, going a little further, are these RR dimers conserved in the sarbecovirus strains closest to SARS-CoV-2? How are they encoded?

Methods
The source of information were NCBI GenBank and GISAID databases. The SARS-CoV-2 reference sequence (NCBI, GenBank NC_045512.2 and GISAID, EPI_ISL_402124), WIV04 isolate, was also usesd as reference sequence in this study. The sarcoviruses used were the following: four human SARS-CoV-2 (WIV04, Wuhan-Hu-1, CDC-CruiseA-12, INMI1), nine bat closest to SARS-CoV-2 (RmYN02 (9), RatG13 (19), RShsTT200, RShsTT182 (11), RpYN06 (12), RacCS203 (13), PrC31, ZC45, ZXC21 (14)), six from pangolin (MP789, GX-P5L, GX-P4L, GX-P1E, GX-P5E, GX-P2V), two bat sarbecovirus more phylogenetically distant to SARS-CoV-2 (BM48-31, BtKY72) and two SARS-CoV sarbecovirus (human, HKU-39849; and ferret Tor 2/FP1-10912) ( Table 1). First, it was identified the proteins of the SARS-CoV-2 having the RR pair. Second, for the sarbecovirus used in this stidy, it was created an in-house database of their genes and their orthologous. Since not all virus genomes were fully annotated, for each of these sarbecovirus, the gene sequences were obtained through successive pairwise BLASTn analyses, using gene sequences of the SARS-CoV-2 reference sequence as a query. The protein sequences were obtained by translating the gene sequences. Multiple sequence alignments were created by the EMBL Clustal Omega (v.1.2.4) using default parameters. Gene and protein multi-alignments were done separately. In the gene multi-alignments, the RR dimer codons were identified, from the position of the RR dimers in the corresponding protein multi-alignments multiplied by three (the RR pairs codons were not identified using codon multiple alignments). All Tables are sorted according the pairwise BLASTn identity percentage between the sarbecovirus genome sequence and the SARS-CoV-2 genome reference sequence.

Results and Discussion
According to the SARS-CoV-2 genome structure (5), from the 5' end, there are two large open reading (ORF1a and ORF1b) covering two-thirds of the RNA genome, and encoding 15 non-structural proteins (nsp) that compose the viral replication and transcription complex. It is followed by the other third of the genome, already in the region of the 3 'end that the encodes structural proteins, namely spike (S), envelope (E), membrane (M), nucleocapsid (N). All these SARS-CoV-2 proteins were analysed for the presence of RR dimers. Results are shown in Table 2.
In the S protein, the only RR dimer is that of the polybasic furin cleavage site. However, RR pairs were found in some orf1ab related non-structural proteins that make up by the RNA polymerase complex. Also, RR pairs were found and in the structural protein nucleocapsid, encoded by the N gene. Among the non-structural proteins, RR pairs were in the papain-like protease (nsp3); the proteins involved in the formation of replication compartments (nsp3, nsp4 and nsp6); the helicase (nsp13);and, and in the 3′-5′ exonuclease that assists RNA synthesis with a unique RNA proofreading function (nsp14) (5). Regarding the nucleocapsid protein, it is involved in the package of the positive strand viral genome RNA into a helical ribonucleocapsid (15). Keeping in mind the positive charged of the arginine guanidine group, the presence of the four RR dimers in the SARS-CoV-2 nucleocapsid protein, agrees with what is known that arginine residues are essential in viral capsid assembly (1).
Interestingly, the RR dimers found in that SARS-CoV-2 proteins (apart from that of the S protein) were strictly conserved in the sarbecoviruses strains closely phylogenetically related to it. This suggests a fundamental biological role of these RR doublets. Table 3 shows again the sarbeviruses used in this study, with the genomic coordinates of the genes, or part of them, that harbour the encoding RR dimers. Table 4 shows these RR dimers in detail. However, in some orf1ab related non structural proteins from the more distant sarbecoviruses strains, the lysine (K) took the place of the arginine. Lysine, that is also amphipathic, is also highly present in viral polybasic fury cleavage sites (2). However, the main meaning of table 4 was to show the triplets that encode the RR dimers. It was not a surprise that the AGA arginine majority codon in SARS-CoV-2 (frequency 0.445) (8) was the most frequent encoding the RR dimers. In few occasions the repetition of the same codon appeared, but when it took place, it was in the AGAAGA sequence. On the other hand, the CGGCGG sequence, encoding the RR dimer of the SARS-CoV-2 furin site, did not appear in any case.   Table 3. Selected sarbecoviruses genes encoding proteins with the presence of the arginine doublet. For genes derived from the orf1ab polyprotein, it is shown the coordinates of the genomic region, that in the pairwise BLAST search matched with the corresponding gene of the reference genome (WIV04). For the N gene, structural nucleocapsid protein N, in the genomes which are annotated, the GenBank access number is shown. ns: no significant similarity found in the pairwise BLASTn search using the corresponding gene of the NC_045512.2 SARS-CoV-2 reference s genome (WIV04) query Table 4. Arginine doublet y their codons of SARS-CoV-2 and conserved in the closest sarbecovirus species to it. Data was extracted from del protein and encoding gene multiple alignment, respectively. Numbers on the right stand for the amino acid and nucleotide position in the reference genome (WIV04) sequences. * represents strictly conserved residues in the multiple alignment.