Genomic Variations in SARS-CoV-2 Strains at the Target Sequences of Nucleic Acid Amplification Tests

Background: Nucleic acid amplification is the main method used to detect infections of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, the false-negative rate of nucleic acid tests cannot be ignored. Methods: Herein, we demonstrated genomic variations at the target sequences for the tests and the geographical distribution of the variations across countries by analyzing the whole-genome sequencing data of SARS-CoV-2 strains from the 2019 Novel Coronavirus Resource (2019nCoVR) database. Results: Among the 21 pairs of primer sequences in regions ORF1ab, S, E, and N, the total length of primer and probe target sequences was 938bp, with 131(13.97%) variant loci in 2415 (38.96%) isolates. Primer targets in the N region contained the most variations that were distributed among the most isolates, and the E region contained the least. Single nucleotide polymorphisms were the most frequent variation, with C to T transitions being detected in the most variant loci. G to A transitions and G to C transversions were the most common and had the highest isolate density. Genomic variations at the three mutation sites N: 28881, N: 28882, and N: 28883 were the most commonly detected, including in 608 SARS-CoV-2 strains from 33 countries, especially in the United Kingdom, Portugal, and Belgium. Conclusions: Our work comprehensively analyzed genomic variations on the target sequences of the nucleic acid amplification tests, offering evidence to optimize primer and probe target sequence selection, thereby improving the performance of the SARS-CoV-2 diagnostic test.

COVID-19 [9,12]. This is similar to the diagnostic approach for SARS and MERS, the two other coronavirus diseases [13][14][15]. Hence, the sensitivity and specificity of nucleic acid testing are crucial. Although it is highly specific, the sensitivity is not satisfactory [16]. There are various factors that may interfere with test sensitivity. The sampling days from onset of symptoms, different sites used to obtain a specimen, insufficient viral material in the specimen, laboratory error during sampling, and restrictions on sample transportation could all influence the nucleic testing and cause false-negative results [17][18][19][20]. However, few studies mention false negatives due to variations in primer target sequences.
The RT-PCR kits have been designed to detect SARS-CoV-2 genetically, involving the reverse transcription of SARS-CoV-2 RNA into complementary DNA (cDNA) strands, followed by amplification of specific regions of the cDNA with paired primers and probes [21]. Among the SARS-related viral genomes, there are three regions with conserved sequences: 1) the RdRP gene (RNA-dependent RNA polymerase gene) in the open reading frame ORF1ab region, 2) the E gene, and 3) the N gene [22]. Both the RdRP and E genes had high analytical sensitivity for detection, whereas the N gene provided poorer analytical sensitivity [9,22]. Thus, some assays were designed as a two-target system, where one primer set universally detected coronaviruses, and a second set only detected SARS-CoV-2 [9]. As of 21 May 2020, 315 commercialized molecular assays had been designed to detect SARS-CoV-2 genetically, worldwide, according to information provided by The Foundation for Innovative New Diagnostics, some PCR protocols had been shared [23,24]. The sensitivity of nucleic acid detection kits relies on the binding of primers and probes to the virus genome [9,25]. High variation frequency, especially the variation near the 3′ end, might influence the primers or probes binding to the virus genome [26]. Thus, variations in the target sequence and how these variations affect the accuracy of the tests need to be elucidated.
SARS-CoV-2 is a positive-sense single-stranded RNA virus, with a genome that has typical betacoronavirus organization: a 5′ untranslated region (UTR), replicase complex (orf1ab), S (spike) gene, E (envelope) gene, M (membrane) gene, N (nucleocapsid) gene, 3′ UTR, and several unidentified nonstructural open reading frames [27][28][29][30]. T RNA virus's mutation rate is dramatically high, and this high rate is correlated with virulence modulation and evolvability for viral adaptation [31]. It has been reported that the most frequent variations are found within open-reading frame (ORF) regions [32], however, some studies suggest that overall variation in ORF regions is low [33].
Therefore, the variation frequency in the aforementioned regions of the genome needs further study, especially in the primer and probe target sequence.
Although most primer and probe sets were designed to focus on specific and conserved sequences, there were still some variations in the target sequences of the nucleic acid test assays of SARS-CoV-2 strains. A number of studies compared the sensitivity and efficiency of different SARS-CoV-2 detection assays [34,35]. But few people pay attention to the variations and geographic distribution that may affect the PCR performance of detection kits used on a large scale in the population. Herein, we comprehensively analyzed genomic variations in the target sequences of the nucleic acid amplification tests and their geographic distribution. Our results are expected to provide evidence for optimizing the selection of detection kits used on a large scale in the population, thereby improving the diagnostic accuracy of the tests.

Novel Coronavirus Resource
We used the 2019 Novel Coronavirus Resource [36,37], constructed by the Chinese Academy of Sciences. It integrates genomic and proteomic sequences as well as their metadata from the Global Initiative on Sharing All Influenza Data, National Center for Biotechnology Information, China National GeneBank, National Microbiology Data Center, and China National Center for Bioinformation/National Genomics Data Center.
The 2019 nCoVR offers visualization functionalities for genomic variation analysis results based on all SARS-CoV-2 strains collected.

Source of primers and probes used in nucleic acid detection kits.
The WHO offered molecular assay protocols for SARS-CoV-2 that have been shared (WHO, 2020), including seven institutes: the Chinese Center for Disease Control and Prevention (China CDC); Institute Pasteur, Paris, France (IP France); Centers for Disease Control and Prevention, the United States (US CDC); National Institute of Infectious Diseases, Japan (NIID Japan); Charité , Germany; School of Public Health, Hong Kong University (HKU); and the National Institute of Health, Thailand (NIH Thailand). Meanwhile, 21 pairs of primers and probes from the above protocols were mapped to the reference genome.

Mapping of genomic variations
The sequences of primers and probes were mapped to the SARS-CoV-2 isolate Wuhan-Hu-1, complete genome (NCBI Reference Sequence: NC_045512.2) [36]. The positions of the primers and probes used in the nucleic acid testing kits were identified.
Then, the sequences of the primers and probes were compared with the variant loci offered by 2019nCoVR, with the variant loci and corresponding isolates downloaded from the database.

Variation curve
Virus isolates with variations at each site divided by country at each time point were counted. The curve shows the trends of isolates from different countries over time at some variant loci.

Variation density and isolate density calculation
In ORF1ab, S, E, and R regions of the SARS-CoV-2 genome, the length of primers and probes, and the number of variant loci covered by the primers and probes in each region were calculated first. The variation density of a region was the number of variant loci divided by the length of primers and probes. In addition, the number of isolates presenting variations in the primer and probe target sequence was calculated. Isolate density was the number of isolates divided by the length of the target sequence.

The variance of time and regional variations
First the frequency of population occurrence at each variant loci over time or country was identified. Time variance was the variance based on the frequency of population occurrence at each time point, to assess the dispersion of changes at that site. Taking the country as a unit, we calculated regional variance based on the frequency of population occurrence in each country, to evaluate the dispersion of variations at that location.

Statistical analysis
The majority of statistical analyses were performed using SPSS software package version 22. The chi-square test was used to compare the differences in counting data between groups. Variance was used to evaluate the degree of dispersion as described above. A two-sided P value less than 0.005 was considered statistically significant.

Variation landscape of primer and probe target sequences in the SARS-CoV-2 strains
Twenty-one pairs of primers and probes obtained from seven institutes worldwide were analyzed. The primers and probes were mainly focused on the replicase complex  (Table 1).
Among the four regions, most primers and probes were located in the ORF1ab region.
As for the S region, there were only three pairs of primers covered from 24354 to 24900. isolates per locus. Conversely, the N region had the most isolates and the highest density, with 2102 isolates and 5.78 isolates per locus (Fig. 1G).

Variation analysis of SARS-CoV-2 strains in the target sequence of nucleic acid amplification tests.
To further illustrate the variations, we investigated the specific variation types. Among the variant loci covered by paired primers and probes, the most common variations were single nucleotide polymorphisms (SNPs), accounting for 96% of all variations observed.
Only 3% of the variations were deletions, and 1% of the variations had SNPs and deletions (Fig 2A). Regarding the involved SNPs, 58% were transitions and 35% were transversions. A total of 37% of SNP variant loci were C to T transitions, the most common type of SNP, followed by G to T transversions and T to C transitions (Fig 2B).
In the 6198 SARS-CoV-2 virus strains, the G to A transition involved the most isolates (1218) with the highest isolate density (121.8 per locus) in all types of SNPs, followed by G to C transversion and T to C transition. T to G transversion and C to T transitions showed the opposite trends. The former were found in fewer virus strains, but at a higher density (53 strains, 17.67 strains per site), while the latter were found in more virus strains, but at a lower density (172 strains, 3.25 strains per site) (Fig 2C).
The C to T transition was the most common SNP type, accounting for 42% of the SNPs in the ORF1ab region, 38% in the S region, 60% in the E region, and 39% in the N region. The ORF1ab region contained 12 SNPs. Following the C to T transition, the common SNPs were G to T transversion, T to C transition, and A to G transition. There were eight SNP types in the S region. In addition to the C to T transition and A to G transition (15%), the other six types uniformly accounted for 7%-8% separately. The least SNP types were in the E region, with only three types of SNPs; C to T transition, T to A transversion (20%), and T to C transition (20%). In the N region, there were 16 types of SNPs, the most abundant. Besides the C to T transition, the most common type of SNPs were G to T transversion and T to C transition (Fig 2D). Only one strain was unique, shown at 28881 (Fig 3B). The variation tendencies of the isolates involved in the above three variant loci were almost the same. At the end of Greece and Portugal also had high variation frequencies at the three sites. The regional variances of the three sites equaled 0.0424 (Fig 3D).

Isolate analysis of common variant loci.
Geographical distributions of several other common variant loci that involved many isolates were also demonstrated. At ORF1ab:490, there were 21 (60.00%) mutant isolates from Australia, and 11 (31.43%) from the United States. The highest variation frequencies were in New Zealand and Georgia (0.125, regional variance=0.0007) ( Fig   4A). And the forward primer of the first pair of primers and probes from Japan National Institute of Infectious Diseases targeted this variant locus ( Fig 1A and Table 1). At ORF1ab:514, most mutant isolates were from the Netherlands (66 isolates, 75.86%) with a high variation frequency (0.12). The highest variation frequencies were observed in Slovakia (Fig 4B). And this variant locus was targeted by the forward primer of No.3 pair of primers and probes from Japan National Institute of Infectious Diseases ( Fig 1A and Table 1). Prominently, 96.08% (49 isolates) mutant isolates at ORF1ab:13402 were from Belgium with the second-highest variation frequency. The highest variation frequency was from Denmark (0.29, regional variance=0.0020) ( Fig 4C). As for N:28311, targeted by No.5 pair of primers and probes provided by China CDC (Fig 1D and Table 1), 20 (57.14%), isolates were from Australia, which was the highest.
However, the variation frequency of strains from Australia was not high. The highest variation frequencies occurred in Saudi Arabia, South Korea, and Malaysia (regional variance=0.0053) (Fig 4D). The probe of No.15 pair of primers and probes provided by US CDC targeted this variant locus ( Fig 1D and Table 1). The distribution of mutant isolates at N:28688 was decentralized. It was targeted by the forward primer of No.17 pair of primers and probes, which was recommended by US CDC (Fig 1D and Table   1). There were 31 (36.47%) isolates from Australia, 18 (21.18%) from China, and 11 (12.94%) from India. High variation frequencies were observed in Turkey, India, and Kuwait (regional variance=0.0273) (Fig 4E).

Discussion
In the present study, our fundamental purpose is to reveal the variations in the target isolates, especially in the N region, which had the highest variation density and the highest isolate density. The most variant loci and the most isolates were also located in N region (Figs 1F and 1G). The high variation frequency and included isolate density implied that the N region might not be suitable for primer design, which might elevate the false-negative rate. Therefore, merely nucleic acid testing alone is not enough to suppose a negative case, it should be combined with other tests such as antibody detection [9]. In contrast, the lowest variation density and the lowest isolate density were observed in the E region. In addition, it had the least variant loci and the least isolates. Thus, the E region might be the most conserved region, suitable for primer design [26]. This is consistent with the findings of Buddhisha et al. and Corman et al.
that the RdRP and E genes had high analytical sensitivity for detection, whereas the N gene provided poorer analytical sensitivity [9,22,38]. However, only two identical pairs of primers and probes were from the E region, which might cause an incomplete understanding of this region in the present study.
The variations involved in the primers and probes were mainly SNPs, which were abundant in the SARS-CoV-2 genome [39][40][41]. Recent research has demonstrated that SNPs in SARS-CoV-2 are capable of substantially changing their pathogenicity [41].
Importantly, a small number of SNPs in the primer target sequence might not cause primer binding failure [26]. The C to T transition was the most common type of SNP in the primer and probe target sequence. N:28881, N:28882, and N:28883 were the most influential variant loci, involving 608 or 609 viral isolates. It would be better to avoid designing primers or probes whose target sequences contain these three sites. As only the No. 18 primer pair targeted this sequence in the 21 pairs, verification with different nucleic acid detection kits or ddPCR could elevate sensitivity [42,43]. As viruses evolve during outbreaks, SNP in primer or probe binding regions could alter the sensitivity of PCR assays [35].  Figure 4A, 4B, S1A and S1B). It would be better not to use the 5th pair in Belgium ( Figure 4C). Singapore, India and Australia would better avoid using the 15th pair of primers and probes ( Figure 4D and S1C).
There were some limitations to this study. Initially, the sequences of primers and probes used in the nucleic acid detection kits were provided by the WHO. However, not all kits used worldwide were available, some of which were not open access. Although the virus strain data is updated frequently, the strains we used merely presented the situation before our research ended. In addition, it has been reported that variations in primer binding sites could affect PCR efficiency [51,52]. Thus, we are supposed to further validate the relationship between variations and PCR efficiency in the future. In addition, how SNPs influence virus detection and virus pathogenicity needs to be further investigated.

Consent for publication
Not applicable

Availability of data and materials
The datasets generated and analysed during the current study are available in the 2019nCoVR, https://bigd.big.ac.cn/ncov.