Global landscape of SARS-CoV-2 genomic surveillance, public availability extent of genomic data, and epidemic shaped by variants

Genomic surveillance has shaped our understanding of SARS-CoV-2 variants, which have proliferated globally in 2021. We collected country-specific data on SARS-CoV-2 genomic surveillance, sequencing capabilities, public genomic data from multiple public repositories, and aggregated publicly available variant data. Then, different proxies were used to estimate the sequencing coverage and public availability extent of genomic data, in addition to describing the global dissemination of variants. We found that the COVID-19 global epidemic clearly featured increasing circulation of Alpha since the start of 2021, which was rapidly replaced by the Delta variant starting around May 2021. SARS-CoV-2 genomic surveillance and sequencing availability varied markedly across countries, with 63 countries performing routine genomic surveillance and 79 countries with high availability of SARS-CoV-2 sequencing. We also observed a marked heterogeneity of sequenced coverage across regions and countries. Across different variants, 21-46% of countries with explicit reporting on variants shared less than half of their variant sequences in public repositories. Our findings indicated an urgent need to expand sequencing capacity of virus isolates, enhance the sharing of sequences, the standardization of metadata files, and supportive networks for countries with no sequencing capability.

Some SARS-CoV-2 variants disappeared immediately, while others characterized by several key mutations adapted well, enabling their rapid spread 1 . WHO has designated four Variants of Concern (VOCs) associated with increased transmissibility and various extents of immune escape [2][3][4] , namely the Alpha, Beta, Gamma, and Delta variants, rst detected in the UK, South Africa, Brazil, and India, respectively 5 . Speci cally, the Delta variant is highly transmissible, with an estimated transmissibility increase of 40-60% compared with Alpha variant 6-9 , while the Beta variant has been shown to have the highest reduction in neutralization activity whether from natural infection or vaccination 10 , both re ected in lower vaccine e cacy or effectiveness [11][12][13] .
The identi cation and classi cation of SARS-CoV-2 variants mainly relied on partial or whole genome sequencing, although PCR assays have been used to identify speci c features relatively unique in speci c variants, like spike gene target failure (SGTF) 14 . Since the rst SARS-CoV-2 sequence was published in January 2020 15 , the unprecedented rate of genome data generation was far greater than any other pathogen 16 , with 3.1 million genomes deposited in Global Initiative on Sharing All In uenza Data (GISAID) through August 2021 17 . Genomic data has been vital to the early detection of mutations and monitoring of virus evolution, as well as evaluating the degree of similarities between circulating variants with vaccine strains, especially since the availability of SARS-CoV-2 vaccines 18 .
Several studies have employed genomic data to examine the evolution and associated spread of dominant variants in one country or one region, raising a claim of the rapidity of local transmission of SARS-CoV-2 variants and urgency of genomic surveillance 6, [19][20][21] . However, the lack of representation of genomic data from low and middle-income countries (LMICs) was most concerning in these studies 6, [19][20][21] . Indeed, the impact of genome data is dependent on their quality, and the reliability and accuracy of such data may in uence the global community's ability to track the spread of variants in a timely manner.
In this study, we aimed to investigate the global diversity of SARS-CoV-2 genomic surveillance and the global sequencing coverage of con rmed cases and public availability extent of genomes. In addition, we sought to map the global identi cation and spread of SARS-CoV-2 variants. This data can provide evidence to better inform policy on SARS-CoV-2 surveillance.

Results
We classi ed genomic surveillance strategies for 105 countries, including 72.3% (34/47) of WHO-de ned Africa Region countries, 58.5% (31/53) of European Region countries, 61.9% (13/21) of countries in the Eastern Mediterranean Region, 31.4% (11/35) of countries in Americas Region, 37.0% (10/27) of countries in the Western Paci c Region, and 54.5% (6/11) of countries in the South East Asia Region (Table S4). We downloaded a total of 3.8 million deduplicated SARS-CoV-2 sequence samples from publicly repositories corresponding to samples collected between 1 December 2019 to 15 September 2021 in 164 countries. Additionally, we collected o cial aggregated data of variants from 55 countries, and extra data for the rst identi cation of COVID-19 variants from 33 countries.
The global diversity of SARS-CoV-2 genomic surveillance strategies and sequencing availability We observed marked geographical heterogeneity in genomic surveillance of SARS-CoV-2 across countries. Globally, a total of 60.0% (63) countries had performed routine genomic surveillance, 26.7% (28) countries implemented limited routine genomic surveillance, 13.3% (14) countries had no routine genomic surveillance with the remaining countries (89) having no data on genomic surveillance strategy identi ed (Fig. 1A). Surveillance diversity across various countries was also re ected in the context of target populations, sampling method, sequenced proportion, and diagnostic criteria (Table S4). Speci cally, 32 countries randomly selected or used all con rmed cases with su cient quality for sequencing, and 18 countries adopted PCR assay to screen or con rm variants. From the regional perspective, limited or no routine genomic surveillance was common in the Eastern Mediterranean (84.6%, 11/13) and Africa (61.8%, 21/34), followed by the South East Asia (50.0%, 3/6), Americas (36.4%, 4/11), Western Paci c (20.0%, 2/10), and Europe (3.2%, 1/31) (Fig. 1A). However, among the 167 Member States with data accessible, the availability of SARS-CoV-2 sequences was high in 79 countries, moderate in 80 countries, and low in 8 countries (Fig. 1B).

Sequencing coverage of SARS-CoV-2 con rmed cases
The European (57.3%) and Americas (35.3%) Regions uploaded the most SARS-CoV-2 sequences to public repositories, with marked intra-region heterogeneity across countries, with a range from 3 (Saint Kitts and Nevis) to 1.1 million (United States) as of 15 September 2021 (Fig. 1D).
Since September 2020, no more than 4.0% of global con rmed SARS-CoV-2 infections were sequenced, with a relatively high proportion of infections sequenced in March (3.4%) and early August (3.8%). In any week, the Africa, South East Asia and Eastern Mediterranean Regions (Fig. 1E) sequenced no more than 1.5% of cases. At the end of May 2021(start of Delta widely spreading), Europe had the highest sequenced proportion of 9.6%, followed by Western Paci c (2.1%), Americas (1.4%), Africa (1.1%), South-East Asia (0.1%), and Eastern Mediterranean (0.09%) (Fig. 1E). At the country-level, higher rates of sequencing were observed in Iceland, New Zealand, Denmark, Australia, Luxembourg, Norway, Finland, and United Kingdom, all of which had at least 10% reported infections sequenced as of 15 September 2021. In addition, almost all countries in the Africa and Eastern Mediterranean had sequenced less than 2.5% of con rmed infections, except for Gambia (6.3%) and Djibouti (3.1%) (Fig. 1F).

Public availability extent of SARS-CoV-2 variant sequences
The public availability extent of SARS-CoV-2 genomic data varied across variants and countries. Overall, among countries with aggregated data on the number of variant infections, less than half of sequences of Alpha, Beta, Gamma, and Delta were publicly available in 38.5% (20/52), 20.9% (9/43), 25.8% (8/31), and 45.5% (20/44) countries, respectively (Fig. 2). However, the result for Alpha might be in uenced by SGTF detected via PCR. The public availability extent of Delta variants across countries ranged from 0.0% (Cyprus, Hungary, Laos) to 100.0% (New Zealand, Indonesia, Netherlands, South Africa). At the country level, low sharing proportions (less than 50%) across all VOCs was found in several countries, including Austria, Cyprus, Greece, Hungary, Philippines, Thailand and Senegal. For example, the sharing proportion of Alpha, Beta, and Delta in Thailand was 10.5%, 15.0%, and 4.0%, respectively, which indicated more than 85.0% genomic data of variants were not uploaded.
Moreover, incomplete metadata attached to GISAID sequences was common globally, with about two thirds of sequences missing demographic information (age and sex), and more than 95% of that missing clinical information (e.g., symptom history, clinical outcome, and vaccination status) (Table S5). Highincome regions tend to have low information completeness, especially in the European Region, where less than 25.0% and 3.0% of sequences had demographic and clinical information, respectively, signi cantly lower than other Regions (P < 0.0001, Chi-square test). Besides, 93.4% sequences were reported by subnational geographies, with a low proportion in Western Paci c (58.2%) and Eastern Mediterranean (76.8%).
Earliest identi cation of SARS-CoV-2 variants across regions Alpha was rst identi ed in the Europe, then in Africa, Americas, and South-East Asia in September-October 2020, followed by spread to Eastern Mediterranean and Western Paci c in November 2020 (Fig. 3A). The earliest publicly available sequenced Beta infection was sampled in Africa in May 2020, and subsequently identi ed in other regions (Fig. 3B). Gamma variant detection remained spatially constrained after it was rst identi ed in Brazil (Fig. 3C). After the rst identi cation of Delta in October 2020 in South-East Asia, global spread occurred after January 2021 (Fig. 3D).

Global and regional spread of SARS-CoV-2 variants
The number of new VOC cases dramatically increased until April 2021, with a peak weekly value of about 100,000 VOC cases sequenced in which most of them were Alpha variants (Fig. 5A). Subsequently, another peak of weekly new VOC case occurred in early August 2021, but with a large amount of Delta variants. The number of VOC cases may be an underestimate for the most recent weeks due to a collection-to-report time delay. Notably, this increase was also accompanied by the increase in the volume of new sequenced cases and new COVID-19 con rmed cases.
The global prevalence of reference (non-variant) strains fell into a low level of 0.6% in the period of Jul-Sep 2021, compared with 14.0% of that in 2020 (Fig. 4). Globally, the COVID-19 pandemic was driven by the circulation of Alpha at the start of 2021, with an average prevalence of 50.4% in the rst quarter of 2021. Alpha variants continued to outcompete other strains in the second quarter of 2021, accounting for 59.8% of the contemporary lineages (Fig. 4). However, the rapid global rise of the Delta variant began in May 2021, reaching a global prevalence of nearly 98.8% at the end of August 2021 (Fig. 5B). In contrast, Beta and Gamma variants remained at low prevalence (Fig. 4). Additionally, the shifting of predominant variants from Alpha to Delta rst occurred in South-East Asia where the proportion of Delta exceeded 60.0% in April 2021 (Fig. 5J).

Discussion
Our study characterized the global diversity of genomic surveillance strategies and sequencing availability, sequenced coverage of SARS-CoV-2 cases, public availability extent of variant sequences, as well as current epidemic situation of SARS-CoV-2 variants. We found that genomic surveillance strategies were globally heterogenous, with limited or no routine surveillance among many countries in the Africa and Eastern Mediterranean Regions. Our analysis of publicly deposited SARS-CoV-2 sequences implied that the sequenced coverage is low in most countries, with a low proportion of VOCs sequences shared to public repositories. The pervasive spread of Alpha and Delta variants further highlights the threat of SARS-CoV-2 mutations despite the availability of vaccines in many countries.
The diversity of SARS-CoV-2 genomic surveillance between countries is associated with country-speci c priorities (e.g., surveillance objectives, targeted monitoring, or event-/risk-based sequencing) and available resources. ECDC recommends population-based and/or targeted sampling strategies (e.g., imported cases, cluster cases, and potential vaccine escapers) for genomic surveillance, which could provide a more representative estimate of the relative prevalence of variants. Notably, several countries, many of which are classi ed as low-or lower middle-income countries by the World Bank, lack genomic surveillance data, likely due to limitations in infrastructure capacity and resources. However, even some countries classi ed as high-income, have suffered from a slow and inconsistent process of adopting genomics-based surveillance 22 . Establishment of reference laboratories and networks to provide sequencing services for countries without established sequencing capacity may enable improved detection and tracking of emerging variants worldwide.
The detection of most variants relies on the full-length or partial genomic sequencing, but the sequences only become available for the global community when the laboratories have established sequencing capacity, willing to share, and legally allowed to upload them. The discrepancies in sharing was observed in each region, which con rmed that some countries are sequencing but are not uploading. However, our study observed a sharing extent of exceed 100% exists in some countries, likely due to delays in the o cial reporting of sequencing results, or the incomplete o cial reporting system. The timely sharing of those enables to adequately contextualize local data when looking at introductions and examine transmission routes, as well as to look for sites of repeated mutations that can guide laboratory work on characterizing those mutations effects on therapeutics and vaccine e cacy. The underlying reasons why some countries didn't share might be related with the distrust for publicly repositories in the concept of data security.
We found relatively low completeness of demographic and clinical characteristics in metadata accompanying uploaded sequences. Our analysis revealed that high-income countries frequently did not share demographic information. A possible reason for this is that these regions may having more restrictive data privacy/laws preventing/discouraging the release of this information. Genomic data coupled with those additional data can maximize the utility of genomic sequences in rapid scienti c discovery during this pandemic, which are valuable for in-depth epidemiological analyses to characterize risk factors, clinical severity, and other public health risk of variants [23][24][25] . Therefore, it's vital to optimize the sharing of information in a secure and trusted channel in the context of protect patient anonymity and in accordance of local regulations 26 . In addition, decreasing the lag between sample collection to deposition of these sequences 27 , including the timely sharing and standardizing of metadata 18,23,28 , may facilitate the design and development of treatment and prevention strategies by policy-makers 29 .
An important role of genomic surveillance is to investigate the spread and dynamics of SARS-CoV-2 variants. Amidst the emergence of different variants, the current dominance of the Delta variant suggests that it may possess higher tness than other variants, which might be associated with a combination of higher virus load, and shorter incubation period and serial intervals [30][31][32] . The decrease in real-world vaccine effectiveness against Beta variants 11 and increasing breakthrough cases with Delta variants 33 underlines the importance of determining the local or regional patterns of variant spread, including the need to develop new or modi ed vaccines to achieve adequate protection 34 .
Our results should be interpreted in view of several limitations. First, the lack of data from some countries limited our global mapping. The data completeness and quality could be impacted by key steps in the surveillance or reporting, including differences in diagnostic criteria, under-reporting, delayed reporting, and reporting methods. The inconsistent diagnostic criteria of variants might cause sampling bias, especially when adopting PCR assay to detect Alpha variant owing to its non-speci city 35 . We did an extensive search to collect multi-source data and chose the aggregated data with a priority to sequencing results rather than PCR-screening results. Second, the analysis of global and national spread could be biased as data from public repositories or aggregated dataset are not always representative of the variants circulating in the regions, especially for the regions with relatively limited sequencing capacity or with investigating outbreak-based events. Therefore, the global prevalence of variants may be biased due to the uneven sequencing across the regions. Indeed, it's di cult to obtain a truly representative and random sample, and how to understand these biases will become important 24 . Lastly, the detailed demographical, epidemiological and clinical information about variant cases cannot be accessed, which limited our further epidemiological analysis about variant spread.
In conclusion, our study provides a landscape for genomic surveillance, the global coverage of sequencing and public availability of sequences, as well as the evidence for rapid spread of SARS-CoV-2 variants. Our ndings suggest that global SARS-CoV-2 genomic surveillance strategies and capacity are diverse, and may be limited in some regions, especially in the context of the global spread and dominance of variants of concern. The gap still exists in sequencing availability and magnitude, therefore international efforts are needed to address some genomic bottleneck, and more work are needed to be done in de ning the ideal sampling schemes for different purposes and sharing these data in public repositories to allow for further rapid scienti c discovery.

Data sources and collection
Through extracting country-speci c data from multiple publicly available sources, we built three datasets of genomic surveillance, deposited genomic data in publicly repositories, and o cial aggregated genomic data as of 15 September 2021.

Dataset of genomic surveillance
Each country's genomic surveillance strategy and sequencing availability was gathered from searches of the websites of regional WHO, the country's Ministry of Health, CDC, local academic partners, and o cial news, supplemented by a literature search (appendix). Data extracted included the overall surveillance strategy, sequencing availability, target population, sampling method, diagnostic criteria, and sequenced volume. Given that the surveillance strategy and density may change with time, we only gathered information on the most recent surveillance strategy (as of 15 September 2021).

O cial aggregated dataset
To gain additional insights regarding the public availability (sharing) extent of genomic data of SARS-CoV-2 variants, we extracted country-speci c, variant-speci c, and time-speci c aggregated data on the number of SARS-CoV-2 variant cases from o cial websites, using the same sources as above, except for the literature source. The search was done by either directly locating to the o cial website for each country or indirectly searching in search engines (Google, Bing, Baidu) by using the terms "variant" and country name. To supplement the aggregated data of variants that we collected, we also downloaded the aggregated data with a valid denominator (namely, the number of isolates sequenced is reasonable) from the European Surveillance System (TESSy). The aggregated dataset included country name, date of report or collection, new or cumulative numbers of different SARS-CoV-2 variant cases and total sequenced cases, as well as the new number of con rmed cases. COVID-19 epidemic data were derived from WHO 43 . Considering that the diagnostic criteria of SARS-CoV-2 variants varies in different countries, the general principle for collecting aggregated data was to give priority to the results based on whole and partial genome sequencing instead of those based on a PCR assay.
For countries noted by WHO as having had VOCs identi ed, but no data in publicly repositories, we collected the information about when VOC/VOCs were rst detected from country's Ministry of Health and o cial media news. The search of media news was also done in search engine queries (Google, Bing, Baidu) using combined terms " rst" and "variant" and country name.
All data were entered into a structured database by a trained team (co-authors). All recorded data were cross-checked by coauthors. Details of data sources and data completeness are shown in Appendix (Table S1-S2, S4-S5).

Data analysis
We used the variant naming system proposed by WHO, where four VOCs and ve Variants of Interest (VOIs, included Eta, Iota, Kappa, Lambda, and Mu) had been designated as of 20 September 2021 5 . Our study focused on analyses on four VOCs in the 194 Member States of WHO. We did not integrate data from the overseas territories into that country's data. Given most countries weekly released aggregated data of variant, most of our analyses were performed on a weekly basis.
To characterize the global diversity of SARS-CoV-2 genomic surveillance, we classi ed the surveillance strategy of each country into three categories: 1) routine genomic surveillance; 2) limited routine genomic surveillance; and 3) no routine genomic surveillance. Routine genomic surveillance was de ned as conducting nationwide genomic sequencing, coupled with at least 150 specimens per week or 10% of all samples sequenced 28 . The classi cation of most African countries was from a pre-de ned version from African CDC 44 , while other countries were subsequently de ned according to the de nition criteria (Table   S3). As we could not identify information on the surveillance strategy for some countries through public sources, we also classi ed three extra categories according to the public availability/ability of genomic sequencing: 1) high availability; 2) moderate availability; and 3) low availability (Table S3).
The process of data cleaning in publicly repositories and aggregated dataset is shown in the Appendix. For data that was reported in aggregate, when the date of sample collection was not available, we assumed a xed three-week lag 45 from sample collection to reporting unless other country-speci c information was available to inform this extrapolation.
The sequencing coverage was inferred by using the percent of cumulative positives sequenced as a proxy, de ned by the ratio of the number of isolates sequenced to the number of con rmed cases in the same unit of time. Given that not all sequences will be uploaded to genomic repositories, we analysed the public availability extent of genomic data by comparing the cumulative number of variants between public repositories and o cial aggregated datasets. Since the Alpha variant had a characteristic deletion of amino acids 69-70 in the SGTF that can be detected via a widely used PCR assay 46 , we performed this analysis across VOCs.
We plotted the earliest time when the rst VOC specimen was identi ed in each country. The earliest identi cation was de ned by the earliest sampling time of sequences deposited in publicly repositories. If a VOC was identi ed by WHO but no corresponding sequence in publicly repositories for one country, we used the date collected from other sources. The sequences with sampling date earlier than the earliest sample identi ed in the United Kingdom (for Alpha), South Africa (for Beta), Brazil (for Gamma) and India (for Delta), respectively, were not used in analyses.
We also described the global and regional prevalence trend of variants. The prevalence of variant was de ned as the proportion of the variant number to the sequencing number in the same period. When multiple data sources were available for one country, the most abundant dataset was chosen. For example, if there was publicly available genomic data as well as an o cial aggregated dataset, the priority was given to the one with highest reported sequenced numbers in a speci c week.
The comparison between ratios of two groups was done using the Chi-square test for, and differences were considered statistically signi cant at P-value < 0.05. All statistical analyses and visualizations were done using R (version 4.0.2).

Declarations
Contributors H.Y. designed and supervised the study. Z.C., J.Z., Y.T., R.S, X.X., Y.W., W.L., S.G., and Z.Z. collected data. Z.C., J.Z., Y.T., R.S, and X.X. prepared the tables. Z.C. analysed data, prepared the gures, and wrote the rst draft of the manuscript. A.S.A., X.C., J.Y., D.T.L., D.B.D, and H.Y. interpreted the results and revised the content critically. All authors contributed to review and revision and approved the nal manuscript as submitted and agree to be accountable for all aspects of the work.
Declaration of interests H.Y. has received research funding from Sano Pasteur, GlaxoSmithKline, Yichang HEC Changjiang Pharmaceutical Company, and Shanghai Roche Pharmaceutical Company. None of those research funding is related to COVID-19. All other authors report no competing interests.

Role of the funding source
The funder had no role in study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data in the study and had nal responsibility for the decision to submit for publication.

Data sharing
Any additional information is available from the lead contact upon reasonable request.  The public availability extent of SARS-CoV-2 genomic data to publicly repositories by country and the variants of concern. The public availability extent was de ned as the ratio of the cumulative number of variants in publicly repositories to the o cial reported number of variants within the same period. The ratio of exceed 100% in some countries might be due to the delay in the o cial report of the sequencing results or the incomplete o cial reporting system. In view of the availability of o cial data, the cumulative number of variants in different countries corresponds to different time periods, with detailed information in Table S6. Sequence without date of specimen collection in publicly repositories is not included in our analysis. The variant data for China only includes cases reported by mainland China. The o cially reported number of Alpha variants might contain those con rmed by the PCR screen assay. The values beneath the country names show numbers of cumulative variant as of one speci c week: variants in publicly repositories/o cial reported variants.

Figure 3
The earliest identi cation of Alpha, Beta, Gamma, and Delta variant in each country. If information about the earliest sampling date was unavailable but that of the earliest reporting date was available, we extrapolated the sampling date by using a xed three-week lag from sample collection to reporting.
Countries with darker red indicate earlier samples, and with darker blue refers to later samples. The white areas represent countries with variants unreported or data unavailable.

Figure 4
The prevalence and temporal dynamics of reference strains and four SARS-CoV-2 VOCs by country. The date presented in the top refers to the range of date of specimen collection. The prevalence was de ned as the proportion of the strain number (reference strains or variants) to the total number of sequences generated in the same unit of time. Reference strains includes lineage A, A.1, B, and B.1; the sub-lineages of four VOCs are aggregated with the parent lineages. The grey areas represent countries with no COVID-19 epidemic, or no performing sequencing, or no uploading genomic data to publicly repositories.