Host genomes for the unique SARS-CoV-2 variant leaked into Antarctic soil metagenomic sequencing data

Recently Csabai et al. 1 have found a most likely contaminated metagenomic sample set from Antarctica that contained traces of unique SARS-CoV-2 variants. This is a short followup of that report where we attempt to ﬁnd genetic footprint of the hosts. With reasonable conﬁdence we could identify genetic material from mitochondria of Homo sapiens, green monkey and Chinese hamster. The latter two most probably originated from cell lines Vero E6 (or COS-7) and CHO respectively, which are frequent laboratory culture media for studying viruses including SARS-CoV-2 and its closest relatives.


Introduction
Despite the significant worldwide efforts we still do not know the origin of COVID-19. For this reason, detailed analysis of any sample containing early genetic material from SARS-CoV-2 is of paramount importance. Csabai et. al 1 have recently found samples under project ID PRJNA692319 2 with 12 samples, all were collected from soil at King George Island, Antarctica in a 3 week period after 2018-12-24. Three of the samples, SRR13441704, SRR13441705 and SRR13441708 contain enough sequence reads to cover the SARS-CoV-2 genome at average depth of 17x which allows to call mutations. The DNA from soil was extracted in December 2019 and sent to Sangon Biotech, Shanghai, China 3 , but the more exact time of the sequencing is not known. According to the phylogenetic analysis the samples seems to be very early, maybe one of the earliest -to our knowledge not published -variants most closely related to Pango lineage "A". They also contain some unique rare mutations and a long deletion starting at nucleotide 21761. After posting our preprint 1 the public access to the sequencing data was revoked 4 at SRA, but a month later the public access was restored 5 .
We think that the most likely scenario is that the SARS-CoV-2 sequence fragments are not from the Antarctic soil samples, but are mistakenly assigned reads from another sample(s) sequenced on the same flow cell at the Sangon Biotech sequencing facility in Shanghai. The equipment, Illumina HiSeq 4000 that was used for sequencing according to the project's metadata, is known to be prone to barcode misassignment errors 6,7 .
In a nutshell, as detailed in Illumina's whitepaper 6 , this artifact is related to the exponentially increasing density of next-generation sequencing (NGS) chips. Nowadays a single flow cell has such a large capacity that it is rarely filled up by a single sample. For example the 12 antarctic soil samples in PRJNA692319 altogether contain around 320 million paired end reads, while Illumina HiSeq 4000's capacity is up to 10 billion paired end reads. The increased capacity is utilized through multiplexing, which adds unique sequences, called indexes or barcodes, to each DNA fragment during library preparation. Before final data analysis, sequencing reads from pooled libraries need to be identified and sorted computationally in a process called demultiplexing. This allows large numbers of libraries to be pooled and sequenced simultaneously during a single sequencing run. After adapters are ligated to nucleic acid fragments, the products are cleaned up to remove any free, unligated adapters. Libraries can be cleaned up by a bead-based or gel purification step to remove free adapters or primers. Failure to remove free adapters or primers can lead to contamination of prepared libraries and may result in index hopping and misassignment. As a results, sequencing reads from other, not related samples that are sequenced on the same flow cell can get into our samples.
This putative misassignment mechanism may be related to the fact that most of the suspected contaminated paired-end sequenced samples contained mostly only the R2 mates (reverse 5' Illumina reads) of SARS-CoV-2. In the following we attempt to recover genetic sequences of the host(s) of the detected viruses.    Figure 2. The average depth for the human mitochondrial reference genome as a function of the distance from the nearest gene's end. Union of R2 reads from samples SRR13441704, SRR13441705 and SRR13441708 was used. The depth is the lowest at the end position of the genes (distance=0), and growing towards the internal parts (larger positive or negative distances). This may indicate that not DNA, but the expressed mRNA was sequenced.

Results
We hypothesised that, like the virus sequences, the host(s)'s genetic material may have been leaked into the sequencing data in a similar way. All known hosts of betacoronaviruses are mammals that have huge genomes so it is not likely that they will be completely captured in the studied samples. The size of mammalian mitochondrial genome (mtDNA; 16knt), on the other hand, is comparable to that of a coronavirus. Another advantage is that a single mammalian cell may contain thousands of mitochondria, each with its own genome, resulting a higher genome coverage.
We have downloaded the current database of all assembled mitochondria from NCBI and to cast our net somewhat wider than mammals only, we have extracted the ones, altogether 6158 species, that contained the term "Vertebrata" in their taxonomy annotation. We aligned separately the R1 and R2 reads of the of PRJNA692319 to this vertebrate mtDNA reference genome set.
When multiple genomes are used in parallel, the bowtie2 alignment software assigns each read to the most perfectly matching part of the best matching genome but due to the homology between closely related species, this may sometimes    result false alignments. Also, low complexity non-mtDNA reads produce false positive alignments, so simply the number of reads aligned to a genome may not be a good statistical measure of the true presence of a species. Instead we calculated the mitochondrial genome coverage (percentage of nucleotides covered) and average depth (average number of reads aligned to a given nucleotide) for each species. Table 1 shows the top 10 results for samples SRR13441704, SRR13441705 and SRR13441708 that were the most abundant in SARS-CoV-2 reads and as comparison two other samples, SRR13441701, SRR13441702 in which we could not identify SARS-CoV-2 reads. The SARS-CoV-2 abundant samples have in general larger mtDNA content and for certain species we have found orders of magnitudes higher coverage than in the samples where SARS-CoV-2 could not be detected. In these samples the extra counts mostly come from three species, Homo sapiens, Cricetulus griseus and Chlorocebus sabeus. We can observe the same asymmetry as with the SARS-CoV-2 counts (see Table  1 in Csabai et al. 2022 1 ) that there are more hits for the R2 mates than for the R1 ones. Since the abundance of the mtDNA content correlates with the abundance of the SARS-CoV-2 content (see also be considered as the hosts of the detected SARS-CoV-2. The coverage of the mitochondrial genome of this three species can be seen in Figures 1 , 3

and 4.
Since these three mammals are not arctic species, this also confirms the hypothesis that SARS-CoV-2 was introduced as a contaminant in the samples. We do not examine here the other less abundant top alignments, like Geotria australis or pouched lamprey which is lineage of jawless fish widespread in the Southern Hemisphere. They most probably come from the original Antarctic soil samples that were collected around seal and penguin colonies or they are only chance alignments.

Discussion
As we are investigating a small amount of genetic material from putative contaminants of unknown origin, it is difficult to make definitive statements. Accordingly, we caution the reader not to take the following as proven final scientific truth. We try to stick to the facts and refrain from making unsubstantiated hypotheses but at the same time squeeze every last drop of information out of the available data.
Homo sapiens is not a surprising find for a possible host. Since some of the reads aligned to other Homo species like Homo sapiens neanderthalensis, Homo sp. Altai and Homo heidelbergensis we realigned all the SRR13441704,05,08 R2 reads solely to NC_012920.1 Homo sapiens mtDNA genome. As shown in Fig. 1 the mitochondrial genome is mostly covered with average depth of 85, but there are large fluctuations. Aligning the fluctuations to the genome annotation, it is very striking to the eye that the sequencing depth is lower at gene junction positions. We have calculated the average depth as a function of the distance from the nearest gene's end, as shown in Figure 2. Indeed, the depth is the lowest at the end position of the genes, and growing towards the intragenic parts. This indicates that not DNA, but the expressed mRNA was sequenced for these samples, maybe just as a by-product of the SARS-CoV-2 RNA virus sequencing.
With the caveat that the coverage is gappy and low, and there may be sequencing artifacts, we have attempted to call mutations and reconstruct the consensus human mtDNA genome. Altogether we have found 23 mutations: A263G, A750G, A862G, A866C, C870A, T903C, A918G, T921C, A929G, A939G, A1041C, T1055C, A1110G, T1119C, G1130A, A1154C, A1438G, A4769G, A8860G, G15043A, A15326G, C16223T, A16293G. We determined its haplogroup with the mixemt tool 8 , that were designed for forensic analysis of low quality sequence data from complex mitochondrial mixtures. It gave a rare South-Asian haplogroup M48. We have also compared the list of mutations to the 1000 Genomes Project's 9 mitochondrion database. It may be just an interesting coincidence that the closest known relatives of SARS-CoV-2 are coming from bats that live in caves close to Xishuangbanna in Yunnan province, and from the Indochinese peninsula 10 .
Cricetulus griseus or Chinese hamster was first used as a laboratory animal to type pneumococci obtained from human patients 11 and became a valuable animal model in other infectious disease and epidemiology research studies 12 , including MERS-CoV, SARS-CoV 13 and SARS-CoV-2 14 . There is also a Chinese hamster ovary (CHO) cell line which is widely used in biological and medical research, primarily for toxicity screening and for expression of recombinant antibodies. For example in Fukushi et al. 15,16 in vitro passage of SARS-CoV on CHO cells that expressed rat ACE2 yielded a new adapted variant that encoded two new mutations in the spike protein. The same cell line was used for SARS vaccine development 17 and recently also in several SARS-CoV-2 related research studies 18,19 .
Chlorocebus sabeus or green monkey is infamous for zoonosis researchers as it was associated with Marburg virus that was discovered in 1967 in a laboratory outbreak in Germany among workers exposed to cell cultures from green monkeys imported from Africa 20 . The 'Vero' cell line that was isolated from kidney epithelial cells of C. sabeus 21,22 has various applications in molecular and cell biology research. Vero E6 cells enable achieving high titers of SARS-CoV and are also used in the cultivation of Rabies virus, Reovirus and Japanese encephalitis virus 23 . The first isolation of bat coronaviruses that uses the ACE2 receptor was also done in Vero E6 cells in 2013 from Chinese horseshoe bats (Yunnan, China, RsSHC014 and Rs3367) that are known to be among the closest relatives of SARS-CoV 10,24 . Two years later, using also Vero cell line, genetically engineered highly pathogenic chimeric virus was reported that reproduced in human lung cells in cell culture as well as in the mouse lung with the corresponding pathogenesis in animals 25 . This and similar other experiments are one of the sources that feed the lab-leak hypothesis, intensively discussed in scientific and public media 26 . There is another SV40 transformed C. sabeus based fibroblast-like cell line, COS-7 that is used among other applications to study viral entry and inhibition 27 . The Vero E6 and COS-7 cell lines are commonly used in COVID-19 research, too. The first paper from Wuhan Institute of Virology to report isolating SARS-CoV-2 28 also used Vero cell line and a PubMed search for keywords "SARS-CoV-2"+"Vero"+"E6" as of January 2022 lists 322 articles 29, 30 .
It is interesting that while the human mitochondrial genome is mostly covered from end to end, as can be seen on Figures 3  and 4, only the first ∼ 3200nt of the C. sabeus, and almost complementarily, starting around at 2600nt the other end of the C. griseus genomes are covered. The position of these breakpoints may not be coincidence. Looking closer the annotations, we 4/8 noticed that the C. sabeus genome is covered just up to the end of its 16S ribosomal RNA gene at nucleotide 3225 and the coverage for C. griseus start just at the end of its respective 16S rRNA gene at nucleotide 2656. Based on this we hypothesise that the samples may contain heterokaryons from fused cells, and part of each homologous (mitochondrial) genomes are silenced. Though the details are elusive, as other viruses SARS-CoV-2 can also induce cell fusion and result syncytia formation 31,32 .
Combination of the above two cell lines, a binding assay between CHO cells and Vero E6 cells was also developed 33 for SARS-CoV research. In this assay, the SARS-CoV S protein expressed by transfection into CHO cells mediates adhesion to Vero E6 cells that possess ACE2 34 . Cell fusion assays or serial passage of viruses in cell lines 35 may be related to the fragmented coverage of the mitochondrial genomes as seen in Figures 4 and 3. Combination of green monkey and hamster cell lines are also used for SARS-CoV-2 research [36][37][38] .
All these findings support our hypothesis, that the identified species (or their cell lines) are not just chance alignments but likely hosts of the SARS-CoV-2 viruses found in the PRJNA692319 sample set. Based on the available information we cannot tell neither the exact number of the contaminating samples nor if each contained both the mammal cells and SARS-CoV-2, but we suspect that there was one or more SARS-CoV-2 infected human (or human cell line) samples and one or more green monkey (or cell line) and Chinese hamster (or cell line) or cell-fusion samples.
Since SARS-CoV-2 is an RNA virus, without reverse transcription to cDNA an infected researcher could not contaminate the samples during the lab preparation steps. Running PCR on virus containing samples in the same lab may be source of the contamination, but most probably that would not pick up host's genome fragments in comparable concentration to SARS-CoV-2, that we have found. The presence of the animal model genomes indicate that some of the samples are not just simple samples from COVID-19 patients, rather they come from research experiments using simultaneously more cell lines or animal models.
As we have shown previously 1 SARS-CoV-2 mutations in these samples show a unique pattern most closely related to the earliest known variants, but we could not find exact match neither in the published literature nor in the most comprehensive database of GISAID. The absence of more precise information on the contaminator samples leaves open the possibilities that either there are known samples that have escaped our attention, or that there are unpublished results that may be key to identify the origin of SARS-CoV-2.

Materials and Methods
The analysed sequencing project can be found under project ID PRJNA692319 but after the publication of the first preprint 1 the authors have removed the public access to the affected three samples.
The mitochondrial genome database of the 1000 Genome Project 9 was downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz We have used The bowtie2 alignment software 39 with the -local to align the raw sequencing reads to all vertabrate mitochondrial genomes, downloaded from https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/. The samtools coverage command 40 was used to calculate the depth and coverage for the mitochondrial genomes. For realignment to the human mitochondrial genome alone, we have used the slower but more sensitive bowtie2 -local -very-sensitive-local command. Mitochondrial genome coverages were visualized by the DNA Features Viewer Python library 41 .
The mixemt tool 8 was used to determine the possible haplogroup of the human mitochondria. All the other analysis was done with custom scripts using bash, python 42 and jupyter 43 .