Systematic detection of co-infection and intra-host recombination in more than 2 million global SARS-CoV-2 samples

Orsolya Pipek Dept. of Physics of Complex Systems, ELTE Eötvös Loránd University, Budapest, Hungary Anna Medgyes-Horváth (  horvath.anna@ttk.elte.hu ) Eötvös Loránd University https://orcid.org/0000-0003-4435-5797 József Stéger Eötvös Loránd University Krisztián Papp Eötvös Loránd University Dávid Visontai Eötvös Loránd University Marion Koopmans Department of Viroscience Erasmus Medical Center https://orcid.org/0000-0002-5204-2312 David Nieuwenhuijse Erasmus University Medical Center https://orcid.org/0000-0003-1310-5031 Bas Oude Munnink Erasmus Medical Center https://orcid.org/0000-0002-9394-1189 VEO Technical Working Group Eötvös Loránd University István Csabai Eötvös Loránd University https://orcid.org/0000-0001-9232-9898


Introduction
Due to the high mutation rate of the SARS-CoV-2 virus, numerous variants emerged since the onset of the COVID-19 pandemic. As diverse variants from different lineages may circulate simultaneously, there is a chance of an individual being infected by two (or more) variants (strains) at the same time, de ned here as co-infection. The rst co-infection cases were reported as early as mid-late 2020 1,2 and since then increasing evidence suggests, that co-infections occur frequently, with an estimated rate of ~ 0.2-0.6% of the observed infections [3][4][5][6] . Elevated severity was observed in some case studies 7,8 , but co-infection was also found in a patient with mild symptoms 9 and additional comprehensive studies report further con icting ndings concerning the course of co-infectious disease 2,4,10,11 . Recently, mainly recombinant lineages are circulating around the globe (Omicron XBB-variants) and also several recombinant lineages have been classi ed by pangolin [12][13][14] . For recombination to occur, an individual has to be infected with two different lineages 14,15 which leads to the hypothesis that co-infection should take place frequently. This can happen in immunocompromised individuals, who have been described to be prolongedly infected with SARS-CoV-2 16 , but also in the general population 4-6 .
Co-infections have been identi ed by investigating samples with ambiguous pangolin assignment 17 results, "heterozygous" consensus sequence positions, or inconclusive PCR genotyping 5,10 . However, some studies from different countries (USA, France, United Arab Emirates, Brasilia, Costa Rica) perform a systematic search for samples with evidence of two (or more) strains present, mainly focusing on Variants of Concern (VOCs), based on the presence of minor variant genomes (in other words the observed variant allele frequencies (AFs)) in lineage de ning positions [2][3][4][18][19][20][21] . Nevertheless, to our knowledge, no worldwide comprehensive study was realized as of now that identi es coinfection cases on a large-scale, exhaustive dataset that portrays the global trends of the disease. Through international effort, the Versatile Emerging infectious disease Observatory consortium 22 analyses and interprets genomic data from SARS-CoV-2 sequencing samples as one of its subprojects. Thanks to the worldwide research community, an enormous number of SARS-CoV-2 raw read datasets have been deposited to and are shared openly at the European Molecular Biology Laboratory -European Bioinformatics Institute (EMBL-EBI) European Nucleotide Archive (ENA) 23 European COVID-19 Data Portal [24][25][26] . This has allowed the processing of these raw datasets with a standardised work ow, developed by the VEO consortium, thereby reducing variability in the dataset from the use of different bioinformatic work ows. The outputs of this standardized analysis work ow 27 , including mutations, their positions, and allele frequencies, among other genomic information are also made available, which allowed us to detect co-infection cases in more than 2 million samples around the world in a cohesive and reproducible manner, by analysing the major and minor variations in the samples.
The detection of co-infection in WGS data is hampered by several technical di culties. 1) The vast majority of these data is obtained as a result of amplicon sequencing, where the applied ampli cation primers have a signi cant bias for speci c genomic regions. For example, Bal et. al. 4 showed that observed alternate AFs in arti cially mixed samples of speci c ratios of different variant strains hardly represent the original mixture proportions. 2) Even though an immense number of SARS-CoV-2 sequences have been shared publicly to date, typically only consensus sequences are provided 28 . This approach, however, poses a signi cant obstacle to identifying the coexistence of two or more variants at a single genomic position simultaneously. 3) The limited amount of unique lineage-de ning variations and the otherwise high mutation rate of the genome causes di culties in lineage designation. The presence of variant strains with many lineage-de ning mutations (e.g. Alpha, Gamma, Mu, Omicron etc.) can be more reliably detected than that of ones with only a few unique genomic variations (e.g. Epsilon, Iota, Delta, subvariants of Omicron etc.). 4) The risk of contamination during laboratory procedures also cannot be overlooked. The accidental mixing of samples (especially during the time of emergence of a new variant) prepared for sequencing can create the effect of coinfection in the resulting dataset [29][30][31][32] . Different methods have been constructed for systematically detecting co-infection samples from WGS data, with some limitations. The manual evaluation of suspicious samples (i.e., samples with ambiguous PCR genotyping, "heterozygous" positions in the consensus sequence, or inconclusive lineage designation results) used in multiple case studies 5,10 is not e cient on a worldwide dataset with millions of samples. Most studies employ the basic bioinformatic pipeline of aligning sequenced short reads to a reference genome, using a variant-caller to identify mutations and nally determine the presence of co-infection from the resulting AF distributions of lineage-de ning mutations. Most of these methods differ in their applied ltering criteria and the set of considered lineage-de ning mutations 2,4,18,19 . A metagenomic approach using amplicon sequence variant like (ASV-like) fragments was proposed by Molina-Mora et al. to essentially separately assign each read of the sequenced sample to an appropriate lineage. Their method was tested on arti cially mixed samples, but they identi ed no real co-infection cases in Costa Rica in their dataset 20 . Zhou et. al developed a method named Cov2Coinfect based on a hypergeometric-distribution model to assign the most likely lineages. Whenever the sum of the proportions of candidate lineages with consistently present lineage-de ning mutations was approximately 100%, the sample was deemed to be a co-infection sample. They analysed more than 50,000 samples from the USA, resulting in a co-infection rate of 0.3-0.5%, however, their approach automatically rejects samples with inconsistent AFs (not summing to 100%), which may simply be due to sequencing bias of special ampli cation primer systems or intra-host recombination of the lineages present concurrently 6 .
Once a co-infection sample has been reliably identi ed, it is of interest, whether it is a simple mixture of its composing variant strains or if their genomic material has in parts been fused together (recombined), essentially creating a new, possibly transmissible variant. It is widely documented that recombination frequently occurs across Betacoronaviruses in bats and other animal hosts [33][34][35][36] . However, the frequency of recombination depends on the likelihood of dual infections, which is in part dictated by the epidemiology and ecology of infections in different hosts. For instance, the frequent occurrence of recombination has been attributed to the sharing of habitats of large numbers and different species of bats 37 . Whenever genetically distinct strains of a virus co-exist in a given patient during infection, novel strains can be produced with unique mutational landscapes via the process of recombination. Even though it is suspected that recombination is a common event during virus evolution, its traces may be di cult to detect, particularly in case of the SARS-CoV-2 pandemic where the genomic diversity is limited due to the short evolutionary history, and recombination of highly similar parent strains would cause virtually no distinguishable effect in the resulting recombinant genome. However, if the parent strains contain unique genomic variations that are highly speci c to the given strain, the presence of recombinant genomes can be discovered when these mutations occur concurrently. This would be most likely during time periods when such new variants emerge. A sample obtained from an individual infected with a virus that is a product of recombination would contain a subset of the lineage-de ning mutations of both parental strains as major variants (i.e. with high AFs), depending on the genomic position of the recombination breakpoint(s), thus the presence of recombination would be tractable from the consensus sequence 38 .
For example, GISAID (consensus) sequences assigned to a Pango lineage whose pre x starts with an "X" 17,39 all belong to recombinant lineages (or their non-recombinant descendants), comprising about 2% of the total dataset. However, prior to the transmission and spread of a recombinant genome, a recombination event must take place within a single host co-infected by the non-recombinant parental lineages. This leads to sequencing data in which the genomes of the parental lineages are most abundant, and recombination is only apparent in a small fraction of the short reads. Identifying traces of these in situ/incidental/intra-host/subclonal recombination events is unachievable from consensus sequences and require the examination of raw read datasets. Evidence of intra-host recombination for SARS-CoV-2 viruses has previously been demonstrated [40][41][42] , but these studies either analysed only aggregated AF data obtained from raw sequencing results (thus detecting only indirect signs of recombination) or were limited to the examination of samples secured from a single patient. Additionally, even with an initial dataset of 10,000 randomly selected samples, co-infection is expected to occur in about 30-50 cases, allowing for the manual curation of these in search for traces of intra-host recombination, which has been the practice of previous studies 38 .
Here we aim to construct a computational pipeline that rst identi es SARS-CoV-2 samples exhibiting signs of coinfection from a database of more than 2 million samples collected worldwide and processed with a uni ed bioinformatic pipeline. The extensive information made available by sharing raw read datasets [23][24][25][26] afforded us the opportunity to then further investigate these co-infection cases in a meticulous manner to nd traces of intra-host recombination. As manual inspection would have been unfeasible owing to the overwhelming volume of the data, an automated approach was implemented for intra-host recombinant detection. To this end, both the alternate AF distribution of de ning mutations in the samples are examined for the presence of putative recombination breakpoints, and the raw short reads overlapping the genomic positions of multiple lineage-de ning mutations of the parental strains are interrogated for the simultaneous presence of more than one variation in these positions. Both strategies have previously been used for the identi cation of intra-host recombination by hand 4,5,9,18 , albeit both come with limitations. We additionally discuss and emphasize the theoretical and technical di culties hampering automated intra-host recombinant detection that should be taken into consideration to avoid artefacts.
Data les generated during the study, along with detailed computational pipelines and supplementary materials are available on GitHub at the repository github.com/csabaiBio /SARSCoV2-coinf.

Co-infection samples in the CoVEO database
Co-infection samples were de ned as samples in which a given ratio of the mutually exclusive variant de ning mutations of at least two different viral strains was simultaneously present (see Methods and Supplementary Methods 1 for details). Based on the exact value of the threshold used for the required ratio of mutually exclusive variant-de ning mutations, the number of detected co-infection samples varied considerably (Fig. 1a). Setting the threshold to a reasonable, but stringent 0.8, meaning that at least 80% of the mutually exclusive de ning mutations of all consisting variants have to be present, resulted in a total number of 7,700 co-infection samples out of the 2,172,927 good-quality samples of a human host in the CoVEO database 26,43 , corresponding to an overall co-infection rate of 0.35%, which is in line with previous reports (displayed with the blue shaded area in Fig. 1a). There were 72 samples in the database that contained all mutually exclusive lineage-de ning mutations of more than one variant (indicated with the horizontal black dotted line in Fig. 1a).
The most abundant variant compositions of the identi ed co-infection samples were Delta -Omicron (BA.1), Alpha -Iota, Alpha -Epsilon and Alpha -Delta (Fig. 1b). This is in line with the fact, that the top ve variants with the largest numbers of samples assigned to them in the database are Delta, Alpha, Omicron, Iota and Epsilon ( Supplementary  Fig. 1a). Furthermore, we found a near-linear relationship between the number of samples assigned to a given variant in the database and the number of co-infection cases containing the variant ( Supplementary Fig. 1b), which emphasizes that results are largely in uenced by the temporal and/or geographical distribution of the available sample set, calling attention to the importance of systematic worldwide surveillance. Other notable variant compositions include mixtures of two Omicron strains (Omicron (BA.2.12.1) -Omicron (BA.4)) and some combinations of more than two variants (Alpha -Epsilon -Zeta and Alpha -Epsilon -Iota). Additionally, all frequently observed variant mixtures were found in samples that were collected during the time period when worldwide incidences of the two (or more) variants were concurrently high ( Fig. 1c Supplementary Fig. 2).
To rule out obvious artefacts arising due to wet lab contamination, we collected the list of studies in which coinfection samples were found, and calculated study-speci c co-infection prevalences (Supplementary Table 1). Naturally, the percentages of co-infection samples in the listed studies were usually markedly higher than the average rate of 0.35% (to compensate for the vast majority of the studies in the database that had exactly zero co-infection samples), but generally did not reach 10%. Outliers were studies with ENA accessions PRJNA817870 (prevalence 77%), PRJNA827817 (prevalence 71%), PRJNA853723 (prevalence 85%), PRJNA817806 (prevalence 44%) and, PRJNA809680 (prevalence 63%), the former three of which contain arti cial mixtures of Omicron, Delta and Alpha viral variants in different ratios 4,44 , while the latter two comprise preselected co-infection samples detected during the fth wave of COVID-19 in France, between December 6, 2021 and February 27, 2022 4,5 . The fact that not all samples of these studies were identi ed as co-infection cases highlights the rigorous nature of our detection method that ensures that only high-quality sequences with convincing evidence of the traces of all comprising variants are selected. Thus, our estimated co-infection rate of 0.35% best serves as a lower bound for the actual worldwide value.
Naturally, the possibility of contamination cannot be completely ruled out with this method, but its probability is decreased due to the relatively even distribution of co-infection cases across different studies.
The geospatial distribution of the identi ed co-infection samples was for the most part fairly homogeneous, with some exceptions arising likely due to the uneven sequencing capacities of different countries (Fig. 1d).
For example, the highest prevalence of co-infection was measured in France (34%), but the total number of goodquality samples originating from France was relatively low (738) and many of them (206) were uploaded under the study accession numbers PRJNA817806, PRJNA809680 and PRJNA853723, and thus were speci cally pre-selected as co-infection samples or arti cial mixture of two viral variants (see above) 4,5 . In countries with more than 1,000 good-quality samples (labelled in Fig. 1d), the co-infection rate varied in the 0-1.60% range, however, its value largely depends on the local sampling strategy, i.e. if the countries conducted random continuous monitoring or had preset criteria for the inclusion of speci c samples in their work ows, once again highlighting the value of systematic global surveillance.
Additionally, the possibility of identifying contaminated samples as cases of co-infection cannot be ignored. A distinctive clue for such instances could be that accidental mixing during an experimental procedure is likely to affect more than one sample simultaneously, thus co-infection cases sequenced on the same day, on the same instrument, in the same run, on the same ow cell could be potential products of contamination. However, such detailed metainformation could not be assessed for the samples in our dataset. Nevertheless, the large number of investigated samples ensures that rare events of arti cial inter-sample mixing would not largely in uence the results in a statistical sense.
Traces of intra-host recombination in the AF distribution of coinfection samples A naive hypothesis would suggest that the alternate AFs measured at mutually exclusive de ning mutations in a coinfection sample should directly re ect the variant proportions comprising the sample, i.e. a Delta -Omicron (BA.1) sample with variant ratios 80%-20% (respectively) should have alternate AFs at mutually exclusive Delta-de ning mutations of around 0.8 and at mutually exclusive Omicron (BA.1)-de ning mutations of around 0.2. In such an ideal setup, the presence of intra-host recombinant genomes in a sample could be detectable by distinct shifts in the AFs of de ning mutations at recombination breakpoints ( Fig. 2a). This method has already been used for recombinant detection by manual visual inspection by Bolze et al. 18 . However, published evidence proves that measured AFs are rarely consistent with true variant proportions, mostly due to systematic bias introduced by the PCR primers used for sequencing 4 . Nevertheless, we employed a pipeline for the above-detected co-infection samples that aims to identify shifts in AFs along the genomes that could be consistent with recombination breakpoints, while simultaneously correcting for systematic offsets that were established in a previous analysis step (see Supplementary Methods 2 for the whole pipeline with detailed description). Our approach identi ed 13 putative recombinant samples with the ratios of recombinant genomes ranging from 6-16% within the samples. The majority (11)  Notably, 6 out of the 13 identi ed samples originated from study PRJNA817870 in which arti cial Delta -Omicron mixtures were sequenced that were manually generated by mixing Delta and Omicron viral isolates in different ratios 4 . The fact that these samples were detected as putative intra-host recombinants calls attention to the ambiguities and challenges in recombinant identi cation, as traces of recombination is not expected in such an arti cial setting. However, Supplementary Figure S5 of Bal et al. also directs notice to the formation of chimeric consensus sequences in these samples (based on the majority rule) 4 due to the uneven distribution in measured AFs. Based on the manual inspection of Figure S5, arti cial chimera formation was present in at least 25 of the mixture samples in contrast with the 6 putative intra-host recombinants identi ed by our pipeline.

Recombinant reads in raw sequencing data of co-infection samples
Another approach for nding traces of intra-host recombination in co-infection samples is to investigate the raw reads produced during sequencing to identify ones that simultaneously carry mutually exclusive de ning mutations of multiple parental strains. The whole pipeline including down-sampling of all co-infection samples, data acquisition, raw read processing, and visualization of the results can be seen in Supplementary Methods 3. During the analysis, 118 pre-selected co-infection samples were considered to decrease computation time.
Given that raw reads need to overlap mutually exclusive de ning mutations of both parental strains simultaneously for this analysis, their genomic distribution is largely in uenced by the distribution of mutually exclusive lineagede ning mutations. We found a signi cant correlation (Pearson R = 0.725, p = 0.012) between the density of mutually exclusive de ning mutations and the density of reads meeting the above criteria ("overlapping reads") for different genes of the SARS-CoV-2 genome (Fig. 3a). When considering only those overlapping reads that showed signs of recombination (i.e. carried mutations of multiple parental strains), the genomic distribution of recombination breakpoints suggested by these appeared to be moderately correlated (Pearson R = 0.595, p < 0.001) with that found by Turakhia et al., who analysed consensus sequences for the presence of recombination breakpoints (i.e. searched for clonal recombinants) 38 (Fig. 3b).
When exploring the average percentage of genomic positions per sample where the ratio of recombinant overlapping reads out of all overlapping reads ("recombinant read ratio") reached a given threshold of T, we found that even though arti cial samples do contain a substantial number of recombinant reads, the prevalence of genomic positions overlapped by a recombinant read ratio of more than 0.1 is much lower than in true co-infection samples (Fig. 3c). This result suggests that breakpoints supported by no more than 10% of the overlapping reads might be considered artefacts due to chimera formation during PCR. This is further supported by the fact that when calculating the ratio of duplicate reads among recombinant reads in positions with a recombinant read ratio of lower vs. higher than 10%, genomic positions in which recombination was supported by less than 10% of overlapping reads show considerably higher fractions of duplicates, indicating possible evidence of PCR artefacts (Fig. 3d).
To nd recombination hotspots, we considered each genomic position of the genome and calculated the number and ratio of non-arti cial samples in which the given position is part of a recombination breakpoint range based on the evidence of at least 10 reads and a recombinant read ratio of at least 0.1 of all reads overlapping the genomic position (Fig. 4). We found that many of the genomic regions indicated as putative recombination breakpoint ranges in multiple samples coincide with gene boundaries. When analysing the 13 samples previously identi ed as putative intra-host recombinants, we observed no compelling evidence of recombinant reads supporting any of the presumed breakpoints. This is, in many cases, due to the lack of reads overlapping the breakpoint region. For samples with at least one overlapping read, the percentage of recombinant reads at the breakpoint position ranged from 0 to 9.5% (see Fig. 8 of Supplementary Methods 3), which still falls short of the 10% threshold set to discard potential chimeric sequences.

Discussion
The COVID-19 pandemic has brought about never-before-seen international collaboration in sample collection and high-throughput sequencing, leading to the accumulation of publicly available processed (GISAID 28 ) and raw read datasets (EMBL-EBI ENA 23 , European COVID-19 Data Portal 25,26 ) uploaded by various laboratories and researchers worldwide. With the aim of unifying the bioinformatic processing of such diverse data sources, the VEO consortium (www.veo-europe.eu) has developed a cohesive work ow 27 that produces lists of mutations from raw sequencing data, incorporating additional key information, such as alternate AF, sequencing depth and quality scores. The CoVEO database stores this massive number of records in a queryable, searchable format, appended with detailed sample metadata, which allows for straightforward subsequent analysis. To truly harness the vast volumes of added information, compared to the more traditional approach of consensus sequence analysis, novel tools and methodologies need to be developed. In our current study, we examined more than 2 million samples with an automated pipeline for the purpose of detecting co-infection cases and traces of intra-host recombination aided by the available allele frequency and raw read information in the dataset.
The fact that we could recover a high percentage of known co-infection samples from previous studies 4,5,10,18 , suggests that even with a completely independent bioinformatics work ow used for variant calling and an automated approach for co-infection detection, the results are reproducible and robust. Given the huge number of mutations in the SARS-CoV-2 genome since the onset of the pandemic, it is very di cult to assign samples to lineages based on a limited number of de ning mutations, therefore we used an extended set of mutually exclusive de ning mutations and very strong ltering criteria for co-infection detection, which may result in an underestimation of the number of coinfection samples, but ensures that sequencing artefacts are not identi ed incorrectly as signs of co-infection.
The reliability of the detected co-infection samples was further con rmed by the nding that most abundant variant combinations were the ones in which parental lineages were both separately present in large numbers in the database and that sample collection dates for co-infection samples consistently coincided with worldwide incidence peaks of their comprising variants. However, based on the near-linear trend uncovered between the number of samples assigned to a given variant and the number of co-infection samples containing that variant, the initial sample composition of the investigated dataset inherently determines the most frequent variant combinations in co-infection samples. Thus, to draw meaningful, globally relevant conclusions, the database should contain a representative sample of the worldwide cases. This underlines the value of systematic, international, synchronized surveillance and continuous sequencing.
Countries and research projects with strikingly high incidence of co-infection cases among their sequenced samples turned out to be ones speci cally designed for the detailed investigation of co-infection from the outset 4,5,44 . No other dubious correlations were found between sample metadata, which could possibly indicate signs of contamination during laboratory processing. However, artefacts caused by accidental mixing, or the dropout of sequencing amplicons 45,46 can never be entirely ruled out with certainty, which accentuates the importance of providing detailed metainformation about sample preparation and sequencing procedures. As an example, raw read les (FASTQ) produced by Illumina instruments contain multiple speci c features of sequencing ( ow cell IDs, run, lane, tile numbers, coordinates of cluster, etc.) 47 , which would serve as useful guides for exploring suspicious correlations between co-infection samples. For instance, multiple co-infection samples sequenced during a single run on the same ow cell are likely to be the products of contamination. Unfortunately, given that data uploaders are free to assign generic read IDs prior to data submission, the SRA database does not store this information, as it is frequently unreliable. Nevertheless, one of the main advantages of analysing large datasets is that statistically meaningful results can be gained even in the presence of a few arti cial cases. Subsequent to co-infection detection, even more stringent selection parameters were applied for intra-host recombinant identi cation using multiple approaches, as several technical challenges give rise to distinct artefacts in sequencing data.
Arti cially mixed samples may serve as negative controls for recombinant detection, as in puri ed RNA mixture 'coinfection' samples, no recombination is expected. Sovic 4 . None of the samples in study PRJNA827817 were identi ed as putative intra-host recombinants based on AF distribution, but 6 out of the total 13 intra-host recombinants detected originated from study PRJNA817870. Bal et al. discuss in detail the emergence of arti cial chimeric sequences when creating the consensus sequences of their mixture samples using the majority rule with multiple computational tools. The effect is due to the uneven distribution of AFs at de ning mutations, which also causes the incorrect identi cation of these samples as intra-host recombinants in our pipeline. The offset in AFs is attributed to artefacts caused by primer bias during PCR ampli cation. Regardless, our analysis method detected only 6 of the mixture samples as putative recombinants, while the traditional approach of consensus calling resulted in chimeras for at least 25 of them (based on Supplementary Figure S5 of Bal et al.), emphasizing that raw mutational data can provide more nuanced insights about the samples than consensus sequences. However, whenever possible, putative intra-host recombinants should be con rmed by the re-sequencing of the selected samples and hybrid capture based sequencing is preferred to reduce the risk of PCR artefacts 18 .
On the other hand, based on our read-level analysis results, despite these arti cial samples containing a notable number of recombinant reads, putative breakpoints supported by a recombinant read ratio of more than 0.1 is much lower than in true co-infection samples. This might indicate that using a 0.1 threshold for this parameter can be useful to avoid PCR chimera-related artefacts when identifying recombinants.
A potential mechanism of recombinant genome formation is template switching during RNA replication, i.e. RNAdependent RNA polymerase (RdRp) switches from one RNA strand to another with different sequence during replication 48 . Template switching occurs frequently during subgenomic RNA (sgRNA) transcription, usually around transcription-regulatory sequence in body (TRS-B) sites where RdRp pauses and switches to TRS in the leader (TRS-L) sequence, resulting in a discontinuous transcript 49,50 . Based on our read-level recombination detection on a limited subset of the co-infection samples, recombinant reads that carry genomic traces of sgRNA origin (either they contain the leader sequence and/or were soft-clipped during alignment) are indeed located exclusively near these known junction points 49 . This nding emphasizes that fact that a fraction of the detected recombination events occur during transcription and therefore have little to no effect on viral evolution. However, due to short read lengths and fairly long sequences of the transcriptome, a short read might still be sgRNA-derived, even without the presence of the leader sequence and/or soft-clipping.
The detection of intra-host recombinant genomes in co-infection samples is hampered by multiple technical and theoretical factors that together make it nigh impossible to reliably distinguish between true evidence of recombination and artefacts. Here we brie y summarize the main causes of this di culty.
First, given that co-infection samples usually contain signi cant amounts of the two original parental strains, intrahost recombinant genomes comprise only a small portion of the sequenced viral population. Thus, any attempt at identifying recombinants must reckon with decreased coverage and consequently a limited amount of available data.
Secondly, well-known PCR artefacts may lead to misinterpretation of the sequencing data. PCR ampli cation is the standard method for the generation of su cient genetic material prior to sequencing. Most of the sequencing data for SARS-CoV-2 has been produced by a pipeline that incorporates PCR ampli cation in its initial steps. Bal et al. found inconsistencies in alternate AF distribution in arti cial samples of mixed variants, i.e. that alternate AFs measured at de ning mutations in these samples do not correctly re ect the original mixture proportions, instead they usually overrepresent the proportion of the Delta variant 4 . Additionally, based on our observations, a systematic bias can be identi ed in the alternate AF distribution of de ning mutations in co-infection samples of speci c variant combinations. As a result of this, the presence of intra-host recombinant genomes cannot be reliably detected by subtle shifts of alternate allele frequencies along the genome. Another PCR-related artefact is the formation of chimeric sequences. It has been known for decades that during PCR ampli cation, PCR-mediated recombination or "chimera formation" systematically occurs 51 , generating arti cial sequences that are essentially no different from true recombinants. Even though chimera detection tools like UCHIME 52,53 are widely used in sequence analysis pipelines aiming to assess diversity or compare populations, it is virtually impossible to dependably distinguish between true recombination and chimeric sequences, and most methods simply discard all instances of chimeric/recombinant origin in the data. Generally, one must assume that chimera formation is relatively rare, while viral recombination is well-documented in laboratory settings, hence also expected to occur in co-infection samples. However, based on our estimations, putative recombination breakpoints supported by no more than 10% of the overlapping short reads are likely to be caused by chimeric sequences. By comparing the number of recombinant reads to the total number of sequenced reads in arti cial samples, chimeric sequences can be present at a percentage of 0.04-0.3% or even more, emphasizing that PCR-related bias is non-negligible in sequencing data and the cautious interpretation of the results is crucial. Admittedly, there are a few experimental setups in which hybridization capture sequencing are employed, which eliminates various problematic issues introduced by amplicon-based methods 18 . Moreover, the low number and the uneven distribution of the de ning mutations along the genome also inherently limit detectability. Given that, disregarding the relatively low number of de ning mutations, parental strains in SARS-CoV-2 co-infection samples are highly similar, recombination events might go completely unnoticed. The identi cation of recombination breakpoints is constrained to the genomic ranges between de ning mutations, thus the uncertainty in their location is extremely high. Furthermore, de ning mutations are unevenly distributed across genomic positions, a disproportionately high amount (considering gene lengths) of them is located on genes S and N, making it very di cult to detect recombination breakpoints occurring in other, less frequently mutated regions of the genome.
Finally, short read length also poses a challenge in recombinant detection, as direct evidence of recombination events can come from reads that simultaneously contain the de ning mutations of both parental strains in a co-infection sample. This approach, however, is limited by the relatively short read lengths (100-200 bp) in sequencing data generated by Illumina platforms, as only those de ning mutation pairs are overlapped by the same reads that are located close enough on the genome. The recent advances in long-read sequencing technologies (Nanopore, PacBio) might provide a solution for this problem, as they usually generate reads ranging from 10 to 100 kbp in length, though the increased sequencing error rate compared to Illumina technologies might present additional challenges 54,55 A recent study speci cally utilized sequencing data generated by long-read PacBio single-molecule real-time (SMRT) sequencing technology to detect co-infection and subsequent intra-host recombination in a set of around 7,000 samples from France 21 .
Even though co-infection detection is relatively straightforward, all the above-described obstacles render it demanding to systematically detect traces of recombination in co-infection samples with an automated pipeline. In many cases, one must resort to the manual observation of sequencing data to distinguish between artefacts and true signs of recombination, and even then, the decision usually ultimately rests on an educated guess. In our work, we methodically investigated a database containing the raw sequencing data of more than 2 million good-quality SARS-CoV-2 samples collected in the COVID-19 Data Portal 25,26,43 with worldwide sources, and reliably identi ed 0.35% of them as co-infection cases. We further set out to detect the presence of intra-host recombinants and employed two independent pipelines for their identi cation. We have shown that sensitivity is dominantly determined by the presence and density of de ning mutations along the genome and that a threshold of 0.1 for the ratio of recombinant reads overlapping a given position might be reasonable to get rid of PCR-induced artefacts. Recombination hotspots were usually located at gene boundaries (with a fraction of recombinant reads carrying signs of sgRNA-origin) and three additional intergenic hotspots were identi ed in Delta -Omicron (BA.1) co-infection samples. Our work paves the way for further large-scale studies systematically utilizing raw read sequencing data for the detailed investigation of the recombination potential of SARS-CoV-2 in real-world, non-laboratory settings, which might help monitor and forecast important milestones in viral evolution.

Methods
The CoVEO database Pre ltering steps and initial selection of co-infection samples were carried out with the use of the CoVEO database, a PostgreSQL database storing the mutational data (VCF les) of SARS-CoV-2 sequencing samples uploaded to the European COVID-19 Data Portal 25,43 (www.covid19dataportal.org). This database and the automated analyses leading to a standardised VCF le was developed by the Versatile Emerging infectious disease Observatory 22 (VEO, www.veo-europe.eu) consortium. The dataset is unique in the sense that besides the commonly available consensus sequences of the samples, it also contains low alternate allele-frequency mutations (minor variants) and sequencing depth information in a queryable format, along with sample metadata to allow for simple ltering. Additionally, samples of the database are analysed with a standardized variant calling work ow (available on GitHub 56,57 ) in order to keep technical bioinformatics artefacts at a minimum and to obtain comparable results in spite of multiple sample collectors and various laboratory protocols.

Quality ltering criteria
The altogether 3,093,454 samples of a human host in the CoVEO database were initially ltered to exclude samples that had a total base count of 100,000 or less to avoid misinterpreting sparse sequencing data. Additionally, to further ensure relatively even coverage of the viral genome, we discarded samples that had a sequencing depth of less than 10 in more than 10% of the 29,903 genomic positions of the reference genome (NCBI ID: NC_045512.2). This ltering step resulted in 2,172,927 remaining samples collected in the period from the 30th of December 2019 to the 30th of June 2022.

Unique de ning mutations of SARS-CoV-2 viral variants
Instead of using a precompiled list of genomic variations characteristic of each SARS-CoV-2 viral strain, we used the marker table 58 provided by Valieris et al. 59 in which all distinguishing mutations are listed with the number of GISAID samples containing the reference and alternate alleles for each lineage. Brie y, they treated each consensus sequence of the GISAID SARS-CoV-2 database 28,60 (Global Initiative on Sharing All In uenza Data, https://www.gisaid.org) as a single sequencing read, aligned them to the Wuhan reference sequence and called variants with the highly sensitive GATK Mutect2 tool 61 .
In downstream analyses, we used the mutations listed in this table that were unique and highly indicative of speci c viral strains. More precisely, for each mutation, the lineages with the largest and second-largest prevalence were identi ed. Genomic variations with a largest prevalence of larger than 80% and a second-largest prevalence of less than 10% were considered as "unique de ning mutations" of the lineage with the highest mutational incidence. The list of unique de ning mutations used in the study is listed in Supplementary Identi cation of candidate co-infection samples Samples of the CoVEO database that met the coverage threshold across the SARS-CoV-2 genome (see above) were considered to have moderate evidence of co-infection if more than 50% of the unique variant-de ning mutations of at least two different variant strains were concurrently detected in them. No allele-frequency ltering was applied at this step to enable the identi cation of even trace amounts of the minor variant strain. This analysis step produced 29,666 putative co-infection samples.

Mutually exclusive de ning mutations in speci c variant combinations
Given that the number of truly unique de ning mutations was below 10 for more than half of the investigated variants (Supplementary Table 3), we extended the list of unique de ning mutations by including all mutually exclusive mutations speci c to each variant combination (Supplementary Table 4). For this purpose, for each variant combination indicated in any candidate co-infection sample (see above), we collected the list of mutations that were present in at least 80% of GISAID samples assigned to one of the comprising variants, while simultaneously present in less than 10% of the samples assigned to any other variant(s) of the variant combination. This way, in a hypothetical co-infection sample of Delta Omicron (BA.2) composition, the potential number of mutually exclusive de ning mutations for the Delta and Omicron (BA.2) variants is 18 and 70, respectively, even though they had only 5 and 4 truly unique de ning mutations. This is explained by the fact that Omicron BA.2 shares most of its typical mutations with various other Omicron lineages, but in a Delta Omicron (BA.2) context these can be leveraged to differentiate between the two comprising variants. This expansion of the list of considered mutations allowed the re ning of coinfection detection by largely reducing the bias introduced by using only a few de ning mutations to identify samples of potentially mixed origin.
Re ning the list of candidate co-infection samples The number of supposed co-infection samples was then determined with multiple thresholds for the ratio of required mutually exclusive de ning mutations in the range of 0.5 to 1, with the threshold applied to all consisting variants simultaneously. The nal ltering limit of 0.8 was chosen for the identi cation of a total number of 7,700 co-infection samples. This extremely stringent cut-off ensures that a co-infection sample truly contains at least trace amounts of the whole genomes of its comprising variants and inherently discards samples with only recombinant genomes (i.e. samples that can be assigned to any of the Pango lineages with a pre x starting with "X"). Theoretically, if mutually exclusive de ning mutations were evenly distributed along the length of the genome and recombinants had only one breakpoint, a recombinant genome would have a ratio of mutually exclusive de ning mutations of x for one of its parental variants and a ratio of 1-x for the other one, which can never simultaneously reach 0.8.

Comparison of collection date with the worldwide prevalence of different variants
To validate that the detected co-infection samples could reasonably arise in a natural setting by the simultaneous infection of the same patient by two (or more) different viral strains, we compared the collection date of these samples with the worldwide prevalence of the variants identi ed in them. To this end, we queried the metadata of SARS-CoV-2 samples uploaded to the GISAID website and calculated the number of samples assigned to different Pango lineages 17 for each week, then plotted their incidence curves, along with the collection dates of the samples containing those variants (Fig. 1c, Supplementary Fig. 2).

Distribution of co-infection samples across different studies and countries
In an attempt to identify obvious giveaways of wet lab contamination and see how co-infection samples are distributed across different studies and geographical locations, we determined the number of good-quality samples of a human host with available mutation information of various study accession IDs and collecting countries in the CoVEO database. Study-and country-speci c prevalence rates were calculated as the percentage of identi ed coinfection samples within a given study (Supplementary Table 1) or country.
The details of the whole pipeline including unique and mutually exclusive de ning mutation selection, along with the detection of co-infection samples can be found in Supplementary Methods 1.
Detection of intra-host recombinants from AF distribution at mutually exclusive de ning mutations Clonally recombinant samples would in principle have genomes that were fused together from the appropriate parts of the genomes of parental viral strains at some breakpoint(s), thus would exhibit signs of different sets of mutually exclusive de ning mutations of their parental variants before and after the breakpoint(s). (These are ab ovo discarded with the above co-infection detection pipeline. During intra-host recombination, however, the two parental strains coexist with the recombinant strain (with varying ratios) within a single sample, and putative recombinant breakpoints have previously been identi ed by the shifts in AFs observed for the variant-de ning mutations of the parents 4,5,18 . In an ideal setting, the absolute value of the AF shift corresponds to the ratio of the recombinant genome in the sample (Fig. 2a).
To this end, we developed a pipeline that detects putative breakpoints in co-infection samples where the mean alternate AF of one set of mutually exclusive de ning mutations increases, while the mean alternate AF of the other set of mutually exclusive de ning mutations decreases. To remove presumed artefacts and noise, only those genomic positions were retained as possible breakpoints, that met multiple stringent ltering criteria. AFs were further corrected for systematic bias in their distribution. Odds-ratios (ORs) of a "breakpoint" and a "no-breakpoint" model were calculated for samples that had a potential breakpoint that satis ed prior requirements and ones with OR > 1 were considered to be putative intra-host recombinants. The ratio of recombinant genomes (RR) in the sample was determined from the shift in AFs at the breakpoint.
In this analysis as well, all mutually exclusive lineage-de ning mutations were considered for the given variant combination to increase statistical power. To simplify the procedure, out of the 7,700 co-infection samples identi ed above, only 7,290 were examined that contained traces of exactly two variant strains.
The detailed pipeline of intra-host recombination detection from AF distributions, along with corrections for AF-bias can be found in Supplementary Methods 2.
Detection of intra-host recombinants from short reads overlapping multiple mutually exclusive de ning mutations As an independent analysis approach, we obtained aligned sequencing data (BAM les) of previously identi ed coinfection samples from ENA (European Nucleotide Archive, https://www.ebi.ac.uk/ena/) and queried them for short reads that simultaneously overlapped de ning mutations of both parental strains in the sample. Reads were ltered for Phred-scale mapping quality (larger than 30) and additionally for base quality (larger than 30) at genomic positions of mutually exclusive de ning mutations. If these reads concurrently carried both de ning mutations, they were considered to exhibit signs of recombination. The ratio of recombinant reads out of all overlapping reads at a given genomic position was de ned as "recombinant read ratio". Genomic positions that showed su cient evidence of recombination (at least 10 recombinant reads and a recombinant read ratio of more than 0.1) in multiple samples were considered to be recombination hotspots. The locations of these genomic ranges were compared to the recombination breakpoint ranges of GISAID samples assigned to the XF, XS, and XD Pango lineages, which are clonal recombinants of the Delta and Omicron strains ( Supplementary Fig. 3). Recombinant reads were further explored for traces of subgenomic RNA by determining if they contained the nucleotides of the leader sequence utilized during transcription and/or if they were soft-clipped during alignment.
The detailed pipeline of overlapping and recombinant read identi cation, along with analysis results can be found in Supplementary Methods 3.

Declarations Data availability
Datasets created and further processed in the current study, along with analysis pipelines and all supplementary material are available in the SARSCoV2-coinf GitHub repository at github.com/csabaiBio/SARSCoV2-coinf.   Genomic distribution of short reads overlapping multiple mutually exclusive de ning mutations and recombination breakpoints indicated by them. a. The relationship between de ning mutation density and the density of reads overlapping de ning mutations of multiple parental strains. All overlapping reads of the 118 analysed samples were considered for this gure. Overlapping read location is de ned as the midpoint of the mutually exclusive de ning mutations it overlaps. b. The relationship between the number of recombination breakpoints indicated by overlapping reads and the number of breakpoints identi ed by Turakhia