Sample Extraction and Sequencing.
All the samples were extracted and processed using the Swift Amplicon SARS-CoV-2 Panel (Swift Biosciences) The samples were sequenced on an Illumina NovaSeq platform at the Norwegian Sequencing Centre (NSC) NorSeq.
Generation of SARS-CoV-2 consensus sequences.
SARS-Cov-2 consensus sequences were generated using the “Covid-seq” pipeline developed by the NSC (https://github.com/nsc-norway/covid-seq). Briefly, PCR primers used during library preparation were removed using NSCTrim (https://github.com/nsc-norway/NSCtrim). Then, sequencing adapters, poorly called nucleotides and overall low-quality reads and adapters were removed using fastp (Chen et al., 2018).
Next, the high-quality-trimmed reads were mapped to the Wuhan-Hu-1 reference genome (NC_045512.2) using Bowtie2 (Langmead & Salzberg, 2012). Consensus sequences were generated from the resulting mapping files using samtools, mpileup (Danecek et al., 2021) and iVar (Grubaugh et al., 2019) with a minimum depth threshold of 10 for calling a nucleotide.
Noise calculation.
We define noise as the sum of the ratios of all the nucleotides minus the ratio of the most frequent nucleotide (i.e., the one called in the consensus sequence). To calculate the noise of the samples we developed a tool called NoisExtractor (https://github.com/garcia-nacho/NoisExtractor). NoisExtractor uses indexed bam files as inputs and for each position of the genome it outputs the noise, depth, the nucleotide with highest frequency and the nucleotide with the second highest frequency and their frequencies respectively.
Identification of coinfections/contaminations.
As a part of the sequencing routines at NIPH, a quality control is performed for each sample. In this analysis low-quality samples and individual samples containing more than one virus are flagged, as this could indicate a contaminated sample or a coinfection at the patient level. To do this analysis, we developed a machine learning model. This model is based on linear regression in which noise-related parameters (e.g., mean and standard deviation of noise across the genome, binned number of positions with noise, etc) and depth and coverage-related parameters (e.g., binned number of missing positions, average depth, etc) were used to classify a sample as low-quality, high-quality or contaminant. To train the classification model, we used a subset of 1846 manually curated samples that were assigned into 4 different classes: high-quality-high-contamination, high-quality-low-contamination, high-quality-no-contamination and low-quality. The code to perform the quality control and the trained model is available at here: https://github.com/folkehelseinstituttet/FHI_SC2_Pipeline_Illumina.
Extraction of sequences for the major and minor variants.
Once a possible contamination or coinfection is identified, the sequence of the major variant (most abundant variant) was generated by concatenating the nucleotides with highest frequency at each position of the genome. To generate the sequence of the minor variant (second most abundant variant), the nucleotides in which the noise of sequence was higher than 0.1 were replaced. The nucleotide that replaced the nucleotide with highest frequency (major) was the one with the second highest frequency (minor). We implemented the extraction of sequences by parsing the output of NoisExtractor in R.
Identification of recombinant sequences.
To identify recombinant sequences, we developed PrecFinder (https://github.com/garcia-nacho/Precfinder). For each single mutation in a sequence, PrecFinder calculates the Bayes’ probability of the virus belonging to a particular Pangolin lineage based on the distribution of mutations in different lineages. As the ratios of the different mutations in the virus continues to evolve, the probability is calculated based on a database of sequences which is regularly updated.
To find which sequences are recombinants, PrecFinder uses a 1D-convolutional neural network model. The model consists of three sets of a 1D-convolutional neural network layer (1D-CNN) followed by a 1D-MaxPooling layer. The three 1D-CNN have 64, 32 and 12 filters and kernel sizes of 5, 3 and 3 nucleotides respectively. The pool sizes of all the 1D-MaxPooling layers were set to 2. Then, two feed-forward layers with 24 and 12 layers respectively were included. Finally, a softmax classification layer outputs the score to classify the sample. The input of the model consists of a Bayes’ probability matrix of n by m dimensions. Where n is the number of unique lineages present on the database and m is the maximum number of mutations present in at least one sequence of the database. To train the model we used binary-crossentropy as loss function, adagrad as optimizer and a batch size of 128. The training of the model was scheduled for 60 epochs but it was early-stopped if there was no improvement on the accuracy after eight epochs. The weights of the model with highest accuracy were saved. Moreover F1, precision and recall were calculated. As training set, we used the sequences present in the database which consists of a synthetic set of recombinant sequences that were generated using the sequences present in the database. Sequences assigned to different lineages were recombined in silico through one, two or three breaking-points randomly selected in the genome. Moreover, we augmented the dataset through the reordering the n rows of the training set. The model was implemented and trained using Keras (Chollet et al., 2015) and TensorFlow v2.8 (Abadi et al., 2015) in R.
Recombinant sequences were also identified using the program sc2rf (https://github.com/lenaschimmel/sc2rf).
Lineage assignments.
SARS-CoV-2 lineage assignment was performed using Pangolin (O’Toole et al., 2021) and Nextclade (Aksamentov et al., 2021) and the mutations at nucleotide and amino acid levels were identified using Nextclade (Aksamentov et al., 2021).
To identify the AY.98.1 and BA.5 specific mutations, 2000 AY.98.1 and BA.5 sequences were downloaded from NCBI GenBank using cov-sampler (Cheng et al., 2022). Sequences with low-quality and/or wrong lineage assignment according to Nextclade were removed, and the mutations present in the remaining sequences were extracted using Nextclade (Aksamentov et al., 2021). Based on the frequency of the mutations present on the AY.98.1 and BA.5 lineages, the mutations present in our sequences were classified either as AY.98.1-specific, BA.5-specific or other, where other means that the mutation is not found on any of the lineages or that it can be found in both. All plots to visualize Pangolin lineages were generated in R (R Core Team, 2022) using the library ggplot2 (Wickham, 2016).
Cultivation of recombinant virus.
Vero E6/TMPRSS2 cells (NIBSC #100978) were cultivated in complete Dulbecco’s Modified Eagle Medium (cDMEM) supplemented with 10% fetal bovine serum (FBS) and 1mg/ml G418. In a biosafety level 3 (BSL3) laboratory, clinical samples collected from the patient were added to the cells at approx. 60% confluency in a T-25 flask for 1h at 37°C. The inoculate was subsequently removed and replaced with fresh viral culture medium (DMEM supplemented with 2% FBS, 100 units/ml penicillin, 100 ug/ml streptomycin and 25 mM HEPES). The infected cells were incubated for 3–4 days at 37°C and the supernatant was then diluted 1:1000 and passaged onto fresh cells for a second passage. After 3–4 more days the second passage of virus was harvested. Both the first and the second passage of the virus were sequenced.
Fitness estimation.
To estimate the fitness of the different virus strains, we identified the substitutions that they carried at the amino acid level using Nextclade. Then, we connected those mutations with the fitness estimated from Bloom and Neher (2023). The fitness of each of the variants was computed as the sum of the fitnesses of the individual mutations present in the sample. If a sequence contained mutations absent in the fitness database, no fitness was assigned to that mutation.