Nucleic Acid Sequence Composition of the Oxford – AstraZeneca Vaccine ChAdOx1 nCoV-19 (AZD1222, Vaxzevria)

The vaccine ChAdOx1 nCoV-19 has been widely used, but its purity has been disputed because of rare side effects. We used Nanopore sequencing to assess the sequence and genetic purity of a vaccine dose. As we were lacking a reference sequence for the antigen cassette, we provide the obtained annotated sequence of the full vector to aid further studies on this topic. Our sample adhered to the published data, was highly pure (>99.97%), and no copy of the E1 gene as a predictor on replication-competent escape mutants was found.


Introduction
Vaccination with ChAdOx1 nCoV-19 against COVID-19 is safe and effective with millions of people around the world already immunized. Very rarely however, a serious medical condition referred to as vaccine-induced immune thrombotic thrombocytopenia (VITT) is observed after the rst inoculation (1)(2)(3), and even more rarely, for the Ad26.COV2.S vector vaccine by Johnson & Johnson (4). Research into the causes is ongoing. Preliminary ndings hint that residual host cell protein in the vaccine solution, and the physical vaccine structure may contribute (5,6). Secretion of soluble spike protein from incorrectly spliced spike transcripts was also recently proposed as an adverse factor in VITT development (7). Although the general design of ChAdOx1 nCoV-19 is described (8,9), prediction of splicing, or development of other sequence-based assays like PCR distribution studies (10), or studies on rare integration events (11), requires exact sequence information, which we provide here by nanopore sequencing and curated assembly.
The sequence of the underlying ChAdOx1 vector has been disclosed in patent EP3321367A2 / US20150044766A1 (patent Seq. ID 38 and 40). Likely, publication of speci c ChAdOx1 nCoV-19 patent applications will further disclose sequences of the expressed transgene (e.g. UK patent application no. GB2003670.3 as mentioned in a preprint con ict of interest statement by Fischer et al. (12)). However, to our knowledge the latter documents have not yet been released to the public.
Previously, we described that direct nanopore sequencing also reveals information on sequence and quantity of DNA impurities in viral vectors (13). Thus sequencing of the nal product adds to the recently released transcriptomic (14) and proteomic (5) characterizations of the vaccine.

Methods
ChAdOx1 nCoV-19 vaccine solution was retrieved from expired remainders in a vaccine vial (Vaxzevria lot 210163). Adenovirus vector was precipitated from the bulk at 40% ammonium sulfate saturation (242.3 g/l bulk), 2 h incubation at room temperature and centrifuging 15 min, 1614 rcf (RT) as previously described (15). The pellet was then resuspended in 1/20 the original volume with Tris-buffered saline (50 mM Tris, 150 mM NaCl, 2 mM MgCl 2 , pH 8.0) containing 0.02 U/µl of proteinase K (NEB) and incubated for 1 h, 50 °C and 10 min, 95 °C. Vector DNA was puri ed from this solution by a PCR cleanup kit (Macherey-Nagel) according to the manufacturer's instructions, and eluate DNA concentration and purity were determined photometrically on a NanoDrop 2000c (Thermo).
200 ng of extracted DNA was used for nanopore sequencing on a MinION Mk1B device with a Flongle FLG-001 ow cell (R9.4.1 pore chemistry) and SQK-RBK004 library prep kit (Oxford Nanopore) according to the manufacturer's instructions. Raw reads were basecalled with Guppy 5.0.7 with the super accuracy preset (Oxford Nanopore) and reads with a Q score >= 10 (by Guppy) and a length > 500 nt (by NanoFilt 2.8.0, 100 nt head crop) were analyzed further. The 500 nt cut-off exceeds a 200 nt cut-off suggested for contamination monitoring (16) and was chosen for technical reasons.
De-novo assembly of the genome was performed on 395 reads > 25 000 nt using Flye 2.8.3 (nano-raw preset) (17) and the assembly was polished with all reads using the medaka_consensus wrapper of This represents a theoretical DNA recovery rate of 32% when compared to the EMA product information that states 5 × 10 10 viral particles per 0.5 ml dose (20) and our initial guess of 36 kb genome size. 30 hours of sequencing on a new ow cell (without multiplexing) yielded 44 219 basecalled reads (268 Mb) that passed the initial ltering. These reads had an average length of 6 078 bp. Assembly resulted in a single contig of 35 501 bp. Alignment of the assembly to chimpanzee adenovirus Y25 isolate (GenBank NC_017825) and human adenovirus 5 (GenBank AC_000008) con rmed the deletion of the E1 and E3 genes and the substitution of the Y25 E4 ORF 4, 6, 7 and 34K CDS regions with their human Ad5 counterparts, as described for the creation of the original ChAdOx1 vector system (8). These alignments were also used to transfer annotations to the nal vector sequence where applicable, which is presented in the GenBank le format as Supplementary Information S1 to this manuscript. Notably, compared to the Y25 isolate, 15 bp are missing from each terminus in our assembly, which is probably a limitation of the library preparation method used here. The sequence disclosed in the vector patent lists 5′ further 16 additional bases compared to Y25.
They very likely originated from the biased error pattern of the utilized nanopore ow cell and are therefore probably not present in the vaccine. For clarity we decided to manually correct these deletions in the nal annotated sequence as provided (insertions are marked in the GenBank le). No further deviations from the reference DNA sequences (and in case of spike, protein sequence) were found.
As expected, the expression cassette for the SARS-CoV-2 surface glycoprotein (also S or spike, GenBank QHD43416.1 was used as a reference) is inserted at the E1 site and consists of a wild-type CMV promoter and enhancer (nt 175 298 -174 206 of GenBank MN920393.1), two repeats of TetO, a CMV intronic sequence (nt 174 211 -173 227 of GenBank MN920393.1), HindIII and KpnI recognition sites, a tPA leader peptide sequence (GenBank E04506, a Kozak sequence, e.g. present in the wildtype-tPA, is notably absent), the codon-modi ed S protein coding sequence (amino acid 2 -1273) and the polyadenylation signal of bovine growth hormone as described (9). The 3′-UTR of the spike gene harbors additional 78 bp that are not thoroughly described in recent literature. The sequence appears to be a multi-cloning site, including a bacteriophage SP6 promoter in reverse orientation to the transgene. This UTR sequence is identical (by web-based BLASTn) to several patented sequences for adenovirus and DNA vaccines by the University of Oxford and may be a remnant of the original direct DNA vaccine vector pTH, from which the expression cassette design for ChAdOx1 seems to originate (21).
We also note that the coding sequence for the spike protein as sequenced has only 97.8% pairwise identity (90 mismatches) with a DNA sequence obtained for the same amino acid sequence with the GeneArt sequence optimizer (ThermoFisher, analysis performed on August 2, 2021, homo sapiens setting), which was previously assumed to be identical (7). However, the algorithm of the GeneArt sequence optimizer might be subject to continual changes.
Of 44 219 reads for ChAdOx1 nCoV-19 that passed initial ltering, 44 205 reads (99,97%) mapped to the here provided assembly (average of 99.7% alignment block length per read length and average of 71% matched residues), giving the impression of a very homogenous payload. Two of the reads that mapped to the assembly also mapped to the human genome built hG38, but these hits were false positives due to sequencing artifacts of low complexity within the reads (as identi ed by manual inspection). Of the 14 unmapped reads, 13 gave a hit with the nucleotide database (by web-based megablast.). However, we note that occasional spurious reads due to our sample handling under non-cleanroom conditions cannot be ruled out.
Lastly, no read mapped to the human adenovirus 5 E1 gene, which would be necessary to form a replication-competent escape mutant. This nding corroborates previously published transcriptomics studies on this safety-aspect of the ChAdOx1 vector system (14).

Conclusion
Sequence-level assessment of payload homogeneity for viral gene transfer vectors is a relatively new area of research. Our sample proofed to be rather unspectacular in this regard. Regulatory bodies may accept small amounts of residual host cell DNA in vector preparations (up to 10 ng per dose for EMA (22)) and the vaccine sample undercuts this target in our analysis. Also, no other signi cant DNA contaminations were found. On the other hand, the exact sequence information for ChAdOx1 nCoV-19 is important for studies on several aspects of vaccine safety but will probably become easily accessible only through patent applications, which is a lengthy process. Given the pace and scale of the SARS-CoV-2 pandemic, high throughput sequencing, which is cheap and available (e.g., material for this experiment cost less than 100 €), can ll the gap when justi ed by public interest.
Due to the safety pro le, work with the ChAdOx1 vector is classi ed as biosafety level 1 in many countries (23,24) and the sequence may serve as a blueprint for what can be regarded as safe.

Declarations
Ethics statement. The vial taken for sequencing was rst used for vaccination, but the remainder expired due to low demand. It was kindly provided to us by a medical doctor.
Data availability. Read data can be made available upon written request to the corresponding author.
Con ict of Interest. The authors declare no con ict of interest.