Primer design for long range PCR
Primers are depicted in Supplementary Table 1. A total of 310 whole genome sequences were screened for conserved regions in the NCBI Multiple Sequence Alignment Viewer 1.11.1. Optimal primer sequences within the conserved regions were determined with Primer-Blast. In silico primer design was performed using the GeneArt® Primer and Construct Design Tool with the Single Site-Directed Mutagenesis option. Sequences of desired target regions which include highly polymorphic regions within the UL section were derived from strain Merlin (AY446894.2) and a blastn search with the Nucleotide collection database (nr/nt) was performed. Single CDS, partial and artificial genomes were excluded.
Preparation of HCMV DNA used for sequencing
Bacterial artificial chromosomes (BAC) HCMV DNA from E.coli
TB40-BAC4-DNA was purified from 400 ml of E. coli overnight culture using the Nucleobond BAC 100 kit (Machery-Nagel) for low-copy plasmid purification. All steps were done according to the manufacturer. Purified BAC-DNA was eluted in 50 µl nuclease-free deionized water, quantified using Nanodrop and stored at 4°C and under no conditions frozen to avoid DNA fragmentation. Purified BAC-DNA was used to establish long range PCR assays.
HCMV DNA from fibroblast supernatant
Cryopreserved virus-stock vials of strains Merlin (AY446894) and TB40-BAC4-luc (derivative of TB-BAC4, EF999921 [39]) derived from human foreskin fibroblast (HFF) supernatant were used for HCMV-DNA purification. Therefore, 500 µl of thawed virus stock samples were transferred into 2 ml of lysis buffer and eluted in 50 μl elution buffer using the bead-based NucliSens EasyMagextractor (BioMérieux). Quantification of HCMV-DNA was done by HCMV-specific qPCR (see below) and stored at 4°C before being directly subjected to Illumina and PacBio sequencing or taken as template for long range PCR.
HCMV DNA from BAL samples
Six BAL samples stored at -20°C, all from patients who received lung transplants at the Medical University of Vienna between 2014 and 2016, were investigated. DNA was isolated using the QIAamp Viral RNA Mini kit (Qiagen). Therefore, 250 µl of BAL solution was lysed and further purified as described in the manufacturer’s protocol. DNA was eluted from columns with 70 µl elution buffer. HCMV DNA was quantified by HCMV-specific qPCR (see below) and stored at 4°C before being subjected to short and long range PCR.
Artificial mixtures of two distinct HCMV-DNA genomes
HCMV-DNA either purified from Merlin or TB40-BAC4-luc virus stocks were diluted to the appropriate concentrations and mixed to achieve the ratios as listed in Table 5. Five µl of each mixture (total HCMV-DNA ranged from 3x106 to 9x106 copies per reaction) was used as template DNA for amplification by long range PCR.
DNA Quantification
Purity and content of BAC-DNA and cell-culture-derived DNA was quantified using the NanoDrop 1000 tool (Peqlab). Amount of PCR amplicons was determined using the QubitTM double-stranded DNA High-Sensitivity Assay according to the manufacturer’s instructions (Thermo Fisher Scientific) on the Qubit 2.0 fluorometer. HCMV-specific DNA of cell-culture-derived DNA and BAL samples was quantified by an in-house qPCR as previously described [40] and cell-culture-derived HCMV DNA for generation of artificial mixtures was additionally quantified by a gH genotype-specific PCR, also as previously described [41].
Long range PCR amplification
PCR enzymes initially used for evaluation of sensitivity and specificity of long range PCR are listed in Table 1 and Supplementary Table 2. Target regions chosen for evaluation are shown in Figure 1. After evaluation, LA Taq HS DNA polymerase kit from TaKaRa (TaKaRa Bio) conveyed the best performance and was therefore further used. For long range PCR 15µl mastermix (0.25 µl Taq, 2.5 µl 10x buffer, 4 µl dNTPs, 9.75 µl nuclease-free water, 1.25 primer each) was combined with 10 µl extracted DNA. Correlating copy number input for individual samples can be seen in Tables 4a and 4b and Supplementary Table 11. The cycling program was 94°C for 1 min, 30x (98°C for 10 sec, 68°C for 1 min/kb) and without an additional extension step to avoid PCR-mediated recombination. Small aliquots of the PCR products were visualised on analytical agarose gels, then quantified by Qubit and subjected to library preparation for Illumina and PacBio sequencing.
Short range PCR amplification
HCMV-DNA positive BAL samples were screened for mixed genotype infections by short range PCR amplification of polymorphic target regions within envelope glycoprotein N (gN), envelope glycoprotein O (gO), and UL146 (nested PCR). Detailed description of PCR amplicons, associated primers and respective annealing temperatures are listed in Supplementary Table 1. For PCR, 19 µl AmpliTaq Gold 360 Mastermix (Applied Biosystems) and 0.5 µl of each primer was combined with 5 µl of extracted BAL DNA. Initial denaturation was performed at 95°C for 10 min followed by 40 cycles of 95°C for 1 min, annealing for 1 min at 50-61°C, and elongation at 72° for 1 min, with a final extension time of 5 min at 72°. For the second step of the nested UL146 PCR, 40 µl master mix and 8 µl PCR grade water was mixed with 2 µl of the first PCR product. Small aliquots of the PCR products were visualised on analytical agarose gels, then quantified by Qubit and subjected to library preparation for Illumina sequencing.
PacBio SMRT sequencing
SMRT bell library preparation and sequencing were performed by the Next Generation Sequencing Facility at Vienna BioCenter Core Facilities. Long range PCR amplicons and non-enriched DNA were further purified using the QIAEX II Gel extraction Kit (Qiagen). Due to indispensable loss of HCMV-DNA (>80%) during gel extraction, both non-enriched DNA and PCR products were subjected to the protocol: Desalting and Concentrating DNA Solutions. All steps were performed as advertised in the protocol and the HCMV-DNA was eluted in 20 µl of Tris buffer. For library preparation a minimum DNA input of 700 ng and 2.5 µg for amplicon and whole genome sequencing was required, respectively. Initial quantification and purity were measured with the NanoDrop 1000 tool (Peqlab) and a Qubit 2.0 fluorometer (Thermo Fisher). Large fragment analysis was performed with a Femto Pulse system and the genomic DNA 165 kb Kit (Agilent). Non-enriched DNA was sheared to an expected fragment length of <10 kb. Samples were indexed and multiplexed according to the Sequencing facility. Subsequent to library preparation a blue pippin size selection (Sage Science Inc.) was used to isolate target fragment lengths of long range PCR amplicons. Sequencing was performed on a PacBio-Sequel system for 10 and 20 hours for cell-culture derived and BAL samples, respectively. Details on PacBio sequencing run information is listed in Supplementary Table 3.
Illumina Sequencing
Two ng input DNA per sample was used for library preparation using the Nextera XT library preparation kit, and samples were indexed using the Nextera XT index kit (Illumina). After index PCR, samples were purified with 45 µl of Agencourt AMPure XP magnetic beads with a sample to beads ratio of 3:2 (Beckman Coulter), normalized by Qubit quantification and pooled to generate a 4 nM library. To compensate for low diversity libraries, a 12 pM PhiX control spike-in of 2.5% was added. Single and paired-end sequencing (150 to 250 cycles, V2 kits) was done on an Illumina MiSeq instrument with automatic adapter trimming (Illumina). Raw reads in fastq format that passed filters were used for analysis using CLC Genomics Workbench 12.0 (Qiagen). Detailed information on MiSeq sequencing runs is listed in Supplementary Table 4.
Bioinformatical workflows for PacBio and Illumina reads
Generation of circular consensus sequence reads upon PacBio sequencing
Bam files of PacBio raw reads were demultiplexed using Lima (https://github.com/pacificbiosciences/barcoding/) with default parameters and symmetric options. In order to generate Highly Accurate Single-Molecule Consensus Reads, demultiplexed subreads were aligned with the PacBio circular consensus sequence (ccs) tool of Bioconda (https://github.com/PacificBiosciences/ccs). No full-length subreads were required for ccs generation and a minimum predicted accuracy of 0 was chosen. To prevent heteroduplexes, consensus sequences for each strand of a molecule were separated by strand. Resulting ccs reads were used for further analysis using CLC Genomics Workbench 12.0 (Qiagen Bioinformatics).
Trimming and mapping for error calculation
Illumina fastq reads were imported as paired-end reads and PacBio ccs were imported as single reads into CLC Genomics Workbench. Raw reads were quality trimmed by using the modified-Mott trimming algorithm with a base-calling error probability of 0.01 or 0.001 (Phred quality score of 20 or 30, respectively), an ambiguous limit of 2 in a read, a minimum and maximum number of nucleotides in reads of 100 and 9000, respectively. Human genomic DNA reads were filtered out by randomly mapping them against the latest reference genome GRCh38 (accession GCA_000001405.28), with default setting parameters for match/mismatch scores and insertion/deletion costs, length fraction of 0.3, and similarity fraction of 0.8. PacBio-derived reads were further trimmed to exclude reads lower 500 nucleotides. Remaining reads from both sequencing platforms were randomly mapped to the pathogen reference sequences Merlin and TB40-BAC4-luc, respectively, with default setting parameters for match/mismatch scores and insertion/deletion costs, length fraction of 0.3, and similarity fraction of 0.8. Details are listed in Table 2 and Supplementary Table 5.
For error calculation resulting mappings were used. Therefore, the total number of bases as well as the total number of substitutions, insertions and deletions were counted to calculate the respective error rates. Error rate estimation was further divided into the 4 distinct bases in the reference sequence to estimate substitution among bases. Analysis was performed using the function QC for Read Mapping of CLC Genomics Workbench 12.0.
Trimming and mapping for ratio estimation
Illumina fastq raw reads were quality trimmed with a base-calling error probability of 0.001, an ambiguous limit of 2 in a read and a minimum and maximum number of nucleotides in reads of 100 and 9000, respectively. Human genomic DNA reads were filtered out as described.
First, quality trimmed and filtered reads were randomly mapped to the Merlin and TB40-BAC4-luc amplicon reference sequences, with default setting parameters for match/mismatch scores and insertion/deletion costs, length fraction of 0.3, and similarity fraction of 0.8 to determine the overall % of HCMV-specific reads. Second, for ratio estimation genotype-specific reads mapping to highly polymorphic regions within glycoprotein H (gH) (position in CDS of Merlin strain: 1 to 180), gO (1 to 680), and gN (1 to 220), were considered. In total, 2 gH, 8 gN, and 8 gO genotype sequences were used as reference sequences for mapping (Supplementary Table 6). Mapping parameters were: default setting parameters for match/mismatch scores and insertion/deletion costs, length fraction of 0.3, similarity fraction of 0.95, and non-specific matches were ignored. Merlin and TB40-genotype-specific reads were counted to estimate the ratio. Reads mapping to one or more of the other genotype reference sequences were counted to determine the false-positive mapping rate.
Trimming and mapping for genotype determination
Illumina fastq reads were quality trimmed with a base-calling error probability of 0.001, a minimum number of nucleotides in reads of 100, and human genomic DNA reads were removed. Then, quality trimmed and filtered reads were mapped to highly polymorphic regions within gO (1 to 680), gN (1 to 220), and UL146 (position in CDS of Merlin strain: 1 to 360) with default setting parameters for match/mismatch scores and insertion/deletion costs, a length fraction of 0.3 and a similarity fraction of 0.8. A total of 30 genotypic reference sequences were used as previously described [19] and listed in Supplementary Table 6.
A positive genotype mapping was scored if the number of reads was >10 and the consensus length corresponded to at least 80% of the reference length. Consensus sequences were derived from the mappings by using a minimum read depth of 5 reads per base with low coverage regions coded as ambiguities (Ns). Ambiguity codes were inserted using a noise threshold of 0.1 and a minimum nucleotide count of 5. Alignments of derived consensus sequences were screened visually and unique consensus sequences were counted as independent genotypes.
Trimming and mapping for haplotype determination
To retain PacBio ccs reads that span the complete amplicon length, all ccs were initially length trimmed. For this, F3 amplicon-derived reads below 7500 and above 8000 and F4 amplicon-derived reads below 6500 and above 7000 number of nucleotides in reads were excluded. Remaining reads were processed with following parameters: quality trimming with a base-calling error probability of 0.01, ambiguous limit of 2 in a read and exclusion of low quality reads. Only reads that span more than half of the amplicon length (4000 nucleotides for F3- and 3500 nucleotides for F4-derived amplicons) were retained and human genomic DNA reads removed. Remaining reads were mapped to 236 HCMV full genomes (Supplementary Table 7), with default setting parameters for match/mismatch scores and insertion/deletion costs, length fraction of 0.7, and similarity fraction of 0.8 and non-specific mappings were ignored. Consensus sequences were generated from mappings with a consensus length of >6.9 kb for F3- and >5.9 kb for F4-derived amplicons, and with a minimum coverage of >0.1% of the average coverage (Supplementary Tables 12a and 12b). Extracted consensus sequences were aligned by muscle [42] and visually inspected in BioEdit 7.2.5 (Ibis Therapeutics). Unique consensus sequences were used as new reference sequences to repeat the mapping with length and similarity fraction of 0.9. New consensus sequences from mappings spanning over the complete amplicon length showing a uniform coverage were extracted. Ambiguity codes were inserted using a noise threshold of 0.1 and a minimum nucleotide count of 5. Unmapped reads were further mapped to human and HCMV genomes to confirm that no haplotype sequences got lost during the analysis steps. After alignment and visual inspection unique consensus sequences were counted as independent haplotypes. To visualise the diversity among distinct haplotypes, nucleotide alignments and phylogenetic tree analysis of all haplotypes were performed in Geneious Prime® 2019.0.3 and Mega 7.0.14 [43], respectively. Phylogenetic trees were inferred by using the Maximum Likelihood method based on the HKY + G (0.05) + I (0,27) for F3 haplotypes and Kimura-2 + G (0,47) + I (0,62) for F4 haplotypes. Best fit substitution model (lowest BIC score) was assessed with Mega 7.0.14.