Mite sampling on original and novel honey bee hosts
We sequenced adult females throughout the Varroa native ranges where original and novel hosts occur in sympatry (V. destructor N = 31 and V. jacobsoni N = 32; Table 1). For V. destructor, specimens were collected between 1996 and 2003 as part of the taxonomic revision and diversity survey carried by Anderson and Trueman (D. L. Anderson and Trueman 2000) and Navajas et al. (Navajas et al. 2010) and included 14 samples from A. cerana and 17 from A. mellifera collected from Japan, South Korea, China, Thailand, Viet Nam, Myanmar and Nepal. For V. jacobsoni, female specimens were obtained from Anderson and Trueman (D. L. Anderson and Trueman 2000) (1994-2004) and Roberts et al. (Roberts, Anderson, and Tay 2015) (2008-2015) surveys of host shifts reports in Papua New Guinea. A total of 19 adult female V. jacobsoni mites were collected from A. cerana across different areas in India, Thailand, Viet Nam, Malaysia, Indonesia and Papua New Guinea. In contrast to V. destructor, V. jacobsoni infestation on A. mellifera is still restricted to Papua New Guinea from which 10 samples were collected. Additionally, we included three V. jacobsoni mites collected from an alternative original host A. nigrocincta found in Indonesia. A. nigrocincta is not believed to be the source of the host-switched mites, so these samples were included for comparison purposes.
Sampling complete details regarding host, location and year are available in Table S1. Exact geographical coordinates were not always available for samples collected before 2008, and were approximated by the locality provided, or published survey maps (D. L. Anderson and Trueman 2000; Navajas et al. 2010; Zhou et al. 2004; Denis L. Anderson and Fuchs 1998). All individuals were mature sclerotized females collected from single colonies and were preserved in individual Eppendorf tubes. Each collection tube was stored with 70% ethanol and kept at -20°C at the CSIRO in Canberra, Australia.
DNA extraction and whole genome resequencing
Mites were surface sterilized by cleaning them in absolute ethanol using a sterile brush to remove any external debris, and then gently shaken in a 2.0 mL Eppendorf filled with absolute ethanol. Each mite was then dried for 10 sec on a sterile paper towel before being placed in a 1.5 mL Eppendorf tube in liquid nitrogen. Genomic DNA was extracted from each mite by crushing the whole body, using a sterile pestle to obtain a fine powder and processed with a QIAamp DNA Micro Kit (© Qiagen) following the manufacturer’s instructions. Final elution volume was 15 μL. Total dsDNA was measured using a Qubit™ 4 Fluorometer with an Invitrogen dsDNA HS Assay Kit.
For population genomic samples, short-inserts of 150-bp paired-end libraries were prepared for each individual using a Nextera XT DNA Library Preparation Kit (Illumina ®). Size-selection and cleanup were accomplished using CA-magnetic beads (Dynabeads® MyOne Carboxylic Acid, Invitrogen), and 11-11.5% PEG 6000 (Sigma-Aldrich © LLC). Library quality and size were assessed using a Bioanalyzer High Sensitivity DNA kit (Agilent). Libraries were run on HiSeq 4000 and NovaSeq6000 in 150 bp x 2 paired-end mode (Illumina ®) at the OIST Sequencing Center. Biosamples and DRA accession as well as sequencing coverage for each sample are provided in Table S1.
Host DNA contents in mite whole-body metagenomics
We cross-checked honey bee host identities reported during sampling with host read identity to detect potentially migrating mites. Mites feed on honey bees during the phoretic phase (Ramsey et al. 2019) and maintain a consistent feeding regimen by consuming ~1 μL of host fluid (digested fat-body and haemolymph) per day (Posada-Florez et al. 2019). Therefore, we assumed that host DNA would be retrieved from crushed whole-mite tissues. Mitochondrial DNA was targeted, as it is more abundant than nuclear DNA. We mapped raw fastq reads on honey bee host mitochondrial reference genomes using [NC_001566.1] for A. mellifera ligustica (Crozier and Crozier 1993) and [NC_014295] for A. cerana (H.-W. Tan et al. 2011). The number of reads mapped to either one of these honey bee host reference genomes was counted and compared to sampled host identities.
Data filtering, mapping and genotype calling
Commands used for each analysis step are available on our Snakemake script (Köster and Rahmann 2012) available on https://github.com/MaevaTecher/varroa-host-jump. Briefly, we assessed demultiplexed fastq read quality using FastQC (Andrews and Others 2010). We then mapped reads to the V. destructor reference genome on NCBI [GCF_002443255.1] (Techer et al. 2019) separately from the complete mitogenome [NC_004454.2] (Navajas et al. 2002) using the soft-clipping and very sensitive mode of NextGenMap v0.5.0 (Sedlazeck, Rescheneder, and von Haeseler 2013) (following a comparison with Bowtie2 v2.6 (Langmead and Salzberg 2012)). Reads were sorted and duplicates were removed using SAMtools (H. Li et al. 2009), and subsampled to a maximum coverage of 200 using VariantBam (Wala et al. 2016) to speed up processing. Mapping rates and reads depths were computed from the generated BAM files.
We generated three data sets for the mapped reads depending on analysis requirements, following Yamasaki et al. (Yamasaki et al. 2020). First, we obtained a “SNP-only dataset” containing only variant sites using FreeBayes v1.1.0 (Garrison and Marth 2012) with the following parameters: minimum mapping = 10, minimum base quality = 5, use of the four best SNP alleles, and no populations prior. In order to correct for coverage bias and sequencing artefacts in problematic regions, we estimated twice the mean reads depth along the seven main chromosomes. Subsequently, variants were filtered using using VCFtools v0.1.12b (Danecek et al. 2011). The “SNP-only dataset” resulted in 2,728,471 SNPs. Second, we computed an “all-sites dataset” by calling both variants and invariants sites using BCFtools v1.9 mpileup (H. Li 2011). We removed indels and sites at over twice the computed mean depth, as well as sites with any missing data. We excluded sites that were not placed on the seven chromosomes. After filtering, the “all-sites dataset” resulted in 2,130,335 SNPs and 120,279,163 monomorphic sites. Third, we obtained a “mtDNA SNP-only dataset” by calling polymorphic sites only on the mitochondrial genome NC_004454.2. We used FreeBayes v1.1.0 with strict quality parameters. The dataset resulted in 2,091 SNPs. These were validated in the COX1 region by Sanger sequencing (see next section).
Comparative mitogenomes analysis and lineages identification
We used the “mtDNA SNP-only dataset” for mitochondrial variation genomic analysis by converting and generating individual nucleotide sequences using the option vcf2fasta in vcflib scripts (github.com/vcflib). Nucleotide sequences were screened and aligned (16,476 bp) using Geneious Prime® 2019.2.3 together with the reference mitochondrial sequences NC_004454.2 (V. destructor on A. mellifera, France (Navajas et al. 2002)) and also, to avoid problems with reference bias, from AP019523.1 (V. destructor on A. mellifera, Japan (Harada et al. 2020)). Mitotype diversity for each Varroa species and distribution in hosts was assessed using the median joining network method from SplitsTree5 (Huson and Bryant 2006). Divergence levels between Varroa species was estimated by comparing the whole-mitochondrial consensus sequences with Geneious Prime®.
Standard methods exists to determine Varroa species and lineage (also named haplogroup) identity by using the following mitochondrial markers: 1) a 458-bp fragment from the COX1 gene (D. L. Anderson and Trueman 2000; Dietemann et al. 2013) and 2) a concatenated 2,696-bp fragment including COX1, ATP6, COX3, and CYTB (Navajas et al. 2010). Thus, we determined the lineage (e.g. Korean K1, Japan J1, etc.) of each mitotype by extracting the COX1 barcoding region of interest and aligning it together with 60 unique COX1 reference sequences from NCBI using ClustalW and manual check. A neighbor-joining tree was computed on the fasta alignment using IQ-tree (Nguyen et al. 2015) and exported with iTOL (Letunic and Bork 2019). Following Varroa revised nomenclature (Traynor et al. 2020), a sequence was considered from a known lineage only if 100% identical to reference COX1 haplotype. The same process was used for sub-lineages (e.g. Korean K1-1/K1-2, Korean K1-3,...) identification by with the COX1-ATP6-COX3-CYTB concatenated region sequences which were aligned with 22 available reference concatenated sequences for V. destructor only (Navajas et al. 2010; Gajic et al. 2013; Elbeaino et al. 2016).
To ensure that filtering of the “mtDNA SNP-only dataset” did not remove existing and known variants, we also sequenced the COX1 gene using Sanger sequencing using primers 10kbCOIF1 and 6,5KbCOIR (Navajas et al. 2010). PCR reactions were carried out in 25μL containing 5 μL of 5X Phusion® HF buffer; 0.5 μL of dNTP mix (10mM); 0.25 μL of Phusion® High-Fidelity DNA Polymerase (NEB); 1.25 μL of each oligo primer 10kbCOIF1and 6,5KbCOIR (10mM); 1 μL of template DNA (0.5 ng/μL) and Milli-Q water. Samples were denatured at 98°C for 30 sec, and then PCR was performed for 35 cycles of 10 sec denaturation at 98°C, 15 sec of annealing at 59°C and 15 sec of extension at 72°C with a 5 min final elongation at 72°C. DNA amplification success was visualized by loading 3 μL of PCR product with 3 μL of loading dye on 1% agarose gel (110V for 20 min). PCR products were then cleaned-up using Dynabeads® MyOne Carboxylic Acid, CA-beads (Invitrogen) and 19% Polyethylene glycol PEG. Directly, fragments were sequenced in the two directions using the BigDye™ Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher) in a capillary sequencer Applied Biosystems 3730xl DNA Analyzer (Thermo Fisher). FASTA sequences generated by mitogenome mapping and Sanger sequencing were then aligned and checked for differences.
Genome-wide variation and analysis of divergence
We examined levels of genetic diversity and differentiation between host-adapted population of each Varroa species. To investigate differentiation among populations, we first measured the absolute divergence Dxy which requires both variants and invariants sites (Cruickshank and Hahn 2014). We also investigated levels of standing genetic variation (π). For both analyses genome scans were computed using a sliding window method on the “all-sites dataset” (50kb window, sliding every 20kb and containing a minimum of 100 SNPs). This method was implemented using the parseVCF.py and popgenWindow.py python scripts (github.com/simonhmartin/genomics_general) (Martin, Davey, and Jiggins 2015).
Using the biallelic “SNP-only dataset”, we also estimated the genome-wide differentiation by calculating Weir and Cockerham’s FST per site using VCFtools. Subsequently, we assessed the genetic structure within species using a principal component analysis (PCA) performed with the R package vcfR (Knaus and Grünwald 2017). To further investigate the population structure and related ancestry, we conducted an incremental K step analysis using NGSadmix (Skotte, Korneliussen, and Albrechtsen 2013). We reduce the effect of linkage disequilibrium between SNPs by conducting a pruning using PLINK v1.90b3 (Purcell et al. 2007) on the “SNP-only dataset” with the following parameters: --indep-pairwise 20 10 0.5. After removing linked SNPs, the pruned dataset containing 1,276,602 SNPs was converted into BEAGLE format and directly used in NGSadmix. A total of 10 replicates per K, from 2 to 20, were run and the model with the highest likelihood was selected for plotting each K. The number of genetic clusters among and within species was determined following guidelines in (Meirmans 2015; Lawson, van Dorp, and Falush 2018) and making biological sense.
Mutation rate estimation from mother-son pairs
Coalescent analysis requires an estimation of mutation rate, yet none is available for closely related species. To do this we collected and sequenced six V. destructor diploid mothers and their haploid sons for this purpose. These samples were collected from individual capped cells containing A. mellifera red-eye drone pupae. Each family was selected under the following conditions: 1) each cell was only infected by a single mother-foundress mite, 2) a Varroa family at this stage should contained three to four offspring, including one haploid son and two to three diploid daughters, 3) the son was at the deutonymph or adult phase to avoid misidentification with protonymph sisters. We collected three families and preserved in absolute ethanol in February 2018 (#1, #2, and #7), and three more in May 2018 (#15, #17, and #19) from the same colony at the OIST Ecology and Evolution experimental apiary.
Libraries were prepared using an NEBNext® Ultra™ II FS DNA Library Prep Kit (New England Biolabs, Inc) with a fragmentation size incubation time of 15 min at 37°C, and three PCR cycles. Libraries were cleaned up using CA-magnetic beads (Dynabeads® MyOne Carboxylic Acid, Invitrogen), and 17% PEG 6000 (Sigma-Aldrich © LLC). Libraries were pooled and sequenced on two lanes of HiSeq 4000 (Illumina ®) at the OIST sequencing center.
Mutation rate was estimated from GATK variant calls (McKenna et al. 2010) using DeNovoGear's dng-call algorithm (Ramu et al. 2013) which can model Varroa's haplodiploidy sex-determination system. Single-nucleotide mutations were called on the 7 largest contigs in the Varroa genome, avoiding sites that were within 100 bp of an indel. Mutation calls were de-duplicated, retaining only sites that were biallelic and not part of a mutation cluster (with 100 bp of another call or part of a long-run of calls from the same sample). After deduplication, the remaining calls were filtered such that P (denovo| data) was high (DNP >= 0.75) and the fit to the model was good (LLS >= -3). DNP (de-novo probability) and LLS (log-likelihood scaled) are both per-site statistics generated by dng-call. The denominator for the mutation rate analysis was estimated by simulating mutations at 1,000 locations in each son and calculating what fraction of these simulated mutations was recovered by our pipeline. A VCF was generated for these locations, and mutations were simulated by changing the son’s haplotype to one of the other three bases. ALT and AD fields were updated as needed.
Demographic analyses of host switch using SFS
We inferred the demographic history of both V. destructor and V. jacobsoni mites with the coalescent simulator fastsimcoal v2.6 using the site frequency spectrum (SFS) (Excoffier et al. 2013). Demographic inferences were computed independently for each species and samples were subsetted following genome-wide analysis results while considering geographical sympatry and continuity. For V. destructor (N = 27), we excluded Nepal and the Japanese mites (continental island) whereas for V. jacobsoni (N = 12), we included only mites from Papua New Guinea (host switch region). To reduce the effect of selection which can biais the SFS (71), we kept SNPs and invariants from the “all-sites dataset” that were at least 50 kb away from any annotated genes regions (VCFtools exclude-bed option). The 2D-joint folded-SFS was computed for each species/host population on the filtered 12,594,802 sites including 224,568 SNPs, using the vcf2sfs R scripts (github.com/shenglin-liu/vcf2sfs) (S. Liu et al. 2018).
We considered a scenario consistent with the known history of Varroa jumps (Figure S3). We incorporated the following demographic parameters (Table 2): the estimated effective population size for Varroa mites parasitizing A. cerana NVAC1 (in haploid genomes) to be stable before, during, and after the host switch. On the other hand, the mite population size on the new host A. mellifera expanded from NBOTAM or not with a growth rate GAM after the host switch to reach modern population size NVAM0 (in haploid genomes). The host switch founder event occurred at a time TJUMP (in generations) and ended at a time TBOTEND (in generations). Finally, in a case of bidirectional migrations due to sympatry we estimated MAMtoAC/MACtoAM, to be the proportion of haploid genomes migrating from one population to another.
We ran 100 replicates using the observed SFS as follow: a minimum of 20 loops (--minnumloops 20) and a maximum of 150 loops (e.g., ECM cycles, --numloops 100) were performed to estimate the parameters, with one million coalescent simulations per loop (--numsims 1,000,000), and a 0.001 minimum relative difference in parameter values estimated by the maximum composite likelihood (--maxlhood 0.001). The replicate with the highest likelihood set of estimated parameters was retained for model comparison.
To ensure that the upper limit genome-wide mutation rate µ = 8.0 x 10-10 previously used was adequate to estimate parameters from the observed SFS, we tested different µ levels. For this we ran scenario 4 in the same conditions as for scenario choice but µ ranging from 8.0 x 10-10 to 1.0 x 10-11. Additionally, fastsimcoal2.6 offers the possibility to input inbreeding coefficient. As Varroa populations showed high inbreeding, we also ran scenario 4 with (F = 0.7, preliminary tests also ran with 0.8 and 0.9) and without inbreeding coefficient. For all these conditions combined, we ran 100 replicates for each of the 10 combinations per species.
Finally, the overall best estimate parameters set was used to generate 100 pseudo-observed SFS (for a similar number of polymorphic SNPs) for parametric bootstraps. We repeated the parameters estimation for each pseudo-observed SFS and kept the best run. The top 100 runs estimated parameters values were used to calculate 95% confidence intervals. The mean and 95 percentile confidence intervals were computed using the ‘boot’ R package (Canty and Ripley 2019). Finally, the fit of the best expected SFS was visualized against the observed using SFStools R scripts (github.com/marqueda/SFS-scripts).