Sequencing and genome assembly of female Basenji, China
Sampling. China or Zanzipow Bowies China Girl is an Australian Supreme Champion Kennel Club show dog. Her Australian National Kennel Council registration number is 2100018945. She is bred primarily from Australian lines, with her most recent common ancestor coming from Africa 18 generations previously. She was born on 14 Jan 2016, weights 10 kg and lives at Zanzipow Kennel in south east New South Wales, Australia. She is free from all known currently reported diseases. High-molecular weight DNA was extracted from blood taken without sedation from the cephalic vein in July 2018. After 10 min observation she was released back into the care of her owner without any ill effect.
Sequencing. High molecular weight DNA was extracted from 100 µl of blood using the DNeasy Blood and Tissue kit (Qiagen). For long read (Oxford Nanopore) sequencing, 1 µg of DNA was prepared via the Genomic DNA by Ligation kit (SQK-LSK109) as per the manufacturer’s protocol. Prepared DNA (180 ng) was loaded onto a PromethION (FLO-PRO002) flowcell and sequenced with standard parameters. After 48 hours, a nuclease flush was performed and an additional 115 ng was loaded onto the run. Base-calling was performed after sequencing with the GPU-enabled guppy basecaller (v3.0.3) using the PromethION high accuracy flip-flop model with config ‘dna_r9.4.1_450bps_hac.cfg’.
For short read sequencing whole blood was shipped to BGI, Hong Kong. High molecular weight DNA was extracted, a paired-end library prepared, and the sample run on the BGISEQ-500 sequencing system to generate high quality PE100 data. A total of 767,111,208 clean reads (115.1 Gb) were produced with a lower base call accuracy (Q20) of 95.92%.
Assembly. An overview of the China assembly workflow is given in Supplementary Fig. 1A, Additional File 1. The ONT reads were assembled with the Flye (v2.6-release) assembler (16, 17). The resulting contigs were polished with ONT reads using four rounds of Racon (v1.4.3) (18) followed by Medaka (v0.10.0) (19) to minimise error propagation. BGI-seq reads were aligned to the polished assembly with BWA-mem (v 0.7.17) (46) and Pilon (v1.23) (diploid) (20) was used for further error correction. A second assembly was performed using Canu assembler (v1.8.0) (47) and correcting the sequencing errors using 2 rounds of Arrow polishing (48). The Flye assembly was considered more contiguous and therefore selected as the core assembly.
Scaffolding. The ONT assembly was scaffolded to chromosome-length by the DNA Zoo following standard DNA zoo methodology (www.dnazoo.org/methods). Briefly, an in situ Hi-C library was prepared (23) from a blood sample from the same individual and sequenced to ~ 30x coverage (assuming 2.4 Gb genome size). The Hi-C data was processed using Juicer (49), and used as input into the 3D-DNA pipeline (50) to produce a candidate chromosome-length genome assembly. We performed additional finishing on the scaffolds using Juicebox Assembly Tools (21). The matrices are visualized in Juicebox.js, a cloud-based visualization system for Hi-C data (22) and are available for browsing at multiple resolutions on https://www.dnazoo.org/post/basenji-the-african-hunting-dog. After scaffolding, all ONT reads were aligned to the assembly with Minimap2 (v2.16) (-ax map-ont) (26) and used by PBJelly (pbsuite v.15.8.24) (24) to fill gaps. BGI reads were re-mapped with BWA-mem (v 0.7.17) and another round of polishing was performed with Pilon (v1.23) (diploid, SNP and indel correction) (20) .
Final clean-up. The Pilon-polished genome underwent a final scaffold clean-up to generate a high-quality core assembly following Field and colleagues’ processing of Canfam_GSD (14). Scaffolds were reoriented and assigned to CanFam3.1 chromosomes using PAFScaff (v0.3.0) (14) (Minimap2 v2.16 mapping) (26). Diploidocus (v0.9.0) (51) was used to screen contamination, remove low-coverage artefacts and haplotig sequences, and annotate remaining scaffolds with potential issues as described in (14). ONT reads were mapped onto the assembly using Minimap2 (v2.17) (-ax map-ont --secondary = no) (26) and read depth summaries calculated with BBMap (v38.51) pileup.sh (52). Any scaffolds with median coverage less than three (e.g., less than 50% of the scaffold covered by at least three reads) were filtered out as low-coverage scaffolds. Single-copy read depth was estimated using the modal read depth of 35X across the 5,5578 single copy complete genes identified by BUSCO (v3.0.2b) (30). This was used to set low-, mid- and high-depth thresholds at 8x, 25x and 68x for PurgeHaplotigs v20190612 (53) (implementing Perl v5.28.0, BEDTools v2.27.1 (54, 55), R v3.5.3, and SAMTools v1.9 (56)). PurgeHaplotig coverage was adjusted to exclude gap regions and any scaffolds with 80%+ bases in the low/haploid coverage bins and 95%+ of their length mapped by PurgeHaplotigs onto another scaffold were filtered as haplotigs or assembly artefacts. Any other scaffolds with 80%+ low coverage bases were filtered as Low Coverage. Remaining scaffolds were further classified based on read depth profiles. Scaffolds with < 20% diploid coverage and 50%+ high coverage were marked as probable collapsed repeats. Scaffolds with “Diploid” depth as the dominant PurgeHaplotigs coverage bin and > 50% match to another scaffold were marked as a possible repeat sequences (14).
Genome assembly correction. Main chromosome scaffold integrity was checked using D-GENIES (28) comparisons of different assembly stages with CanFam_GSD chromosomes (14). Two pairs of fused chromosomes were identified following incorrect joins made by PBJelly (pbsuite v.15.8.24) (24). Pre-gap-filled HiC scaffolds were mapped onto the assembly using Minimap2 (v2.17) (26) and parsed with GABLAM (v2.30.5) (27) to identify the gap corresponding to the fusion regions. These were manually separated into four individual chromosomes, gap lengths standardised, and scaffolds re-mapped onto CanFam3.1 using PAFScaff (v0.4.0) (25). Finally, scaffolds under 1 kb were removed following correction of Chromosome 29 (below).
Correction of mitochondrial insertion into Chromosome 29. NUMT analysis identified a 33.2 kb region consisting of almost two complete copies of the mitochondrial genome, not present in other dog genome assemblies. GABLAM (v2.30.5) (27) was used to confirm that this region was also absent from the Canu assembly of China v1.1. ONT reads that mapped onto both flanking regions of the 33.2 kb putative NUMT were extracted and reassembled with Flye (v2.7.1) (16, 17). The new NUMT was approx. 2.8 kb long. To avoid repeated problems with mitochondrial reads mis-polishing the sequence, a subset of 4.82M ONT reads (72.7 Gb, ~ 30X) was extracted and mapped onto the assembled region with Minimap (v2.17) (26). Reads mapping to at least 5 kb of the assembled region including some immediate flanking sequence were extracted (66 reads, 1.50 Mb) and polished with one round of Racon (v1.4.5) (18) (-m 8 -x -6 -g -8 -w 500) and Medaka (v0.7.1) (19) (model r941_prom_high). The polished NUMT region was mapped on to the Chromosome 29 scaffold with GABLAM (v2.30.5) (27) (blast + v2.9.0 (57) megablast) and stretches of 100% sequence identity identified each side of the NUMT. The mtDNA sequence between these regions of identity was replaced with the re-assembled NUMT sequence.
Mitochondrial genome assembly. To assemble the mitochondrion, ONT reads were mapped onto a construct of three tandem copies of the CanFam3.1 mtDNA with minimap2 (v2.17) (26) (-ax map-ont --secondary = no). Reads with hits were extracted using SAMTools (v1.9) (56) fasta and mapped onto a double-copy CanFam3.1 mtDNA with GABLAM (v2.30.5) (27) (blast + v2.9.0 (57) megablast). “Pure complete” mtDNA ONT reads were identified as those with 99%+ of their length mapping to mtDNA and 99%+ coverage of the mtDNA. These reads were assembled with Flye (v2.7b-b1526) (16, 17) (genome size 16.7 kb) and polished with Racon (v1.4.5) (18) (-m 8 -x -6 -g -8 -w 500) followed by Medaka (v0.7.1) (19) (model r941_prom_high). The polished mtDNA assembly was mapped onto CanFam3.1 mtDNA with GABLAM (v2.30.5) (27) (blast + v2.9.0 (57) megablast) and circularised by extracting a region from the centre of the assembly corresponding to a single full-length copy with the same start and end positions. Final correction of SNPs and indels was performed by adding the mtDNA to the nuclear assembly, mapping BGI reads with BWA (v0.7.17) and polishing the mtDNA with Pilon (v1.23) (20). The polished mtDNA was then added back to the nuclear genome for the final China assembly.
Genome assembly quality assessment. At each stage of the assembly, summary statistics were calculated with SLiMSuite SeqList (v1.45.0) (58), quality was assessed with Merqury (v20200318) (Meryl v20200313, bedtools v2.27.1 (54, 55), SAMTools v1.9 (56), java v8u45, igv v2.8.0) and completeness assessed with BUSCO (v3.0.2b) (30) (BLAST + v2.2.31 (57), HMMer v3.2.1 (59), Augustus v3.3.2, EMBOSS v6.6.0, laurasiatherian lineage (n = 6253)). To account for fluctuations in BUSCO ratings, presence of complete BUSCO genes across assembly stages was also assessed with BUSCOMP (v0.9.4) (60). Final assembly scaffold statistics and quality assessment was performed with Diploidocus (v0.10.2) (KAT v2.4.2, perl v5.28.0, BEDtools v2.27.1 (54, 55), SAMTools v1.9 (56), purge_haplotigs v20190612, java v8u231-jre, bbmap v38.51, minimap2 v2.17 (26), BLAST + v2.9.0 (57)).
DNA methylation calling. China’s blood DNA methylome libraries were sequenced on the Illumina HiSeq X platform (150 bp, PE), generating 336 million read pairs and yielding 14x sequencing coverage. Sequenced reads were trimmed using Trimmomatic (61) and mapped to the China v1.0 genome reference using WALT (62) with the following parameters: -m 10 -t 24 -N 10000000 -L 2000. The mappability of the MethylC-seq library was 86%. Duplicate reads were removed using Picard Tools (v2.3.0). Genotype and methylation bias correction were performed using MethylDackel with additional parameters: minOppositeDepth 5 --maxVariantFrac 0.5 --OT 10,140,10,140 --OB 10,140,10,140. The numbers of methylated and unmethylated calls at each CpG site were determined using MethylDackel (https://github.com/dpryan79/MethylDackel). Bisulphite conversion efficiency was 99.71%, estimated using unmethylated lambda phage spike-in control.
UMR and LMR calling. Segmentation of basenji’s blood DNA methylome into CpG-rich unmethylated regions (UMRs) and CpG-poor low-methylated regions (LMRs) was performed using MethylSeekR (33) (segmentUMRsLMRs(m = meth, meth.cutoff = 0.5, nCpG.cutoff = 5, PMDs = NA, num.cores = num.cores, myGenomeSeq = build, seqLengths = seqlengths(build), nCpG.smoothing = 3, minCover = 5).
Sequencing and genome assembly of male Basenji, Wags
Wags is registered as American Kennel Club Champion Kibushi the Oracle, born on December 3, 2008. His registration number is HP345321/01. Sire is AM Ch C-Quests Soul Driver, HM827502/02, and his dam is Avongara Luka, HP345312/01, a native female dog imported from the Haut-Ule district of the DRC Congo, 3°24'04.0"N 27°19'04.6"E, in 2006. He weighs 11 kg. DNA was derived from blood drawn from the cephalic vein without sedation in May 2016 while he was in Columbia, Missouri, USA. Within 10 min he was released back into the care of his owner without any ill effect.
SMRT sequences for Wags (Fig. 1B) were generated on the Pacific Biosciences Sequel instrument (V2 chemistry) to approximately 45x genome coverage based on a genome size estimate of 2.5 Gb. An overview of the Wags assembly workflow is given in Supplementary Fig. 1B, Additional File 1. All SMRT sequences were assembled with the HGAP4 algorithm, a Falcon based assembly pipeline available through the SMRT Link interphase (SMRT Link v5.0.1.9585) (63). The assembly was then error corrected with the original SMRT sequences using the Arrow error-correction module (63). Additional polishing of the assembly for residual indels was done by aligning 32x coverage of Illumina data and the Pilon algorithm (20). Chromosomal level scaffolds were generated with the same DNA source using the Proximo™ Hi-C genome scaffolding software (Phase Genomics Inc) and finalized by alignment to the CanFam3.1 reference.
Locus copy number estimation
Copy numbers for specific assembly regions were calculated using Diploidocus (v0.10.0) (runmode = regcnv) (51). For each animal, long reads were mapped onto the assembly with Minimap2 (v2.17) (no secondary mapping) (26) and the modal read depth across single-copy Complete genes identified by BUSCO v3 (30) (laurasiatherian_db) calculated using SAMTools (v1.9) (56) mpileup. This set the expected single-copy read depth, XSC. Copy number for a region, Nreg was estimated by dividing the mean read depth across that region, Xreg, by XSC. The variance and standard deviation of the estimate was calculated using Xreg for all single copy BUSCO genes. For Wags (a male), genes on the X chromosome were excluded from this analysis.
Amylase copy number. The copy number of the beta amylase gene Amy2B was calculated using Diploidocus (v0.10.0) (runmode = regcnv) (51) using a modification of the single locus copy number estimation (above) to account for multiple copies of the gene in the assembly. First, the AMY2B protein sequence from CanFam3.1 (UniprotKB: J9PAL7) was used as a query and searched against the genome with Exonerate (v2.2.0) (64) to identify assembled copies of the Amy2B gene. A full-length (14.8 kb) Amy2B gene repeat was also extracted from the CanFam_GSD assembly and mapped onto the assembly with Minimap2 (v2.17) (-x asm5) (26). Estimated Nreg values were converted into a number of copies by multiplying by the proportion of the query found covered by that region. The total genome Amy2B copy number was then calculated as the summed copy number over each hit. To further investigate the robustness of the method and improve the Amy2B copy number estimate in CanFam_Bas, analysis was repeated with ONT reads at least 5 kb in length and at least 10 kb in length. These reads should be less susceptible to poor mapping at repeat sequences, but at a cost of reduced coverage.
We also used droplet digital PCR (ddPCR) to directly quantify the Amy2B copy number of China DNA (65). ddPCR was performed using a QX100 ddPCR system (Bio-rad). Each reaction was performed in a 20 µl reaction volume containing 10 µl of 2x ddPCR Supermix (Bio-rad), 1 µl of each 20x primer/probe, 1 µl of DraI restriction enzyme (New England BioLabs #R0129S), 5 µl of DNA template (4 ng/µl) and 2 µl ddH2O. Primer sequence for Amy2B: forward 5′-CCAAACCTGGACGGACATCT-3′ and reverse 5′-TATCGTTCGCATTCAAGAGCAA-3′ with FAM probe: 6FAM–TTTGAGTGGCGCTGGG-MGBNFQ. Primer sequence for C7orf28b-3: 5′-GGGAAACTCCACAAGCAATCA-3′ and reverse 5′-GAGCCCATGGAGGAAATCATC-3′ with HEX probe HEX-CACCTGCTAAACAGC-MGBNFQ.
Nuclear mitochondrial DNA (NUMT) fragment analysis
To make sure that any NUMTs were identified as a single region, a double-copy CanFam3.1 mtDNA sequence was constructed and then searched against the CanFam_Bas nuclear genome and compressed to unique hits using GABLAM (v2.30.5) (27) (blast + v2.9.0 (57) blastn, localunique) with a blastn e-value cut-off of 1e-4 (29). Predicted copy number was calculated for each NUMT in China v1.1 using the method described above. Diploidocus (v0.10.0) was also used to calculate number of reads spanning each entire NUMT plus flanking regions of 0 bp, 100 bp, 1 kb and 5 kb. In addition, assembly coverage for each NUMT was calculated for Wags, CanFam3.1 and CanFam_GSD. Each genome was split into 1 Mb tiled fragments and mapped onto CanFam_Bas with Minimap2 (v2.17) (26). Each BAM file was used for Diploidocus (v0.10.0) regcnv (51) analysis with a single-copy read depth of 1x. Mitochondrial genome coverage was analysed by extracting all 291 NUMT regions with SeqSuite (v1.23.3) (66) and mapping them onto the CanFam3.1 mtDNA chromosome using GABLAM (v2.30.5) (27) (blast + v2.10. (57) tblastn).
Genome annotation
Following Field et al. (14) each genome was annotated using the homology-based gene prediction program GeMoMa (35) (v1.6.2beta) and nine reference organisms. For each reference organism, coding exons of full-length transcript were extracted and translated to peptides using the GeMoMa module Extractor. These peptides were searched in each Basenji genome using mmseqs2 (67) (v5877873cbcd50a6d954607fc2df1210f8c2c3a4b). Based on the results of mmseqs2 and Extractor, transcripts were predicted for each Basenji genome from each reference organism independently. These nine gene annotation sets were then combined into a final gene annotation using the GeMoMa module GAF. To make a fair comparison of the influence of genome quality and completeness on annotation, CanFam3.1 was annotated with the same pipeline. Annotation of CanFam_GSD using the same pipeline was obtained from Field et al. (14).
Annotation summary and quality assessment. Annotation summary statistics and the longest protein isoform per gene were generated with SAAGA (v0.4.0) (68). Annotation completeness was estimated using BUSCO v3 (30) (laurasiatherian, n = 6253, proteins mode), run on a reduced annotation consisting of the longest protein per gene. To check for truncated or fragmented protein predictions, predicted proteins were mapped onto the Quest For Orthologues reference dog proteome (36) with mmseqs2 v (67). The best protein hit for each gene was used to calculate a protein length ratio (length of predicted protein / length of reference protein). Percentage coverage of query and hit proteins was calculated with mmseqs2 v (67). A reciprocal search identified the closest predicted protein for each reference protein. Any reciprocal best hits were marked as predicted orthologues.
Annotation copy number and coverage analysis. Predicted copy number was calculated for every protein-coding gene in CanFaBas using the method described above. In addition, assembly coverage for each CanFam_Bas gene was calculated for Wags, CanFam3.1 and CanFam_GSD. Each genome was split into 1 Mb tiled fragments and mapped onto CanFam_Bas with Minimap2 (v2.17) (-ax asm5 -L) (26). Each BAM file was used for Diploidocus (v0.10.0) regcnv analysis with a single-copy read depth of 1x. In addition, SMRT reads from Wags and ONT reads from the GSD were mapped onto CanFam_Bas with Minimap2 (v2.17) (--secondary = no -L -ax map-pb or -ax map-ont) (26) and the standard predicted copy number calculation applied. Genes with zero coverage were marked 0n. Other genes were binned according to coverage: for mapped assemblies, coverage was rounded to the nearest integer; for long read mapping, coverage was rounded to the nearest 0.5n. Genes with greater than zero but less than 50% coverage were assigned to 0.5n. Any genes exceeding a rounded coverage of 2n were grouped as “3+”.
Ribosomal RNA prediction. For each genome, genes for rRNA were predicted with Barrnap (v0.9) (69) (eukaryotic mode, implementing Perl v5.28.0, HMMer v3.2.1 (59) and BEDTools v2.27.1 (54, 55)).
Whole genome assembly comparisons
Whole genome synteny analysis were performed for the main chromosome scaffolds using the D-GENIES (28) web portal.
Long read structural variant detection
Structural variant calls were generated using a combination of minimap2 (v2.17-r943-dirty) (26), SAMTools (v1.9) (56), and sniffles (v1.0.11) (70). In total, four sets of long reads from three samples were analyzed consisting of China the Basenji (Oxford Nanopore), Wags the Basenji (SMRT) and Nala the German Shepherd (Oxford Nanopore and SMRT (14)). Reads were mapped against China v1.0, CanFam3.1 and CanFam_GSD. Analysis was restricted to the main nuclear chromosome scaffolds. Variants for the Basenji and for Nala were annotated with gene model predictions generated using GeMoMa (35) (v1.6.2beta) while CanFam3.1 variants were annotated with Ensembl gene annotations v100 for CanFam3.1 (71)
Short read mapping, SNV / small indel detection
Representative Illumina data was identified from (72). All Sequence Read Archive (SRA) IDs associated with the DBVDC bioproject on NCBI (SRP144493) were downloaded along with their metadata and reduced to 126 samples representing the biggest sequencing run (no. bases) per annotated breed. These 126 samples were used for initial read mapping and variant calling (Supplementary Table 7, Additional File 2). Following removal of unknown/mixed/village dog samples, canids other than domestic dogs, and duplicates the remaining breeds were mapped onto those used by Parker et al. (10). In total, 58 breeds in the significantly monophyletic clades (greater than 70% bootstrap) designated by Parker et al. (10) were considered for analyses (Supplementary Table 6, Additional File 2).
SNVs and small indels were called from the Illumina reads of the 58 representative breeds against three reference genomes (Basenji China v1.0, CanFam3.1, and CanFam_GSD). All Illumina reads were downloaded from the short read archive using the SRA toolkit (v2.10.4) (73). All samples were analysed using a modified version of an existing variant detection pipeline (74). Briefly, the pipeline employs BWA (v0.7.17-r1188) for read alignment (75), SAMTools (v1.9) (56) and picard (v2.4.1) (http://broadinstitute.github.io/picard) for binary alignment map (BAM) file preprocessing, and Genome Analysis ToolKit (v3.6) (GATK) for calling SNVs and small indels (76). The workflow follows GATK best-practices using default parameters throughout. Additionally, SRA reads from each reference genome were aligned using BWA and SAMTools and a consensus variant list generated using an approach described previously (77). The consensus variant lists for each reference genome were utilized in GATK’s BaseRecalibrator step as the ‘-knownSites’ argument to serve as the required variant truthset. Variant alignment statistics were generated using SAMTools flagstat. Joint variant calls were generated for the larger dataset of 126 samples relative to the three reference genomes and the total number of SNVs and small indels for individual samples tabulated. Variants for the Basenji and CanFam_GSD were annotated with gene model predictions generated using GeMoMa (see above) while CanFam variants were annotated with ENSEMBL gene annotations v100 for CanFam3.1. Read mapping statistics for each sample were calculated using SAMTools to remove secondary mapping and then summarized with BBTools (v38.51) pileup.sh (52). Numbers were then converted into relative values for each reference by averaging the score for each breed over the three reference genomes and then calculating the difference from the mean. Breeds were considered individually, but mean values for each clade were also calculated. Three clades are of particular importance as they are closely related, or include, the three reference genome assemblies. The Asian Spitz clade is considered closely related to the Basenji. This clade contained the Alaskan Malamute, Shar-Pei, Shiba Inu and Tibetan Mastiff. The European Mastiff Clade contained the Boxer, Bull Terrier, Cane Corseo, Great Dane, Mastiff and Rhodesian Ridgeback. The New World clade contained the Berger Picard, Chinook and German Shepherd. One-way ANOVA’s using PRISM© were employed to detect significant differences between groups. As the same short read samples were examined relative to the three reference genomes statistical significance was set to be P < 0.01.