De novo assembly. Mischka, a 12-year-old female German Shepherd, was selected as the source for our high-quality reference genome assembly (UU_CFam_GSD_1.0/canFam4; henceforth called GSD_1.0). Mischka was free of known genetic disorders, and when compared with additional German Shepherd sourced from within Sweden, was found to be genetically representative of the breed (Supplementary Fig. 1). We sequenced the genome using ~ 100X coverage PacBio long reads and assembled these in contigs with the standard FALCON method14. Further scaffolding using 94X of 10x and 48X of Hi-C linked reads resulted in 39 single-scaffold chromosomes (total 2.35 Gb) and 2,159 unplaced scaffolds (total 128.5 Mb; Fig. 1a). The latter contigs predominantly contain segmental duplications (58.1%) and centromeric repeats (30.1%, Supplementary Fig. 2).
Reference benchmarking. Compared to CanFam3.1, the contiguity of GSD_1.0 has been improved 41-fold, reaching a contig N50 of 14.8 Mb (Supplementary Fig. 3), with only 367 gaps in the chromosome (chr) scaffolds (Table 1; Fig. 1a). The quantity of sequence with extreme GC content (> 90% in 50 bp-windows) has doubled to 1.7 Mb (Fig. 1b), leading to a 14% increase in the average length of CpG islands (1,056 bp versus 926 bp, P = 8.4 × 10− 4, t-test). Filled gaps were found to have either high GC or high repeat content (Fig. 1c).
Table 1
Assembly statistics of GSD_1.0 compared to CanFam3.1
| GSD_1.0 | CanFam3.1 |
Number of contigs | 2,783 | 27,104 |
N50 (L50) contig | 14,840,767 bp (57) | 267,478 bp (2,436) |
Number of scaffolds | 2,198 | 3,268 |
N50 (L50) scaffolds | 64,299,765 bp (15) | 63,241,923 bp (15) |
Number of Gaps | 585 | 23,876 |
Gap density (gaps/Mb) | 0.24 | 9.9 |
Total bases | 2,482,000,080 bp | 2,410,976,875 bp |
Total ungapped bases | 2,481,941,580 bp | 2,392,715,236 bp |
Repeat structure. Approximately 42.7% of the genome is repetitive sequence, with the three major categories being LINEs (504 Mb), SINEs (253 Mb) and LTRs (120 Mb) (Supplementary Fig. 4, Supplementary Table 1). Long read technology allowed for the further resolution of centromeric repeats, and based on their positions, the orientation of chr 27 and 32 were reversed compared to CanFam3.1. In addition, the q-arms of 21 autosomes now begin with centromeric repeats, and 17 autosomes end in telomeric repeats (Fig. 1a). As expected, the sub-metacentric chr X has telomeric repeats at each end, and a clear centromeric signal at 49.4–49.9 Mb. Throughout the genome we found 10 internal centromeric, and 7 internal telomeric repeats. These may indicate ancient centromere and telomere positions prior to chromosomal rearrangements and most were also present in the previous reference genome assembly.
Functional annotation. To resolve transcript complexity and account for the gap closures in GSD_1.0, we generated more than 70 M nanopore and PacBio full-length cDNA reads from 40 tissues (including 15 brain regions; Supplementary Table 2), and combined this with 24 billion public RNA-seq paired reads (Supplementary Table 3). The annotation consisted of 159 thousand transcripts in 29,583 genes; of which 20,654 had an open reading frame (ORF) of at least 100 amino acids, and 19,691 genes had a significant BLAST hit against proteins in Swissprot or ENSEMBL. Further, 7,725 were defined as long noncoding genes. Compared to proteins extracted from CanFam3.1, our new GSD-1.0 annotation has a higher number of genes with BLAST hits and the number of genes with a full-length match has increased by 11% (Supplementary Fig. 5). Gene predictions and non-dog refSeq alignments were used to identify potentially missed genes that did not overlap with our annotation, yielding an additional 874 protein coding genes with BLAST evidence. Using a combination of new miRNA-seq reads and public data we identified a conservative set of 719 miRNAs, similar to the set found for CanFam3.115. Among the novel miRNAs, a copy of the highly expressed Mirlet-7i was identified in a filled gap region (Supplementary Fig. 6). This miRNA has been implicated in several human diseases, including multiple sclerosis16, gastric cancer17 and breast cancer18, but has yet to be extensively studied in dogs.
We identified 7,468 closed gaps containing either an exon or promoter sequence as defined by ATAC-seq peaks, accounting for 5,743 unique coding exons which were missing in CanFam3.1 (Fig. 1e). Notably, eight genes with expression across multiple tissues were completely absent in CanFam3.1 but were now available for interrogation (PSMA4, CDHR5, SCT, PAOX, UTF1, EFNA2, GPX4 and SLC25A22). These genes have diverse functions ranging from embryonic stem cell co-activator (UTF1), through to osmoregulation (SCT). Both CDHR5 and SLC25A22 (Fig. 2a) have been investigated as biomarkers for either renal19 or colorectal20 cancers.
Implications for research. We assessed the chromosomal order and contiguity of regions essential to the study of cancer and immunological disease. Using the human COSMIC21 gene list as a baseline, we affirmed that 282 tier1 and 78 tier2 genes are now completely captured, including HOXD13 and KLF4 (Supplementary Table 4). Both have been implicated in human breast cancer; HOXD13 methylation status functions as a prognostic indicator22 and deubiquitination of KLF4 promotes metastasis23 (Supplementary Fig. 7). The Dog Leukocyte Antigen (DLA) regions positioned on chr 12 (Fig. 1f) and 35 (Supplementary Fig. 8a) are contiguous in GSD_1.0 (covering 2.58 Mb and 0.61 Mb, respectively) and contain new coding and potential regulatory sequences in filled gaps. Contiguous sequence was also reported for both the T-cell receptor alpha (TRA) and T-cell receptor beta (TRB) loci on chr 8 and 16 respectively (Supplementary Fig. 8b&c).
Comparison to canine assemblies. Five additional canine genome assemblies have recently been deposited in NCBI (Supplementary Table 5). For each assembly, we compared BUSCO24 scores and mappability using in-house Iso-Seq cDNA alignments generated above from a beagle dog (Supplementary Table 2). With GSD_1.0 it was possible to map > 5% more bases from 25,609 of Iso-Seq reads compared to CanFam3.1 (4.8% of total reads; Supplementary Fig. 9). This was a higher fraction than for the other assemblies (Supplementary Table 6). GSD_1.0 had the second highest BUSCO score for complete genes (95.5%), but each canine assembly, including the dingo, is of value to the community and may serve different experimental goals (Supplementary Fig. 10).
Genome variation. Polymorphisms detected in 27 dogs (19 breeds) were extracted from 10x sequencing data to facilitate the investigation of genome features and across-breed variant segregation (Supplementary Table 7). We identified 14,953,199 SNPs, 6,958,645 indels and 217,951 structural variants (SV, average 2.4 kb; Fig. 3b). Of these, 42.1% were private, 57.9% polymorphic across multiple individuals, and 1.4% overlapped with protein-coding regions (295,112 SNPs and 16,654 SVs). Intersection with existing SV catalogues based on either SNP or aCGH arrays25–27 showed between 12.6–39.0% agreement, but these numbers are likely a reflection of within project breed and detection technology. 10x sequencing allowed for the detection of many novel SVs with small to medium size (≥ 30 kb) with accurate breakpoints (Test set of 9 validated with PCR and Sanger sequencing, data not shown).
Genome “dark” regions unmasked. The majority of publicly available dog WGSs were generated with short read technologies. To facilitate the reanalysis of these resources with GSD_1.0 we aimed to identify the genome’s “dark” regions28; those sections either not adequately covered by sequencing method (“covered”, COV) or to which unique alignment is not possible (“camouflaged”, CAM). We defined GSD_1.0 dark regions for Illumina short reads (ISR), 10x, and PacBio (PB) sequencing (see Methods). COV comprised 5.8 Mb, 5.7 Mb and 6.4 Mb respectively, while CAM comprised 15.9 Mb, 6.4 Mb and 1.0 Mb (Fig. 2c). Intersection showed that while 10x could rescue 11.3 Mb dark regions not seen with ISR (9.73 + 1.56 Mb), more than half of this again (5.9 Mb) could be further recovered by PacBio (Fig. 2d). We noted six tier 1 & 2 COSMIC genes that contained COV/CAM regions (EPHA3, RALGDS, LRP1B, CSMD3, ZMYM2, PTEN; 0.8–6.6% of coding region dark), potentially masking drivers of disease. Due to the nature of COV/CAM regions, default practices will not allow for the mapping of IRS reads to, and subsequent variant extraction from, these regions. Instead, we extracted variants overlapping annotated dark regions from our “healthy” 10x data set, and in doing so, identified 51,994 SNPs and indels, including 19,340 intronic and 2,074 exonic variants. Many of these variants were embedded in genes that may be important for morphology or associated with disease. For example, 14 variants were found within seven intronic TYRP1 ISR CAM dark regions (Supplementary Fig. 11a): a gene linked to brown color in dogs29 and melanoma in humans30,31. Likewise, 76 variants were found in ADCY2 ISR CAM regions (Supplementary Fig. 11b). Polymorphisms in this gene have previously been associated with neurological disorders (Bipolar disorder32; and Alzheimer's disease33), and response to associated drug therapies of schizophrenia34 in humans.
Chromosome mis-assembly resolved. A direct comparison of CanFam3.1 and GSD_1.0 revealed a complex ~ 10 Mb inverted region on chr 9 that harboured SOX9 and was previously implicated in canine XX disorder of sex development (DSD)35–37. Three polymorphic regions homologous to parts of MAGI2 on chr 18 (M1, M2, M3), have been inserted upstream of SOX9 (Fig. 3a,b). In DSD, having multiple copies of a CNV overlapping M236, was shown to be associated with altered SOX9 function during gonadal development. Using HiC and BAC end sequencing data, we confirmed the inverted GSD_1.0 orientation was correct and refined the placement of regions M1, M2 and M3 (Fig. 3a). These chr 9 insertions are missing from GSD_1.0, but allelic depth analysis revealed that most 10x dogs (26/27) carry between 2–6 chr 9 copies (Fig. 3c,d), similar to the estimates reported for non-DSD dogs37. Recently it was shown that the DSD phenotype presents in a breed specific manner, and is influenced by the combination of a SNP and CNVs in this region35,37. However, as this inversion contains numerous genes and regulatory elements, this rearrangement including multiple CNV expansions, has the potential to impact additional canine traits.
CYP1A2 locus variation. To further investigate the impact of SVs on coding genes, we examined the 16.2 kb copy number locus which encompassed CYP1A2 (Fig. 4a). Dogs are used as comparative models for human xenobiotic metabolism, and while a CYP1A2 premature stop codon (rs852922442 C > T) has been reported38,39, he locus expansion has not. The homozygous T mutant is polymorphic across breeds40 and results in an array of pharmacokinetic effects, including reduced hepatic drug metabolism41. This T variant was observed in 4/27 10x dogs, but in heterozygous form and not segregating with CNV count (2–5 copies; Fig. 4b). Differential gene expression analyses for this and neighboring genes outside the locus were performed using either liver or spleen tissue from additional individuals (Supplementary Tables 8 and 2). After accounting for CYP1A2 SNP rs852922442-T, no significant relative gene expression difference was observed, leaving the phenotypic consequence of this expansion unresolved (CNV 3 vs > 3; Supplementary Table 9). It may be that the effect in this region is subtle, and so not detectable with qPCR, however, CYP1A2 is an inducible gene and so the true outcome may only be observed after a drug challenge42.