Analysis of Genomic Variations between P. maniculatus and P. polionotus
The ability to generate fertile and viable hybrid offspring from P. maniculatus and P. polionotus provides an opportunity to identify the genetic loci associated with phenotypic traits that differ between the two species. Capitalizing on this rich genetic resource requires identifying the allelic differences between the two species. We sought to identify allelic differences between P. maniculatus and P. polionotus, using publicly available genomic resources. The first assembly of the P. maniculatus genome was used as the reference genome, and the P. polionotus sequencing reads deposited in the SRA database were mapped against the P. maniculatus reference genome. For polymorphism determination we required 10 independent P. polionotus sequencing reads to be mapped to a specific P. maniculatus genomic location (data file available at https://osf.io/4eypx/). With these criteria we determined that 81.9% of the P. maniculatus genome was covered by 10 P. polionotus sequencing reads, resulting in 38,166,334 polymorphisms between P. maniculatus and P. polionotus. Among these variations are 34,084,607 single nucleotide polymorphisms (SNPs) and 4,081,727 insertions or deletions (INDELs) between BW and PO (Table. 1). Lowering the read mapping requirement to five sequencing reads increased the number of identified polymorphisms to 40,815,360. However, we chose to keep the more stringent criteria for subsequent analysis, recognizing that we are likely underestimating the number of polymorphisms between the species. Using the more stringent 38,166,334 polymorphisms results in a variant rate of one variation every 68 base pairs. We used SnpEff to annotate the potential functional effects of the polymorphisms (data file available at https://osf.io/4eypx/), recognizing that individual variant locations can have multiple annotations, which increases the number of annotations above the total number of variants. More than 23 million variants occur in intergenic regions, with more than 7 million occurring both within 5 kb upstream and downstream of an annotated gene, potential impacting transcriptional regulatory sites (Table 1). More than 45 million variants occur in an annotated transcript region, including > 769,000 in annotated exons and > 44 million in annotated introns (Table 1). Within exons, the majority occur in 3’ untranslated regions (UTR).
We characterized the SNPs that occur between P. maniculatus and P. polionotus for changes in predicted protein coding sequences and found that there are 11,588 genes that contain a coding sequence SNP. Of these variants, 70.2% are silent, 29.6% are missense, and 0.2% are nonsense variants. We identified a set of 10,405 genes that contain a nonsynonymous change, which might result in a functional difference in the protein between the two species. Using the annotated Mus genome as a reference, we conducted a gene ontology (GO) term analysis on the list of 10,405 genes with missense or nonsense variants to determine if any biological processes or molecular functions are over or underrepresented within this list of genes. There are 2060 biological processes that are over-represented, and 63 that are under-represented (Table 2 and Additional Files 1 and 2). The processes with the most statistically significant enrichment are typically broad in GO terminology, including cellular process, cellular metabolic process, and metabolic process. P. maniculatus and P. polionotus are known to have distinct physiological differences, including an almost four-fold increase in blood cholesterol levels in P. polionotus compared to P. maniculatus and a two-fold increase in blood triglyceride levels in P. polionotus over P. maniculatus (24). We searched the list of overrepresented GO terms for terms with a possible relationship with cholesterol and triglycerides and found that the GO terms cholesterol metabolic process, cholesterol homeostasis, cholesterol transport, triglyceride metabolic process, triglyceride biosynthetic process, and triglyceride catabolic process are enriched in this list of genes with potential functional protein changes (p = 3.90 x 10-6, 3.05 x 10-5, 3.86 X10-4, 9.86 x 10-6, and 1.81 x 10-5, respectively) (Additional File 1). Each of these GO terms contains from 20 to 77 different genes with nonsynonymous changes, suggesting that there is substantial genetic complexity that may underlie the metabolic differences between the two species. In contrast, there are 63 GO terms that are under-represented in the list, including sensory perception of smell, sensory perception of chemical stimulus, sensory perception, G protein-coupled receptor signaling pathway, and nervous system process, suggesting that these biological processes are more conserved between the two species.
The P. maniculatus laboratory stock, BW, is known to have a significant incidence of repetitive, or stereotactic, behavior, including repetitive jumping (18, 25, 26). The P. polionotus laboratory stock, PO, does not display stereotactic behaviors. In humans, repetitive movements are associated with autism, obsessive-compulsive disorder, and other neurologic disorders, including tics and schizophrenia (20, 27).BW animals have been used to study how neuroactive drugs and environmental enrichment can reduce repetitive behavior (28, 29). BW animals are also less social than PO animals, another hallmark of autism in humans (18, 30). Our expectation is that there are genetic loci that underlie the stereotactic and social behaviors and we sought to determine if there is genetic variation between BW and PO that may be associated with these behavioral differences. We examined the list of overrepresented GO terms for processes that may be related with autism associated behaviors and found that locomotory behavior and social behavior are both enriched GO terms (p = 1.94 x 10-5 and 6.79 x 10-4, respectively) (Additional File 1). In addition, we selected a list of candidate genes that all have a high confidence of being associated with autism in humans from the Simons Foundation Autism Research Initiative (SFARI) database (Table 3). We then identified sequence variations that occur between P. maniculatus and P. polionotus in this list of autism candidate genes. Each gene analyzed has multiple sequence variations between the two species that could result in a functional change to the protein, including missense changes, nonsense changes, in-frame deletions, and nucleotide variations in splicing regions. In addition, there are numerous differences in untranslated regions, introns, and upstream and downstream sequences that could result in differences in transcript or protein levels, if they occur in regulatory regions.
We wanted to further explore how polymorphisms between BW and PO might result in functional changes by examining polymorphisms in one autism candidate gene, recognizing that there are no identified connections between there autism candidate genes and the behavioral differences between BW and PO. ASH1L is a chromatin modifying protein that is associated with transcriptional activation(31). De novo mutations in ASH1L have been identified in multiple people with autism, while rarely occurring in controls (32-34) . We aligned select mammalian ASH1L protein sequences to identify highly conserved amino acids and determine if Peromyscus amino acid substitutions occur in these conserved residues (Fig. 1). None of the nonsynonymous changes between P. maniculatus and P. polionotus ASH1L are in the ASH1L conserved protein domains, SET, BROMO, PHD, and BAH, and generally occur in regions with less conservation between mammalian ASH1L proteins. However, at positions 61, 484, 770, 1632, and 2814 there are amino acid substitutions in one Peromyscus species where the amino acid is conserved in the other mammals. The S484A, S770P, and T1632P substitutions are intriguing as they remove potential phosphorylation sites in one of the Peromyscus species. The potential functional impact of these amino acid substitutions on ASH1L function will require further characterization.
We also examined possible transcriptional regulatory changes between P. maniculatus and P. polionotus for ASH1L by generating a VISTA plot to identify conserved non-coding sequences (CNS) in the ASH1L locus (35). A conserved non-coding sequence occurs in intron 3 of ASH1L in 100 vertebrate species (UCSC Genome Brower: Human GRCh38/hg38 chromosome 1: 155,459,751-155,478,012) and is also found in BW and PO (Fig. 2) (36). Within this CNS there are three SNPs between P. maniculatus and P. polionotus. Two of the three Peromyscus SNPs are in positions that are not conserved between a group of seven mammalian species. However, one SNP occurs in a region of 16 nucleotides that are completely conserved within the selection of mammalian species (Fig. 2c). We used PROMO to identify potential transcription factor binding sites within this region and found that in six mammalian species, including P. maniculatus, the conserved sequences contain a potential NKX2-1 binding site (37, 38). However, in P. polionotus the SNP removes the NKX2-1 binding site and generates a potential EBF1 binding site.
Restriction Enzyme Recognition Sites in P. maniculatus
A QTL analysis or GWAS using Peromyscus is likely to utilize RADseq. RADseq is a flexible approach to genomic analysis, as the choice of restriction enzyme used to digest the genomic DNA can be varied to customize the number of sequenced sites, known as RAD markers, across the genome (39). The number of RAD markers generated is twice the number of restriction enzyme recognition sites. An enzyme that cuts more frequently will generate more RAD markers and, therefore, provide more allelic information than an enzyme that cuts less frequently. We used the P. maniculatus reference genome to determine the number of cuts sites and the average fragment size for the enzymes listed in Table 4. This data provides a range of restriction enzymes with recognition sites from approximately 1000 bp apart (DraI) to approximately 1 million bp apart (AscI), enabling an informed choice for restriction enzyme selection in P. maniculatus RADseq projects. RADseq generates about 400 bp of sequence information flanking a restriction enzyme recognition site. Because a sequence variant between P. maniculatus and P. polionotus occurs approximately every 68 base pairs, it is likely that RADseq analysis on F1 hybrids will generate informative allelic information at most RAD markers.
Linkage Analysis of Dominant spot
Dominant spot is a spontaneous mutation that arose within a wild population of P. maniculatus near Morrison, Illinois (1, 40). The Dominant spot trait (S) is maintained as heterozygotes on the BW laboratory stock of P. maniculatus at the PGSC. PGSC breeding records suggest that S/S homozygotes are likely embryonic lethal. We crossed BW S/+ adults and generated timed pregnancies and observed resorbing embryos at embryonic day of development 14.5 (e14.5), and approximately one quarter of e13.5 embryos have a variable phenotype that includes morphological defects consistent with embryonic lethality (Fig. 3).
We sought to perform linkage analysis to identify genetic loci associated with Dominant spot by crossing BW S/+ (Fig. 4a) with the PO laboratory stock of P. polionotus (+/+). F1 hybrids of P. maniculatus and P. polionotus exhibit developmental dysgenesis (1, 41). When female P. maniculatus are crossed with male P. polionotus, the hybrid offspring are smaller than either parent, but are viable and fertile. In contrast, female P. polionotus crossed with male P. maniculatus result in overgrown fetuses with developmental defects and are not viable. Therefore, a male +/+ PO was crossed with S/+ BW to generate six F1 hybrids with forehead spots (Fig. 4b). These six S/+ offspring were then backcrossed to PO +/+ to generate an N2 generation containing 125 animals, of which 46 have spots (S/+) and 79 did not (+/+) (Fig. 4c).
Disrupted pigmentation patterns in laboratory mice, Mus musculus, are readily identifiable, and characterization of the causative mutations for these spotting defects has identified key members of a neural crest gene regulatory network necessary for normal neural crest development (42). We pursued a candidate gene approach as a first step towards linking Dominant spot with a specific genomic region. Edn3, Ednrb, Kit, Kitl, Mitf,Pax3, Ret, Snail, and Sox10 are all known to cause spotting phenotypes in M. musculus; therefore, we sought to identify allelic differences in P. maniculatus and P. polionotus for each gene to determine if any of these candidate genes are linked with Dominant spot. For our list of candidate genes, we identified a sequence variant that removes a restriction enzyme recognition site in one Peromyscus species. We will call these sites restriction fragment length polymorphisms (RFLPs) because of their similarity to the technique used for genomic variation analysis (Table 5). We then designed polymerase chain reaction (PCR) primers flanking the site to generate an RFLP site specific amplicon. BW and PO genomic DNA, along with genomic DNA from S/+ N2 animals was PCR amplified and then digested with the appropriate restriction enzyme (Fig. 5 and Additional File 3). The S mutation arose in P. maniculatus and has been maintained on the BW stock; therefore, the S mutation occurs in a BW allele. If a candidate gene is linked with the Dominant spot trait, then all S/+ N2 animals will have both a BW allele and a PO allele for that candidate gene. If an S/+ N2 animal has only PO alleles for a candidate gene, then that candidate gene is not linked with the Dominant spot trait. From our list of candidate genes, eight of the candidate genes are not linked with Dominant spot as there are multiple S/+ N2 individuals with only the PO allele. However, Sox10 is linked with Dominant spot; 46 S/+ N2 individuals were genotyped at the Sox10 RFLP site, and all 46 are BW/PO (χ2 (1, N = 46) = 46, p = 1.2 X 10-11). By employing the same RFLP analysis, we have identified a 1.7 Mb region between Tex33 and Pdgfb on chromosome 20 that is linked with Dominant spot (Additional File 4). Among the 53 genes contained in the linkage interval, only Sox10 has a defined role in neural crest development. Therefore, we favor the possibility that the S mutation disrupts Sox10 function. We have sequenced the Sox10 exons, exon/intron junctions, promoter, and several conserved enhancer regions but have not identified a sequence variation that disrupts Sox10 function. We are expanding the sequencing analysis to include the entire linked region.
In generating the N2 generation, we noticed that the spot size was smaller on the F1 and N2 animals compared to the originating BW background. The average spot size for 25 S/+ BW animals is 77.6 ± 36.6 mm2. We crossed one BW S/+ with PO +/+ and generated six F1S/+ animals, which had very small spots in comparison, 3.75 ± 1.56 mm2, suggesting that PO alleles have a dominant effect on the S/+ spot size phenotype. The six F1S/+ were backcrossed with PO +/+ and in the PO N2 generation the spot size for 46 affected animals averaged 14.5 ± 13 mm2, which is significantly smaller than the spot size of S/+ BW animals (Welch’s t (27) = 8.34, p = 5.38 x 10-9), suggesting that genetic background has a significant impact on the S/+ phenotype. A histogram for spot size for S/+ on BW and PO N2 illustrates the quantitative nature of the phenotype and the shift in spot size in the PO N2 animals (Fig. 6). We used the backcross data to estimate the number of loci that affect spot size. The six F1 animals with small spots produced 46 offspring with spots, of which 17 (37%) resembled the F1 parent. In this backcross experiment there are only two possible genotypes for any gene. If one unlinked locus determines the spot size phenotype, then there are two possible genotypes that interact with Dominant spot and 50% of the offspring are expected to resemble the F1 parent. If two unlinked loci determine the spot size phenotype, then 25% should resemble the parent, and if three unlinked loci are involved then 12.5% should resemble the parent. Using a X2 analysis, we can reject a model for three interacting loci (X2 (1, N = 46) = 23.19, p = 1.47 X 10-6), but not models for one interacting locus (X2 = (1, N = 46) = 3.13, p = 0.077) or two interacting loci (X2 (1, N = 46) = 2.82, p = 0.093). These data suggest that there are one to two modifiers that cause the observed variability in the Dominant spot trait. Discriminating between these two possibilities will require a larger sample size.
Further analysis indicates that there is a significant loss of affected animals in the N2 backcross. A total of 125 offspring were produced in the N2 generation. Of these animals, 46 pups had forehead spots, representing the S/+ genotype, while 63 are expected (X2 (1, N = 125) = 9.25, p = 0.0024). Analysis of the PGSC breeding records for BW S/+ X BW +/+ indicate that the S/+ genotype is produced at the expected frequency (310 total offspring of which 165 have spots, (χ2 (1, N = 310) = 1.29, p = 0..256)). These results suggest that there is a significant loss of the S/+ phenotype in the PO N2 offspring, resulting from either a lethality of S/+ in PO N2 animals or because some PO N2S/+ animals have a phenotypic rescue and do not have a forehead spot. In the backcross, PO N2 animals without spots should have two PO alleles for Sox10, barring a rare recombination event that separates the Sox10 RFLP site from the unknown causative mutation. We genotyped 50 PO N2 individuals without spots at the Sox10 RFLP site and found that 21 are BW/PO, supporting the hypothesis that a modifier can phenotypically rescue the S/+ genotype. We are currently conducting a QTL analysis to identify loci associated with the spot size phenotype.