Sample collection, DNA extraction and sequencing
For de novo genome sequencing, blood from a 3-years-old male reindeer from the region of Hardangervidda (Norway, 60⁰ 0’16’’N, 7⁰ 34’24’’E) was extracted by puncture of the heart immediately after killing and placed into EDTA tubes. Blood was frozen in a paraffin-driven freezer and shipped on ice to the IKMB Sequencing Center in Kiel (Germany). High-molecular-weight genomic DNA was extracted using MagAttract HMW DNA Kit (Qiagen) following the manufacturer’s instructions and fragment length > 50 Kb confirmed using Agilent TapeStation 4200. DNA was sent to the NGI/ SNP&SEQ Center in Uppsala, where one 10x Genomics Chromium library [41] was prepared. Sequencing was performed using an Illumina HiSeqX (Illumina, San Diego, CA, USA) and 941 million 2 x 150 bp reads were obtained.
Samples used for WGS are described in Supplementary Table 3, as well as the methods utilized for genomic DNA extraction. All samples that were not collected from already available museum specimens, were collected from dead animals during industrial slaughter or subsitence hunting. All samples were sent to the IKMB Sequencing Center in Kiel (Germany) for sequencing. DNA quality was inspected using an Agilent TapeStation 4200 and quantified using a Qubit 2.0 Fluorometer, after which Nextera DNA Flex libraries were prepared. Samples were sequenced in 2 x 150 bp reads on an Illumina HiSeq4000 or Illumina NovaSeq6000 sequencer (Supplementary Fig. 11). Each sample was sequenced to ~ 6x coverage (~ 50 million paired-end reads) and samples were barcoded, pooled and demultiplexed accordingly. Additionally, 6 libraries from Svalbard samples and 6 libraries from mainland Norway samples were sequenced a second time (~ 100 million additional paired-end reads per sample) to increase coverage (up to a total of ~ 20x per sample). Access and utilization of the genetic material in this study followed Nagoya Protocol obligations, with no operational requirements according to Norwegian law.
Ancient reindeer samples
Late Pleistocene hunter-gatherer groups were heavily dependent on reindeer hunting for subsistence. Especially the locations and the landscape around the archeological sites in the Ahrensburg tunnel valley were repeatedly used by these humans for seasonal hunting of reindeer [42, 43]. We sampled petrous bones of two adult reindeer, which were direct dated to 13,152 − 12,416 calBC (KIA-53518: 12,520 ± 55 BP) and 13,046 − 12,293 cal BC (KIA-53525: 12,460 ± 55 BP) using the radiocarbon method following standard protocols at the Leibniz Laboratory for AMS Dating and Isotope Research, Kiel. Here, “calBC” denotes real calibrated 14C dates.
Bleach was used to remove surface contaminants from petrous bones. DNA extraction and subsequent partial uracil-DNA-glycosylase treated sequencing libraries were prepared from bone powder following previously established protocols [44]. All steps, including sampling, DNA extraction and the preparation of sequencing libraries, were performed in clean-room facilities of the Ancient DNA Laboratory in Kiel. Negative controls were taken along for the DNA extraction and library generation steps. The libraries were paired-end sequenced using 2 x 75 cycles on an Illumina HiSeq 4000. Illumina sequencing adapters were removed and paired-end reads were merged if they overlapped by at least 11 bp using ClipAndMerge, a part of the EAGER pipeline [45]. Merged reads were filtered for a minimum length of 30 bp. Both samples show the expected degradation patterns, increased C > T and G > A substitutions at the 5’ and 3’ ends of the reads (Supplementary Fig. 12).
Genome assembly
Supernova v2.1.0 [46] was used to assemble the 10x barcoded FASTQ reads. A total of 2,458 scaffolds were generated, with a N50 value of 15,262,525 bp. Leading and trailing Ns were removed using a custom script. To further scaffold the assembly, we used ARKS v1.0.2 [47] with default parameters and increased the scaffold N50 to 20,747,807 bp in 2,302 scaffolds. To generate an assembly of the mitochondrial genome, first Trimmomatic v0.33 [48] was used to trim the 10x barcodes using HEADCROP:23 for forward reads and HEADCROP:1 for reverse reads. A single lane of reads (150 million reads) was used as input for MitoZ v2.2 [49] with the option “genetic_code 2” to generate the mitochondrial assembly (Supplementary Fig. 13).
To assess the quality of our reindeer assembly, we downloaded the following available deer genomes (together with one sequencing lane of raw reads in each case): Chinese reindeer genome [16] was downloaded from http://gigadb.org/dataset/100370, caribou genome [17] was downloaded from NCBI (GCA_019903745.2), white-tailed deer (Odocoileus virginianus) genome was downloaded from NCBI (GCF_002102435.1, Ovir.te v1.0), hog deer (Axis porcinus) genome [50] was downloaded from NCBI (GCA_003798545.1, ASM379854 v1) and red deer (Cervus elaphus) genome [51] was downloaded from NCBI (GCA_002197005.1, CerEla1.0). The quality of our Hardangervidda reindeer assembly was compared to these other deer assemblies by FRC analysis. Briefly, we mapped one lane of the original genomic reads on the corresponding assembly using BWA v0.7 [52] and used FRCbam to detect mapping errors [53]. Results were plotted to determine overall congruency between assembly and raw data. Each assembly was also analysed for the presence of known mammalian single-copy orthologs (BUSCO v3.0.2) [18] and determine gene space coverage as a proxy for overall completeness.
One available reindeer RNAseq dataset was downloaded from NCBI (SRR5647658), trimmed using Trimgalore v.0.4.4 [54] and option “length 36 q 5 stringency 1 e 0.1” and assembled into a transcriptome using Trinity v2.4.0[55]. This transcriptome was mapped against our assembly using minimap v2.9 [56] “ax splice” to further assess genome completeness.
Genome annotation
The genome was annotated using an in-house developed pipeline (https://github.com/ikmb/esga) based on ab-initio gene model prediction and model hints from different sources. In short, the genome sequence is first repeat-masked using RepeatMasker v4.0.8 [57], followed by generation of 3 types of hint features: 1) reviewed metazoan proteins were downloaded from UniProt [58] and aligned using Exonerate [59] proteins2genome mode, 2) trimmed raw RNAseq reads (see previous section) were aligned using Hisat2 [60], and 3) assembled transcriptome sequences (see previous section) were aligned using Exonerate est2genome mode. Resulting alignments were used as extrinsic hints to run Augustus v3.2.3 [61] specifying “species = human UTR = off –alternativesfrom-evidence = false”. Finally, UTRs and alternative isoforms were incorporated by mapping the assembled transcripts using PASA [62], to finally produce 28,985 protein coding gene models. InterProScan v5.19 [63] was used to perform functional annotation, with protein domains being assigned to 20,402 genes and GO terms to 13,329 genes.
Evolutionary analysis of reindeer ecotypes
Sequencing reads were first trimmed with Trimgalore v0.4.4 [54] using options “paired retain_unpaired length 36 q 30”. Read quality of trimmed reads was analyzed with FastQC v0.11.9. Trimmed reads were mapped to the genome reference using the MEM algorithm from the BWA v0.7 [52] package. Mapped reads were sorted with Samtools v1.9 [64] and duplicate reads were flagged with Picard v2.17.8 [65]. Genomewide coverage was measured using the genomeCoverageBed script from BEDtools v2.25.0 [66]. Reads from whitetailed deer (used as outgroup), caribou and Chinese reindeer were downloaded from NCBI (SRR4069819, SRR9332022 and SRR5763127, respectively) and processed in the same way as described above.
For PCA analysis, the ANGSD v0.921 [67] package, which takes the uncertainty of lowcoverage sequencing data into account, was used. First, genotype likelihoods were calculated from the mapping information collected in the bam files (one per individual) using ANGSD with parameters “uniqueOnly 1 remove_bads 1 only_proper_pairs 1 trim 0 baq 1 minMapQ 15 minQ 15 setMinDepth 60 setMaxDepth 400 doCounts 1 GL 1 doMajorMinor 1 doMaf 1 skipTriallelic 1 SNP_pval 1e-3 doGeno 32 doPost 1”. Next, the covariance matrix between individuals was computed using the ngsCovar tool from the ngsTools suite [68]. Eigenvectors were calculated with the “eigen” function implemented in R [69] and the first two principal components were plotted using a custom R script.
In order to infer the relationship between all sequenced individuals, variants were detected in all samples using FreeBayes v1.1.0 [70], and afterwards filtered with the vcffilter script from the VCFlib library [71] to call only variants with quality > 20 and a coverage of at least 10 reads across all samples (to discard variants unique to a single sample), and with at least 5 reads in each specific sample (f “QUAL > 20 & DP > 10” g “DP > 5”). Variant information was collected in a single VCF file and allele frequencies were calculated using PLINK v1.90b6.16 [72]. Frequency information for the resulting 24,756,418 positions was converted to TreeMix format using custom a script. 100 bootstrap replicates of TreeMix v1.13 [73] were run, fixing white-tailed deer as outgroup and re-sampling blocks of 500 SNPs (“root White_tailed_deer noss bootstrap k 500”). The 100 bootstrap trees were concatenated and the consensus script from the Phylip package v3.696 [74] was used to identify the tree with highest support, which was visualized using FigTree v1.4.4 [75].
Genomic comparison of Svalbard and mainland Norway ecotypes
In order to study the Svalbard ecotype in more detail, only the samples that had been sequenced more deeply (Svalbard n = 6, Norway mainland n = 6; ~20x coverage) were used in the analyses described in this section. Variant calling was performed using FreeBayes and VCFlib in the same manner as described previously.
We employed Population Branch Statistic (PBS) to scan the genome of Svalbard reindeer for possible selective sweep [76]. PBS values were calculated using the following formula:
$$PBS= \frac{{log}\left(1-{{F}_{ST}}_{RO}\right)-{log}\left({1-{F}_{ST}}_{TR}\right)-log(1-{{F}_{ST}}_{TO})}{2}$$
Where FST is the fixation index between populations, RO is FST between the reference (Norway mainland) and the outgroup (Hamburgian) population, TR is FST between the target (Svalbard) and reference population and TO is FST between the target and the outgroup population. A high PBS value generally points out the amount of allele frequency change for a given locus which corresponds to the regions which are generally highly differentiated branches of the population tree compared to a reference population. We have chosen to calculate PBS over FST as PBS is also directional and can detect the selection signal that happened only in Svalbard reindeer.
We used Svalbard population as the target population, Mainland as the reference population and ancient population as the outgroup. We only kept those SNPs which are present in all the samples, bi-allelic and at least one minor allele is present in the sample (not fixed). We removed all the indels from our analysis. The filtering is done by BCFtools [77] using the “bcftools view g ^miss m2 M2 v snps c 01” command.
PBS was calculated using scikitallel [78]. After calculating PBS for every SNP, we average all the SNPs for a 50 kb region with a 5 kb sliding window with at least 10 SNP. Any scaffold with less than 50kb regions is removed from further analysis. The plots are generated using matplotlib [79].
To infer population size of the two ecotypes over time, the multiple sequential Markovian coalescent method was used, as implemented in MSMC2 v2.1.1 [23], for the three largest genomic sequences (scaffolds 1, 5 and 8, 237 Mbp in total). The segment patterning “p 1*2 + 15*1 + 1*2” was used to reduce overfitting on analysing single scaffolds. Results were scaled by a generation time of eight years and 1.6×10− 8 mutations/generation, as estimated with a phylogenetic analysis incorporating the ancient samples with BEAST v2.6.1 [80].
Heterozygosity values were estimated using the software Rohan [81]. Long runs of homozygosity (RoHs) were inferred using nonoverlapping 1 Mb windows. To achieve higher resolution of the homozygous segments we also assessed the RoHs with BCFtools/RoH [82]. The RoHs were quality filtered by average fwdbwd phred scores > 40. Further, the segments were classified into two length classes: small (< 0.1Mb) and medium (0.1 to 1.0 Mb). These intervals have been considered representative of inbreeding levels corresponding to identity-by-descent from > 500 and 50 to 500 generations, respectively [83].
To study the effect of the genetic variants of the Svalbard ecotype in relation to the mainland ecotype, first we filtered the VCF files using VCFtools and the parameters “minQ 30 minDP 10 maxDP 80”. Next, we generated a VCF with the 6 Svalbard samples and selected only the positions where at least 5 of the 6 samples presented only the alternative allele (genotype = 1/1) using a custom script. In parallel, to discard positions that were different in the Svalbard samples in relation to the reference individual, but not in other Norway mainland samples, we generated a VCF file with the 6 mainland Norway samples and selected all the positions where at least two individuals present the alternative allele, either in homozygosis or in heterozygosis (genotype = 1/1 or genotype = 0/1). The variants present on this later VCF were subtracted from the Svalbard VCF file. Variant Effect Predictor (VEP) v106.1 [84] was run on the resulting VCF file which contained Svalbard-specific, homozygous variants, together with the genome annotation described above. Only genes with coding variants, with more non-synonymous than synonymous substitutions were selected. OrthoMCL v2.0.9 [85] was used to identify the 1:1 orthologs of all the annotated reindeer genes in the B. taurus reference, since this is the most closely related species with a curated BiomaRt database. BiomaRt package [24] was used to map the B. taurus GO terms to orthologs of the selected reindeer genes, and the topGO package [86] was used to run a fisher test to identify significantly enriched biological processes in this set of genes. Following recommendation of the topGO authors, we did not correct for multiple testing. Functional evaluation of variant effects was performed with PolyPhen-2 [87].