Sample collection, laboratory work and sequencing
Blood samples were collected from white-tailed eagle chicks as a part of an ongoing monitoring program in Iceland since 2001 by the Natural History Institute of Iceland. The sex of the chicks was determined in the field based on morphology. Three to ten mL of blood was extracted from each chick. The blood was stored in EDTA buffer at -20 degrees until DNA extraction.
DNA from blood samples from 135 chicks were extracted using the ThermoFisher GeneJET Whole Blood Genomics DNA Purification Mini Kit following the standard protocol [44]. DNA concentration was estimated using the NanoDrop 1000 and run on 0.7% agarose gels to evaluate the fragment size. Samples with concentration higher than 60 ng/µl were selected for library preparation and sequencing. The 133 of 135 extracts were double digest restriction-site associated DNA sequenced on Illumina HiSeq2500 (see supplementary text 1 for full description).
Two individuals of white-tailed eagle (the last two of 135), a male and a female, were selected for high-depth whole genome shotgun sequencing with two lanes each on an Illumina HiSeqX. Library preparation and sequencing was done at deCODE genetics, using the TruSeq Nano sample preparation method [45].
Two reference assemblies from male golden eagles (ZZ), one in 1142 scaffolds and one assembled to chromosome level (GenBank Assembly Accession numbers: GCA_000766835.1 and GCA_900496995.2, respectively) and female chicken (ZW) (GenBank Assembly Accession: GCA_000002315.3) were downloaded from NCBI and used in the analysis [9, 46].
Sequence cleaning and mapping
The white-tailed eagle RADseq data was demultiplexed, sorting sequence reads into individual files, both for forward and reverse sequences using the command ‘process_radtags’ in Stacks version 1.47 [47, 48]. Default settings were used for the RADseq data, applying the option “r” to rescue barcodes and RAD-tags.
After demultiplexing, FastQC[49] was run for quality control. For the RADseq data, an excess of specific sequences (kmers) were removed using AdapterRemoval v2 (version 2.2.2) [50]. The high depth shotgun sequenced individuals were tested in the same way but found no excess of kmers.
The Burrows-Wheeler Aligner (BWA) mem and SAMtools version 0.7.17-r1188 and 1.7, respectively [51, 52] were used to process RADseq and high depth shotgun data and map reads to the golden eagle scaffold assembly of 1142 scaffolds with no identified chromosomes (GCA_000766835.1) [9] using default settings in both instances.
Four different approaches to find the Z-chromosome - Depth, Heterozygosity, Mapping and SNP-loadings
Four different approaches were used to identify scaffolds in the white-tailed eagle genome belonging to the Z-chromosome, by comparison with the golden eagle scaffold assembly with no chromosomes (GCA_000766835.1). An assembly consisting of 1,141 assembled scaffolds, excluding mtDNA, and a total of 1,192,725,744 bp, ranging in size from 913 to 30,727,332 bp with a median of 5,587 bp, and average length of 1,045,334 bp (SD 3,203,066 bp). An overview of the methods is presented in Fig. 5 and the data used in each analysis is available in supplementary Table S1.
Depth. For the high-depth white-tailed eagle sequencing data, the average autosomal sequencing depth was estimated for the male and female separately, as the mode of the number of mapped reads per position across all scaffolds, based on results from the command “bedtools coverage” from Bedtools v2.18.2 [53]. Using these averages, 195 for the female and 181 for the male, the relative sequencing depth was calculated for each position in each scaffold for both individuals. The per-scaffold relative sequence depth was then estimated for the female and male, separately, as the mode across positions. Positions in autosomal scaffolds are expected to have a relative depth of 1 in both sexes, whereas Z-chromosomal scaffolds are expected to have a relative depth of 0.5 in females and 1 in males. As the estimate of relative depth may be less reliable for smaller scaffolds, the dependency of the relative mode depth due to scaffold size was analysed by calculating the variance in the depths per interval of scaffold sizes, transformed to a log scale. The distribution of the proportions of scaffolds at each interval was summarized with a cumulative percentage curve. In addition, the depth per scaffold was evaluated by comparing the per-scaffold relative sequencing depth between the two individuals: male over female. Scaffolds with a relative sequencing depth below 0.25 and above 1.5 were removed (corresponding to 523 scaffolds, and 0.47% of the genome). This ratio is expected to be around two for Z-chromosomal scaffolds and one for the autosomal scaffolds, as the male has two copies of Z and the female one. Thus, a cut-off was set at 1.5.
Heterozygosity. Sex differences in heterozygosity were assessed by comparing numbers of heterozygous sites per scaffold based on genotypes of the high-depth white-tailed eagle male and female, called using Graphtyper [54, 55] with default settings. The variation on the Z-chromosome is expected to be ¾ of the autosomes and it should be restricted to the male, except for the PAR and non-recombining homologous regions. As scaffolds vary in length and may include short variable regions, the variation was also analysed per 50kb window. Genotypes were filtered for quality using vcftools and bcftools version 0.1.15 and 1.7, respectively [56, 57] before counting, using minimum GQ score 20, minimum Q score 1000, missingness 1 (both individuals had to have a valid genotype at the site), mapping quality equal or above 60 (MQ), and only biallelic sites. Two additional criteria were applied to remove sites with likely spurious heterozygous genotypes. First, heterozygous genotypes where the number of mapped reads deviated significantly from the mode depth of the scaffold, based on a two-sided Poisson test (P < 0.01) were excluded. Second, we used a binomial test to assess whether the proportion of reads in heterozygous genotypes, either in the male or the female, deviated from the 50/50 expectation, using P < 0.05 as the exclusion threshold.
Mapping. In order to assign the short reads from the white-tailed eagle to chromosomes, the 1142 scaffolds from the golden eagle scaffold assembly (which the white-tailed eagle genome had been mapped on) were mapped to the chicken genome, which has assigned chromosomes, using LASTZ [26]. Standard settings were used with the following modifications: ambiguous = iupac, gfextend, chain, gapped. Scaffolds in the golden eagle which mapped better to the Z-chromosome than any other chromosome, measured as most bases mapped, were deemed to belong to the golden eagle Z-chromosome.
SNP-loadings. A PCA analysis of 133 low-depth RAD sequenced white-tailed eagle individuals was constructed using PCangsd version 1.0 [58], an extension of ANGSD [59], as described below. A clear split between males and females was observed along the first principal component (PC) (Figure S1). Loadings obtained with PCangsd were used to identify which parts of the scaffolds induced the split, with the “-selection” option [58] and with sites passing the following filters: a minimum 25% of individuals had to have valid genotypes, only unique mapping sites, base quality minimum 20, mapping quality minimum 30, SNP p-value 1e-6. ANGSD uses genotype likelihoods to tackle the restrictions of low depth [59, 60]. To assess which scaffolds contributed to the split on the first axis (PC1), a 95% range of loading values for all SNPs per scaffold was calculated using R and compared between scaffolds with more than 50 SNPs. The distributions of the range of loading values were summarized with accumulation curves, combined for all scaffolds, and separately based on the results obtained by the mapping on the autosomes and Z chromosome. Scaffolds were assigned to the Z-chromosome or autosomes depending on whether the range-values were above or below a threshold of three standard deviations from the mean (covering ~ 99% of a normally distributed variable).
Comparison of the four methods. To evaluate how well the four approaches performed, the golden eagle scaffold assembly (GCA_000766835.1) was mapped to a golden eagle genome with known chromosomes (GCA_900496995.2) using LASTZ with the same settings and cutoff as described previously. In what follows, the outcome of this mapping was used as the true chromosome identity of the 1141 scaffolds that was used to assess the accuracy of our four different approaches to identify Z chromosome scaffolds (Fig. 5 and Table 2). A total of 168 scaffolds were assigned to the Z-chromosome, with a total length of 86,839,530 bp (mean = 516,902, sd = 1,509,132, and median = 5,236), which is slightly smaller than the Z-chromosome in the newly released genome of 88,216,475 bp (GenBank Assembly Accession: GCA_900496995.2). The autosomal loci mapped to 973 scaffolds of a size of 1,105,886,214 bp (mean = 1,136,574, sd = 3,403,676, and median = 5,674).