Nightingale Sampling
Three male individuals from each nightingale species were sampled for whole genome sequencing in allopatric regions (North-Eastern Poland for L. luscinia and South-Western Poland for L. megarhynchos). From each individual, somatic tissue (kidney) and gonadal tissue (testis) were dissected and either used immediately for DNA isolation or frozen in liquid nitrogen and stored in 80°C prior to DNA isolation. The work was approved by the General Directorate for Environmental Protection, Poland (permission no. DZP-WG.6401.03.123.2017.dl.3).
Preparation of meiotic spreads and estimation of GRC size
Measurements were made of immunostained synaptonemal complexes of pachytene chromosomes from Poignet et al. (2021). Chromosomes were immunostained with anti-SYCP3 antibody recognizing the lateral elements of the synaptonemal complex, and human anticentromere serum (CREST, 15–234, Antibodies Incorporated) binding kinetochores (see Poignet et al. 2021 for details). The GRC can be recognized from other chromosomes on these slides by its relatively weaker staining by anti-SYCP3 antibody and the CREST signal which covers the whole chromosome, instead of just the centromere.
The lengths of the 22 largest chromosomes and the GRCs were measured in high-quality cells for each species using ImageJ software (ImageJ 1.50i, Rueden et al. 2017). This resulted in 15 L. megarhynchos cells (12 from one individual and 3 from another) and 16 cells from L. luscinia (9 from one individual and 7 from the other) being used. The measured length of the GRC was divided by 1.5, due to a measurement discrepancy caused by its univalent nature (Malinovskaya et al. 2020), before the size was calculated using a linear regression (see Supplementary Fig. 2). This used the relationship between the logarithmic values of the 22 longest chromosomes lengths and the logarithmic size in base pairs of the 22 largest chromosomes from the F. albicollis genome, FicAlb1.5 (R2 = 0.97 in L. megarhynchos and R2 = 0.98 in L. luscinia). The approximate size was checked against the size of the eliminated GRC micronucleus using rabbit monoclonal anti-H3K9me3 antibody (ab8898, Abcam) (dilution 1:200).
Sequencing of somatic and germline genomes
We sequenced DNA from somatic (kidney) and gonadal (testis) tissues from three individuals of each species. One individual from each species had DNA from both tissues sequenced using 10x linked (Zheng, et al. 2016) Illumina sequencing technology, while the other two individuals had DNA samples sequenced with standard paired-end Illumina sequencing.
For 10x linked sequencing, high molecular weight DNA was extracted from frozen testis and liver samples using a phenol-chloroform methodology (Pajer, et al. 2006). The DNA was sent to SEQme (Dobris, Czech Republic) for 10x linked sequencing library construction and 2x150 bp paired-end sequencing using the NovaSeq 6000 (Illumina). For the standard Illumina sequencing, DNA was extracted from frozen tissue samples using MagAttract HMW DNA Kit (Qiagen) and sent to the Institute of Applied Biotechnologies (Prague, Czech Republic) where the sequencing libraries were prepared using NebNext Ultra II DNA Kit (New England Biolabs) and sequenced with the NovaSeq 6000 (Illuimna) using 2x150 bp paired-end mode.
Testis samples were sequenced to higher depth (105-150x) than the kidney samples (45-120x) to ensure sufficient coverage over the GRC (see Supplementary Table 2).
Identification and assembly of GRC reads
Linked 10x reads that originated from the GRC were identified using the following methods before being assembled using Supernova (Weisenfeld, et al. 2017) and the “megabubbles” option (method visualised in Supplementary Fig. 3):
1) The testis samples were assembled using the 10x linked reads and Supernova (Weisenfeld, et al. 2017). The 10x reads from the testes and kidneys were then processed using Long Ranger v2.2.2, trimmed and checked for adapters using Trimmomatic v0.39 (Bolger, et al. 2014), before being aligned to their respective genome using bwa v0.7.17 (Li and Durbin 2010). Regions of the testis genome assembly were identified using Samtools v1.14 (Li, et al. 2009) which were fully covered by reads from the testis dataset while having no reads align to them from the somatic dataset and that were at least 500 bp long. Reads from testis samples that overlapped these regions by at least 10 bp were used to identify 10x barcodes and their associated reads as originating from the GRC.
2) Sequencing reads from 10x libraries were processed using Long Ranger v2.2.2 before all reads were trimmed and checked for adapters using Trimmomatic v0.39 (Bolger, et al. 2014). These reads were then aligned to draft somatic genomes from their respective species created using Oxford nanopore data and Illumina reads (Rídl, et al. unpublished). SNP variants were identified using GATK v4.1.7.0 (Poplin et al, 2017). If a variant was present in all three testis samples from a species, but in none of the kidney samples from that species, it was considered to be a GRC variant. These SNP variants were used to create 29 bp kmer sequences (i.e. each variant resulted in 29 kmers). Any kmer that was present in the 10x kidney reads was removed and the remaining kmers were used to identify reads from the 10x testis dataset that had a matching sequence. The barcodes from these reads were used to identify any associated reads.
3) A sample of 100 000 reads from each of the 10x datasets was used to identify repetitive elements that might be unique to the GRC of each species using RepeatExplorer (Novak, et al. 2010). While no such repeats were found in L. luscinia, a candidate repeat was found in the testis dataset of L. megarhynchos. This result was confirmed using all the 10x reads from L. megarhynchos and Blastn (Altschul, et al. 1990), with a word size of 8, an e-value of 1e-5, and a max hsps of 1. Any read that matched with at least 100 bp and greater than 90% identity to the repetitive element was selected. Once again, all reads with the associated barcodes were designated as having originated from the GRC.
Gene Annotation
The first frame of the F. albicollis transcriptome coding sequences (v FicAlb1.5) were aligned independently to the two assembled GRCs using Tblastx (Altschul, et al. 1990) and an e-value cut-off of 1e-6. Overlapping alignments on the same strand of the GRC were merged. The resultant regions were aligned back to the F. albicollis coding sequences and the top hit in the positive strand selected to identify which gene (and which portion of the gene) the exon represented.
In order to account for a possible lack of conservation, when calculating what proportion of a gene was present on the GRC, the fraction of a gene that was found in the GRC was normalised by the fraction of the gene that was found in the L. megarhynchos genome, to a minimum of 0.75. In other words, if 80% of a gene was found in the genome, and 80% was found in the GRC, that gene was treated as being 100% present in the GRC. The cut-off of 0.75 fully corrects for 2/3rds of all genes identified on the genome. This correction resulted in an average increase in the measured proportion of the found gene of 8.5% in L. megarhynchos and 8.9% in L. luscinia.
Coverage of GRC scaffolds
In order to identify scaffolds that represent near-identical duplicates and/or erroneous sequences in the GRC assembly, the assembled GRC sequence for each species was combined with the corresponding draft somatic genome assembly (Rídl et al. unpublished). The sequencing reads from both tissue types for all three individuals were aligned to the combined genome and GRC assembly using bwa v0.7.17 (Li and Durbin 2010) for each species. For each individual, regions of the GRC which had zero read coverage from the kidney dataset were identified. The modal testis coverage value of these regions for each individual was used as an estimate of the expected coverage for single copy GRC regions. The ratio of the expected GRC coverage to the modal genomic coverage was used as a proxy for the “expected” GRC coverage in the kidney samples. These expected coverage values were used to normalise the observed coverage for each sample. The coverage of the kidney sample was then subtracted from the coverage of the testis sample to control for A chromosomal reads misaligning to the GRC sequence. Finally, the average GRC coverage was calculated across 1 kb windows along the two GRCs, with a minimum value of zero.
GRC Scaffold Origin and Conservation
Given that GRC sequences appear to originate from A chromosomes, the GRC scaffolds were aligned to the draft L. megarhynchos and L. luscinia somatic genomes (Rídl et al. unpublished) using Blastn and an e-value cutoff of 1e-6 (Altschul, et al. 1990). Additionally, the two GRCs were aligned to each other using Blastn. The A chromosomal origin of the GRC sequences were determined by the top hit in the L. megarhynchos genome, since it was the higher quality genome assembly. The genomic scaffold was then aligned to the F. albicolli genome assembly to determine the chromosome identity using Nucmer from the MUMmer package (Kurtz et al. 2004).
cpeb1 Evolution
The cpeb1 gene sequence was determined from the Tblastx results with the XP_005051706.1 transcript from the two GRCs and their respective genomes. The GRC sequence was used to identify homologous sequences in related species using the ‘nr’ database on NCBI and Tblastx. For each species, the top hit was selected. These species included: Acanthisitta chloris, Atrichornis clamosus, Calyptomena viridis, Corvus cornix cornix, Gallus gallus, Lonchura striata domestica, Sapayoa aenigma, Serilophus lunatus, Serinus canaria, Struthio camelus australis, Taeniopygia guttata and Tyrannus savana. The sequences were aligned using ClustalW (Thompson, et al. 1994) and a maximum likelihood tree drawn with MegaX (Kumar, et al. 2018).