Genome-Wide Analyses of the Relict Gull (Larus Relictus) Insights Into Evolutionary Implications


 BackgroundThe relict gull (Larus relictus), one of the least known Aves, was classified as vulnerable on the IUCN Red List and is a first-class national protected bird in China. Genomic resources for L. relictus are lacking, which limits the study of its evolution and its conservation.ResultsIn this study, based on the Illumina and PacBio sequencing platforms, we successfully assembled the genome of L. relictus, the first reference genome of the genus Larus. The size of the final assembled genome was 1.21 Gb, with a contig N50 of 8.11 Mb. A total of 18,454 protein-coding genes were predicted from the assembly results, with 16,967 (91.94%) of these genes annotated. The genome contained 92.52 Mb of repeat sequence, accounting for 7.63% of the assembly. The phylogenetic tree was constructed using 7,339 single-copy orthologous genes, which showed Charadriiformes located at the basal position and Philomachus pugnax as the closest relative of L. relictus. The divergence time between L. relictus and P. pugnax was ~68.44 Mya. The population dynamics of the Ordos breeding subpopulation in Hongjian Nur is a good confirmation that these birds are suffering from habitat loss and fragmentation.ConclusionsThis assembled genome will be a valuable genomic resource for a range of genomic and conservation studies of L. relictus and helps to establish a foundation for further studies investigating whether the other three breeding subpopulations have combined with the Ordos breeding subpopulation. As the species is threatened by habitat loss and fragmentation, actions to protect L. relictus are suggested to improve the fragmentation of breeding populations.


Background
The relict gull (Larus relictus) (Charadriiformes, Laridae), a middle-sized gull with a black-coloured head, had been known for nearly 50 years before it was regarded as a unique species [1]. L. relictus is one of the least known Aves [2,3]. It was classi ed as vulnerable (VU) on the IUCN Red List and is a rst-class national protected bird in China. Its population size has been estimated at 10,000-19,999 (BirdLife International, 2020), and the vast majority of L. relictus (90%) reside in Hongjian Nur [4]. Their main wintering place is situated on the west coast of the Bohai Sea [5]. A small number of winter migratory individuals have been sighted in Hong Kong [6]. Therefore, the main threats to L. relictus are overwhelming lake shrinkage on breeding grounds and at stopover sites, as well as the loss of intertidal ats on wintering grounds [3]. A novel data-driven habitat suitability ranking approach for L. relictus species using remote sensing and GIS indicated that three threat factors, road networks, developed buildings and vegetation, were the most signi cant factors in the highly suitable region for this species and supported the development of strategies or recommendations for sustainably managing the ecosystem to enhance the protection of L. relictus [7]. Because of the concentration of this population in Hongjian Nur, the birds having to disperse to breed and the phenomenon of abnormal mass mortality of L. relictus edglings in 2011 due to food shortage, habitat loss, and habitat fragmentation are inferred to be the major causes of population fragmentation [4]. On the whole-genome level, genome sequencing technology is usually used to characterize genetic variations and acquire comprehensive molecular characterizations [8]. At present, only limited genetic information, regarding mitochondrial markers and population structures, is available for L. relictus [2,4,9,10], as well as dispersal and migration information [3]. However, no genome information for L. relictus has been published, which limits our understanding of the evolutionary and molecular mechanisms of some signi cant processes.
High-throughput sequencing technology has notably reduced sequencing costs [11] and marked the start of a new era of genomic studies [12]. Among them, long-read sequencing technologies such as Paci c Biosciences (PacBio) [13] can produce average lengths of over 10,000 bp [12]. PacBio technology has been used to obtain high-quality genome assemblies for several avian species, such as Gallus gallus (Galliformes) [14] and Malurus cyaneus (Passeriformes) [15].
In this study, the rst contig-level genome of L. relictus was constructed using both Illumina HiSeq and PacBio sequencing platforms. Then, we assessed various genomic characteristics and performed comparative analyses. These genomic data will facilitate population studies of L. relictus and bene t the comprehensive protection of this vulnerable avian species.

Results
Genome sequencing and assembly Approximately 106.29 Gb of raw sequencing data were obtained using the Illumina HiSeq platform, including three 250-bp insert libraries and two 350-bp insert libraries (Table S1). The sequencing depth was 87.85X. We then used the PacBio sequencing platform to obtain long reads for assembling the genome and retained approximately 30.50 Gb raw data. After ltering out low-quality and short-length reads, the read N50 and mean read length were 12,712 bp and 8,418 bp, respectively. Finally, a 1.21 Gb assembly with a contig N50 of ~8.11 Mb was obtained for L. relictus (Table S2), with a GC content of 43.11%. The genome consisted of 1,313 contigs, with the longest contig being ~29.7 Mb long (Table   S2).
Approximately 99.97 of the clean reads could be mapped to the contigs, with 93.33-93.77% properly mapped reads (Table S3). The CEGMA analysis identi ed 416 CEGs (core eukaryotic genes), accounting for 90.83% of all 458 CEGs, and 175 CEGs (70.56%) could be detected with homology to the 248 highly conserved CEGs (Table S4). In addition, 4,555 (92.7%) of the 4,915 highly conserved Aves orthologues from BUSCO v3.0.2 were identi ed in the assembly (Table S5). These results show that the assembled L. relictus genome sequence was complete and had a low error ratio.

Genome Annotation
The consensus gene set included a total of 18,454 PCGs. The average gene length, exon length, and intron length were 20,749.08 bp, 164.24 bp, and 1,996.77 bp, respectively (Table S6). The nal prediction results revealed 17,452 (94.57%) PCGs supported by homology-based and RNA-seq-based methods (Fig.   S1), which showed a good gene prediction effect. A total of 16,967 (91.94%) predicted PCGs in the L. relictus genome were annotated and functionally classi ed by the GO, KEGG, KOG, TrEMBL and NR databases (Table S7). Noncoding RNAs were also identi ed and annotated, including 208 miRNAs (microRNA genes), 73 rRNAs and 289 tRNAs. A total of 221 pseudogenes were identi ed in the L. relictus genome.
A total of 92.52 Mb of repeat sequence was annotated, composing 7.63% of the total genome length. We found that class I transposable elements (TEs) (RNA transposons or retrotransposons) occupied ~8.22% of the genome assembly. Among class I TEs, 5.85% were LINEs (long interspersed elements), 1.12% were LTR (long terminal repeat) elements, and 0.02% were SINEs (short interspersed elements) ( Table S8). The LINE percentage of L. relictus was larger than that of Charadrius vociferus (4.53%), while SINEs were obviously less represented than in C. vociferus (0.13%) [16]. The L. relictus genome also contained class II TEs (DNA transposons), which occupied ~0.28% of the genome.

Gene families
Comparison of the L. relictus genome assembly with the genomes of seven other avian species showed that a total of 18,454 genes of L. relictus could be clustered into 12,681 gene families, including 192 unique genes belonging to 56 gene families (Table S9). The proportion of species-speci c genes within L. relictus genomes was obviously larger than that of other sampled genomes. In addition, 8,650 gene families were shared among all sampled species. The phylogenetic relationships based on 7,339 singlecopy orthologous genes indicated that Charadriiformes was basal to three other orders ( Fig. 1). L. relictus showed the closest relationship to another member of the order Charadriiformes, P. pugnax.

Positive selection genes and functional enrichment
We found that 519 single-copy orthologous genes were under positive selection in the L. relictus genome. The Gene Ontology (GO) annotation classi es the positively selected genes (PSGs) in terms of three categories: cellular component, biological process, and molecular function ( In addition, we also identi ed the biochemical pathways of the PSGs. The KEGG annotation of the PSGs suggested the presence of 29 pathways related to cellular processes (33 genes), genetic information processing (43 genes), human diseases (10 genes), environmental information processing (29 genes), metabolism (55 genes), and organismal systems (24 genes) (Fig. S2b).

Discussion
Genomic characteristics and evolution The genome size of L. relictus was similar to those of some other Aves, such as P. pugnax (1.25 Gb) [18]. The GC content of the L. relictus genome was similar to that of other Aves, such as 41.43% in Tetraophasis szechenyii [19], an average of 42.5% in P. pugnax and an average of 41.5% in G. gallus [18]. This proportion of repeat sequences is similar to that found in previous studies, in which almost all avian genomes contained lower levels of repeat elements than other animal genomes, with percentages of approximately 4-10% [16].
The timescale results indicated that the ancestral lineages of L. relictus and P. pugnax diverged approximately 68.44 Mya (million years ago) (Fig. 1), later than the divergence time between the complex clade containing the genus Larus and the clade including Philomachus [20]. The population dynamics of the Ordos breeding subpopulation in Hongjian Nur provides a good con rmation that these birds are suffering from habitat loss and fragmentation, and interference by human activity is inferred to be the major cause of population fragmentation.

Conclusions
The whole-genome sequence of L.relictus was successfully assembled employing the Illumina and PacBio sequencing platform. The size of the nal assembled genome was 1.21 Gb, with a contig N50 of 8.11 Mb and 92.52 (7.63%) Mb of repeat sequence, and 18,454 protein-coding genes were predicted with 16,967 (91.94%) of these genes annotated.
The population of relict gull (L. relictus) was expressed very low genetic diversity while lacking a large geographical population. In this study, the genome information of L. relictus which is the rst assembled reference genome of the genus Larus, will be effectively to investigate the evolutionary and molecular mechanisms of some signi cant processes in this species.

Methods
Sampling and sequencing DNA was extracted from the muscle using the CTAB method, and the DNA concentration and quality were measured using a NanoDrop 2000 and a Qubit Fluorometer, respectively. Both Illumina HiSeq 4000 and PacBio RSII sequencing pipelines were used. For the Illumina pipeline, ve short fragment paired-end libraries (three of 270 bp and two of 350 bp) were constructed using the standard Illumina pipeline. The details of library construction are as follows. The genomic DNA was broken randomly using the ultrasonic method, and target fragments were ltered. The small fragment sequencing library was constructed through the steps of end repair, addition of polyA and adaptor, selection of target-size fragments and PCR. The size and quality of the library were evaluated using an Agilent 2100 and qPCR. The Illumina HiSeq 4000 sequencer was used for sequencing, with PE=150. For the long fragment library in the PacBio pipeline, the details of library construction are as follows. The genomic DNA was sheared using g-TUBE, followed by DNA damage-repair and end-repair. The dumbbell-type adapters were ligated, and exonuclease digestion was performed. BluePippin was used to select segments to obtain the sequencing library. In addition, total RNA was extracted from the heart, liver, spleen, lung and kidney of L. relictus using TRIzol, and RNA concentrations were measured using NanoDrop 2000, Qubit 2.0 and Agilent 2100. The Illumina HiSeq 4000 platform was used for sequencing RNA data.

Genome assembly assessment
Whole-transcriptomic data from the liver and an equal mix of ve tissue RNA samples were used to assist genome annotations. Raw reads were ltered to remove adapter sequences and low-quality data, with clean reads assembled using Trinity [26]. After ltering out low-quality and short-length PacBio reads, LoRDEC [27] software was used for error correction of PacBio data employing HiSeq data. The HiSeq data were preliminarily assembled by Platanus [28] software. Using dbg2olc [29] software, mixed assembly was carried out by using the data after error correction and the preliminary assembly results of HiSeq data. Pilon [30] software was used to correct the assembly results using HiSeq data. To assess the completeness of the L. relictus genome assembly, we used two methods, with the rst remapping the Illumina paired-end reads to the assembled genome and the second employing CEGMA v.2.5 [31] and BUSCO v3.0.2 databases.

Phylogeny and positively selected genes
We used the whole-genome sequence of L. relictus and seven published whole-genome sequences of three Charadriiformes species (L. relictus, P. pugnax and C. vociferus), three Pelecaniformes (Phaethon lepturus, Nipponia nippon and Egretta garzetta), one Gruiformes (Mesitornis unicolor) and one Procellariiformes (Fulmarus glacialis). OrthoMCL was used to cluster gene families [61]. A total of 7,339 single-copy orthologues were identi ed, with protein sequences used for constructing phylogenetic trees.
The protein sequences were aligned using MUSCLE [62] and then concatenated into a combined dataset. We then constructed the phylogenetic tree using the ML (maximum likelihood) algorithm with the JTT amino acid substitution model implemented in PhyML [63].
Based on the results of phylogenetic trees, divergence time was estimated using the MCMCTree program in PAML [64]. Divergence times and ages of fossil records were derived from TimeTree (http://www.timetree.org/) and applied as the time control. In addition, the CodeML program in PAML [65] included single-copy genes to detect positively selected genes in the L. relictus genome. Among them, the branch model (model=2, NSsites=0) was used to calculate ω0 (dn/ds) of the foreground clade and obtained the average ω1 of other clades. Then, model=0 and NSsites=0 were used to evaluate the ω2 of the whole tree, and those genes with ω0>ω2 were selected.

Availability of data and materials
The authors declare that the data supporting the nding of this study are available in the article and its supplementary information les, and are available from the corresponding author.
Ethics approval and consent to participate Not applicable.