CpGenome provides an essential resource for plant evolution
As an essential component of plant organelles and photosynthesis organs, chloroplasts have a simple structure, the small genome size (~ 110–165 kb) containing ~ 90–110 protein-coding genes21 and highly conserved gene region across species, due to their non-recombinant, haploid and uniparentally22. The genomic characterization of various aspects of chloroplasts has led to an important role in the research of plant origin, evolution and phylogenetic analysis relationship between different plant species23,24. Many studies had been reported using chloroplast genes to construct phylogenetic trees of plants. For example, Jansen et al25 used 81 chloroplast genes to estimate relationships among the major angiosperm clades; Saarela et al26 found weak support for Amborella as the basal-most angiosperm lineage using 17 plastid genes and the nuclear gene phytochrome C (PHYC). With the deepening of chloroplast research, more and more researchers are focusing on the complete chloroplast sequence27–29. Kane et al 30 suggested that the whole CpGenome could serve as an ultra-barcode for identifying plant varieties.
Hybridization-based probes for target enrichment in large-scale CpGenome sequencing
Chloroplast DNA can be traditionally acquired by the sucrose gradient centrifugation method6 or the high salt method7. Another method was to amplify the entire chloroplast DNA from the whole cellular DNA base on a long PCR technology by primers, which were designed on conserved sequences8. These methods were not suitable for large-scale samples due to the large amount of labor and material resources required to obtain chloroplast DNA, and the labor-intensive method used to prepare chloroplast DNA. Chloroplast reads also can be identified from WGS reads by aligning the WGS data with the reference CpGenome. It is a demanding bioinformatics technique and requires a closely related reference CpGenome. The method was not suitable for the species that are not closely related or have poor quality reference genome sequences. Moreover, to assemble only CpGenome based on this method, a great deal of useless sequencing data was thus generated, consuming much of the sequencing capacity and reducing the efficiency of parallelly chloroplast sequencing, since the chloroplast DNA sequencing data represents only a small fraction of WGS. Therefore, most existing methods for obtaining DNA and sequencing data suitable for whole CpGenomes cannot meet the needs of large-scale CpGenome sequencing, greatly limiting and hindering the in-depth research of plant genetics and evolution.
Target enrichment before sequencing is a useful method that allow for in-depth analysis of specific portions of the genome. Moreover, a group of universal probes covering whole CpGenome in a tribe species can make target enrichment strategy exert it’s advantages. Large scale CpGenomes target enrichment by universal probes can provide cost-effective, high density, and high coverage.
Efficiency target enrichment and comparative analysis of CpGenomes for different clades
More than 3,000 chloroplast genomes have been released recently31, since the first reported sequencing of the complete CpGenome of Nicotiana tabacum32. We chose the 99 representative CpGenomes, including 6 bamboo CpGenomes from 3,654 CpGenomes published to design probes. These vascular plants included 7 clades (Lycopodiophyta, Moiliformopses, Gymnosperms, Basal angiosperms, Monocots, Eudicot, and Magnoliidae), belonging to 57 families and 40 orders. The alignment of the CpGenomes of 7 clades to Arabidopsis thaliana CpGenome may show the CpGenome structure variation during evolution and indicating differences among different clades (Fig. 1). Structure variation indicated the pan-CpGenome derived from CpGenomes of distinct clades was essential for constructing greater applicability of pan-CpGenome with more divergent sequences. In 146–150 kb, 124–129 kb, and 88–92 kb, Poaceae had alignment gaps compared to the rest of Monocots, ANA grade, Magnoliids, and Eudicots. Moreover, Ferns, Horsetails, Gymnosperm, and Lycophytes indicated fragment sequences at the corresponding positions. It may suggest the corresponding CpGenome regions completed in angiosperm during evolution and uniquely lost in Poaceae after Angiosperm. However, the phenomenon should be further tested on the basis of broad-spectrum reference and amplification samplings.
In pan-CpGenome construction, unique sequences were selected, and the final pan-CpGenome size was ~ 15 Mb. A total of 180,519 probes were designed and synthesized using a new hybridization-based approach to enrich chloroplast DNA fragments. Evaluation of the quality of the probes and pan-CpGenome showed a high mapping ratio, which was stable and efficient in bamboo CpGenomes. Besides bamboos, the amplified plant CpGenomes expanded variational sequences and universality of the probes in the pan-genome construction step. Thus, the probes also had high mapping rates in some orders, such as Malvales, Rosales, Pinales and Poales, et al, and indicated the applicability of the probes in these clades. Conversely, lower mapping rates were found in Nymphaeales, Solanales, Schizaeales, Lamiales, et al, which may due to inadequate and poor corresponding CpGenomes materials in pan-Genome constructing. It can be solved by amplifying corresponding CpGenomes to expand divergent sequences in pan-CpGenome or decreasing parameter restriction. Comparing of the assembled CpGenome with its published counterparts demonstrated a mapping coverage of over 98%, further confirming the efficiency of the probes in enriching chloroplast DNA fragments. In general, this pipeline of pan-CpGenome construction, pan-CpGenome-based probes design, and CpGenome enrichment showed its performance in bamboo CpGenomes and recommended a strategy of large-scale CpGenomes acquiring to green plants.
Bamboo CpGenomes could provide additional information on large-scale phylogenetic relationships
There are more than 500 bamboo species in China, which play significant roles in economy, ecology, culture, aesthetics, and technology33,34. Bambusoideae is one of three subfamilies in Poaceae known as the BEP clade35. Bamboo remains one of the most challenging groups for plant taxonomists and field botanists36, due to infrequent, incongruent, unpredictable flowering events, and diversity vegetative characters, which may result from frequent hybridization occurred in bamboos36,37. As a useful strategy in phylogenetics and classification of species, phylogenetic analysis based on sequences has been performed in bamboos over the past decades. Extensive sampling and sequencing of the plastid genome has been a remarkable effort in genetic, phylogenetic, and classification analysis of bamboo. We have constructed a phylogenetic tree of 412 samples, covering more than 300 species, 40 genera, which is the largest sampling project of bamboo in China and provides a large-scale phylogenetic tree of bamboos. According to the phylogenetic tree, XI (Ampelocalamus calcareus) is the earliest diverging Arthrostylidiinae species, consistent with previous studies19,20,38. The phylogenetic tree supports (Arundinarieae (Bambuseae, Olyreae)) pattern ,and the pattern is consistent with previous studies based on smaller-scale plastid sequences, suggesting a non-monophyletic lineage of woody bamboos35,39−41. The results also showed the stability of the pattern, which may no change under amplified sampling. Differently, phylogenetic trees using nuclear sequences suggested the basal position of Olyreae in Bambusoideae and showed a monophyletic origin of the woody characteristic of bamboo36,42. For clarifying the confliction, the analysis should focus on changes in gene duplications and genome structure caused mainly by multiple hybridizations in bamboo, by performing largely amplified sampling and genome-wide sequences. Additionally, there is a fundamental demand for bamboo life trees, especially in China, which has the world’s largest areas of bamboo plantation43.
The Phyllostachys genus, with 59 species, is the most economically important among bamboos44–46. Phyllostachys edulis is the most significant Phyllostachys species, accounting for ∼73.8% bamboo-growing regions in China (4.43 million ha), and is the most abundant non-wood resource34. This study included 102 Phyllostachys CpGenome sequences, covering more than 90% Phyllostachys species, and provides an unprecedented opportunity to expand taxonomic knowledge of Phyllostachys genus. Traditionally, Phyllostachys genus can be divided into two groups, P. sect. Phyllostachys and P. sect. Heteroclada, based on morphological features such as inflorescences and rhizomes et al47,48. But there is a controversy in this classification due to some in-between morphological features of two groups44,47. Compared to the traditional taxonomy, the species tree we constructed exhibited different phylogenetic relationships in P. sect. Phyllostachys and P. sect. Heteroclada, specifically the two groups of species intermixed in the species tree. Incongruence between morphological taxonomy and the phylogenetic tree may be due to complex evolutionary processes or taxonomic treatments. Totally, 13 non-Phyllostachys species, such as Indocalamus pedalis, Oligostachyum oedogonatum, Pleioblastus solidus, et al were found in Phyllostachys genus Clade. They are all scattered in Phy-II. The existence of numerous non-Phyllostachys species may indicate non-monophyly of the Phyllostachys genus. It is supporting the non-monophyly thesis of Phyllostachys genus based on previous studies of plastid sequences37,49,50 and conflicting with previous results based on non-genome wide nuclear sequences or morphological features44,47,48. The classification should be treated carefully because of the evolutionary complexity of bamboos. Moreover, The incongruence between plastid and nuclear gene phylogenies in Arundinarieae was found in the previous study19. Though the species tree we constructed supports more than 90% species coverage of Phyllostachys, the taxonomy of Phyllostachys clade should be further tested within the phylogenies based on genome-wide nuclear genes.