Characterization of PEBP genes in Rosaceae species
By combining HMM and BLAST searches, we identified 56 PEBP-like proteins across nine Rosaceae tree species (Table 1). Each putative gene was validated by blasting against SMART, Pfam and NCBI CDD to ensure that they contained complete PEBP domain. We then assigned all Rosaceae PEBPs to their closest Arabidopsis homologs (Figure 1; Table 1). In total, these Rosaceae PEBPs included 12 FT/TSF-like, 11 TFL1-like, 11 CEN-like, 10 MFT-like and 12 BFT-like genes (Table 1). TFL1 and CEN-like genes showed the highest identities of 72.25-80.0% with their Arabidopsis orthologs, while BFT-like proteins showed the lowest identities of 62.07 to 67.82% compared with AtBFT. Five to six PEBPs were detected among Prunus species, while the average number of PEBPs almost doubled in M. domestica and Pyrus communis (Table 1). The duplicated paralogous gene pairs, such as MdTFL1 and MdTFL2, PcTFL1 and PcTFL2, were retained in the genomes of M. domestica and Pyrus communis, while only one copy of MFT was present in both species (Table 1).
Phylogenetic analyses
Phylogenetic trees were constructed based on protein sequence alignment of Arabidopsis and Rosaceae PEBPs using three approaches: the neighbor-joining, maximum likelihood, and Bayesian inference methods (Figure 2; Figure S1). All three phylogenetic trees shared similar topologies (Figure 2; Figure S1). The phylogenetic trees showed that the total of 62 PEBP proteins can be clustered into three major clades, which are the FT-clade, TFL1-clade, and MFT-clade (Figure 2). The FT-clade could be further split into FT/TSF-like genes and BFT-like genes, and the TFL1-clade can be split into TFL1-like and CEN-like subfamily genes (Figure 2). Within each subfamily, the genes of Prunus species first group closely together, then group with the genes of Maleae species including M. domestica and Pyrus communis, and finally group with genes of R. occidentalis and Arabidopsis genes (Figure 2). Among Prunus PEBPs, proteins within the same subgenus tend to group together, for example, P. dulcis and P. persica from the Amygdalus subgenus, P. armeniaca and P. mume from the Prunus subgenus, and proteins of P. yedoensis and P. avium from the Cerasus subgenus (Figure 2). The duplicated paralogous gene pairs from M. domestica and Pyrus communis within the TFL1, CEN, and BFT subfamilies were grouped separately, for example, PcTFL1-MdTFL1 and PcTFL2-MdTFL2 form separate clusters, rather than genes of the same species grouping together (Figure 2).
Structural analysis of PEBP family genes
Rosaceae PEBP family genes displayed conserved gene structures and high amino acid sequence similarity (Figure 3; Figure S2). The length of the coding regions of PEBPs ranged from 507 to 576 bps, with FT-like genes falling between 522 to 543 bps, MFT-like genes between 507 to 576 bps, BFT-like genes between 519 to 525 bps, TFL1-like genes between 516 to 519 bps, and CEN-like genes between 519 to 522 bps. All PEBP genes have a rather loose gene structure consisting of four exons and three introns (Figure S2). For example, BFT-like genes harbor the shortest intron total lengths, ranging from 522 to 534 bp (Figure S2).
Sequence alignment revealed a high degree of conservation across the entire protein and within the PEBP domains present in all 65 genes (Figure 1; Figure 3). The phylogeny structure inferred from the alignment of PEBP domains was generally in accordance with that of the whole protein sequence alignment, suggesting the PEBP domain as the major factor driving the evolution of Rosaceae PEBPs (Figure 3a). Five motifs covering 160 amino acids were identified by the MEME program among Rosaceae PEBP proteins (Figure 3b). Among these, Motifs 1, 2, 4, and 5 together spread over the whole PEBP domain (Figure 3b). Motifs DPDXP (Asp-Pro-Asp-X-Pro) and GIHR (Gly-Ile-His-Arg), which are essential for anion-binding activity, were present in the fourth exon of the PEBPs (Figure 1). We also found that residues distinguishing FT-like genes from TFL1-like proteins were conserved among the two gene lineages (Figure S3). Previously reported key residues conferring the flowering-promoting role of FT including V76, Y91, E115, L134, Y140, G144, W145, Q147, and N159 were present in all Rosaceae FT-like proteins (Figure S3) [20, 43, 54]. The corresponding residues (I/T)76, H91, E115, (K/N/T)134, (F/N)140, (P/S)144, S145, D147, and D159 were found in all TFL1-like proteins (Figure S3). Residues determining the 14-3-3 receptor binding interface (R68, F107, R137) were shared by both protein types (Figure S3).
Microsynteny and duplication analysis of PEBP genes
To understand the evolution origin of PEBP family genes, we performed inter- and intra-genomic synteny analysis with MCScanX for Arabidopsis and seven Rosacea species with chromosome-level genome assemblies. We observed large interspecies collinear blocks between four Prunus species, P. avium, P. persica, P. armeniaca, and P. mume, which indicates high level of macrosynteny among Prunus species (Figure S4). The genome comparisons between R. occidentalis and M. domestica and between M. domestica and P. avium revealed large-scale chromosomal rearrangements including translocation and fusion-fission events that possibly occurred during the genome evolution of Rubus, Malus, and Prunus genera (Figure S4). Based on intra-genomic comparisons, we classified the duplication origin of orthologous gene pairs for Arabidopsis and other Rosacea species (Table S1). Among all duplication types, whole-genome duplication (WGD)/segmental duplication was the major type for M. domestica, tandem duplicated genes were mostly found in A. thaliana, P. armeniaca, and P. persica, and dispersed duplication events were enriched in the genomes of R. occidentalis, P. mume, P. avium, and P. dulcis (Table S1).
Furthermore, we characterized the duplication modes of PEBP family genes across species (Table S2; Figure 4; Figure S5). In Arabidopsis, R. occidentalis, and P. armeniaca, all PEBP gene members were predicted to be originated from dispersed duplications (Figure S5; Table S2). In four Prunus species, FTs, MFTs, and BFTs were classified as having dispersed duplication, while TFL1-like and CEN-like genes were classified as exhibiting WGD/segmental duplication (Figure S5; Table S2). The inter-genomic comparison of Prunus species confirmed that TFL1 and CEN genes were within shared syntenic blocks between species, indicating a shared duplication origin of TFL1-like and CEN-like genes in Prunus species (Figure 4). Within the genome of M. domestica, we detected seven syntenic blocks consisting of three WGD/segmental duplication gene pairs, including MdTFL1-MdTFL2, MdCEN1-MdCEN2, and MdBFT1-MdBFT2, and two dispersed duplication events related to MdFT and MdMFT (Figure S5; Table S2). The inter-genomic comparisons between M. domestica and R. occidental and between M. domestica and P. avium also confirmed that the duplicated gene pairs MdCEN1-MdCEN2, MdTFL1-MdTFL2, and MdBFT1-MdBFT2 are likely resulted from an independent WGD event unique to the Malus tribe (Figure 4).
Codon usage bias and other gene parameters
We observed differentially preferred codons and different gene features across five Rosaceae PEBP gene lineages (Figure 5; Figure S6; Table 2). For arginine, codons AGA and AGG were most frequently used by all lineages compared with other codons (Figure S6). Codon UCC encoding serine was mostly used in CEN-like and MFT-like proteins, while codon UCU was mostly employed by the FT and TFL1 lineages (Figure S6). We also observed significant differences in other gene parameters among different PEBP lineages (all Kruskal-Wallis tests pval<0.01) (Figure 5; Table 2). The codon adaptive index (CAI) of MFT and TFL1 genes is significantly larger than those of the other gene groups (Kruskal-Wallis test pval= 2.75e-7) (Figure 5; Table 2). In contrast, the effective number of codons (ENC) estimated for MFT and TFL1 genes is much lower than those of the other groups (Kruskal-Wallis test pval= 0.005) (Figure 5; Table 2). The average ENC values ranging from 51.24 to 54.95 indicated weak codon bias among PEBP genes. Analysis of the GC content revealed that MFT lineage genes had much higher GC content indices compared to other genes (Figure 5; Table 2). In contrast, TFL1 and BFT lineage genes appear to have lower GC1% and GC3% but relatively higher GC2% compared to other groups (Figure 5; Table 2). All gene parameters showed no variation among species (Figure S7; Kruskal-Wallis test pval>0.05). Strong pairwise correlations between gene parameters were observed (Table S3). For example, the CAI was positively correlated with the total GC% and GC3% (both correlation coefficient r≥0.56) but was negatively correlated with the ENC (r=-0.62) (Table S3). On the other hand, the ENC displayed a negative correlation with the total GC content and GC3% (Table S3).
Molecular evolution of different PEBPs lineages
To investigate the evolution of PEBP genes in Rosaceae species, we performed selection scans on coding sequences of all PEBPs using the branch model, site model, and branch site model in the CODEML program of PAML (Table 3; Table S4; Table S5). Branch models with different ω parameters specified for foreground lineages (i.e., FT-like, TFL1-like, CEN-like, and BFT-like lineages and FT/TFL1 clades) were compared with the fixed ratio model (Table S4). The likelihood ratio tests (LRT) on models specifying individual lineages of FT, TFL1, CEN, and BFT genes as the foreground branch showed no significant difference in ω between the foreground and background branch (P>0.05) (Table S4). However, the LRT test on the branch model specifying the FT and TFL1 clades as the foreground branch suggested significant divergence among FT/TFL1 and MFT clade genes (P<0.001) (Table S4). We then applied the site model LRT test and detected signs of positive selection among sites of PEBP proteins (Table S5). The branch-site LRT tests further revealed strong positive selection within TFL1 lineage and slight positive selection within FT lineages at specific protein sites (Table 3). The Empirical Bayes model suggested modest selection at positions 19 and 106 when FT lineage was set as the foreground branch and at positions 11 and 18 when TFL1 lineage was set as the foreground branch (Table 3). We further validated the results by performing selective pressure analysis on five gene lineages separately with the software Selecton. Only the FT and TFL1 lineages showed the signature of positive selection, in which residues 40N, 56N, 128S, and 181L in the FT lineage (with RoFT1 as the reference gene) and 4T, 73V, 134P, 141S, 157L, and 161S in the TFL1 lineage (with PyTFL as the reference gene) were mostly selected (Figure 6). In contrast, the genes of the other three lineages all showed signs of purifying selection across most sites (Figure S8).
Cis-acting element analysis of the FT promoter
We extracted the 2000 bp region of FT genes and scanned for putative cis-elements by searching against the PlanPan and the PlantCARE databases (Table 4). We compared the type and copy number of cis-elements for 11 FT genes from A. thaliana, P. trichocarpa, M. domestica, Pyrus communis, R. occidentalis, P. armeniaca, P. avium, P. mume, and P. persica (Table 4). Within the promoter region of the investigated FTs, three to ten CCACA boxes (binding site for CO) were identified across nine species, while none were found for PtFT2 (Table 4). CArG boxes, the binding site for the MADS transcription factor, were present in all FT promoters, among which the AtFT promoter contained the most (Table 4). Light-response elements including the G-box, AE-box, GATA-motif, GT1-motif, and TCT-motif were present within all FT genes but in different types (Table 4). In addition, binding sites for MYB, MYC transcription factor, ethylene-responsive transcription factor, and abscisic-acid responsive element (ABRE) were present in all FT promoters (Table 4). Gibberellin-responsive elements of different types were detected in FT promoters, with GARE-motif in the promoters of MdFT, PcFT1, RoFT1, PvFT and P-box in the promoters of AtFT, PtFT2, PaFT, PvFT, PmFT, and PpFT. We also observed some cis-elements with species-specific distribution patterns. For example, the low-temperature responsiveness (LTR) element was only detected within the promoters of AtFT, PcFT1, RoFT1, PvFT and PmFT (Table 4). The W-box, which is the binding site for WRKY transcription factor, was detected exclusively in RoFTs, PtFT2, and Prunus FT promoter regions (Table 4).
Tissue-specific expression patterns of PEBPs
To explore the functional roles of PEBP genes, we examined their expression patterns in different tissues of four Rosaceae species, P. persica, P. mume, P. yedoensis, and R. occidentalis (Figure 7a-d). In general, we observed a differentiated expression preference of PEBP genes across different tissues (Figure 7). Among the five PEBP subfamilies, FT-like and TFL1-like genes were expressed in both vegetative tissues such as leaf and stem, and reproductive organs such as flower bud and fruit (Figure 7). The transcription of CENs, as the closest paralogs of TFL1, was barely detected in any organs, except in the root tissues of P. mume (Figure 7). MFT was only detected in seed embryos of P. persica and fruit tissues of P. yedoensis and R. occidentalis (Figure 7). BFT was detected in the fruit tissues of all species but was relatively highly expressed in leaf and stem tissues in P. yedoensis and R. occidentals, respectively (Figure 7). We validated the tissue-specific expressions of five PEBP genes by real-time quantitative PCR (qRT-PCR) in P. mume (Figure S9). PmFT is highly expressed in floral buds compared with its expression in leaf and stem, which is consistent with result of the above tissue transcriptome sequencing in P. mume (Figure 7; Figure S9). PmTFL and PmCEN were relatively highly expressed in root tissues (Figure S9). PmBFT and PmMFT was barely detected in the four examined tissue types (Figure S9). The somewhat inconsistent tissue-specific expression patterns of PEBP orthologs across examined species are likely a result of non-uniformity in the sampling time, plant physiological state, and tissue specificity across four independent studies. Despite the inconsistency, the divergent expression of PEBP members across different tissue types indicates significant functional differentiation of PEBP gene lineages.
Expression analysis of PEBP genes during floral bud development in P. mume
We analyzed the expression of PEBP genes in flower buds of different developmental stages from July 10th, 2019 to January 12th, 2020 by qRT-PCR analysis. The expression of PmFT first decreased as the bud initiated the floral meristem from July to August, increased as floral organ initiated and developed (from August to October), slightly decreased during bud dormancy, and then significant increased as the floral bud exited dormancy and bloomed (Figure 8). PmBFT maintained a low expression level throughout the whole process, with only a minor increase during floral bud development in August and September (Figure 8). The other PEBP members retained barely detected expression levels in floral buds of all developmental stages (Figure 8). These results imply that PmFT is possibly the primary PEBP member participating in regulating floral bud development and bud flushing in P. mume.
Co-expression network analysis of FT during the blooming process in P. mume
To explore the regulatory network of FT in flowering regulation in trees, we reanalyzed the transcriptome changes of P. mume during dormancy release and the floral bud opening process [55] and performed a weighted co-expression network analysis (WGCNA). We identified 23 modules with distinct expression patterns (Figure S10a). Module-trait association analysis revealed four modules, ‘brown’, 'turquoise', ‘dark green’, and ‘salmon’, associated with the progression of bud flushing (R2 > 0.8). Among them, module ‘brown’ showed the strongest correlation with the FPKM of PmFT (Figure S10b). The ‘brown’ module genes were significantly enriched in biological processes including cell cycle (GO: 0007049), flower development (GO:0009908), glucan metabolic process (GO:0009251), auxin transport (GO:0060918), and responses to abiotic stimulus (GO: 0009628). We further identified the top 50 genes most associated with PmFT and 15 known flowering-related genes such as PmLFY, PmAP1, and PmCOL (Table S6) [56, 57]. Among genes in the ‘brown’ module, SVP (SHORT VEGETATIVE PHASE), SOC1 (SUPPRESSOR OF OVEREXPRESSION OF CO 1), GI (GIGANTEA), and CIB1 (CRYPTOCHROME-INTERACTING BASIC-HELIX-LOOP-HELIX 1) were previously identified as key players in the FT-dependent floral regulation in Arabidopsis [58, 59] (Figure 9a). Four tandem-duplicated PmDAMs (PmDAM1, PmDAM4, PmDAM5, PmDAM6) from the ‘brown’ module also exhibited expression patterns negatively correlated with that of PmFT (Figure 9a-b). The expression patterns of other known floral regulators such as COL (CONSTANS-LIKE) from the ‘turquoise’, LHY1 (LATE ELONGATED HYPOCOTYL 1) and AP1 (APETALA1) from ‘dark green’ module were not highly correlated with PmFT (R2 < 0.62) (Figure 9b). PmFT showed a relatively weak transcription level in endodormant floral buds (Figure 9b). As the floral bud continued accumulating chilling units and exiting dormancy, PmFT expression significantly increased and showed the highest expression in flushing buds (Figure 9b). PmCIB1 and 37 other genes showed similar expression patterns to that of PmFT, while PmPHYB (Pm008367), PmGI, PmLHY, PmCOL, PmSVP, PmSOC1, and four PmDAMs displayed contrasting expression patterns, with their expression decreasing as the floral buds exited endodormancy (Figure 9b). The expression patterns of FT and its coexpressed genes were further verified by qRT-PCR analysis (Figure 9c).