Molecular Evolution, Genome-Wide Characterization and Codon Bias Analysis of the SBP-Box Family in Cucumis sativus

Background: SQUAMOSA promoter binding protein (SBP)-box genes encode a group of transcription factors which extensively play essential roles in plant development and stress responses. However, the SBP-box gene family has not been well characterized in cucumber (Cucumis sativas). Results: In present study, 15 putative SBP-box genes were identified distributing on 4 chromosomes of cucumber. Evolutionary analysis showed that the green plant SBP family originated from a common ancestor. Phylogenic analysis divided CuSBPs into 6 groups similar to those of Arabidopsis and rice. Intron-exon and motif structure within each group shared common features according to evolutionary study. Expression pattern analysis of transcriptional data about flowering and resistance to powdery mildew demonstrated conserved SBP-box genes function in vegetative-to-reproductive transition and potential roles in other regulatory pathways. Moreover, codon bias analysis explained the mutation and selection pressure exerted on genes. Conclusions: This study comprehensively characterized cucumber CuSBP gene families, which would provide a foundation to explore the functions of CuSBPs for improving yield, quality and stress tolerance of cucumber in the future.


Background
Transcription factors (TFs) are proteins binding to DNA sequence in specific manner and activate or inhibit transcription of target gene to specifically express in different tissues and environments. TFs have major roles in regulating plant defense and stress resistance [1]. SBP (SQUAMOSA promoter binding protein)-box genes are plant-specific TFs that regulate plant development [2,3] and stress responses [4,5] by targeting specific genes. Since the first identification in Antirrhinum majus [6], SBP-box genes have been identified and functionally characterized in various plant species, including angiosperms such as Arabidopsis thaliana [7], rice (Oryza sativa) [8], maize (Zea mays) [9], tomato (Lycopersicon esculentum Mill) [10] and silver birch (Betula pendula) [11], as well as early divergent land plants such as moss (Physcomitrella patens) [12]. Typically, SBP-box proteins tend to have a highly conserved SBP domain with 76 amino acids containing two zinc finger-like motifs and a nuclear location signal sequence [13]. identified and analyzed in cucumber. However, research about SBP-box family has not been reported.
In the present study, we conducted genome-wide identification and characterization and molecular evolution analysis of the SBP-box gene family in cucumber. Phylogeny, gene structure and the codon bias were systematically analyzed. Moreover, expression patterns comparison showed that the SBPbox genes potentially participated in flowering and contributed to powdery mildew resistance.
2 Results 2.1 Identification of SBP-box genes in cucumber and their chromosome location A total of 15 CuSBP genes were identified in cucumber. They were named according to NCBI database (Table 1). The length of genomic sequences of these genes ranged from 2 kb to 12 kb, and the amino acid residues of the predicted proteins ranged from 141 to 1031. The results were quite similar to previous studies [27,33], in which the lengths of genes and proteins showed significant variation.

Conserved domain and phylogenic analysis
The common feature of SBP-box gene members is that they all have SBP domain, which consists of two Zinc finger-like structures (C3H and C2HC), and a highly conserved nuclear localization signal peptide. In present research, conservative SBP domains were used to do the multiple alignment. The sequences found in Cucumis sativus shared all of these three features (Table S1), which proved that these SBP domains were highly conserved in Cucumis sativus ( Figure 1A). The sequence logo was drawn to show the distribution of amino acid with residue at each site ( Figure 1B). Several highly conserved sequences were found at some positions within SBP domains, such as CQQC sequences and KRSCR sequences. Notably, the fourth His in Zn-1 of CuSBP7a was different from the others, which was replaced by Cys.
To further investigate the evolutionary relationship of SBP-box genes in Arabidopsis, rice and cucumber, a phylogenetic tree was constructed using 51 SBP sequences from three species ( Figure   2). The sequences were clustered into 6 groups (Group I-Group VI). The number of group members varies greatly. The Group IV and VI had the largest number of members, both of which were 14.
However, there were only three members in the Group I. The CuSBP family gene was distributed in all six groups and closely related to the Arabidopsis AtSPL family genes and had a relatively distant relationship with rice. Among them, CuSBP7a, CuSBP8, CuSBP9, CuSBP14, CuSBP1a, CuSBP6, CuSBP13a and CuSBP13b were distributed in the same sub-branches with Arabidopsis homologous genes.
2.3 Structural organization and conserved motif analysis of CuSBP genes A phylogenetic tree for cucumber was constructed to further understand the structures of CuSBP ( Figure S1). The exon-intron structures were generated based on their genome and coding sequences.
The results revealed high variation in the number of introns, ranged from 2 to 10. Group I proteins contain 10 introns, Group II contains 2, Group III contains 2-3, Group IV contains 3, Group V contains 10, and Group VI contains 3 introns, which was in agreement with previous studies [27,34].
Thirty conserved motifs were characterized in CuSBPs of cucumber by MEME ( Figure S2). The number of motif types varied from 4 (CuSBP1b, CuSBP1c, CuSBP3) to 13 (CuSBP1a, CuSBP14). Generally, Motif1, 2 and 3 constitutes the SBP domain. Group VI had more diversity of motifs because it contained largest number of members in SBP gene family. CuSBP1a and CuSBP14 both belonged to Group V, and shared exactly same motifs, in which motif 8 and 30 were specially presented. Motif 4, 17, 12, and 9 were shared by Group I and Group V. Motif 28 was the only repeated motifs, which repeated twice in CuSBP3.

Cis-element analysis of CuSBP genes
The analysis of cis component in promoter sequence is helpful to understand gene regulation patterns. In present study, cis-acting regulatory elements' function in promoter regions of CuSBP genes were classified into two types (Table S2). Type Ⅰ involved in stress responses, such as ABRE (involved in the abscisic acid responsiveness), MBS (MYB binding site involved in droughtinducibility), ARE (essential for the anaerobic induction). Type Ⅱ included elements involved in development process, like CAT-box (related to meristem expression), GCN4_motif (involved in endosperm expression), and P-box (gibberellin-responsive element). The distribution of these elements varied a lot among CuSBPs, even if belonging to the same group. The second type includes CuSBP3, CuSBP7a, CuSBP9, CuSBP12 and CuSBP16. They had high expression level at the early flowering stage, and the expression level decreased along the timeline.
The third type expressed at a high level and especially high at a specific development stage. CuSBP1a and CuSBP14 were members in Group V. They were both up-regulated at day 3 and day 5 while at day 1 and day 4 they were down-regulated to same degree. Taking promoter analysis into consideration, type three genes' promoter regions all had light responsive elements (Table S2).

Expression profile of Cucumber SBP-box genes against powdery mildew
Powdery mildew is one of severe diseases in cucumber, which could decline the production dramatically. To study the role of SBP-box gene family against powdery mildew, we used the leaf transcriptional data of powdery mildew resistant segment substitution line SSL508-28 and the parent  Table S3. Firstly, the codon usage pattern was studied. The GC content is 46.32±3.49% with GC1 51.09±2.94%, GC2 45.66±3.70%, GC3 42.20±7.31%. The GC content is relatively lower at the 3rd site, while with relatively big variance. The neutrality plot (GC12 vs GC3s) was drawn (Figure 5a), in which the correlation of the points is not significant (P>0.05) and the slope of the regression line is close to 0. Association between ENC and GC3s were learned with a standard curve shown in (Figure 5b) under the hypothesis that there is no selection. phylogenetic tree analysis showed that the SBP genes were clustered into 6 groups (Group I-Group VI) ( Figure 2). Preston divided the SBP family genes of 9 species into eight clades [13]. In present study, dividing Group IV and Group VI from the first node respectively into Group IV -1, Group IV -2, Group VI -1, Group VI -2 generates 8 same clades as previous studies. However, there is no CuSBP in Group IV -2.

Evolution of cucumber SBP family genes
A recent study on SBP-box genes elaborated that land plants' SBP-box genes were generated through duplication events from one common ancestor. Based on the timeline of duplications, SBP-box genes are divided into group 1 and group 2-1, group 2-2, among which group 2-2 plays the major role of the expansion of SBP-box genes. [13] Group I in present study corresponds to group1, which is the first formed lineage. Group V in present study corresponds to group 2-1, which is the subgroup in the second group retaining similar evolutionary features to group 1 compared with group 2-2. We further constructed a phylogenetic tree of cucumber SBP genes and analyzed their conserved motifs and introns ( Figure S1 To explore potential function of CuSBPs in cucumber, two sets of transcriptional data (flowering and PM) are used. Flowering serves as transition from vegetative phase to reproductive phase, which makes it crucial in reproductive plants. Therefore, it is important to study the expression pattern of CuSBP family genes in cucumber flowering process, discover its potential functions in flowering in order to provide potential resources for future cucumber breeding to improve cucumber yield. From the expression patterns of CuSBP family genes in different flower development stages, they could be divided into two groups, one group had lower expression levels during flowering (CuSBP8/6/1b/13a/7b/13b) The other group has a higher expression level (CuSBP3/12/16/7a/9/1c/14/13c/1a). It is speculated that genes with low expression during flowering may be involved in the vegetative growth process of plants. CuSBP8, CuSBP1b and OsSBP10 are members of Group Ⅲ, and their expression patterns are similar. OsSBP10 was shown to be highly expressed in rice seedlings and young spikelets [44]. CuSBP13 homologous gene AtSPL13 expresses highly in hypocotyl, shoot apical meristem, leaf primordia and developing inflorescence. Disruption of AtSPL13 regulation delays post-germinative transition from the cotyledon to vegetative-leaf stage. Powdery mildew is one of the world most damaging diseases to cucumber, which is caused mainly by Podosphaera fusca [47]. It seriously affects photosynthesis and disturb metabolism, resulting in premature aging and declined production. Therefore, it is important to study the resistance functions of CuSBP family genes involved in the regulation of powdery mildew. In GroupⅤ, CuSBP14 and CuSBP1a maintained high expression levels in both cultivars. After treatment with PM, the expression levels of the two genes in SSL508-28 were lower than those in D8. CuSBP14, a homologous gene of CuSBP14 in Arabidopsis, is more sensitive to programmed cell death (PCD)-inducing fungal toxin FB1.
This indicates that CuSBP14 and CuSBP1c may be involved in the negative regulation of biotic stress. CuSBP7a has a lower expression level in SSL508-28 compared to D8, and AtSBP7, OsSBP9 and CuSBP7a belong to the Group 1. AtSPL7 and OsSPL9 are considered functional genes in copper regulation pathways and Loss-of-function mutations in SPL9 resulting in enhanced plant resistance to rice stripe virus [35,48]. Both CuSBP1b and CuSBP13a were highly up-regulated after treatment with powdery mildew in SSL508-28. This indicates that these two genes may play a positive regulatory role in the regulation of powdery mildew resistance. CuSBP12 was downregulated after treatment of two varieties of PM, but CuSBP6 with the highest homology was not detected in this database. AtSBP6 is a homologous gene of CuSBP6/12, which can actively regulate the defense genes and regulate Plant innate immune system. NbSPL6 is essential for the N-mediated resistance to Tobacco mosaic virus.
[24] This also indicates that homologous genes of different species may differ in functional evolution, but the specific function remains to be studied.

SBP-box genes had high codon usage bias in cucumber
Codon usage bias reflects the genetic information hidden in the RNA sequence and accessed us to the evolutionary process of genes in organism. In present work, we find high codon usage bias in cucumber SBP-box genes.
ENC is a parameter acknowledged to be used to measure the codon bias degree. Previous studies demonstrate that gene expression is negatively correlated with ENC value [49,50], in other words, important highly expressed genes tend to have low ENC values. The ENC value of CuSBPs varies from is 49, indicating severe selection exerted on them and they are well functionalized, which is identical to previous discussion. Group III members have top 3 ENC values, suggesting their roles may not be irreplaceable. In ENC-plot, notably, CuSBPs have low ENC value compared to the ENC expected value, rejecting the null hypothesis that there is no selection. The presented gene points are lying below the curve expect one and the genes were not narrowly distributed in the plot, which demonstrated that both mutation pressure and selection affected the codon usage pattern, reinforcing the theory of equilibrium between mutation and selection [51].
In the neutrality plot, the correlation of the points is not significant (P>0.05) and the slope of the regression line is close to 0. The weak correlation between two GC12 and GC3s suggests that there is high mutation bias or low conservation of GC content levels, which means the mutation pressure rather than translational selection plays the major role.
RSCU is commonly used to analysis the synonymous codon usage (Table S4). Additionally, most abundantly used codons are A/T ended, as the result of compositional constrains (i.e., A and T) [52], which is part of the mutation pressure. For subsequent researches, if transgenic SBP-box genes are required to be expressed in cucumber, the insertion sequence can be modified by preferred codon patterns in this study.

Conclusion
In summary, the present provided a comprehensive understanding of SBP-box genes in cucumber.
Genome-wide identification and characterization of SBP-box genes have been done in cucumber including phylogenic, gene structure and promoter analysis. Expression pattern analysis of flowering and powdery mildew resistance indicated that various potential roles of SBP-box gene in cucumber need to be explored. Moreover, codon usage analysis of CuSBPs could provide us the image of molecular evolution and further transformation work.

Identification of SBP-box genes in cucumber
The whole genome information of cucumber was downloaded from NCBI genome database (https://www.ncbi.nlm.nih.gov). The protein sequences of SBP-box genes from both Arabidopsis thaliana and Oryza sativa were downloaded from Uniprot as 'input files' (https://www.uniprot.org/). Then, two methods were used to genome-widely identify candidate SBP-box genes in Cucumis sativus: (1) the 'input files' were blasted against cucumber genome with E-value < 1e -5 in order to obtain putative SBP-box family genes, (2) HMMER3 were used to get the HMM model and results [53].
These two methods led to the same result. After that, we conformed the sequences by Pfam (http://pfam.xfam.org/) and batch cd-search in NCBI.

Sequence alignment and phylogenetic analysis
Multiple amino acid sequence alignment was performed using DNAMAN software (Lynnon Biosoft, CA, USA). The sequence logo was obtained using the online platform Weblogo4 for conserved sequences.
Phylogenetic trees were constructed using MEGA 7.0 (https://www.megasoftware.net/index.php) with the neighbor-joining method and 1000 bootstrap replicates for SBP-box family genes in Arabidopsis thaliana, Oryza sativa and Cucumis sativas.

Intron-exon structure analysis
Intron/Exon structures were determined by aligning coding sequences to their corresponding genomic sequences. A diagram of intron/exon structures was obtained using TBtools [54].

Motif analysis
Multiple EM for Motif Elicitation (MEME) software (http://meme.nbcr.net/meme/) was used to search for motifs in all 15 SBP-box genes. The number of motifs was set to 30.

Expression analysis of CuSBPs
Two sets of transcriptomes data of Cucumis sativus were obtained from NCBI database to study the expression profiles of CuSBP genes. Data (GSE76358) including samples of different flowering periods were downloaded from the database reported previously by Sun et al. [55]. The normal ovary blooms at 4-5 days after labeling (when the ovary is visible), with the majority blooms at day 5. Corolla of day1, day3, day4, and day5 after labeling were collected for RNA sequencing. Log 2 RPKM values were used to draw the heatmap. Another data set was genes expression pattern against powdery mildew with GEO accession: GSE81234. To investigate the candidate genes governing Pm5.1 and their effects on powdery resistance, the RNA-sequencing based transcriptomes of the powdery mildew resistant segment substitution line SSL508-28 and recurrent parent D8 were compared 48h after inoculation with the PM pathogen. Log 2 FPKM values were used to draw a heatmap.
To further analyze codon usage bias, plot of ENC vs GC3s (ENC plot) and plot of GC12 and GC3s (Neutrality Plot) were generated according to previous study. In ENC plot, the standard cure was calculated by: ENC exp = 2 + S + (29/(S 2 + (1 − S 2 ))) Where, S is the frequency of G + C (i.e., GC3s). And in Neutrality Plot, GC12 is the frequency of GC at 1 st and 2 nd sites.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The datasets analysed during the current study are available in the NCBI genome database (https://www.ncbi.nlm.nih.gov). The protein sequences of SBP-box genes from both Arabidopsis thaliana and Oryza sativa were downloaded from Uniprot as 'input files' (https://www.uniprot.org/). Two sets of transcriptomes data of Cucumis sativus were obtained from NCBI database to study the expression profiles of CuSBP genes. Data (GSE76358) including samples of different flowering periods were downloaded from the database reported previously by Sun et al. [55]

Competing interests
The authors declare that they have no competing interests.

Funding
The research was supported by National Key R&D Program of China (2018YFD0100901). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download. supplementary.rar