Comparative analysis of GRAS genes in six Cucurbitaceae species provides novel insights into their evolution and function

Qiqi Zhang Chinese Academy of Agricultural Sciences Jun He Southwest University Yuanchao Xu Chinese Academy of Agricultural Sciences Sen Chai Qingdao Agricultural University Tianshu Sun Chinese Academy of Agricultural Sciences Hongbo Li Chinese Academy of Agricultural Sciences Hongjia Xin Chinese Academy of Agricultural Sciences Kuipeng Xu Qingdao Agricultural University Zhonghua Zhang (  zhangzhonghua@qau.edu.cn ) Qingdao Agricultural University

A total of 237 GRAS genes were identi ed in the genomes of six Cucurbitaceae crops. The number of GRAS genes was relatively consistent among these species: 37 in C. sativus, 36 in C. melo, 35 in B. hispida, 37 in C. lanatus and 37 in L. siceraria, with a considerably greater number of 55 in C. moschata (Additional le 1: Table S1; Additional le 2: Table S2). Cucurbitaceae GRAS genes contained 1-12 exons, with most containing only one exon (160, 67.51%). Among the 237 GRAS proteins, the length of the putative encoded proteins varied from 258 (Bhi11P000300) to 1,466 (Lsi06P016090.1) amino acids, the predicted molecular weight (MW) ranged from 29.83 (Bhi11P000300) to 164.43 kDa (Lsi06P016090.1), and the isoelectric point (pI) varied from 4.70 (Lsi01P011330.1, CmoCh01G003940) to 9.67 (Cla004408) with a mean of 5.66, demonstrating that most of the proteins are weakly acidic. These GRAS genes (except for BhiUN508M4, BhiUN508P6 and CmoCh00G001590.1, which are not anchored onto the chromosomes) were unevenly distributed across the genomes of the six Cucurbitaceae species (Additional le 3: Fig. S1a-f). For example, among the seven chromosomes of C. sativus, chromosome 3 is the longest, but it had only 7 GRASs, compared with the 10 GRAS genes present on chromosome 6 (Additional le 4: Table S3).
The GRAS proteins in Cucurbitaceae species shared a highly conserved C-terminus, which contained ve distinct conserved motifs in the following order: LHR I, VHIID, LHR II, PFYRE and SAW (Additional le 7:  Table S5). Some motifs were present only in speci c subfamilies; for example, motif 20 was only present in the LISCL and HAM subfamilies (Additional le 8: Fig. S4b). A comparison of members from different GRAS subfamilies showed that those from most closely related subfamilies contained similar motifs. For example, members of the SHR group contained motifs 9, 16, 8, 2, 1, 13, 5, 14, 7, 6, 12, 3, 10, 19 and 4, but those in the NSP1 group contained motifs 9, 16, 8, 2, 1, 13, 5, 14, 7, 10, 19 and 4 (Additional le 8: Fig. S4b). In addition, we analyzed the gene structures of 237 GRAS genes by comparing the patterns of exon-intron architecture among the six Cucurbitaceae species (Additional le 8: Fig. S4c). The majority of GRAS genes within the same subgroup or between orthologous groups contained a similar gene structure. For instance, all genes in the OG-SHR-2, OG-SCL32-2, RAM1 and Ls subfamilies contained only one intron and two CDS.
Further, we compared the predicted motifs in the six cucurbits with those in other species including Arabidopsis, S. lycopersicum (Additional le 11: Fig. S6; Additional le 13: Table S6). Some motifs were speci c in Cucurbitaceae species; for example, motif 7 (EFGDFNFPSANQSGFYQQDISKIGDQTNYQQPNSDCLIFDELLFGNDFTI) in the SCLB clade, (Fig. 2a) and motifs 9 (LDDTTAASRWVISFSDEFRHK) and 10 (MALDGDGGSFFSTDFTSVGKEDEDTVGD) in the RAD1 clade (Fig. 2b). These speci c motifs may be responsible for the speci c processes or traits in Cucurbitaceae species.
Duplication and synteny analysis of GRAS genes among six Cucurbitaceae species Gene duplication events were surveyed to examine the expansion of the GRAS gene family in the six Cucurbitaceae genomes. Preliminary results showed that four types of duplicated GRAS genes (156 dispersed, 7 proximal, 8 tandem and 66 segmental genes) were present in the six Cucurbitaceae species (Additional le 14: Table S7). Homologous gene pairs (2 dispersed, 4 proximal, 8 tandem and 18 segmental genes) with a protein similarity greater than 80% and coverage greater than 80% were analyzed among the duplicated GRAS genes (Additional le 14: Table S7). In total, 18 segmental duplicated genes were identi ed in the C. moschata genome and two tandem duplicated genes were located in C. sativus, C. melo, C. lanatus and L. siceraria, whereas no tandem and segmental duplications occurred during the evolution of B. hispidaGRAS genes. Additionally, we observed that all duplicated gene pairs possessed Ka/Ks ratios lower than 0.5, suggesting that these genes underwent strong purifying selection during genome evolution (Additional le 15,16: Table S8,S9). Intra-genome synteny analysis for C. moschata showed that the duplicated GRAS genes occurred equally in subgenomes A and B (Fig. 3).
To infer the evolution of GRAS genes, synteny analysis was carried out among the six Cucurbitaceae species ( Fig. 4; Additional le 17: Table S10). A total of 225 GRAS genes (37 in C. sativus, 36 in C. melo, 35 in B. hispida, 36 in C. lanatus, 37 in L. siceraria and 44 in C. moschata) were located within synteny blocks of the six Cucurbitaceae genomes. Some orthologous gene pairs showed a two-to-one syntenic relationship between C. moschata and ve other Cucurbitaceae species. We found six gene pairs which exist among all ve other species have two counterparts in C. moschata, for example, the Csa4P196810.1, MELO3C025282T1, Bhi03P001806, Cla012151 and Lsi01G016490.1 are orthologs, and have two collinear gene pairs (CmoCh03G002750.1 and CmoCh07G012270.1) in C. moschata ( Fig. 4; Additional le 17: Table S10; Additional le 18: Fig. S8). These results show that these GRAS genes were conserved during the evolution of all the six cucurbit species, and the ampli cation in the C. moschata genome due to a recent whole genome duplication (WGD) event was well preserved, suggesting their conserved roles and the speci c traits in C. moschata. In addition, some GRAS genes were found to have been lost in C. moschata. For example, some GRAS genes were existed in C. sativus, C. melo, B. hispida, C. lanatus and L. siceraria, but were lost in C. moschata (the Csa7P322070.1/MELO3C023684T1/Bhi09P001228/Cla015025/Lsi02G020110.1 collinear gene pair) ( Fig. 4; Additional le 17: Table S10; Additional le 18: Fig. S8).

Expression analysis of GRAS genes in six Cucurbitaceae species
To investigate the potential function of Cucurbitaceae GRAS genes, we pro led their expression in different tissues, including roots, stems, leaves, owers and fruits. Among the 237 GRAS genes, 191 were expressed in all analyzed tissues (Additional le 19: Table S11), and 60 were relatively high expressed (FPKM > 10) (Additional le 19: Table S11). Almost all the GRAS genes in the HAM, DELLA, LISCL, SCL4/7 and PAT1 subfamilies were expressed in all ve tissues analyzed (Additional le 20, 21: Fig. S9, S10). Transcripts of three GRAS genes (MELO3C008036T1, MELO3C008170T1 and Lsi03G004130.1) could barely be detected in any of the ve tissues (FPKM < 0.001) (Additional le 19: Table S11). Expression of some GRAS genes were barely detected in speci c tissues (Additional le 19: Table S11; Additional le 21: Fig. S10). For example, MELO3C020907T1 (SHR subfamily) was not expressed in stems and transcripts of CmoCh00G001590.1 and CmoCh17G005960.1 (SHR subfamily) in leaves were nearly absent.
The expression pro les of GRAS genes from some subfamilies or orthologous groups varied greatly. For example, most OG-SCR-3 members were hardly expressed in any analyzed tissues, whereas expression level in most OG-SCR-1 and OG-SCR-2 members were high in all tissues (Fig. 5a). In the NSP2 subfamily, OG-NSP2-2 members were barely expressed in any tested tissues, in contrast to the low level of expression of all Cucurbitaceae OG-NSP2-1 genes in leaves (Fig. 5b).
Besides tissue-speci c genes, genes in some subfamilies showed expression patterns that were speci c to Cucurbitaceae species. For example, all Cucurbitaceae genes in the SCLB clade were relatively high expressed in fruits, where other clade members of species outside the Cucurbitaceae were hardly detectable (Fig. 5d); the same was true for members of the RAD1 clade (Fig. 5e).
In most cases, similar expression patterns were shown by duplicated gene pairs. However, a few exceptions were observed (Additional le 23: Fig. S11; Additional le 24: Table S13). For example, for the Cla_TanDup_1.1/1.2 duplicated gene pair, Cla_TanDup_1.2 was highly expressed in roots, owers and fruits, whereas Cla_TanDup_1.1 weakly expressed in roots, owers and fruits; for the Cmo_SegDup_6.1/6.2 duplicated gene pair, Cmo_SegDup_6.2 was highly expressed in roots and fruits, whereas Cmo_SegDup_6.1 was hardly detected in these tissues, etc. These differences in expression suggest that duplicated GRAS genes may have acquired divergent functions.
Identi cation and evolutionary history of GRAS genes in six Cucurbitaceae species The GRAS gene family contains different numbers of members in different species, such as 32-34 in A. thaliana [4,6], 60 in O. sativa [4,8], 53-54 in S. lycopersicum [7,39] and 106 in P. euphratica [4]. The number of GRAS genes in the genomes of the Cucurbitaceae species C. sativus, C. melo, B. hispida, C. lanatus and L. siceraria of the Benincasae group is similar, whereas C. moschata contains 55 GRAS genes, indicating that the GRAS family diverged together with the evolutionary separation of the Cucurbiteae and Benincasae groups. Collinearity analysis showed that segmental duplication events detected in C. moschata occurred between the two C. moschata subgenomes (Fig. 3), indicating that a whole-genome duplication event was the main reason for the difference in the number of GRAS genes between the Benincasae and Cucurbiteae group [26,44,45]. The number of GRAS genes (except for CmoCh00G001590.1 whose chromosome location is unclear) distributed on the subgenome A of C. moschata (32 GRAS genes) was greater than that on subgenome B (22 GRAS genes) (Additional le 3: Fig. S1f), indicating that the rate of gene loss within this family in subgenome A was lower than that for subgenome B [26].
Our characterization of GRAS family members was expanded by the analysis of conserved CDSs and UTRs in Cucurbitaceae genes and motifs in their encoded proteins (Additional le 8: Fig. S4, Additional le 11: Fig. S6; Additional le 12: Fig. S7). Homologous GRAS genes that clustered together in a subgroup or clade tended to share similar numbers and positions of the CDS and encoded protein motifs (Additional le 8: Fig. S4). However, some genes within the same subgroup also contained different structures, re ecting evolutionary differences of these GRAS gene family members and suggesting that they might have different functions (Additional le 11: Fig. S6 . Gene expression pro les reveal diverse roles for GRAS genes in plant growth and development To identify the potential functions of the Cucurbitaceae GRAS genes, we pro led their expression patterns and observed different abundances of GRAS gene transcripts in different tissues and organs, implying that the genes might differ in function (Additional le 20, 21: Fig. S9, S10).
The tissue-speci c expression of some GRAS genes suggests that they play potential roles in the corresponding tissue. For example, Cla020023 and Cla014025 from the SHR subfamily were particularly expressed in roots (Additional le 20, 21: Fig. S9, S10; Additional le 22: Table S12), which is consistent with the function of their homologous gene AtSHR in root development [15,29,46,47]. Similarly, the root expression of Bhi09P001022, Cla014874, Lsi02G021610.1 (Fig. 5a) from the SCR subfamily is consistent with the function of their homologue AtSCR in root development [29].
If the tissue-speci c expression of genes suggests a potential function in a particular tissue, the speciesspeci c expression of some genes may imply they have a speci c role in that particular species. Genes from the SCLB subfamily were expressed in fruits of Cucurbitaceae plants, whereas their homologous genes were not expressed in fruits of the other analyzed species (Fig. 5d). Therefore, these genes might participate in fruit development only in Cucurbitaceae species. Consistent with this, we also identi ed some motifs in members of the SCLB subfamily that were speci c to the Cucurbitaceae (Fig. 2a).
Furthermore, homologous genes may share similar functions. In this discussion, we particularly focus on genes associated with fruit development, from model plants such as S. lycopersicum. For instance, Lsi07G014150.1 and Csa3P043910.1 in the DLT clade, were more highly expressed during the owerfruit transition, implying that they potentially function in fruit set and development (Additional le 21: Fig.  S10; Additional le 19: Table S11). This pattern is consistent with the expression pattern of their homologous gene SlGRAS41 (Solyc08g080400.1.1), which might play a role in fruit development by modulating brassinosteroid signaling in S. lycopersicum [14,35,39]. Within the DELLA subfamily, Lsi05G003650.1 was highly expressed in fruits (Additional le 20, 21: Fig. S9, S10; Additional le 19: Table S11), which is consistent with the reported function of its homologue SlDELLA in controlling cell division, expansion, and the auxin-signaling pathway throughout fruit set and growth in S. lycopersicum [31,48]. It has been reported that silencing of SlGRAS2 (Solyc07g063940.1.1, PAT1 subfamily) led to reduced fruit weight during S. lycopersicum fruit development through the modulation of ovary growth and cell size [16]. In Cucurbitaceae species, CmoCh09G009100.1, CmoCh01G012140.1, MELO3C018144T1 (Additional le 21: Fig. S10; Additional le 19: Table S11), which are homologous to the SlGRAS2 gene, were highly transcribed in fruits, implying that they may play a role in fruit development.

Conclusions
In this study, 237 GRAS genes were identi ed in the genomes of six Cucurbitaceae species. The GRAS family was divided into 16 subgroups on the basis of their phylogenetic relationships and previous classi cations. We observed Cucurbitaceae-speci c motifs and expression patterns in GRAS proteins of the SCLB clade, which may play a speci c role in Cucurbitaceae species. Duplication and synteny analysis revealed that segmental and tandem duplication might be the main drivers of the expansion of the GRAS family in Cucurbitaceae species, with whole-genome duplication as the main reason for the presence of more GRAS genes in C. moschata than in the other ve analyzed species. The expression patterns of the GRAS members among tissues or at different development stages were surveyed using publicly available transcriptome data, and some tissue-speci c genes were identi ed, and their potential functions were discussed, especially for the genes that might be involved in fruit development. In conclusion, this study should facilitate improvements in agronomic traits of Cucurbitaceae crops.

Identification and characterization of GRAS genes in Cucurbitaceae species
We downloaded the genome sequences, protein sequences and annotation information for C. sativus [23], C. melo [24], B. hispida [19], C. lanatus [25], L. siceraria [27] and C. moschata [26]. The Pfam database (http://pfam.xfam.org/) [49] was used to retrieve the GRAS Hidden Markov Model (HMM) model (PF03514) for HMM searches against annotated protein databases from different genomes with an E-value cutoff of 0.01 with the program HMMER (version: 3.1) (http://hmmer.org/). From the proteins obtained using the raw GRAS HMM, a high-quality protein set (E-value < 1 × 10 −20 and manual veri cation of an intact GRAS domain) was aligned and used to construct a species-speci c GRAS HMM using hmmbuild from the HMMER v3 suite. This new species-speci c HMM was used, and all proteins with an E-value lower than 0.01 were selected [50]. To identify amino-acid sequences that contained GRAS domains, we used the NCBI Conserved Domain Search (https://www.ncbi.nlm.nih.gov/Structure/cdd/) to ensure the accuracy of the identi cation of these transcription factors. In the same way, we identi ed GRAS genes from the A. thaliana and S. lycopersicum genomes and compared the results with the number of GRAS genes previous reported, to verify the accuracy of the method. Finally, GRAS genes were further ltered based on GRAS proteins whose ratio of the GRAS conserved domain length to the GRAS HMM length was less than 60%. The GRAS genes were mapped to the six chromosomes of Cucurbitaceae species based on their genome annotation les, and diagrams of the chromosomal locations were plotted with Mapchart software [51]. The chemical features of each GRAS protein, such as isoelectric point (pI) and molecular weight (MW), were extracted using the ExPASy proteomics server (http://web.expasy.org/protparam/) [52]. We used OrthoFinder [53] to identify single-copy homologous genes from each of the 11 species and used EasySpeciesTree (https://github.com/dongwei1220/EasySpeciesTree) to construct species phylogenetic trees.

Analysis of GRAS gene structures and the phylogenic relationships, and conserved motifs among the GRAS proteins
The 237 candidate Cucurbitaceae GRAS proteins and 165 additional well-described GRAS proteins (encoded by 33 GRAS genes in A. thaliana, 49 GRAS genes in S. lycopersicum, 24 GRAS genes in O. sativa, 29 GRAS genes in V. vinifera and 30 GRAS genes in A. trichopoda.) were multiply aligned using MUSCLE software (version 3.8.1551) [54] to further search the evolutionary relationships of the GRAS gene family and a neighbor-joining (NJ) consensus tree was subsequently constructed using MEGA software (version 6.06) [55] under the Jones-Taylor-Thornton (JTT) amino-acid substitution model and the Partial deletion option of 1,000 bootstrap replicates to assess the reliability of internal branches. Finally, we divided the sequences into different subfamilies according to previous reports [5] and modi ed the phylogenetic trees by iTOL software (https://itol.embl.de) [56]. The other trees that contained 318 GRAS proteins from six Cucurbitaceae plants, A. thaliana and S. lycopersicum were constructed for each subfamily in the same way. Multiple alignments of 237 Cucurbitaceae sequences containing a conserved GRAS domain were performed with DNAMAN software (version 8.0), and the maps were constructed.
To obtain the gene structure, a GFF3 annotation le containing the precise positional information of coding sequences (CDSs) and untranslated regions (UTRs) in GRAS genes was retrieved from the genomic dataset. The conserved and shared motifs for protein sequences from each subfamily were annotated by Multiple Expectation for motif Elicitation tool (version: 4.9.1) (MEME, http://meme-suite.org/) [57] with the following parameters: "minsites 6, maxsites 149, minw 6, maxw 200 and nmotifs 20". Gene structures and conserved motifs were visualized using TBtools software [58].
Duplication and synteny analysis of GRAS genes from six Cucurbitaceae species Duplication events among Cucurbitaceae plants were identi ed using the Basic Local Alignment Search Tool (BLAST) (E-value: £ 10 −5 ) and the duplicate_gene_classi er program with the default parameters of the Multiple Collinearity Scan toolkit (MCScan) [59]. The duplicate_gene_classi er program can classify the duplicate genes of a single species into WGD/segmental, tandem, proximal and dispersed duplicates. WGD/segmental duplicates are inferred by the anchor genes in collinear blocks. Tandem duplicates are de ned as paralogs that are adjacent to each other on chromosomes, which are suggested to arise from illegitimate chromosomal recombination. Proximal duplicates are paralogs physically located near each other but interrupted by several other genes. Proximal duplicates are inferred to result from localized transposon activities, or ancient tandem arrays interrupted by more recent gene insertions. Dispersed duplicates are paralogs that are neither near each other on chromosomes, nor do they show conserved synteny [60]. Subsequently, non-synonymous (Ka) and synonymous substitution (Ks) between duplicated GRAS gene pairs were calculated by the add_ka_and_ks_to_collinearity program of the MCScan tookit. The duplicated GRAS genes in C. moschata were visualized as circle plots generated by Circos software (version: 0.69-8) [61]. Syntenic gene pairs among the C. sativus, C. melo, B. hispida, C. lanatus, L. siceraria and C. moschata genomes were identi ed using MCScan with default parameters. Chromosome-scale synteny blocks plots between six Cucurbitaceae genomes were constructed using the python version of MCScan (https://github.com/tanghaibao/jcvi/wiki/MCscan-(Python-version)).

The expression of GRAS genes in different tissues
We analyzed the expression pro les of GRAS genes in different tissues, including roots, stems, leaves, owers and fruits, using public RNA-seq data [26,[62][63][64][65][66][67][68]. RNA-seq data from ve tissues were mapped to the six Cucurbitaceae plant genomes using Hisat2 (version: 2.1.0) [69] with default settings for parameters and were assembled using stringtie (version: 1.3.4) [70]. The bam les of uniquely mapped reads were used as inputs for the stringtie software, and fragments per kilobase million (FPKM) values were calculated to measure the gene expression levels. Heat maps for GRAS genes were plotted by pheatmap in the R package using the FPKM values as log 2 (FPKM value + 1). To examine the speci city of the GRAS gene expression patterns, we calculated a tissue-speci city index, τ, which is a quantitative, graded scalar measure of the speci city of an expression pro le. The τ values interpolate the entire range between 0 for housekeeping genes and 1 for genes that are strictly speci cally expressed in a single tissue. In this study, 0.85 ≤ τ ≤ 1 was regarded as a tissue-speci c gene, whereas 0 ≤ τ ≤ 0.15 identi ed a housekeeping gene. All needed RNA sequencing data in this study from NCBI (https://www.ncbi.nlm.nih.gov/) Sequence Read Archive (SRA) database with the accession number are shown in Additional le 25:       Expression pro les of GRAS genes in A. thaliana, S. lycopersicum and six Cucurbitaceae species. In the heat maps, columns represent different tissues: roots, stems, leaves, owers and fruits, and the rows represent genes. The color changes from white to black to represent relative gene expression (log2 (FPKM + 1)); gray represents missing values. The Neighbor-joining tree of GRAS proteins is shown on the left. a, SCR; b, NSP2; c, SCL32; d, SCLB; e, RAD1 subfamilies.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.

Supplementary le.pdf
Supplementary le.xls