Transcriptome-wide identi cation and characterization of the MYB gene family in sickle seagrass (Thalassia hemprichii)


 Background: Sickle seagrass ( Thalassia hemprichii ) is one of the most important marine plants living in the tropical climate, mainly distributed in Southeast Asian waters. It is an important food source for marine herbivores and plays important roles in nitrogen fixation, water purification and maintaining the balance of marine ecology. In recent years, the area of aquatic plants has declined rapidly, affecting the ecological balance. However, the molecular mechanism of aquatic plants has been poorly studied.Results: In this study, all transcriptional information of T. hemprichii was obtained by using high-throughput sequencing technology, and 32,097 unigenes were identified by annotation. In addition, 119 MYB transcription factors were screened, and 61 genes with complete ORF were sequenced. Furthermore, 17 clays were identified according to the information of Arabidopsis .Conclusions: This study provides useful information for enriching the genetic information of T. hemprichii , and further exploring the molecular mechanisms of the evolution, development, and physiological functions of Sickle seagrass.


Background
Seagrasses are marine owering plants that form habitats in coastal areas. The seagrass can stabilize sediment, reduce water ow rate, and provide habitat for other aquatic animals and plants. Coastal seagrass habitat provides a wide variety of ecosystem services for marine life. Also, it is also a source of food for large herbivores, nursery for most sh and invertebrates, and plays an important role in the nutrition, nitrogen xation and physical and chemical properties of marine organisms [1][2][3][4]. In recent years, seagrass meadows have been lost, leading to marine ecological damage. Scientists have carried out extensive research on genetics, arti cial cultivation, and ecological restoration of different species of seagrass population worldwide [5][6][7].
Sickle seagrass, Thalassia hemprichii is a common seaweed species worldwide, especially biogeographic sub-region of the Indian Ocean stretching on a latitudinal scale from Somalia to the east coast of South Africa [1,8]. T. Hemprichii can be propagated sexually through seeds or asexually through rhizomes. The range of each component of reproduction has an important impact on local population statistics, diffusion, biogeography and genetic diversity [1]. T. hemprichii seeds seem to sink under the water for 24 hours and may oat ten to hundreds of kilometers for about a month on the water. Studies have shown that adult plants of T. hemprichii can oat for several months and remain alive, and are likely to settle in new areas [9]. Their mature stems and fruits are mainly distributed by passive drifting, and how to improve seagrass management in the western Indian Ocean (WIO) area to maintain high adaptability to the environmental changes. The healthy seaweed population is of great signi cance.
The MYB family is one of the largest transcription factor (TF) families, and it is named after a highly conservative sequence (MYB DNA-binding domain) located in the N-terminus of these proteins [10]. MYB TF regulates the physiological and biochemical processes, such as the development and growth of various plants, and the synthesis of avonol by participating in many physiological and biochemical processes such as petal morphogenesis and cell etc [11]. MYB TF plays a central role in the transcriptional regulation of plant secondary cell wall (SCW) deposition. SCW deposition and ligni cation have very important values for plant growth and development [12]. MYB family genes have been found in Arabidopsis, corn, soybean, maize,pineapple, bamboo and many other plants. [10,[13][14][15][16][17].
T. hemprichii is an important seaweed, and T. hemprichii meadow has signi cance in marine ecology. The transcriptome data of T. hemprichii were studied, and the similarity and classi cation of T. hemprichii MYB were compared with Arabidopsis MYB. The complete transcriptome information of T. hemprichii was obtained for the rst time. Also, the genes of the MYB gene family in T. hemprichii were analyzed.

Result
Transcriptome data assembly and acquisition In general, the high-quality sequence (6.47 GB) comes from the transcriptome sequencing of T. hemprichii. The average error rate of the sequences was 0.01%, and more than 87% of clean reads ( Figure 1A). Using Benchmarking Universal Single-Copy Orthologs (BUSCO) v3, a single copy homologous database, and the integrity of the transcriptome assembly was demonstrated by comparing with the conserved genes ( Figure 1B). The assembled sequence data for these raw reads were deposited in the Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/Traces/sra) (Accession no. SAMN13255716 and SAMN13255717). The sequencing data were assembled into 51,619 transcripts with the sequencing lengths ranging from 200 to 19,192 bases ( Figure 1C) (mean length = 1045 bases, N50 = 1798 bases, and GC content = 45.73%). And the longest Unigene is 5.40 Mb (53,976,312 bases). All the data show that the production capacity of the product is high enough and the assembly quality is very good, which can be used for subsequent analysis.

Gene annotation and functional classi cation
Seven databases (KEGG), Gene Oncology (GO), NR, NT, SwissProt SWISS-PROT , Protein families (Pfam), and EuKaryotic Orthologous Groups (KOG) were used to annotate the obtained genes. In total, 32,097 Unigenes (62.18% of the total Unigenes) were annotated in the databases in this study (Table 1). Transcoder software was used to identify the candidate coding regions (CDs) in Unigene. First, the longest open reading frame was obtained. Then, through Swissport database and blast in hmmscan, the homologous sequence of Pfam protein database was found and its CDs was predicted.. 24,870 CDS were predicted, with a total length of 26,466,750, N50 of 1,407 bases, length range of 297 to 10,956 bases and GC content of 49.09% ( Figure 1D). According to the results, 11,804 genes were annotated in all databases. Most of the unigenes get annotation information in the NR database (30,772 and 59.52%) ( Figure 2). Regarding to GO classi cation, the highest number of annotations was in the molecular function (MF) category, for which catalytic activity (6317), binding (6269) and transporter activity (676) were the top 3 GO terms; the secondlargest amount of annotations was in the cellular component (CC) category, the top 3 GO terms were membrane part (3948), cell (3746) and organelle part (1512); and the third-largest amount of annotations was in the biological process (BP) category, the top 3 GO terms were cellular process (3613), biological regulation (1404) and localization (879) ( Figure 3A). According to the classi cation of KEGG, the most annotated genes are metabolic related pathways, and the most annotated pathway is metabolic pathway (15,152). In this category, the Global and overview maps pathway enriched the most genes (5,855) ( Figure 3B). Furthermore, Classi cation of Unigene according to KOG. The top 3 classes were general function prediction(5,970), signal transduction mechanisms (3,334), and function unknown(2,071) ( Figure 3C).

Functional homology analysis of whole transcriptome
In the results of gene function annotation, NR database had 59.52% of the annotation proportion. The proportion of different species on the NR annotation was calculated, and the species distribution map was drawn ( Figure 4). The NR annotation of top 5 species were Elaeis guineensis (5,272 and 17.16%), Phoenix dactylifera (3,436 and 11.19%), Ananas comosus (1,896 and 6.17%), Nelumbo nucifera (1,756 and 5.72%) and Zostera marina (1,672 and 5.44%). T. hemprichii and E. guineensis had more closely related genes and were closer to the terrestrial plants.

Identi cation and distribution of MYB TFs encoding genes in transcriptome
The genes were predicted, which were capable of encoding transcription factors (TFs). Then we use getorf to detect the ORF of Unigene, and then useing hmmsearch to compare ORF with transcription factor protein domain (the data comes from TF), and then recognize the ability of UniGene according to the characteristics of transcription factor family described by TFDB. The TFs family was classi ed and counted (Additional les 1: Figure S1). There were 119 MYB genes family in all TF genes (Additional les 2: Table S1).
GO classi cation was used; the results showed that these MYB family genes were divided into 10 subclasses. The most annotated genes was in the molecular function (MF) category; binding (69), and transcription regulator activity (13). In the cellular component (CC) category, the 3 GO terms were cell (42), organelle part (2), and membrane part (1). In the biological process (BP) category, the 5 GO terms were cellular process (39), biological regulation (39), response to a stimulus (6), cellular component organization or biogenesis (2), and multicellular organismal process (2) ( Figure 5A). According to the KEGG classi cation, All MYB family genes were involved in 9 pathways, among which these pathway was Environmental adaptation (24 genes), Global and overview maps (9 genes), Biosynthesis of other secondary metabolites (4 genes), Transport and catabolism (4 genes), Lipid metabolism (3 genes), Translation Through comparative analysis, a set of 119 MYB Unigenes containing MYB DNA-binding domains was identi ed in T. hemprichii (Additional les 3: TableS 2), which included 67 full-length CDS and 52 fragment sequences. MYB Unigenes (ThMYB1-ThMYB119) were identi ed, respectively. The full-length sequence of MYB ranges from 588 to 3393. The amino acid sequence length of the protein is 148-1131 amino acids, the calculated molecular weight of ThMYB is 15.87-127.50 kDa, and the calculated pi is 4.34-110.01.
The majority of ThMYB protein is about 400 amino acids, and its molecular weight is about 50 kDa. Among the 119 ThMYBs, ThMYB042 was the longest protein (1,131 amino acids), while the shortest protein was ThMYB038 (148 amino acids).
In order to study and identify the characteristics of homologous domain of ThMYB protein, 119 amino acid sequences of ThMYBs were used for multiple sequence alignment. According to the domain classi cation, ThMYB protein contains type 11 domain, which shows that ThMYBs has similar domain with other species (Figure 6).
Putative functions of ThMYBs in T. hemprichii MYB TFs in A. thaliana contains 27 clades, and the function of each clades was annotated [14,[18][19][20][21]. The conglomerated homologous proteins usually have similar functions, which indicates that ThMYB has similar functions with AtMYB in the same clades. Therefore, through the conclusion and discussion compared with AtMYBs, the function of ThMYBs is predicted and summarized (Figure 7; Additional les 4: Table S3). Constructing NJ unrooted phylogenetic tree with 09 ThMYBs and 110 AtMYBs (Figure 7). The results showed that MYB TFs of the two species gathered in 36 clades (C1-C36), and 17 of which were found in both species.

Discussion
In this study, high-throughput sequencing technology was used to obtain the complete transcriptome information of T. hemprichii, and 32,097 unigenes were identi ed by annotation. In addition, we showed that there were more than 52 proteins with one domain, 43 proteins with two domains, and ThMYB095 protein contained three domains. Interestingly, it was found that 7 proteins did not have typical MYB domains, but they were still annotated as MYB genes. Enable ThMYB042 contained a Myb_Cef domain. ThMYB004, ThMYB011, and ThMYB086 contained a SWIRM domain. ThMYB087 and ThMYB019 contained a ZZ domain, ThMYB053 and ThMYB022 contained a DnaJ domain, ThMYB098 contained a DMAP1 domain, respectively.
And we found that 60 ThMYBs were clustered into 17 clades of known functional annotation, and 48 ThMYBs were clustered into 9 clades of unknown functional annotation. According to the annotation results, ThMYBs can be divided into six functional classes.There are six clades in class I (C1, C12, C14, C16, C22 and C32), which regulate the biosynthesis and deposition of lignin, cellulose and hemicellulose, and are responsible for the formation of secondary cell wall (SCW). Class II includes ve clades (C2, C6, C13, C28 and C34), which can regulate the ABA pathway to participate in the response of biotic and abiotic stresses. The third category includes ve clades (C4, C20, C25, C29, C31), which play an important role in morphogenesis and organogenesis of root, anther, embryogenesis, epidermal cell, vegetative cell and stomatal cell development, etc. The fourth group includes two clades (C7, C17), which are mainly involved in the regulation of secondary metabolism. Class IV consists of two clades (C4, C31), which are mainly involved in the regulation of secondary metabolism.
Seagrasses are important marine ecological plants that provide sustainable and reliable guarantees for the health status of marine ecology. Seagrasses play important roles in cleaning water, producing oxygen, xing nitrogen, providing food for aquatic animals and maintaining seabed stability. In recent years, the reduction of aquatic plants has seriously affected the ecological stability of the marine ecosystem. The studies have carried out extensive research in various elds, such as physiology, biochemistry, genetic breeding, and geographic pedigree to keep seabed stability by the arti cial restoration of seagrass. T. hemprichii is a common waterborne grass. It grows fast and can be spread by ocean currents. It is rich in nutrients and is a food of many herbivores.

Conclusions
At present, there are a few studies on the molecular biology and genomics of T. hemprichii. Here, we obtained the full transcriptome information of T. hemprichii through assembly prediction. A total of 32,097 unigenes were also obtained. This provides useful genetic information for the further study of the evolution, development, molecular characteristics and molecular breeding of T. hemprichii. Besides, 119 MYB transcription factors were obtained by screening the whole transcript. Therefore, it provides a molecular basis for a better understanding of T. hemprichii.
The similar number of genes of T. hemprichii and E. guineensis, P. dactylifera, and A. comosus have been found, which indicates that T. hemprichii is closer to terrestrial plants. In the evolutionary process, T. hemprichii might migrate from land to water, which is worthy of being studied in the future.

Methods
Collection of plants T. hemprichii was collected at three separate occasions (10 days before the start of an experimental run) from March 2019 at the Lingshui Xincun port seagrass special protected area, Hainan, China. Xincun port (21.97km 2 ) is about 4km long from north to south, and the port door is from Xin cun jiao (18°24′42′′N and 109°57′58′′E) to Shitoucun Shazui (18°24′34′′N and 109°57′42′′E). The selected grassland is located in the shallow water area (0.5 -1 m deep) on the coast of Xincun Lingshui, about 5 km northwest of Xincun Lingshui.. An iron shovel was used to dig up the ground, and the lower part of T. hemprichii together to ensure the integrity of the roots, rhizomes, erect stems, and leaves, and clean the shore with seawater. The seagrasses were packed in tissues wetted with seawater, placed inside plastic bags and transported within 48 h to the the Molecular Biology Laboratory in Hainan Academy of Ocean and Fisheries Sciences. We identify the collected samples by morphological identi cation in order to ensure the accuracy of the samples. The whole plant packed in the sealed pocket and brought to the laboratory. After cleaning, the liquid nitrogen treatment was carried out for 15 min and stored at -80 °C. We identify the collected samples by morphological identi cation. According to the website of World Register of Marine Species, Petermanns Geogr named the Thalassia hemprichii in 1871 (Aphia ID: 208931), and it was described as follows: Rhizomes terete, with persistent leaf sheaths. Leaves curved, 6 12( 40) cm×4 8 mm. Peduncle of male in orescence 2 3 cm, female in orescence without peduncle; spathe linear. Male ower on a pedicel 2 3 cm; perianth segments elliptic, petaloid; anthers oblong; female ower with ovary of 6 carpels; stigmatic branches 1 1.5 cm. Fruit greenish, 2 2.5×1.8 3.2 cm.
After identi cation by associate fellow Shiquan Chen and proofreading by Associate Senior Engineer Zefu Cai, we thought that the sample was Thalassia hemprichii.. The "Plant Sample Collection Information Record Form" is attached.
RNA extraction, library construction and high-throughput sequencing RNA of each sample was extracted using the TRIzol extraction method and Puri cation of RNA products using RNeasy® MinElute® cleanup kit (Qiagen®). RiboZeroTM Magnetic (Plant leaf) kit (Epicentre®) was used to remove rRNA from 4 μg of each sample. The removal of rRNA was con rmed using a Agilent 2100 BioAnalyser system (Agilent Technologies).
After extracting the total RNA from each sample, Oligo DT magnetic beads were used to enrich mRNA and fragment buffer was used to split mRNA into short fragments. Using mRNA as template, the rst cDNA strand was synthesized with six base random primers, and buffer was added. which was end-repaired after purifying by Qiaquick PCR puri cation kit and eluting with EB buffer, plus oya and ligation of the sequencing link. The agarose gel electrophoresis was done for fragment selection. Finally, The constructed library was sequenced with bgi-500 BGI.
Transcriptome assembly, gene annotation and ontology Short read assembly software, SOAPdenovo, which was used to perform transcriptome assembly from scratch [22]. SOAPdenovo rst concatenated reads with a certain length of overlapping fragments. These N-free assembly fragments obtained from the overlapping reads are called contig. Then, the reads back to contig were compared. Different contigs from the same transcript and the distance between these contigs were determined. The contigs were linked together. The paired-end reads were used to make a hole in the Scaffold (N, unknown intermediate sequence), and the sequence with the least N was identi ed, and there was no longer extension at both ends, which is called Unigene. Unigene assembled from different samples, which was further sequenced and de-duplicated by sequence clustering software to obtain the longest non-redundant Unigene. Finally, the Unigene sequences were aligned with no redundant (Nr) protein, Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Clusters of Orthologous Group (COG) databases (E-value <0.00001), and the best-aligned protein was determined for the sequence orientation of Unigene. If there was a contradiction between different libraries, then the sequence direction of Unigene was determined according to the priority of Nr, Swiss-Prot, KEGG, and COG. ESTScan software was used to predict the coding region and determine the direction of the Unigene, which is incomparable with the above four libraries (Iseli et al., 1999).

Phylogenetic analysis and ThMYB protein domain prediction
To examine the evolutionary relationships among MYBs in T. hemprichii 119 sequences were used. Also, the MYB sequences of Arabidopsis were downloaded from The Arabidopsis Information Resource (TAIR) (http://www.arabidopsis.org). We constructed a neighbor-joining (NJ) phylogenetic tree of ThMYB and AtMYB amino acid sequences by using the method of ClustalW in MEGA software (version X) with the following parameters: pairwise deletion, bootstrap analysis with 1,000 replicates and Poisson correction,. Additionally, the protein domain of ThMYBs was predicted according to the bio sequence analysis using pro le hidden Markov models (HMMER) (https://www.ebi.ac.uk/Tools/hmmer).

Declarations
Ethics approval and consent to participate The plant material used in our study has been generated by ourselves by the methods described.

Consent for publication
Not applicable.

Availability of data and materials
The datasets used and/or analysed during the current study are available in the supplemental information les or from the corresponding author on reasonable request.

Additional Information
Additional les1: Figure S1. Classi cation and statistics of different TF family genes.
Additional les3: Table S2. Nomenclature and protein information of MYBs in Thalassia hemprichii.
Additional les4: Table S3. The functions of ThMYBs were predicted and summarized by conclusion and discussion compared with those of AtMYBs Figure 1 Unigene assembly and prediction. A. The sequences after insert recognition (Reads Of insert, ROI) processing can be divided into four categories; B. Correspondence before and after Isoform clustering. The quality of assembled transcripts was assessed using a single copy ortholog database BUSCO Gene comparison, to some extent, illustrates the integrity of transcriptome assembly; C. Classi cation statistics of UNIGENE gene with different length intervals; D. Classi cation statistics of different length intervals of CDS sequences.   According to the results of NR, the proportion of different species on the annotation is calculated. According to the annotation results of the NR database, the proportion of different species on the isoform annotation is counted, and the species distribution map is drawn  Phylogenetic relationships and protein domain predict. The amino acid sequences of 119 ThMYBs were aligned by the Clustal W program in MEGA, and the phylogenetic tree Was constructed by the NJ method with 1,000 bootstrap replicates. Bootstrap values >50

Figures
were indicated on the nodes. Different subgroups were marked with alternating tones of a gray background to make subgroups identi cation easier.

Figure 7
Putative functions of the MYB proteins in T. hemprichii based on the phylogenetic tree. Along with MYBs from Arabidopsis. The circular unrooted tree was generated by NJ method with 1,000 bootstrap replicates.