3.1.Identification of USP Genes in T. aestivum
Blastp search revealed, a total of 107 USP genes in the T. aestivum genome. This search was made using the OsUSP genes. Moreover, due to their presence on the ABD genome, they were sub-grouped further. The coding DNA sequence’s length ranged from 186 bp to 6498 bp; (TaUSP13(A), and TaUSP16(B) respectively. The length of protein ranged from 121aa to 844aa (TaUSP7(D), TaUSP27b(D) respectively). Molecular weight of these proteins varied greatly from 11,495.23 g/mol to 91,760.08 g/mol (TaUSP42(D), TaUSP8(A) respectively. The average pI of these genes was 7.1098. Supplementary Table 1
3.2. Multiple Sequence Alignment and phylogenetic evolutionary analysis and nomenclature of USP genes of Triticum aestivum.
Multiple sequence alignment of the 107 USP genes gave insight into the conserved amino acids present in these sequences. A total of 34 USPs were identified comprising the 107 sequences identified by the BLASTp search. This phylogenetic analysis revealed that not all OsUSPs identified in rice were present in B. distachyon and T. aestivum. For instance, i.e., OsUSP10, and OsUSP42 identified in T. aestivum were not present in B. distachyon. Similarly, no USP genes were identified in both plants against OsUSP2, OsUSP6, OsUSP23, OsUSP39, OsUSP40, and OsUSP4.
The phylogenetic analysis revealed that the tree was divided into two distinct clades. This classification was done based on the domain architecture. USP genes contain either a single USP domain or a USP domain with an additional kinase domain belonging to the major classes of protein kinase. TaUSP10, TaUSP27(a, b), TaUSP34 and TaUSP35 also had an extra U-box domain in addition to USP and Protein kinase domain. The study also showed that not all the genes are present in all three wheat genomes ABD. The sequences are present on the either AB, AD, or BD genome. Figure 1.
3.3. Gene structure and motif Analysis and amino acid content of USP of wheat
Identification of the conserved motifs was done using MEME online server. There were 10 motifs identified. The information on these motifs is present in Supplementary Table 2. There are either two, 7, or 10 motifs collectively present in the genes. Motifs 1 and 2 are the characteristic feature of the USP domain whereas 3–10 belong to the protein kinase and U-box domain. All the USPs belonging to Class A had 1–2 motifs whereas Class B had 6–10 motifs in total. This confirms the presence of a single USP domain and a protein kinase domain along with a U-box domain in these proteins. Figure 2a
The gene structure of all the identified TaUSPs in the wheat were analyzed through the GSDS webserver. To have a better understanding the exon and intron distribution was studied. A moderate variation is seen in the distribution of these introns and exons. For introns, the distribution was from 1 to 10 with TraesCS1A02G145700.1, TraesCS1D02G144300.1, belonging to TaUSP37A and TaUSP37D have the least introns that is one. The sequences that contain USP and protein kinase domain had relatively more introns as well as exons. The architectural similarity of the exon-intron distribution indicates that these are also similar at the protein level. The exons were also distributed from as low as one to a maximum of 10. TaCS4B02G228200.1 and TaCS4D02G229100.1 have the highest number of exons 10 and these two belong to the TaUSP16A and TaUSP16D. Only TaUSP1A has 1 exon. The presence of 5’UTR and 3’UTR regions suggests their role in post-translational modifications. Figure 2b.
Moreover, the amino acid (aa) composition of these USPs indicates that USPs with the same domain pattern have similar amino acid content whereas, the USPs with different domain patterns have a slight difference in the amino acid composition. This can be checked by looking at the amino acid composition of TaUSP3 and TaUSP35 both have different domain patterns i.e., the former contains a single USP domain whereas, the latter has an additional domain along with USP. Hence, they also have different compositions. The results of gene structure, motifs, and amino acids indicate consistency of the domain architecture present. Supplementary Fig. 1.
3.4. Chromosomal mapping and gene duplication analysis
The genes are distributed variably on each of the seven chromosomes of the ABD genome of wheat. For instance, Chromosome 5A has the highest number of genes i.e., 9, 5B has 7 and 5D has 10. Similarly, 6A 6B, and 6D with 7, 9, and 7 genes, respectively. Whereas chromosomes 7A, 7B, and 7D have only 1 gene Fig. 3a Supplementary Table 1. These results indicate that all genes might not be present on all three genomes of the wheat, and they might have been lost in the process of evolution. The existence of more than 1 gene on certain chromosomes indicates that duplication events have occurred over time that gave rise to these genes on the chromosomes in separate locations. The duplication analysis revealed that 15 gene pairs had been duplicated in which only 1 tandem duplication and the rest of the 14 gene pairs were segmental duplications Fig. 3b. Divergence time is from 1.51 MYA to 5.01 MYA calculated by the Ka to Ks ratio. As well as the selection pressure was found to be negative as the ka/ks ratio was below 1. Synteny analysis among rice and wheat showed high conservancy and identified several paralogs. Figure 3c
3.5. Gene Ontology (GO) annotation and Subcellular localization
Predicting the subcellular localization of the identified proteins revealed that these genes show expression in many compartments of the cell performing a variety of functions. The results revealed that these proteins are localized within the cytoplasm, mitochondria, endoplasmic reticulum, cytoskeleton, chloroplast, and nucleus. They are also present in the extracellular matrix, and some of them are also present in peroxisomes that indicate their function in anoxic conditions and peroxidase activity. The intensity of the heat map indicates that these genes are strongly localized within the cytoplasm. Their presence in the Golgi complex, on the nuclear plasma membrane, vacuoles elucidated their enormous functioning in the various biological, molecular, and cellular processes. Figure 4
Functional annotations were checked through the web server PANTHER. The different Go ontologies based on cellular processes, biological processes, and molecular functions were predicted. A total of 29 out of 107 TaUSP genes were mapped to eight biological processes as protein phosphorylation (GO:0006468), phosphorylation (GO:0016310), phosphate-containing compound metabolic process (GO:0006796), phosphorus metabolic process (GO:0006793), cellular protein modification process (GO:0006464), Protein modification process (GO:0036211), macromolecule modification (GO:0043412) and no sequence found a match with biological regulation (GO:0065007). Figure 5a
Furthermore, talking about the cellular processes, out of 107 sequences were mapped with the PANTHER IDs. These were mapped to 9 cellular processes as cytoplasm (GO:0005737), intracellular anatomical structure (GO:0005622), nucleus (GO:0005634), cellular anatomical entity (GO:0110165), cellular component (GO:0005575), intracellular membrane-bounded organelle (GO:0043231), membrane-bounded organelle (GO:0043227), intracellular organelle (GO:0043229), organelle (GO:0043226), whereas 67 sequences were categorized as unclassified. Figure 5b
Lastly, talking about the molecular functions, sequences were mapped to functions like AMP binding (GO:0016208), protein kinase activity (GO:0004672), phosphotransferase activity, alcohol group as acceptor (GO:0016773), kinase activity (GO:0016301), transferase activity, transferring phosphorus-containing groups (GO:0016772), adenyl ribonucleotide Binding (GO:0032559), adenyl nucleotide binding (GO:0030554), purine ribonucleotide binding (GO:0032555), purine nucleotide binding (GO:0017076), ribonucleotide binding (GO:0032553), carbohydrate derivative binding (GO:0097367), ATP binding (GO:0005524), purine ribonucleoside triphosphate binding (GO:0035639), nucleotide binding (GO:0000166), nucleoside phosphate binding (GO:1901265), anion binding (GO:0043168), small molecule binding (GO:0036094), catalytic activity, acting on a protein (GO:0140096) and the last one being transferase activity (GO:0016740). Figure 5c
3.6. Cis-regulatory elements analysis of USP promoters of wheat
The promoter sequences downloaded from the database were used to identify different cis-regulatory elements (CREs) which might control the activation of genes under certain conditions. The analysis revealed that a total 7 types or classes of CREs are present in the sequences such as light-responsive elements (LREs), hormone-responsive elements (HREs), development-related elements, promoter elements, abiotic stress-related elements, biotic stress-responsive elements, and then was the 7th class of elements with functions still unknown Fig. 6a. The result of each class is discussed below.
3.6.1. Promoter related elements
There are 7 identified promoter-related elements present in these genes. These are CAAT-box, TATA-box, AT~TATA-box, A-box, AT-rich element, TAAT, unnamed-1. One of the most abundant CREs is located upstream to start codon: TATA-box and CAAT-box at -35 and − 10 positions respectively also called core promoters. In the present sequences, the CAAT-box is the most abundant. Figure 6b.
3.6.2. Light responsive elements
Light responsive elements (LRE) identified were G-box, TCT-motif, AE-box, GATA-motif, Sp1, GT1-motif, Box 4, ACE, LAMP-element, I-box GA-motif, Gap-box, ATCT-motif, 3-AF1, L- box, chs-CMA1a, chs-CMA2, chs-Unit 1 m 1, AT1-motif, ATC-motif, TCCC-motif, Pc-CMA1c, GTGGC-motif. These are supposed to activate different photosystems and hence give a response to light. The results also indicated two or more elements of these classes are near each other, which indicates that more than one element is required for the activation of these promoters. G-box is the most abundant LRE present in the sequences. The presence of light-responsive elements in these sequences strongly tells that these genes might be involved in the activation of pathways regulated by light. Figure 6c.
3.6.3. Hormone-related elements
The upstream regions of the USPs also contained the regulatory elements related to the hormones. A total of 18 hormone-related elements belonging to 6 different classes have been identified. Class 1 is abscisic acid (ABA) related: ABRE, ABRE3a, ABRE4, AT ~ ABRE, class 2 is auxin-related: AuxRE, AuxRE-Core, TGA-box, TGA-element, CGTCA-motif (jasmonic acid), JERE and TGACG-motif are included. Class 4 belongs to salicylic acid (SA): TCA, TCA-element, and salicylic acid-responsive elements (SARE), class 5 is gibberellic acid: P-Box, TATC-box, and gibberellic acid responsive element (GARE-motif) and the last class 6 is the ethylene (ETH) which has ethylene responsive elements (ERE). The results indicate that these proteins also get activated by the change in hormones and affect these pathways giving responses according to them. Figure 6d.
3.6.4. Development-related responsive elements
A total of 25 development related responsive elements identified are AAGAA-motif, CCGTCC-box, AC-I, AC-II, as-I, CGGTCC-box, circadian, CAT-box, dOCT, E2Fb, F-box, GCN4-motif, HD-zip 1, HD-zip 3, MSA like, NON, CARE, re2f1, RY element, NON-box, O2 Site, Unnamed__8, Unnamed__10, Unnamed__12 and Unnamed__14. These factors play a distinguished role in the cellular development process including the cell cycle and the cell proliferation pathways. Some of the genes might also control circadian pathways, and the pathways involved in zein metabolism. These motifs also indicate that these might play role in tissue-specific expression of genes related to the developmental process. Figure 6e.
3.6.5. Abiotic stress-responsive elements
A total of 18 out of 113 elements were identified as abiotic stress-responsive elements. CCAAT-box, Drought responsive elements DRE core, DRE 1, GC-motif, LTR; low-temperature response also called response to cold, MBS, MBS 1, MYB along with its binding and recognition sites, MYC, MYB like the site, STRE is for low pH, osmotic pressure, AT-rich sequence and ARE. Figure 6f.
3.6.6. Biotic stress-related elements
Among 113 sequences 4 biotic stress-related elements were identified and participated in wound healing and response to pathogen attack. Box S, W-box, WRE 3, and WUN-motif are identified motifs. Figure 6g.
3.6.7. Unidentified
The last class contains the unnamed unidentified sequences with no known function but are present in quite a number. They are Unnamed__2, Unnamed__4, Unnamed__6, Unnamed__16. Unnamed__4 with a motif sequence CTCC is the most abundant. Figure 6h.
3.7. Identification of Glycosylation and phosphorylation sites in TaUSP genes
Phosphorylation is one of the significant post-translational modifications and plays a key role in the activation, deactivation, and regulation of cellular pathways. In this process, Protein kinases phosphorylate amino acids such as serine, threonine, and tyrosine. A substantial number of phosphorylated sites are predicted in 107 sequences. TraesCS7A02G084400.3(TaUSP27) has the highest number of phosphorylation sites and TraesCS1D02G108300.1 has the lowest sites 122 and 5, respectively. Supplementary Table 3.
Glycosylation is another post-translational modification that helps proteins in proper folding and gives them their characteristics and functionality and plays role in the stability of these protein structures. This study has revealed that 81 glycosylation sites in 19 out of 34 TaUSPs have been predicted. The highest glycosylation sites were present in TaUSP10 i.e., 7; followed by TaUSP32 with 6 sites, TaUSP27a and TaUSP35 with 4 sites, TaUSP8 and TaUSP27b with 3 sites, TaUSP4 and TaUSP14 with 2 sites while the rest 11 TaUSP have only 1 glycosylation site. N-glycosylation scores of more than 0.5 and a jury score of 9/9 indicate high specificity to glycosylation event and predict that protein might have a stable glycosylated mediated structure Supplementary Table 4.
3.8. Protein modeling, disordered region analysis, and prediction of binding sites
Three-dimensional structures of four TaUSP genes TaUSP4, TaUSP10, TaUSP21, and TaUSP30, chosen from the two groups, were modeled, using the crystal structure of human IRAK1 (PDB_6BFN.1. A), the crystal structure of USP from Arabidopsis Thaliana At3g01520 (PDB_2GM3.1. A) and a hypothetical protein (PDB_1MJH.1. A) respectively. Secondary structure of all these proteins was predicted using the SOPMA webserver and α−helices, β−sheets, extended strands, and random coils were observed between 31.91%-46.3%, 4.59%-6.63%, 9.85%-20.48%, and 37.22%-45.91% respectively. Supplementary Table 5.
The TaUSP4 and TaUSP30 have 19% and15.2% disordered region whereas, TaUSP10 and TaUSP21 have no disordered region and solely contains the functional domains. The predicted protein models were validated by the Ramachandran plot in Supplementary Figs. 2, 3, 4, and 5. The favored regions were above 90%. Supplementary Table 6.
Two binding pockets were predicted in the models plotted. The binding pockets contained several functional amino acids such as alanine (Ala), aspartate (Asp), asparagine (Asn), cysteine (Cys), glutamine (Gln), glycine (Gly), histidine (His), isoleucine (Ile), leucine (Leu), serine (Ser), threonine (Thr), valine (Val), tryptophan (Trp) and tyrosine (Tyr) Fig. 8a, b, c, and d. Supplementary Table 7. The presence of the serine and threonine residues in the binding pockets shows that these might be the sites for phosphorylation and the presence of asparagine give insight into N-glycosylation sites.