Version 4.3-12/08/20 Cotton bZIP Transcription Factors: Characterization of the bZIP Family From Gossypium Hirsutum, Gossypium Arboreum and Gossypium Raimondii

Cotton is an important commodity in the world economy. In this study we have carried out genome-wide identication and bioinformatics characterization of basic leucine zipper domain proteins (bZIPs) from cultivated cotton species G. hirsutum along with two subgenome species of allotetraploid cotton, G. arboreum and G. raimondii. Transcription factors (TFs) are the key regulators in plant development and stress adaptation. Understanding interactions of TFs in cotton crop is important for enhancing stress tolerance and yield enhancement. Among plant TFs, bZIPs plays a major role in seed germination, ower development, biotic and abiotic stress response. Most of the bZIP proteins from cotton remains uncharacterized and can be utilised for crop improvement. In this paper we performed genome-wide identication, phylogenetic analysis, structural characterization and functional role prediction of bZIPs from all three genome species of cotton. In the present study genome-wide identication, phylogenetic analysis, structural characterization and functional role prediction of bZIP TFs from G. hirsutum along with two species and were performed. A total of 228 bZIP genes of hirsutum, 91 bZIP genes of and 86 genes of were from CottonGen database. Cotton bZIP genes were in according their Arabidopsis bZIPs. Multiple genes bZIP were observed in Cotton across all 13 chromosomes Phylogenetic all into eight raimondii bZIPs will be useful in further basic characterization of bZIP proteins of cultivated cotton species G. hirsutum. STRING protein interaction analysis of G. raimondii bZIPs resulted in prediction of interactions among A- D, B-K and C-S subfamily members. Phylogenetic analysis of this study will certainly help in the selection of specic cotton bZIP genes according to the close alignment with Arabidopsis orthologs or subgenome homolog.


Introduction
Cotton is an important commodity in the world economy. Approximately 32.93 million hectare was under cotton cultivation around the globe in 2019, with the highest production of 5.77 million metric tons in India followed by US and China (http://ministryoftextiles.gov.in; https://www.statista.com). The textile industry is a signi cant contributor to the nation's economy and employment, the global textile market share was USD 961.5 billion in 2019 (https://www.grandviewresearch.com).
Functional genomics is a key approach for identifying genes for important traits like yield, biotic-abiotic stress tolerance and ber quality. Cultivated G. arboreum, A subgenome species, and its counterpart, non-spinnable ber producer G. raimondii, D subgenome species are two progenitor species of cultivated allotetraploid cotton, G.hirsutum (Li et al 2014). Estimated genome size of G. hirsutum is ~2305.2Mb . Estimated genome size of G. arboreum is ~1746Mb which is around two fold larger than G. raimondii, ~880Mb (Hendrix et al 2005). Genetic information related to TFs of the two important progenitor species will play major role in the improvement of cultivated allotetraploid cotton. Wang et al. 2012 identi ed 2,706 transcription factors in G. raimondii which includes 208 bHLH and 219 MYB class genes which were preferentially expressed in ber (Wang et al. 2012) . Kushanov et al. 2016 developed CAPs and DCAPs markers for the PHYA1, PHYB and HY5 genes which are associated with the ber quality and owering time traits (Kushanov et al. 2016). Over-expression of G. hirsutum bZIP transcription factor, ABF2 (bZIP36) resulted in improved drought and salt tolerance in cotton and Arabidopsis (Liang et al. 2015).
TFs are the key regulators in plant development and stress adaptation. On the basis of conserved DNA binding domains, TFs are classi ed into different families among which bZIP-TFs play a major role in seed germination, ower development, and stress response. bZIP proteins speci cally bind to DNA with an ACGT core, A-box (TACGTA), C-box (GACGTC) and G-box (CACGTG) motifs and binding speci city is regulated by anking nucleotides (Jakoby et al. 2002). The bZIP domain carries two structural components located on a contiguous alpha-helix, the rst one is a ~16 amino acid residues containing nuclear localization signal, followed by a DNA binding domain invariant N-X7-R/K and the second one is present exactly nine amino acids towards the Cterminus which is a leucines (L) heptad repeat or other bulky hydrophobic amino acids creating an amphipathic helix, L-X6-L-X6-L (Jakoby et al. 2002). In Arabidopsis bZIPs, variation observed in leucine repeats, subfamily 'D' bZIPs contains three repeats whereas; more than eight repeats are present in subfamily C and S (Wolfgang et al.2018).
So far, ZIP proteins are classi ed into 9 to 13 subfamilies in different plant species based on structural and functional characteristics. Marc Jakoby et al. 2002 rst classi ed Arabidopsis thaliana bZIP TFs into 10 subfamilies A to I and S. These groups were named with letters considering their family members, for example A for ABF/AREB/ABI5, B for big protein size, C for CPRF2-like, G for GBF, H for HY5, and S for small protein size (Jakoby et al. 2002). The classi cation of Arabidopsis bZIPs is updated by Wolfgang et al. 2018, where the authors divided 78 bZIP proteins of Arabidopsis in 13 subfamilies A to I, J, K, M and S (Wolfgang et al. 2018). Wang et al. 2019 classi ed Arachis bZIPs into nine subfamilies A, B, C, D, G, H, I, S and U . Nijhawan et al. 2008 classi ed Oryza bZIPs into 10 groups A to J (Nijhawan et al. 2008).
If we look at the cotton bZIP protein characterization, Zhang et al. 2018 identi ed 159 bZIP genes from G. arboreum and classi ed into 13 groups (Zhang et al. 2018). Azeem et al. 2020 identi ed 87 bZIP genes of G. arboreum and 85 bZIP genes of G. raimondii from NCBI and classi ed into 11 subfamilies. Recently Wang et al. 2020 identi ed 207 bZIP genes of allotetraploid cotton G.hirsutum and classi ed into 13 subfamilies, A to I, S, M, K and J, having major contribution from subfamily 'A' and 'S' bZIPs (Wang et al. 2020).
Emphasising their functional importance, bZIP protein family in plants play crucial roles in developmental and stress responses. Abscisic acid responsive element binding factors (ABFs) from bZIP subfamily 'A' are activated by ABA signalling and are involved in downstream gene regulation in conjunction with stomatal closure mediated adaptation in drought stress (Sirichandra et al 2010, Yoshida et al.2015. Wolfgang and Christoph described bZIP family 'C' genes (bZIP9, bZIP10, bZIP25, bZIP63) and S1 genes (bZIP1, bZIP2, bZIP11, bZIP44, bZIP53) interaction network in response to nutritional starvation and metabolic adaptation under nutritional stress (Wolfgang et al. 2018). TGA or subfamily 'D' , bZIP family proteins binds to the TGACG consensus sequences, to form homo and hetero-dimer. TGA family proteins are predominantly involved in systemic acquired resistance through SA mediated interaction with NPR1 (Fan et al.2002, Fu et al. 2013. TGA family member PERIANTHIA (PAN/bZIP46) plays a role in ower development, interact with ROXY-type GRX (CC-type glutaredoxin, GRX-ROXY1) gene to regulate petal development (Gatz 2013, Gutsche et al. 2017). Recently, Ullah et al 2019 studied soybean TGA family genes and differentiated legumes-speci c TGAs structures, involvement in legumes-speci c biological processes like legumes-rhizobia symbiotic nodulation and predicted soybean bZIPs role in response to nitrogen.
The bZIP TF family from cotton yet to be fully understand and explored. Thus, this paper attempts phylogenetic analysis, structural characterization and functional role prediction of bZIPs from cultivated cotton species G. hirsutum along with two subgenome species G. arboreum and G. raimondii.

Materials And Methods
A total of 228 bZIP genes of G. hirsutum, 91 bZIP genes of G. arboreum and 86 bZIP genes of G. raimondii were identi ed from CottonGen database using bZIP domain as a query as well as Arabidopsis bZIP protein sequence as a query or through keyword search to recent genome assembly of Gossypium hirsutum (AD1) 'TM-1' genome UTX_v2.1 (published article in may 2020), Gossypium raimondii JGI proteins and Gossypium arboreum BGI proteins (https://www.cottongen.org/). Arabidopsis-bZIP protein sequences were downloaded from TAIR (https://www.arabidopsis.org/). Cotton bZIP genes were annotated on the basis of E-value, bitscore value and percent identity resulted from Arabidopsis-bZIP match. All the data regarding gene ID, genomic position, exon number, CDS length, protein length, protein sequences and Arabidopsis match ID with E-value are recorded in supplementary le 1.
Phylogenetic analysis of 227 bZIP genes of G. hirsutum, 56 bZIP genes of G. arboreum and 57 bZIP genes of G. raimondii along with 73 bZIP genes of Arabidopsis was performed. Duplicated genes from G. arboreum and G. raimondii were excluded from phylogenetic analysis and selected rst representative gene. For example GabZIP14 is having two entries GabZIP14-1 and GabZIP14-2 so GabZIP14-1 is selected for analysis and designated as GabZIP14. Phylogenetic tree was constructed by maximum likelihood phylogeny method, Dayoff (PAM) protein substitution model and performing 100 bootstrap, using MEGA X (https://www.megasoftware.net/). MEME (http://meme-suite.org) motif search analysis was done for all identi ed bZIP genes, 228 bZIP genes of G. hirsutum, 91 bZIP genes of G. arboreum and 86 bZIP genes of G. raimondii , selecting 10 motifs identi cation option and any number of repetition in case of distrubution . Identi ed motifs were evaluated using SMART (http://smart.emblheidelberg.de/), Pfam (https://pfam.xfam.org/search/sequence), Interpro (https://www.ebi.ac.uk/interpro/) and TOMTOM. MAST le of all searches were submitted along with supplementary documents.
To understand the interaction network of G. raimondii bZIP proteins, STRING (https://string-db.org/) analysis was performed. Network was interpreted for strong protein interactions, sharing functions among themselves with highest edge con dence (0.900), which can be correlated by thickness of the edges.

Results
G. hirsutum, G. arboreum and G. raimondii bZIP protein classi cation Phylogenetic analysis of 227 bZIP genes of G. hirsutum, 56 bZIP genes of G. arboreum and 57 bZIP genes of G. raimondii along with 73 bZIP genes of Arabidopsis was performed. The phylogenetic analysis indicates that G. hirsutum, G. arboreum and G. raimondii bZIPs are closely related to Arabidopsis bZIPs and are conserved among subgenomes. Figure 1A shows phylogenetic tree construction. Bootstrap percentage values are mentioned on each branch. Analyzed G. hirsutum, G. arboreum and G. raimondii bZIP genes are classi ed into 12 subfamilies A B, C, D, E, F, G, H, I, J, K and S and 8 subgroups A1, A2, A3, C1, C2, S1, S2 and S3.
Subgroup classi cation was done according to clade separation within subfamily and predicated functional similarities. Further this classi cation is supported by Arabidopsis-bZIPs alignment, G. raimondii BRLZ domain alignment, exon-intron numbers, protein length and conserved motif identi cation ( Figure 1 and Table 1, 2, 3). For the ease of understanding, in the following elaborated subfamily wise description bZIP proteins from G. hirsutum, G. arboreum and G. raimondiicollectively named as Gossypium bZIPs.
Subfamily B contain endoplasmic reticulum (ER) stress response transcription factor bZIP17, involved in salt and osmotic stress resistance in Arabidopsis (Liu et al. 2008) and observed interacted with subfamily K member bZIP60 in STRING analysis, which is also involved in ER stress response.
Subfamily C, subgroup C1 contains GossypiumbZIP9, subgroup C2 contains GossypiumbZIP10, 25 and 63, reported to form hetero dimer with S1 subfamily bZIPs , are involved in anther development, positive regulation of seed maturation and response to starvation Wolfgang et al. 2018).
Subfamily E of bZIPs functioning in cell wall and pollen development (Gibalova et al. 2009 and contains GossypiumbZIP34, 61 and 76. In Arabidopsis thaliana researchers discovered that a conserved proline residue in the third heptad region of leucine zipper of AtbZIP34 and AtbZIP61 interferes with the formation of homo-dimer whereas change of proline by an alanine in the above mentioned region can form homo dimer and bind to the G-box element (Shen et al. 2007). A conserved proline residue which interferes homodimer formation is also found in the third heptad region of leucine zipper of GossypiumbZIP34 and 61 genes except in GrbZIP61-1 which is carrying serine instead of proline, refer supplementary le 1 for alignment. It is indicated that bZIP34 and bZIP61 is involved in the hetero-dimer formation with I and S subfamily members of bZIPs which are function in vascular development (Shen et al. 2007).
Subfamily F which is well known to be involved in adaptation to zinc de ciency in Arabidopsis, wheat and barley contains GossypiumbZIP19 Conserved motif analysis of G. hirsutum, G. arboreum and G. raimondiibZIP proteins Signature bZIP domain is con rmed in all identi ed bZIP proteins of G. hirsutum, G. arboreum and G. raimondii. MEME motif analysis deciphering the presence Abscisic acid insensitive 5-Like protein 5-related motif in G. hirsutum subfamily 'A' and 'J"

members. Sterile alpha motif (SAM)/Pointed domain which is involved in protein-protein interaction is identi ed in subfamily 'A'
bZIPs of all three species. G. raimondii subfamily 'A' bZIPs are carrying MYND-Zinc binding domain which is involved in proteinprotein interaction in the transcriptional regulation context. G. arboreum Subfamily 'A' bZIPs, GabZIP18-3 and GabZIP11-3 genes are carrying GluR7, glutamate receptor domain from GluR proteins known to be function in light signal transduction and calcium homeostasis.The presence of two DOG1-seed dormancy control motif and TGA2 like motif is identi ed in all three gossypium species 'D' subfamily bZIP proteins. Exclusively G. raimondii subfamily 'D' bZIPs are carrying the RPA interacting motif, which is conserved in eukaryotic DNA repair and replication proteins, presence of tetratricopeptide repeats motif which are involved in protein-protein interactions. Further presence of tetratricopeptide repeats and other protein protein interaction domains in subfamily D should characterize, as TGA transcription factor family members, are involved in biotic stress tolerance through interaction with NPR1 and similarly STRING analysis also identi ed D subfamily protein interaction with A subfamily proteins. A typical example of identi ed motifs in G. raimondii bZIP proteins has shown in Figure 2 and some motifs are listed in Table 3.
BRLZ domain structural characterization of G. raimondii bZIP proteins The conserved basic region N-X7-R/K variant is present in almost all analyzed G. raimondii bZIP proteins, variation is observed in subfamily E and J. Subfamily E-bZIP76 is having K-X7-R motif, subfamily J-bZIP62 is having N-X7-I motif. (Figure 4 and for BRLZ domain alignment of bZIP69, 76 refer supplementary le1).

Leucine heptad repeats variation in G. raimondii bZIPs
Variation in presence of number of leucine heptad repeats motif is observed in G. raimondii bZIP proteins. Figure 3 deciphering the detail structure of leucine heptad repeats of G. raimondii bZIP proteins, for GrbZIP69 and 76 refer supplementary le 1. First two leucine heptad repeats are conserved among all G. raimondii bZIP proteins with exception of GrbZIP17, GrbZIP61, GrbZIP76 andS subfamily proteins GrbZIP 2, 4, 11, and 44 which are carrying either methionine, isoleucine, asparagine or valine at L0, L1 or L2 position (L0 to L9 denotes leucine position in leucine heptad repeats L-X6-L-X6-L motif).
Subfamily-wise leucine heptad repeats are described as follows Subfamily A bZIP proteins are carrying two to four heptad repeats of leucine or presence of other hydrophobic amino acids at leucine position in last heptad.
Subfamily B, C, E, F and S bZIP proteins are carrying seven to nine heptad sequences with leucine repeats or other different hydrophobic or polar amino acids at leucine position; alanine observed predominantly at leucine position in subfamily C and valine is predominantly present in subfamily S proteins at leucine position.
Subfamily D, G, I and J bZIP proteins are carrying three to six leucine heptad sequences or other hydrophobic amino acids are present at leucine position with predominance of glycine in D subfamily and methionine in I subfamily members.
Pure leucine heptad repeats are present in subfamily H member GrbZIP56 and subfamily K member GrbZIP60, with presence of 4 and 2 heptad repeats of leucine respectively.
STRING protein interaction analysis of G. raimondii bZIPs G. raimondii bZIP proteins interaction analysis is performed using STRING, protein-protein association network software, Figure 4 deciphering the observed network (https://string-db.org/). Very strong interaction observed among subfamily A bZIPs, GrbZIP35 (ABF1/AREB1), GrbZIP39 (ABI5), GrbZIP65, GrbZIP66 (AREB3) and GrbZIP67 (DPBF2) genes, which are known for abscisic acid inducible stress regulation, seed germination and seedling development in plants (Lindemose et al.2013, Skubacz et al.2016. Observed interaction between bZIP subfamily A and D proteins of G.raimondii are predicted to be involved in signal transduction pathways, biotic and abiotic stress tolerance. Another strong interaction observed among GrbZIP17 and GrbZIP60, two endoplasmic reticulum or unfolded protein response modulators known to get activated under environmental stresses (Humbert et al. 2012, Howell, 2013. Interaction of GrbZIP35/ABF1 and GrbZIP55/GBF3 is also observed, these genes are involved in abiotic stress tolerance. Interacting proteins GrbZIP9, GrbZIP53 and GrbZIP56 (HY5), indicating their common role in developmental processes viz. seed maturation, circadian rhythm (Alonso et al. 2009;Wolfgang et al.2018). Interaction among C group members, i.e. GrbZIP9, GrbZIP10 and GrbZIP25 and S group members GrbZIP 2, GrbZIP53 predicting role in starvation adaptation. During discerning process of Gossypium bZIPs multiple bZIP genes with similar standard designation are observed, possibly due to gene duplication event. Gene duplication may have happen during speciation of dicot species and also during allotetraploidization. Whole-genome duplication in G.arboreum and G.raimondii before speciation is reported in previous reports. The ndings from this study indicate that Gossypium bZIP proteins are important in stress adaptation, developmental processes, need further evaluation for trait development like ber quality, yield and stress tolerance in cotton. Availability of data and materials Data used in the preparation of this article were obtained from the CottonGen database, Plant transcription factor database. All data generated or analysed during this study are included in this published article and its supplementary information les.

Competing interests
The authors declare that they have no competing interests.

Not applicable
Authors' contributions VK initiated, designed and implemented the study, data analysis, drafted the MS AB conceived the study, data analysis and carefully edited the MS BC directed the study, carefully edited the MS RS edited the MS All authors read and approved the nal manuscript. Table 2: G. hirsutum, G. arboreum and G. raimondii bZIP genes distribution and structural characteristics. Table 2: G. hirsutum, G. arboreum and G. raimondii bZIP genes distribution and structural characteristics, * indicates similar gene may distributed among family. Table 3: G. hirsutum, G. arboreum and G. raimondii bZIPproteins MEME motif analysis, motif function and occurrence Table 4: NLS sequences of Gossypium raimondii and Arabidopsis thaliana bZIP proteins, black font from plant transcription factor database, colour fonts from cNLS Mapper predictions, red font >8 score (predicting nuclear localization), blue font 7 or <8 score (predicting partial nuclear localization), orange font >3and<7 score (nuclear and cytoplasm localization). Figure 1 A Phylogenetic tree construction of G. hirsutum, G. arboreum, G.raimondii and Arabidopsis thaliana bZIPs.

Figure 2
A typical example of MEME motif analysis of G. raimondii bZIPs. MEME motif analysis of Gossypium raimondii bZIPs, bZIP1 domain, DOG1 domain, tetraticopeptide repeat motif, MYND domain, Homeobox associated leucine zipper domain are marked by arrow.

Figure 3
Leucine heptad repeats variation in G. raimondii bZIPs. Leucine heptad repeats variation in G. raimondii bZIPs: basic region motif N-X7-R/K is highlighted in magenta colour, leucine position in luecine heptad motif is highlighted in gray, hydrophobic amino acids other than leucine are highlighted in blue and polar amino acids are highlighted in green