Data sources
The Arabidopsis and rice NBS-LRR protein sequences were downloaded from the TAIR (http://www.arabidopsis.org/) and RGAP (http://rice.plantbiology.msu.edu/) databases, respectively. We used these data to build Hidden Markov Model (HMM) profiles. The NB-ARC domain (PF00931) profile, as well as the Arabidopsis and rice HMM profiles, were then used to search against the banana protein database obtained from the Banana Genome Hub (https://banana-genome-hub.southgreen.fr). Following HMM search using the hmmer 3.2.1 program with the default parameter settings, the shared hits from above three HMM searchers were selected and furtherly validated with InterProScan program (Zdobnov and Rolf 2001) (http://www.ebi.ac.uk/interpro/search/sequence-search) and MEME software (Bailey et al. 2015) (http://meme.nbcr.net/meme/cgi-bin/meme.cgi).
After removing sequences containing only NB-ARC or LRR domains, targets harboring both NBS and LRR domains were selected as candidate banana NBS-LRR proteins. In order to minimize the sampling error, a Java program was used to screen coding sequence (CDS) that meet the following conditions (Yue et al. 2008): (1) the total number of bases must be an integer multiple of 3; (2) ATG was used as the starting codon; and (3) TAA, TAG or TGA was used as the termination codon in mRNA coding. Finally, 74 CDS that meet the above conditions were selected.
To identify the conserved domains and gene structures of banana NBS-LRR proteins, conserved motifs were identified with the MEME 5.5.2 program (Bailey et al. 2009), CD-Search (Aron et al. 2017) and TBtool software (Chen et al. 2020). A neighbor-joining phylogenetic tree was constructed based on the conserved NBS-LRR domain sequences using MEGA 11.0 software (Tamura et al. 2013) with bootstrap values for 1000 replicates.
Analysis of synonymous codon usage bias
The GC content and the percentage of bases A, T, G, C and G + C at the third position of the codon were calculated by CondonW1.4.2 software(http://codonw.sourceforge.net/). Relative Synonymous Codon Usage (RSCU), represents the relative usage of synonymous codons, that is, the ratio of the actual observed value of the sample synonymous codons to the average expected value of the synonymous codons. The RSCU is calculated as:
$${\text{RSC}}{{\text{U}}_{{\text{ij}}}}=\frac{{{X_{ij}}}}{{\sum\nolimits_{{j=1}}^{{{n_i}}} {{X_{ij}}} }}{n_i}$$
1
where \({X_{ij}}\) is the number of occurrence of the jth codon for the ith amino acid encoding by \({{\text{n}}_{\text{i}}}\)synonymous codons. When the RSCU value is 1, it indicates that the codon usage is random and has no obvious preference. If RSCU > 1 or RSCU < 1, it indicates that the usage frequency of a codon is higher or lower than that of other synonymous codons (Liu et al. 2004).
Effective Number of Codons (ENC) is the number of effective codons, ranging from 20 to 61.
$$ENC=2+\frac{9}{{\overline {{{F_2}}} }}+\frac{1}{{\overline {{{F_3}}} }}+\frac{5}{{\overline {{{F_4}}} }}+\frac{3}{{\overline {{{F_6}}} }}$$
2
where \(\overline {{{F_{\text{i}}}}}\) (i = 2,3,4,6) represents the average value of \(\overline {{{F_{\text{i}}}}}\) for i-fold degenerate codon families. Using the following formula to calculate \(\overline {{{F_{\text{i}}}}}\) value:
$$\overline {{{F_i}}} =\frac{{n\sum\nolimits_{{j=1}}^{i} {{{(\frac{{{n_j}}}{n})}^2} - 1} }}{{n - 1}}$$
3
where n represents the whole number of occurrence of the codons for that amino acid and \({n_j}\) is the number of occurrence of the jth codon for that amino acid. If the ENC is smaller, the preference for codon usage during gene expression is stronger (Gupta et al. 2004).
Codon Adaptation Index (CAI) refers to the matching degree between the synonymous codon and the best use of codon in the coding region, which is used to predict the expression level of intraspecific genes.
$$CAI=\frac{{CA{I_{obs}}}}{{CA{\operatorname{I} _{max}}}}=\frac{{\sqrt[L]{{\sum\nolimits_{{K=1}}^{L} {RSC{U_k}} }}}}{{\sqrt[L]{{\sum\nolimits_{{K=1}}^{L} {RSC{U_{kmax}}} }}}}=\sqrt[L]{{\sum\nolimits_{{K=1}}^{L} {\frac{{RSC{U_k}}}{{RSC{U_{k\hbox{max} }}}}} }}$$
4
where \(RSC{U_{k\hbox{max} }}\) denotes the RSCU value of the optimal codon corresponding to the amino acid encoded by the kth synonymous codon in the highly expressed protein, and the meaning of L is the total number of codons used in the nucleotide sequence of the studied protein. The value of CAI is between 0 and 1, and the larger the value, the stronger the preference (Peixoto et al. 2003).
Frequency of Optimal Codons (FOP) refers to the percentage of optimal codons in the total number of codons. A3s, T3s, G3s, C3s, and GC3s represent the percentage of bases A, T, G, C, and GC at the third position of the codon, respectively. SPSS26.0 statistical software was used to analyze the correlation between codon composition and preference parameters (A3s, T3s, G3s, C3s, CAI, FOP, ENC, GC3s, GC).
ENC-plot and neutral plot analysis
The theoretical ENC value in ENC-plot plot analysis was calculated by formula (2), and the standard curve was drawn with the theoretical ENC value as the vertical coordinate and GC3 as horizontal coordinate (Wang et al. 2020).
PR2-plot was used to analyze the composition of the third base of the codon encoding amino acids. G3/(G3 + C3 ) and A3/(A3 + T3) were used as horizontal and vertical coordinates for drawing analysis (Sueoka 1999). The point distribution around the center point (A = T, C = G) shows the degree and direction of the base deviation. Under the influence of mutation pressure, the A/T and C/G ratios of gene degenerate codons are balanced. On the contrary, the unbalanced distribution of codon usage indicates that codon preference is influenced by both natural selection and other factors (Xiang et al. 2015).The influencing factors of codon usage preference can be preliminarily judged by neutral plot analysis. The neutral map of banana NBS-LRR protein sequence was constructed with GC3 as the horizontal coordinate and GC12 as the vertical coordinate.