3.1 Number, relative abundance and relative density of SSR in Picornaviruses
Genome-wide screening of 88 available picornavirus genomes revealed a total of 2,488 SSRs distributed across the species (Figure 1, Supplementary file 1). On average, 30 SSRs were observed per genome. The least incidence frequency of 14 was observed in Avisivirus C (V11), and a maximum incidence frequency of 46 was found in Salivirus A (V80) (Table 2). The RD of SSRs was observed, ranging from 13.39 bp/kb in Avisivirus C (V11) to 45.02 bp/kb in Cardiovirus A (V13). Similarly, relative abundance varied from a minimum of 1.953 in Avisivirus C (V11) to a maximum of 5.763 bp/kb in Parechovirus B (V69) (Table 2).
3.2 Number, relative abundance and relative density of cSSR in picornaviruses
The genome-wide scan revealed 1–5 compound microsatellites (cSSRs) in each analyzed sequence. Interestingly, 26 genomes lacked any cSSR. A total of 100 cSSRs were observed in 88 genomes (Table 2). The RD of cSSRs changed drastically in selected genomes, ranging from 1.36 bp/kb in Erbovirus A (V36) to 26.84 bp/kb in Cardiovirus A (V13). Similarly, the RA varied from 0.108 bp/kb in Ampivirus A (V3) to 0.636 bp/kb in the genome of Cardiovirus A (V13). The percentage of individual microsatellites being part of compound microsatellite (cSSR%) ranged from 2.439 in Gallivirus A (V37) to 15.15 in Cardiovirus A (V13) (Table 2).
3.3 Effect of dMAX on cSSR incidence
The dMAX is the maximum distance between any two adjacent microsatellites. If the distance separating two microsatellites is less than or equivalent to dMAX, these microsatellites are considered as cSSR (Kofler et al., 2008). To determine the impact of dMAX, five genome sequences such as bovine rhinitis B virus (V4), Cardiovirus A (V13), Enterovirus A (V21), Aichivirus A (V46), Salivirus A (V80) were chosen randomly to determine the variability of cSSR with increasing dMAX. It is noteworthy that the dMAX value in the IMEx software could only be set between 0 and 50 for analysis (Mudunuri and Nagarajaram 2007). Our study revealed an overall increase in the number of cSSRs with higher dMAX value in all selected Picornaviruses except for Enterovirus A (V21) (Figure 2).
3.4 Diversity of derived motifs
This study's derived microsatellites revealed the presence of mono to penta-nucleotide motif throughout the genome of the Picornaviridae family. Mono-nucleotide repeats were exhibited predominantly next to the dinucleotide motif. A 141 mono-nucleotide repeats stretch was observed in Cardiovirus A (V13) (Acc. no.X74312). Poly (G/C) repeats were found to be more prevalent than poly (A/T) repeats (Supplementary file 1). Within the dinucleotide repeats, CT/TC maintained the highest, whereas CG/GC repeats exhibited the lowest in terms of their distribution in the Picornaviridae. The CT/TC repeats were approximately 4.25 times more abundant than the least represented CG/GC repeats. The tri-nucleotide repeats were present abundantly next to mono-nucleotide repeats and contain 54 codon type repeats. Among them, GAA/AAG, and CAA/AAC represent the highest coding density, codes for glutamic acid/lysine and glutamine/asparagine, respectively.
In this study, seventeen tetra-nucleotide and six pentanucleotide motifs were observed among the picornaviruses. Among these, three tetra-nucleotide motifs such as CTCC, TCCT, and TTAG were localized in Aichivirus A (V46), and two pentanucleotide motifs such as GTTAA and TTAAG were localized in Aichivirus F (V51). Additionally, we have verified the distribution of SSRs throughout the non-genic and genic regions. The non-genic regions occupy 15.50% of SSR, and the rest 84.50% contribute toward the genic region. The 3D gene, which codes for the RdRp enzyme, contains maximum percentage (14.80%) followed by 2C (12.39%), then by VP3 (10.33%), which codes the protein for viral replication and capsid formation, respectively (Figure 3). We further examined the presence of mono-, di-, and tri-nucleotide repeats in the genic and non-genic regions depicted in Figure 4.
We have searched the biasness of the class of repeats to a genomic region. We found that the capsid protein VP4 is biased towards (A)6 repeats, which code for lysine. While VP0, VP3, and 2A (codes for viral protease) were enriched with polyC, which codes for proline. We surveyed the conserved repeats at the 5' UTR and 3’ UTR, which were enriched with polyC and polyA, respectively. Except for VP0 (helping for RNA encapsidation), the gene enriched with dinucleotide repeats such as TC/CT, codes for serine/leucine, and rest of the regions were highly diverse with di- as well as tri-repeat types. The diversification of microsatellite indicates the rapid evolution of the species to adapt to a broad range of hosts.
3.5 Over and under-representation of mononucleotide repeat
We noticed a considerable variation in the numbers of mononucleotide repeats (≥6 nt), ranging from one (Enterovirus J, Enterovirus K, Hepatovirus D, Mosavirus A & Tremovirus A) to 22 (Parechovirus B). The ratio between the observed number of repeats and the expected number of repeats also varied considerably from -7.6407 (Rhinovirus A) to 14.8191 (Parechovirus B) (Table 3). Among the analyzed sequences, approximately 72.5% showed the under-represented distribution of MNRs. Broader host range and virus genome type might have influenced the ratio of the observed number of repeats to the expected number of repeats (O/E ratio) to some degree. The Z scores revealed the statistical significance of mononucleotide repeat representation. If the observed value is more than the expected value, the Z score is assigned Z>0 and vice versa. Thus, the greater the Z value, the higher is the statistical significance. Kunsagivirus A with the highest Z value of 4.85 denotes the most significant genome in terms of mononucleotide microsatellite evolution. Some of the viruses showed an over-representation of MNR loci compared to the expected value. This may arise due to the co-evolution of viral SSRs with that of the host genome.
3.6 Motif complexity and polymorphism of Picornavirus genome
Compound microsatellites (cSSRs) are termed by the presence of two or more adjacent individual microsatellites. The common patterns of cSSR are represented as m1-xn-m2, m1-xnm2-xn-m3 considered as '2-microsatellite', and '3-microsatellite', respectively (Kofler et al. 2008). The analysis of compound microsatellite complexity indicated that all surveyed genomes were rich in ‘2-microsatellite’ cSSR followed by a single ‘3-microsatellite’. Among the 88 species of Picornaviridae analyzed, the cSSR composition showed a random distribution pattern throughout the genome. The non-coding regions consist of 19% of the total microsatellite, while the rest 81% are present in the coding regions. Within the coding regions, VP3 occupied 14% of cSSRs, followed by 3D (13.23%) and VP1 (10.29%) region (Figure 5). The CTG-CAG compound microsatellite composed of self-complementary motifs has been proposed to be created by recombination (Jakupciak and Wells 1999). However, our study showed no such motifs, suggesting these compound microsatellites were not likely to be derived from recombination. The motif having the form [m1]n-xn-[m2]n can be termed as SSR-couples such as the compound microsatellite (A)6-x2-(TA)3 and (A)8-x2-(TA)5. We identified three types of SSR couples, such as (TG)-x-(CT), (TC)-x-(C), (CT)-x-(C) were presented twice in all analyzed genome. Some of the self-complementary motifs have been observed in Picornaviridae (GC)-x-(CG), (TC)-x-(AG), (GT)-x-(CA), and (TC)-x-(AG), which plays an important role in secondary structure formation. Motif duplication is one of the phenomena in which similar motifs are located on both ends of the spacer sequence, for example (CA)n-(X)y-(CA)z. About 13.23% of the total cSSR were made up of duplicated sequences having the motif pattern (C)-x-(C), (TG)-x-(TG), (AT)-x-(AT), (TC)-x-(TC) and (TG)-x-(TG) (Supplementary file 2).
To check polymorphism, the species with more than or equal to five strains were taken into consideration. Due to the limitation of sequence information, 12 out of 88 species got qualified for the detection of polymorphism. A sum of 141 sequences was analyzed to determine species-specific consensus microsatellite motifs (Supplementary file 3). We unraveled those five species contained the species-specific consensus sequences presented in the leader and VP3 regions, presumably being the most conserved microsatellite within the genome (Table 4). These conserved microsatellites could be used as potential biomarkers of virus identifications and population genetics study.
3.7 Recombination analysis
When the dataset of 141 complete genomes of 12 species was analyzed, three species were found lacking any recombination event. The rest of the species' recombination breakpoints were checked with major parent and minor parent information and their exact position in the genome. This analysis resulted in a total of 15 breakpoints, a majority of which associated with the structural proteins (VP4-VP1), followed by nonstructural capsid protein (2A-2C and 3A-3D) (Supplementary file 4). However, the least number of breakpoints were observed in the non-coding regions, followed by the leader sequence. Previous studies suggest that repetitive sequences are favored for the recombination and act as a hotspot, owing to the high affinity of recombinase enzymes towards dinucleotide repeats (Biet et al. 1999). This prompted us to check the repeats within the breakpoints present in nine species. Our results show that out of 15 breakpoints, three were devoid of any microsatellite.
In contrast, others are rich in dinucleotide repeats, which contribute 92%, followed by 4% tri-nucleotide repeats, and 4% mono-nucleotide repeats. Among the dinucleotide repeats, the GT/TG prevailed at the highest (25%) followed by AG/GA (18%), and CT/TC occurred at the least (2.2%) frequency. However, these data were not conclusive but laid the foundation for further research to know the relationship between microsatellite and recombination hotspot in the picornavirus family (Supplementary file 4).
3.8 Genomic parameters influencing SSR and cSSR distribution
Regression analysis shows significant correlation of genome size (R2 = 0.258; P<0.05) and GC content (R2 = 0.055; P<0.05) with incidence of SSR. There was no significant correlation observed between genome size, relative abundance RA (R2 =0.027; P>0.05) and relative density RD (R2 =0.014; P>0.05) (Figure 6A). Similarly, GC content, relative abundance (R2 =0.020; P>0.05) and relative density (R2 =0.022; P>0.05) were also found to be non-significantly correlated (Figure 6B). The regression analysis of cSSR revealed no significant correlation of incidence of cSSR with GC content (R2 = 0.015; P>0.05) and genome size (R2 = 0.031; P>0.05). Furthermore, genome size was also non-significantly correlated with cRA (R2 = 0.007; P>0.05), cRD (R2 = 0.007; P>0.05), percentage of cSSRs (R2 = 0.003; P>0.05) (Figure 7A). Similarly, no significant correlation was observed in case of cRA (R2 =0.009; P>0.05) and cRD (R2 = 0.006; P>0.05) and percentage of cSSRs (R2 = 0.000; P>0.05) with GC content of cSSRs (Figure 7B).