Comparative Genomics Revealed Genus Specific Encoding of Amino Acids by Tri-Nucleotide SSRs in Human Pathogenic Streptococcus and Staphylococus Bacteria


 Pathogenic bacteria use phase variation of surface molecules and other characteristics as a significant adaptation mechanism. Repetitive sequences made up of numerous identical repeat units can be found in many phase variable genes. Here, we investigated the frequency and distribution of long-SSRs in 15 human pathogenic Staphylococcus, Streptococcus, and Enterococcus bacteria. Long-SSRs were found to be distributed differently in the genic and intergenic sequences. In the genic sequences, 61.3 SSRs were discovered on average, while 16.2 SSRs were found in the intergenic regions. Staphylococci exhibited the highest frequency of SSRs, followed by Enterococcus, and Streptococci had the lowest frequency of SSRs. Higher A+T content was found to be the best predictor of long-SSR in these human pathogens. Tetranucleotide repeats predominated in intergenic regions, while trinucleotide repeats predominated in genic regions. In human pathogenic Streptococcus and Staphylococcus bacteria, genus-specific encoding of amino acids by tri-nucleotide SSRs was observed. A genetic relationship between these human pathogenic bacteria was derived based on the presence of SSRs in the housekeeping genes and compared to the phylogeny generated based on the 16S ribosomal RNA gene.


Introduction
Microorganisms play a decisive role in our lives. They are responsible for many of the food products that we consume, and at the same time, they are the cause of many human diseases. Infections caused by gram-positive bacteria (GPB) are a common cause of signi cant illness in individuals. Streptococcus is a genus of gram-positive bacteria that includes a number of clinically relevant species that cause a wide range of infections in humans and animals, each with its own symptoms and course (Krismer, et al. 2017). Streptococci can colonise human and animal mucous surfaces and are considered opportunistic pathogens, meaning they can cause acute infections under certain conditions. Streptococcal species such as Streptococcus pyogenes and Streptococcal pneumonia are extremely virulent and cause serious infections such as pneumonia, necrotizing fasciitis, sepsis, and meningitis, whereas others (for example, Streptococcal mutans, Streptococcal sanguis, Streptococcal agalactiae, and Streptococcal anginosus) cause endocarditis, abscesses (Kosecka-Strojek, et al. 2019).
Species of Staphylococcus bacteria, which are found in the natural microbiota of humans and animals, also act as pathogens. Staphylococcus aureus, the group's main pathogen, causes a wide range of clinical infections in humans, including bacteremia, endocarditis, and infections associated with invasive medical devices. (Rossi, et al. 2020). Meanwhile, coagulase-negative staphylococci (CoNS), particularly S. haemolyticus and S. epidermidis, have emerged as recurrent causal agents of nosocomial infections, particularly those involving indwelling devices (Becker, et al. 2014). Similarly, gram positive species of Enterococcus are abundantly found in the intestines of humans. The two most common bacteria responsible for infection in humans are E. faecalis and E. faecium (Safdar and Armstrong 2019).
The pathogenicity of these microorganisms is in uenced by a variety of factors. The ability of microbial pathogens to persist in human hosts protected by adaptive immunity is a major biological challenge.
Antigenic variation of surface components is one technique used by microbial parasites to evade host immune responses during long-term colonization. (Woolhouse, et al. 2001). Although microbial populations are continually changing in all colonised habitats, many successful long-term colonists have mastered the barrier of producing antigenic variations (Krinos, et al. 2001).
Bacterial pathogens use polymerase slippage of simple sequence repeats (SSRs) to generate phenotypic variation and increased tness through localised hypermutation. SSRs are made up of tandem repetitions of short oligonucleotides that have functional and/or structural characteristics that set them apart from other DNA sequences (Mrazek, et al. 2007). The SSR distribution is not random, and they can be found all over the genome. SSRs can be found in the 30-UTR, 50-UTR, protein-coding, and non-coding sections of the genome (Mahfooz, et al. 2015). SSRs are often utilised as genetic markers since they are extremely changeable and exceptionally polymorphic (Datta, et al. 2010). SSR types are expected to alter transcriptional activity and have a functional role in the evolution of gene regulation (Geng, et al. 2020) and SSRs are thought to be different in various taxa or different areas of the same taxon (Song, et al. 2021).
In bacteria, a large number of the SSR tracts are present, with motifs spanning from one to six nucleotides (Gur-Arie, et al. 2000), there are certain SSRs that are exceptionally long (more than 12 nucleotides). These long SSRs are found within the reading frame or promoter of a class of genes known as contingency loci, whose functions are typically involved in direct interactions with the host (Moxon, et al. 2006). Since the genome sequences of most of the pathogenic bacteria are available, one can easily analyse the distribution of these loci among the human pathogens to get an insight into their evolution as pathogens. Hence, we wish to analyse the frequency and distribution of these long SSRs among the three most common genera of human pathogens species. Furthermore, we would like to examine in which genes these long SSRs are present and how the presence of these long repeats could affect the transcription. Finally, we would like to elucidate a genetic relationship based on the presence and absence of these long SSRs in certain housekeeping genes.

Material And Methods
2.1 Genome sequences. Whole-genome and coding sequences of 15 human pathogenic bacteria were downloaded in fasta format from National Center for Biotechnology Information (www.ncbi.nlm.nih.gov). This study did not include plasmid DNA sequences. On the basis of current annotations, genic and intergenic areas within the genomes were determined. Of the 15 bacteria, 4 belong to the genus Staphylococcus, 9 belongs to the genus Streptococcus, and the remaining 2 were Enterococcus bacteria (Table 1). Data regarding G+C content and the number of genes were taken from the information provided along with the genome sequences.

Simple Sequence Repeat (SSR) mining:
Prokaryotic organisms have a higher number of short repeats, which typically range from two to four repeat units in length. Mononucleotide repeats are the most common among them, followed by trinucleotide repeats. Sequencing errors result in a large number of mononucleotide repeats. As a result, mononucleotide repeats were excluded, and di-to hexa-nucleotide repeats longer than 12 nucleotides were designated as long SSRs. In this case, a di-nucleotide pattern must appear at least six times, a trinucleotide motif must appear four times, a tetra-nucleotide motif must appear three times, and penta-, and hexa-nucleotide motifs must appear three times each. SSR mining was carried out with the help of SSR locator software (da Maia, et al. 2008). This software can mine SSRs as well as create primers and simulate PCR reactions against various databases. Each SSR's frequency, relative abundance (RA), and relative density (RD) were computed. The total number of SSRs and the total length (bases) covered by each of the SSRs were divided by the entire size of the sequence to determine RA and RD (Mb). R was used to do Pearson correlation analysis and Principal Component Analysis (PCA), and R Studio was used to create a graph.

Statistical analysis:
The presence or absence of repetitive motifs in human pathogens' housekeeping genes was used to produce binary data, which was then analysed using the SIMQUAL approach to generate Jaccard's similarity coe cients with NTSYSpc software version 2.1 (Rohlf 1998). Using the Unweighted Paired Group Method of Arithmetic Averages (UPGMA) method and SAHN clustering, these similarity coe cients were utilised to create a dendrogram demonstrating genetic relatedness among the species. The polymorphism information content (PIC) was determined as previously described (Botstein, et al. 1980). The probability that two randomly chosen copies of a gene represent different alleles within a population is de ned as PIC. The following formula was used to calculate the PIC value: where P ij represents the frequency of the j th pattern for marker i, and summation extends over n patterns.
The ClustalW program in the MEGA 5.2 software was also used to create a phylogenetic tree of the 16S rRNA gene. The phylogenetic tree was built using the neighbor-joining algorithm and 1000 replicates of bootstrap analysis.

Gene enrichment analysis
Gene enrichment analysis was performed on the genes that contain long SSRs. This could aid in the discovery of biological pathways that are rich in repeats. One pathogenic microorganism with the greatest SSR count was chosen among Staphylococcus, Streptococcus, and Enterococcus bacteria for this study. We employed a custom protocol that included three main steps: 1) detailed description of a gene list derived from "omic" data; 2) computational determination of statistically enriched pathways; and 3) visualisation and analysis of the results. We used PANZER (Protein ANNotation with Z-scoRE) for the rst and second steps, a completely automated tool for functional annotation of unknown functional prokaryotic and eukaryotic proteins. The programme accepts protein sequences and predicts functional descriptions (DE) and GO classi cations. In the nal phase, visual analysis was performed using visualisation tools provided by ShinyGO v0.61 (http://bioinformatics.sdstate.edu/gonovel). It offers a graphical web tool that can assist in gaining actionable information and access to KEGG and STRING for route diagrams and protein-protein interaction networks.

Staphylococcus has the maximum relative abundance and density of long SSRs
To analyse the presence of long SSRs in human pathogenic bacteria, we scanned the presence of these SSRs in different species of Staphylococcus, Streptococcus, and Enterococcus bacteria. Among all, a total of 1286 long SSRs were identi ed, with an average of 77.5 SSRs/species. Differential distribution of SSRs was observed in the genic and intergenic regions. An average of 61.3 SSRs were found in the genic sequences, whereas 16.2 SSRs were located in the intergenic regions. The maximum frequency of SSRs was identi ed among Staphylococci (Avg=117.5), which was followed by Enterococcus (Avg.=100), Streptococci had the least abundance of SSRs (Avg.=54.7). Individually, Staphylococcus epidermidis has the highest frequency of long SSR (125), closely followed by Staphylococcus haemolyticus (123), and least in Streptococcus mitis (45). We observed a statistically signi cant correlation between the genome size and the frequency of SSRs (r= 0.83, p=0.0001), which means that genome size could in uence the frequency of SSRs. Hence, in order to minimise this, we have calculated the relative abundance and density of SSRs. With updated calculations, Staphylococcus epidermidis remained at the top with the highest RA and RD (50.6 and 631.5), but Streptococcus suis became the least (22.3 and 335.8), displacing Streptococcus mitis (23.4 and 305.2) ( Table 2). We further calculated the percentage of repeats in the genomes. Again, the maximum percentage of repeats (0.063) was observed in S. epidermidis, whereas Streptococcus mutans had the lowest (0.029).
Among the classes of SSRs, tetranucleotide repeats were preferred in the intergenic regions (67.1.2%), whereas genic regions were dominated by trinucleotide repeats (49%). The second most frequent motif class in the intergenic region was tetranucleotide repeats (17.6%), while it was tetranucleotide repeats (44.4%) in the genic sequences. Hexanucleotide SSRs were the least preferred (4.8%) in the intergenic regions, whereas genic regions avoided the presence of di-nucleotide SSRs (0.5%). Staphylococcus haemolyticus harbours the maximum SSRs as tetranucleotide repeats in both genic as well as intergenic regions (Supplementary table 1). 3.2 Tetranucleotide repeats were the most preferred classes of repeats among the human pathogens We further estimated the RA and RD of different classes of repeats among human pathogens. Most pathogens, irrespective of their genera, prefer tetranucleotide repeats within their genomes ( Table 3). The species defying this trend were S. epidermidis, S. pyogenes, and S. sobrinus. It was interesting to note that S. saprophyticus had exactly the same RA for tetra and trinucleotide SSRs. Dinucleotide repeats were completely absent in S. aureus, S. haemolyticus, S. mutans, S. salivarius, and S. suis. Similarly, S. gordonii had no hexanucleotide repeats in its genome.

Hexanucleotides constitute the longest SSRs among pathogens
We further evaluated our data to nd the longest SSRs within each genome. The majority of the genomes (73.0%) have hexanucleotides which are repeated thrice as the longest SSRs. S. agalactiae and S. sobrinus have trinucleotide repeats as their longest SSRs, along with hexanucleotide and pentanucleotide repeats ( Table 4). The majority of the long SSRs were of 18 base pairs in length, with the exception of a 125 bp long repeat in S. suis where a pentanucleotide (GAGCA) was found to be repeated 25 times. It is noteworthy that most of the long SSRs were located in genes that are directly or indirectly related to pathogenesis.

Gene enrichment studies
Gene sequences with long SSRs from S. haemolyticus, E. faecium, and S. agalactiae were taken for gene enrichment analysis. A total of 231 SSR-containing sequences from all three species were aligned to 34 different functional pathways. The highest number of genes with repeats was located in S. haemolyticus (187, 47.5%), which was followed by E. faecium (114, 28%) and the least was observed in S. agalactiae (93, 23.6%). SSRs have been discovered in the genes of speci c metabolic pathways in all species. Genes involved in the organic substance metabolic process, nitrogen compound metabolic process, catalytic activity, and small molecule binding, for example, were discovered in each of the three species. Some pathways have common genes for two species. For instance, repeats were discovered in genes linked to biological regulation in S. haemolyticus, and E. faecium, and some pathways have genes for only one species (a stress-responsive gene was present only in S. haemolyticus) (Figure 1). The most noteworthy nding in this study was the presence of SSRs in pathogenicity related genes, which validates the important role of SSRs in pathogenicity.

Genus wise amino acid was preferred
Since trinucleotide SSRs can encode amino acids if they are present in the reading frame, it would be interesting to investigate the occurrence of amino acids encoded by trinucleotide SSRs in the genic sequences of pathogenic bacteria. A total of 1820 amino acids were encoded by tri-nucleotide SSRs among all the pathogens. Isoleucine was the most abundant amino acid (11.0%), which was followed by lysine (8.2%). Glutamic acid was the third most abundant amino acid (8.1%), whereas proline encoding repeats were the least abundant (0.3%). It is evident from the data that different bacteria have their own preferences for amino acids. Isoleucine, for example, is the most abundant amino acid in two Streptococcus, two Staphylococcus, and one Enterococcus bacteria, whereas glutamic acid was highest in one bacterium from each genus (Supplemental table 2). Since there are 20 amino acids, it would be di cult to interpret each in a different organism. Hence, we further perform Principal Component Analysis (PCA), which is a technique for lowering the dimensionality of datasets, improving interpretability while minimising information loss. PC1 has higher values for Staphylococcus bacteria as they clustered together, whereas Streptococcus bacteria clustered together with lower PC1 values. Enterococcus bacteria were grouped distantly with higher and lower PC1 values.

Motif conservation among human pathogens
We selected 13 housekeeping genes that perform important biological functions like replication, transcription, and translation-related functions to study the conservation of motifs within them. The highest concentration of long SSRs was located in the rplD (50S ribosomal protein) gene. This was followed by the infB gene, which had ve SSRs, and the uvrC gene, which had three SSRs. Among the housekeeping genes, maximum conservation of motifs was recorded in the rplD gene where motif (gtg) 4 was found conserved within eight species of Staphylococcus and Streptococus bacteria. Among the genus, Staphylococcus shows more conservation as compared to its counterparts. For instance, motif (atca) 3 was conserved within the infB gene of S. aureus, S. epidermidis, and S. saprophyticus. Similarly, motif (tga) 4 was conserved in S. aureus and S. epidermidis, and motif (gtc) 4 was conserved in the rpsO (30S ribosomal protein) gene of S. epidermidis and S. haemolyticus (Table 5). Motif conservation was also witnessed among Staphylococcus bacteria. Motif (tga) 4 was conserved between S. agalactiae and S. mutans in the parC (DNA topoisomerase 4 subunit A) gene.

Genetic relationship
To determine the degree of polymorphism, a similarity matrix was created using Jaccard's estimate of similarity, which is based on the probability that SSR found in one species' housekeeping genes will also be found in the housekeeping genes of another species. Over the 120 combinations, the calculated similarity coe cient ranged from 0 to 1.0, with a mean of 0.73. Staphylococcus bacteria had a higher average genetic diversity of 26 percent when compared to Streptococcus and Enterococcus bacteria, which had lower average genetic diversity of 11.0 percent and 3.0 percent, respectively. Gene rpmH (50S ribosomal protein) was the most informative (PIC =0.99), whereas rplD was the least (PIC =0.78) among all. We further constructed a dendrogram based on similarity coe cient values. The dendrogram resulted in two main clusters, A and B. Cluster A is comprised of S. aurius and S. saprophiticus, whereas the remaining species cluster together in B. B was further subdivided into two clusters, BI and BII. BI has the remaining Staphylococcus species, whereas in BII, Streptococcus and Enterococcus bacteria cluster together in two separate clades. Since this dendrogram is based on hypervariable regions of the genome, it would be interesting to see how much it differs from conserved region-based phylogeny. We further constructed a dendrogram based on the 16S ribosomal regions of the genome. The dendrogram was also divided into two main clusters, A and B. Cluster A groups all the Streptococcal bacteria together, whereas Cluster B separates Staphylococcus and Enterococcus bacteria into two different clades.

Discussion
Bacterial pathogens confront di cult conditions due to numerous unpredictable, frequently abrupt, and dynamic changes that occur in the host environment or during transmission from one host to another (Bayliss 2009). Bacterial adaptation to their hosts entails either a system for sensing and responding to environmental changes or selecting mutation-induced variations within their contingency loci (Moxon, et al. 2006). These loci allow bacterial populations to adapt to or survive selective pressures by generating and spreading genetic variants that are " tter," or better adapted to, a particular environment than the majority of the population. Through comparative genomics, it is now possible to nd differences in genetic variants across entire genomes, and to tie those differences together to biological function, as well as to learn more about selective patterns of gene transfer and evolutionary pressures or loss, especially when it comes to virulence in pathogenic organisms (Fitzgerald and Musser 2001). We analysed the frequency and distribution of long SSRs in sequenced species of human pathogenic Staphylococcus, Streptococcus, and Enterococcus bacteria. Among the three, Staphyloccus has a relatively higher frequency of long repeats as compared to the others. To nd an appropriate justi cation for higher repeats in Staphylococcus bacteria, we looked into their G+C content. In our previous reports on fungi, we observed a positive correlation between G+C content and the frequency of SSRs (Mahfooz, et al. 2017, Mahfooz, et al. 2016). However, we found a statistically signi cant negative correlation (r 2 = -0.83, p = 0.0001) between the two in this study. Hence, we can hypothesise that higher A+T content is a good predictor of the frequency of repeats among these human pathogenic bacteria. We found S. epidermidis harbouring the maximum frequency of long repeats among all the pathogenic bacteria. A close comparison between S. epidermidis and S. aureus revealed a higher percentage (~9) of genomic elements (genome islands and Staphylococcus cassette chromosome-like elements, insertion sequence elements, integrated prophage, integrated plasmids, and composite transposons) as compared with its closest species, S. aureus (~7) (Hiramatsu, et al. 2001). We can speculate that this could be the possible reason why we obtained a higher frequency of long SSRs in S. epidermidis.
Among the classes of SSRs, we found an abundance of tetranucleotide SSRs in the intergenic region, whereas trinucleotide SSRs were found dominant in the genic regions. Tetranucleotide repeats present in the intergenic regions are reported to modulate transcription factor binding and consequently modulate gene expression (Martin, et al. 2005). The presence of trinucleotide repeats in the genic regions is expected as it avoids frameshift mutations and these triplets could code for amino acid runs that may have speci c functions in the protein structure (Metzgar, et al. 2000). Further analysis of repeat classes at the whole genome level revealed that in most of bacterial species, the RA and RD of tetranucleotide SSRs were the highest. Its occurrence in intergenic regions is ne as it regulates gene expression by binding the RNA polymerase. However, its presence in genic sequences is surprising as it could change the open reading frame. It is believed that rearrangement within these tetrameric repeats could work as a switchon/off mechanism in phase variable genes (De Bolle, et al. 2000).
Further analysis of the data showed hexanucleotides constituted the longest SSRs in most of the species, which is expected as they contribute the highest number of repeats when compared to other classes of SSRs. We observed an unexpectedly higher repeat number (25) of a pentanucleotide repeat (gagca) that codes for a hypothetical protein in S. suis. We can hypothesise that S. suis could have acquired this repeat through horizontal gene transfer (Perna, et al. 2001).
We further examined the tri-nucleotide SSRs that have a probability of being transcribed by codons and translated in frame into amino acid residue repeats. Isoleucine was the most abundant amino acid, followed by lysine and glutamic acid. It has been reported that in bacterial physiology, branched-chain amino acids like isoleucine play a variety of roles, from promoting protein synthesis to signalling and ne-tuning the adaptation to amino acid de ciency. In some pathogenic bacteria, the response to amino acid de ciency includes activation of virulence gene expression. As a result, isoleucine aids not just infection but also evasion of host defences (Kaiser Julienne, et al.). The second most abundant amino acid, lysine, is used for protein synthesis and the peptidoglycan layer of Gram-positive bacterial cell walls. Additionally, the relevance of lysine for bacterial cell survival is highlighted by the availability of numerous biosynthetic routes for lysine synthesis in bacteria (Gillner, et al. 2013). The quantity of glutamic acid in bacteria appears to be linked mostly to tolerance to acidic environments. Food borne diseases and spoilage bacteria would be able to grow on acidic foods if they developed acid resistance. This feature is also a virulence factor, as it permits pathogens to pass past the stomach barrier's very acidic conditions, respectively (Feehily and Karatzas 2013). We wanted to explore whether any association could be found based on amino acids encoded by tri-nucleotide SSRs. We then performed functional annotation on the genes containing these long SSRs. This allowed us to gure out if these long SSRs were linked to any particular biological process. SSRs were found in the majority of the important biological function pathways.This raises the risk of non-functional proteins as a result of frameshifts, implying that these species may have evolved SSRs in coding areas to promote phase variation (Lin and Kussell 2012). Notably, SSRs were located in pathogensity genes in Staphylococcus and Streptococcus bacteria, which is direct evidence of their involvement in pathogenicity.
SSRs were found in housekeeping genes in our study; however, the majority of SSRs were tri-nucleotide motifs, with a few tetra-and hexanucleotide motifs. These housekeeping genes are linked to a variety of biological functions. The existence of SSRs in housekeeping genes is surprising because SSRs are known to be mutation hotspots (Lin and Kussell 2012), and any mutation in the housekeeping gene would be fatal. Because most SSRs are trinucleotides, their chances of causing phase variation are low. In a previous study, a signi cant difference was reported in the 5'-UTR while comparing the densities and repeat types of SSRs between housekeeping and tissue-speci c genes. According to the report, the GC content of trinucleotide SSRs in the 5'-UTRs of housekeeping genes is higher than that of tissue-speci c genes (Lawson and Zhang 2008). Motif (gtg) 4 was found to be conserved in the rplD gene in eight species of Staphylococcus and Streptococcus bacteria. GTG is also an alternative start codon in bacteria (Hwang, et al. 2005), but its presence as repeated motifs ruled out any such role and focused on its translation to valine, which has a multifaceted physiological role in bacterial survival and pathogenicity (Kaiser Julienne, et al.).
We further attempted to construct a phylogenetic tree using the presence of SSRs, in particular housekeeping genes, as these genes are more conserved with minimal selection pressure. The highest and lowest polymorphism levels were obtained among ribosomal 50S subunit genes. The 50S ribosomal subunit is always an easy target for various antibiotics (Champney, et al. 2003), but with the presence of repetitive sequences, the bacteria may evade the binding of these antibiotics. A SSR-based phylogeny grouped Enterococcus species along with Streptococcus, whereas a 16S ribosomal region-based phylogeny grouped Enterococcus and Staphylococcus together. It was until 1984 that Enterococcus species were classi ed as Streptococcus (Andrewes and Horder 1906). However, with the advancement of techniques like DNA-DNA hybridisation and 16S RNA sequencing, a new genus Enterococcus was formed (Schleifer and Kilpper-Bälz 1984). We observed that SSR-based phylogeny's resolution power was low as it could not differentiate between ve species of Streptococcus, which 16S ribosomal regionbased phylogeny easily did.

Conclusion
In this study, we have discovered a pattern of long-SSR distribution in the genic and intergenic regions of human pathogenic Staphylococcus, Streptococcus, and Enterococcus bacteria and established a relationship between them. Human pathogens have a higher relative abundance of SSRs as compared to non-pathogenic control. The genic regions were home to the majority of the repeat motifs. As a result, the presence of microsatellites in human pathogens is not random. A novel nding in this study was the preference for genus-speci c trinucleotide SSR encoded amino acids.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplementarytable.docx