Species and whole-genome sequences
The UCSC genome browser (https://hgdownload.soe.ucsc.edu) was used to download and analyze the whole genomes of nine species as follows (genome sizes are indicated following each species): rat (Rattus norvegicus): 2,647,915,728, mouse (Mus musculus): 2,728,222,451, gelada (Theropithecus gelada): 2,889,630,685, olive baboon (Papio anubis): 2,869,821,163, macaque (Macaca mulatta): 2,946,843,737, gorilla (Gorilla gorilla gorilla): 3,063,362,754, chimpanzee (Pan troglodytes): 3,050,398,082, bonobo (Pan paniscus): 3,203,531,224, and human (Homo sapiens): 3,099,706,404. Those species encompassed rodents: rat and mouse, Old World monkeys: gelada, olive baboon, macaque, and great apes: gorilla, bonobo, chimpanzee, human.
Extraction of STRs from genomic sequences
The whole-genome abundance of mononucleotide STRs of ≥10-repeats, dinucleotide STRs of ≥6-repeats, and trinucleotide STRs of ≥4-repeats were studied in the nine selected species. To that end, we designed a software package in Java (https://github.com/arabfard/Java_STR_Finder). All possibilities of mononucleotide motifs, consisting of A, C, T, and G, all possibilities of dinucleotide motifs, consisting of AC, AG, AT, CA, CG, CT, GA, GC, GT, TA, TC, and TG, and all possibilities of trinucleotide motifs, consisting of AAC, AAT, AAG, ACA, ACC, ACT, ACG, ATA, ATC, ATT, ATG, AGA, AGC, AGT, AGG, CAA, CAC, CAT, CAG, CCA, CCT, CCG, CTA, CTC, CTT, CTG, CGA, CGC, CGT, CGG, TAA, TAC, TAT, TAG, TCA, TCC, TCT, TCG, TTA, TTC, TTG, TGA, TGC, TGT, TGG, GAA, GAC, GAT, GAG, GCA, GCC, GCT, GCG, GTA, GTC, GTT, GTG, GGA, GGC, and GGT were analyzed. The written program was based on perfect (pure) STRs. The algorithm started from an initial point, the first nucleotide of the genome, by walking on the genome. This algorithm moved from nucleotide to nucleotide. In each step, it investigated a window of 2N nucleotides at first, where N was considered the length of the STR core. If the first half of the sequence inside the window that considered the core was not equal to the second half, the algorithm moved one nucleotide forward. Otherwise, the algorithm checked the next N nucleotides. This process continued until all identical continuous N nucleotides, which were the same as the core were found. This sequence was introduced as a new STR that had a core with a length of N and the number of repeats found. The next step continued from the end of the identified STR. We repeated this process for different values of N (N was between 1 to 3).
Chromosome-by-chromosome aggregation of STRs
Whole-genome chromosome-by-chromosome data were aggregated and analyzed in the nine species, without normalization (approach 1) and with normalization (approach 2). In approach 1, all chromosomal data were collected without removing any numerically non-identical chromosomes across the nine species. In approach 2, data on the identical chromosome sets (numerically) across the nine species were collected in an array of 20 columns, each column corresponding to a chromosome. In this approach, mouse was selected as reference, because it had the lowest number of chromosomes among the nine species i.e., the minimum set of chromosomes across the selected species was used for normalization, as those species had various chromosome numbers, karyotype-wise.
STR abundance and hierarchical cluster analysis across species
Whole-genome STR abundances across the selected species were deciphered and depicted by boxplot diagrams and hierarchical clustering, using boxplot and hclust packages[33] in R, respectively. Boxplots illustrate abundance differences among segments across the selected species, and hierarchical clustering plots demonstrate the level of similarity and differences across the obtained abundances. The input data to these packages were numerical arrays obtained with each approach. Each array consisted of a number of columns, each column corresponding to the STR abundance in different chromosomes.
Statistical analysis
The STR abundances across the nine selected species were compared by repeated measurements analysis, using one and two-way ANOVA tests. These analyses were confirmed by nonparametric tests.