Tandem repeats ubiquitously flank and select translation initiation sites.


 Findings in yeast and human suggest that evolutionary divergence in cis-regulatory sequences impact translation initiation sites (TISs). Here we employed the TIS homology concept to study a possible link between all categories of tandem repeats (TRs) and TIS selection. Human and 83 other species were selected, and data was extracted on the entire protein-coding genes (n = 1,611,368) and transcripts (n = 2,730,515) annotated for those species from Ensembl 102. On average, every transcript was flanked by 1.19 TRs of various categories in their 120 bp upstream RNA sequence. We detected statistically significant excess of non-homologous TISs co-occurring with human-specific TRs, and vice versa. We conclude that TRs are abundant cis elements in the upstream sequences of TISs across species, and there is a link between all categories of TRs and TIS selection. TR-induced symmetric and stem-loop structures may function as genetic marks for TIS selection.


INTRODUCTION
Translational regulation can be global or mRNA specific, and most instances of translational regulation affect the rate-limiting initiation step (1,2). While mechanisms that result in the selection of translation initiation sites (TISs) are largely unknown, conservation of the alternative TIS positions and the associated open reading frames (ORFs) between human and mouse cells (3) implies physiological significance of alternative translation. A vast number of human protein-coding genes consist of alternative TISs, which are selected based on complex and yet not fully understood scanning mechanisms (3)(4)(5)(6). The alternative TISs can result in various protein structures and functions (7,8).
Whereas lack of sufficient knowledge has led to the conclusion that TISs are stochastic for the most part, the probability of using a particular TIS differs among mRNA molecules, and can be dynamically regulated over time (9). Selection of TISs and the level of translation and protein synthesis depend on the cis regulatory elements in the mRNA sequence and its secondary structure such as the formation of hair-pins, stem loop, and thermal stability (10)(11)(12)(13)(14)(15). In fact, the ribosomal machinery has the potential to scan and use several ORFs at a particular mRNA species (16).
A tandem repeat (TR) is a sequence of one or more DNA base pairs (bp) that is repeated on a DNA stretch. While TRs have profound biological effects in evolutionary, biological, and pathological terms (17)(18)(19)(20)(21)(22)(23), the effect of these intriguing elements on protein translation remains largely (if not totally) unknown. There are limited publications indicating that when located at the 5′ or 3′ UTR, short tandem repeats (STRs) (core units of 1-6 bp) can modulate translation, the effect of which has biological and pathological implications (24-28). Abnormal STR expansions impact TIS selection in a number of neurological disorders (29,30).
Based on a TIS homology approach, we recently reported a link between STRs and TIS selection(31). Here, we extended our weighing methods, and developed a comprehensive software package, to study a possible link between all categories of TRs and TIS selection across 84 species.

TRs are ubiquitous cis elements flanking TISs.
A total of 1,611,368 protein-coding genes and 2,730,515 transcripts were investigated across the 84 selected species, which resulted in the extraction of 3 Across TR categories 1-4, we detected 660, 101, 339 and 404 different types of human-specific TRs, respectively, the top most abundant of which are represented in Table 1.

Link between TRs and TIS selection
We employed two weighing settings for designating homologous vs. non-homologous TISs, one of which was similar to our previous approach (31). In both settings, there was significant co-occurrence of human-specific TRs with non-homologous TISs, and non-human-specific TRs with homologous TISs (Fig. 2). The results were replicated in 10-fold validation (Fig. 3).
In addition to the weighing vectors used, we employed the Needleman Wunsch algorithm (32) to check for the robustness of the proposed link. Human TIS homology was checked for the proteins encoded by the orthologous genes in three species, mouse, macaque, and chimpanzee. A dramatically lower homology was observed for human proteins linked by human-specific TRs (Fig. 4).
Similarity calculation between human proteins and the three other species was performed by RESTful API at: https://www.ebi.ac.uk/Tools/psa/emboss_needle (33).

Evolutionary and biological implications
Of the 15,256 human genes which had at least one transcript flanked by a TR in their upstream flanking region, there were 2,991 genes which had at least one transcript flanked by a human-specific TR in their upstream flanking region (Supplementary Tables 2 & 3). Text mining of a number of those genes as examples (34,35) yielded predominant expression and functions in the human brain and skeletal muscle (e.g., MYH2, TTN, SLC6A8, CACNA1A, and EIF5AL1) ( Table 2). These are examples of expression enrichment in tissues that are frequently subject to species-specific evolutionary processes. Gene ontology and pathway Enrichment analyses were performed on the extracted genes from Table 2, by using the online facility "Enrichr" tool, that is a useful application for the success of any high-throughput gene function analysis (https://maayanlab.cloud/Enrichr/enrich) (36-38). The list of genes in Table 2 was given as input, and "GO Biological Process 2021" and "KEGG 2021 Human" were selected as libraries, respectively (Fig. 5). Interestingly, nervous system development (GO 0007399) was the top enriched ontology term, and the calcium signaling pathway was the top pathway. Calcium signaling pathways are being increasingly recognized as essential processes in the human brain neurogenesis (39).
Our findings provide prime evidence of a link between all categories of TRs and TIS selection, mechanisms of which are virtually unknown at this time. Our approach was based on homology search, which reliably identifies "homologous" proteins or genes by detecting excess similarity (40).
This approach was performed using two weighing vectors and a confirmatory algorithm, which consistently supported the link.
It is possible that asymmetric and stem-loop structures, which are inherent properties of repeat sequences result in genetic marks that enhance TIS selection. Asymmetric TR structures have recently been reported to be linked to various biological functions, such as replication and initiation of transcription start sites (41). It remains to be clarified how this intriguingly abundant reservoir of regulatory elements contributes to TIS selection across species.

Conclusion
We conclude that TRs are abundant cis elements that flank TIS sequences, and contribute to TIS selection at the trans-species level. These findings shed light on an underappreciated aspect of evolutionary biology, which warrants future functional analyses.

Data collection
All sequences, species, and gene datasets collected in this study were based on Ensembl 102 (https://www.ensembl.org). 84 species were selected, which encompassed orders of vertebrates and non-vertebrates (Fig. 6). Throughout the study, all species were compared with the human sequence, as reference. The list of species was extracted via RESTful API, in Java language. In parallel, a list of available gene datasets of the selected species was collected by using the "biomaRt" package (42,43) in R language.
In the next step, in each selected species, all protein coding transcripts of protein coding genes were extracted. Subsequently, the 120 bp upstream flanking sequence of all annotated protein-coding TISs were retrieved and analyzed. All steps of data collection were performed by querying on the Biomart Ensembl tool via RESTful API, which was implemented in the Java language, except fetching the primary list of available species and gene datasets. For each species, its name, common name and display name were retrieved. For each gene in each species, its gene name, Ensembl ID and the annotated transcript IDs were retrieved, and finally, for each transcript its coding DNA sequence, upstream flanking region and protein sequence were retrieved. All collected data was stored in a MySQL database which is accessible at https://figshare.com/search?q=10.6084%2Fm9.figshare.15405267 .
A candidate sequence was considered a TR if it complied with the following four rules: 1 -In the case of mono nucleotide cores, the number of repeats should be ≥ 6. 2 -In the case of 2-9 bp cores, the number of repeats should be ≥ 3. 3 -For other core lengths, the number of repetition of cores should be ≥ 2. 4 -TRs of the same core sequence should not overlap if they were in the same upstream flanking region.
We categorized the TRs based on the core lengths as follows: Category 1: 1-6 bp, Category 2: 7-9 bp, Category 3: 10-15 bp, and Category 4: ≥16 bp. This was an arbitrary classification to allow for possible differential effect of various core length ranges.

Retrieval of data across species
Using the enhanced query (Supplementary Table 4) form on the Biomart Ensembl tool along with the RESTful API tools, a Java package was developed to retrieve, store, and analyze the data and information. The source codes and the Java package are available at: https://github.com/Yasilis/STRsMiner-JavaPackage_PaperSubmission/tree/develop .

Identification of Human-specific TRs
The TIS-flanking upstream 120 bp of all annotated protein-coding transcripts of protein-coding genes were screened in 84 species for the presence of TRs in four categories based on the core length. The data obtained on the human TRs was compared to those of other species, and the TRs which were specific to human were identified. The selected genes were clustered based on their names (orthologous genes were placed in a cluster). All TRs of each cluster were extracted and categorized based on the species. In the next step, in each cluster, only TRs that were specific to human (not detected in other species) were retained and set as reference.

Evaluation of TIS homology
Identifying the degree of homology between two transcripts requires assigning a weight value to each position of the sequences. Weighted homology scoring was performed in two different weight settings, as weighing vectors 1 If M is the first methionine amino acid of the two peptide sequences (position of 0 in the two weighing vectors), for all next five successive positions represented by in the formula (Eq. 9), we defined five weight coefficients ,1 to ,5 , observed in the vector.
Homology of the five amino acids and, therefore, the TIS was inferred based on the value of similarity scoring, in which a similarity of ≥ 50% was considered "homology". This threshold was achieved following BLASTing three thousand random pair-wise similarity checks of the initial five amino acids of randomly selected proteins as previously described (31).

Scoring human-specific and non-specific TR co-occurrences with TISs.
In both weighing methods, the initial five amino acid sequence (excluding the initial methionine) of the human TISs that were flanked by human-specific TRs and non-human-specific TRs were BLASTed against all the initial five amino acids (excluding the initial methionine) of the orthologous genes in the remaining 83 species. The above was aimed at comparing the number of events in which humanspecific and non-specific TRs occurred with homologous and non-homologous (TISs). For computing the number of homologous and non-homologous TISs, we needed to consider a number of assumptions. We defined G as the set of all human protein coding genes. Therefore, g denoted a gene that belonged to the G set (Eq. 3).
We also defined ( ) and ( ) as the set of all annotated transcripts in a gene g, which belonged to human and other species, respectively (Eq. 4, 5).
( ) = � � was a human protein coding transcript which belonged to the gene, � ( ) = � � was a protein coding transcript which belonged to the gene, but, did not exist in human � Moreover, * denoted all filtered transcripts of which had at least one humanspecific TR at the 120 bp genomic DNA interval upstream of the TIS, while, + denoted all filtered transcripts of , which had at least one TR at the 120 bp genomic DNA interval upstream of the TIS.
The following formula was developed to measure the degree of similarity of two peptides in the two weighing settings (Eq. 6).
In this formula, Θ is a binary function that decides whether the transcripts are homologous or not, and = {1,2} refer to each weight setting. If function measures the similarity score, Θ can be defined as follow (Eq. 7): . .
For calculating the similarity score, we used another binary function. We defined Φ as follows: (Eq.

8)
: This function takes two amino acids as argument and returns 1 as output if they are the same, and zero if they are not the same. Therefore, ( , ) is defined by the following formula (Eq. 9): In this function, the i th amino acid in the sequence of the transcript t, is denoted by ( ).
We replicated the comparisons in 10-fold cross-validation. In each-fold, genes in the human nonspecific TR group were randomly selected according to the number of genes in the human-specific TRs group. This process was repeated for the two methods (two different weight vectors) and for each of the four categories of TRs. For each category and weighing method, the mean of the result of each round was calculated as a final result. Finally, the Fisher exact test was run for each-fold (Supplementary Table 5).

DATA AVAILABILITY
The datasets generated and analyzed during this study are available in the "figshare" repository, with the identifier "10.6084/m9.figshare.15405267" Also, other source code and software available in the GitHub repository (https://github.com/Yasilis/STRsMiner-JavaPackage_PaperSubmission/tree/develop)

Funding
Not applicable.      shows the distribution of similarity abundance between human proteins and three species mouse, macaque, and chimpanzee in the same gene. For each panel, the first row shows the distribution that was constructed by BLASTing human proteins, which were produced by human-specific TR (HS-TR) genes. Similarly, the second row of each panel shows the distribution that was constructed by

TABLE AND FIGURES LEGENDS
BLASTing human proteins which were produced by non-human-specific TR (NHS-TR) genes. The Needleman Wunsch algorithm (upper panel) was used as a confirmatory measure to our two weighing methods (middle and lower panels). In each method, we detected a significant difference in the distribution of HS-TR genes vs. NHS-TR genes.

Platypus
Gene Count Transcript Count TR Count Fig. 6: Ratios of genes, transcripts, and TR counts for each species. The horizontal axis shows the percentage of each entity, and the vertical axis shows each species. Species can be cross-referenced in Supplementary  Table 3.