github.com/deprekate/prfect
All the code and data presented here are available in the GitHub repository. All of the code exclusive to the PRFect package was written in Python3 in order to be user-friendly and easily updateable for future improvements. PRFect is also available on the Python Package Index PyPI (pypi.org) as an easily installable command-line program that downloads and installs with a single command: pip install prfect. The PRFect package does require the third-party dependencies: scikit-learn and numpy (15, 16), as well as the additional packages, genbank, score_rbs, linearfold, and hotknots. The last two were adapted from their original C code libraries (17, 18) into Python packages that auto-install along with the other packages when the previous command is used to install PRFect.
Obtaining Data
To obtain ribosomal frameshift data, we downloaded 3,679 phage genomes in GenBank format from the Actinobacteriophage Database phagesdb.org (19). Genes exist as CDS features within the GenBank format (20) and no explicit designation indicates if or where ribosomal frameshifting occurs within a gene. However, when a CDS feature has discontinuous locations in the GenBank file, they are denoted by using the join keyword in the coordinates. Figure 2 shows a small example GenBank file with two genes. The first example gene occurs from nucleotide 1 to nucleotide 100, while the second example gene exists in two locations, which are also two different frames, through the use of the join keyword.
Since frameshifted genes have multiple locations, they can be found by looking for CDS Features that use the join keyword. However, in addition to ribosomal frameshifting, other reasons can cause a gene to exist in multiple locations, such as splicing, mobile elements that insert into genes, genes spanning break points in circular genomes, and even sequencing errors. To distinguish ribosomal frameshifted genes from other causes, we required only two sets of coordinates within 10 bp of each other, which was chosen through manual inspection of the genomes. Ideally, the two pairs of ribosomal frameshifted genes would be separated by either 0 (backward) or 1 (forward) nucleotides. However, the joined ribosomally frameshifted genes of the SEA-PHAGES data varied from 0 to 7 nucleotides apart, while the genes that were joined due to other causes (discussed below) varied from 50 to 818 nucleotides apart. Of the 3,679 genomes, 2,489 phages had one or more CDS features with the join keyword, giving a total of 2,557 joined CDS Features. Of these, 61 were at opposite ends of the genome, indicating genes split due to the circularization of the genome. There were 20 joined features with coordinates separated by more than 10 bp that were excluded from the training data: six in genes coding for major capsids, lysins, and DNA methylases (caused by inteins from homing endonucleases); twelve in genes coding for minor tail proteins (due to group 1 introns); one encoding a "structural" protein; and one encoding a tail assembly chaperone. The two pieces of the tail assembly chaperone were separated by 72 bp, which we predict was an annotation error, along with all the other joined genes with coordinates further than 10 bp apart. There were 2,476 frameshifted genes (one per genome) and 360,977 genes without frameshifts. The frameshifted genes were split into their two constituent fragments, giving sets of two consecutive "pseudo" gene pairs for positive cases. For example, the frameshifted chaperone gene in Fig. 2 was split into two CDS features with locations (200..300) and (301..400), and the join keyword was removed. Then all of the genes were evaluated pairwise to determine if an overlap could occur between consecutive genes, allowing for the possibility that the two genes are a single frameshifted gene. Figure 3 shows an example of such an overlap in a pair of consecutive genes (denoted in green).
In the example, geneA (frame 1) is 10 codons long and overlaps with 1 codon of geneB; since the first stop codon upstream (in the 5' direction) from geneB (frame 2) is six codons to the left, there is the possibility that a frameshift could occur within this overlap region (yellow in Fig. 3). If there were a frameshift that occurred, geneB would not be a complete gene but rather part of an alternate frameshifted translation of the fusion geneAB. During translation the ribosomes that are not frameshifted end up producing geneA within the cell while the ribosomes that are frameshifted end up producing geneAB. Most consecutive genes were either on opposite strands (thus, ribosomal frameshifting is untenable) or did not have the possibility of overlap, leaving only 67,664 gene pairs to search for negative case slippery sites.
Slippery Site Motifs
The only slippery sequence to appear in the literature is the previously mentioned XXXYYY motif (7). However this "threethree motif" (the number denotes the same nucleotide repeated three times and then another nucleotide repeated three times) is not present in most frameshift overlaps, so we explored recurring patterns that could similarly serve as motifs. Originally, we tried using various automated tools to detect motifs, such as the MEME Suite (21), but this proved troublesome due to the highly repetitive nature of our training data, so we were forced to manually inspect the overlap regions of the annotated frameshifts. Focusing only on those annotated frameshifts that lacked the threethree motif, we looked for novel nucleotide patterns that could function similarly to the threethree wobble base pairing dynamic. For backwards frameshifts, we found the eight different slippery sequence motifs: six, threethree, fivetwo, twofive, twofour, threetwotwo, five, and twoonefour. A description of these motifs is in SuppFig 1. For forward frameshifts we only looked for the two motifs four and three; however, we also required that the codon of the + 1 frame A-site relative abundance (A1) is greater than the codon of the + 0 frame (A0) A-site. Requiring the A1 codon relative abundance to be more favorable limits the number of candidate forward slippery sites found and speeds up runtimes because three (and four) bases in a row occur very often in a genome. Since some motifs are subsets of other motifs (i.e. twofive is also twofour, six is also threethree, and four is also three), the motif with a lower probability of occurring randomly takes precedence. Once all possible motifs were identified, we were left with a set of 106,692 different sites as potential slippery sequences, of which 3,711 were from (true) frameshifted genes, and 102,981 were from not thought to be frameshifted gene pairs (false). Since the frameshifts are not experimentally tested, we do not know the actual location of the slippery site, only the approximate location that the researcher guessed it to be. Therefore we cannot ascertain exactly which of the motifs in a true frameshift overlap region is the actual slippery sequence. To mitigate error induced by including incorrect, randomly occurring slippery sites into the dataset as true cases, any such motifs that occurred further than 10bp away from where the shift was annotated to occur were denoted as false cases. This left 2,718 positive cases (2,368 backward and 350 forward) and 103,989 negative cases.
Properties contributing to translation efficiency (and potential pausing)
For every slippery sequence motif, cellular properties relevant to the translation process were aggregated based on the motif occupying the E-site and P-site of the ribosome, with the A-site being empty and waiting for tRNA recruitment to occur (Table 1). The first property was the direction of the frameshift, forward or backward. The next two properties were the relative frequency of the waiting + 0 A-site codon, and the relative frequency of the − 1 or + 1 frameshifted A-site codon. The frequencies are found during the genome file reading step by iterating through all the annotated coding genes and counting the relative occurrence of the 64 different codons. This accounts for the idea that if the codon waiting in the + 0 A-site has few matching cognate tRNAs available in the cytoplasm while the tRNA for the ± 1 A-site codon is quite abundant, the A1 codon is slightly more favorable than the A0 codon, and so the occurrence of a frameshift is more favorable. The next two properties added were different methods for scoring the presence of a ribosomal binding site (RBS) upstream of the P-site. The first method is a reimplementation of the 28 variable bins utilized by Prodigal for gene calling, where each bin is an integer from 0 to 27 and corresponds to a given RBS motif and nucleotide spacer sequence (22). The other is a reimplementation of the method employed by the RAST website, which uses the observed frequencies of 191 different RBS motifs with 10 different nucleotide spacer sequence sizes (23). To estimate the 3' secondary structure, the minimum free energy (MFE) of the 50 bp and 100 bp windows were added using two different secondary structure prediction tools: LinearFold, which predicts simple hairpins, and HotKnots, which can predict pseudoknots. LinearFold produces MFE prediction scores identical to the widely used (but not available as an easily installable Python package) RNAFold program from the Vienna software package (24). For more complex secondary structure predictions that include pseudoknots, the HotKnots tool was run with the most recent DP09 parameter set. A parameter sweep of the data using pairs of window sizes from 30 bp to 120 bp, in 10 bp increments, and with offsets of 0 to 6, in 3 bp increments was attempted. The results were ambiguous, and no apparent accuracy peaks were observed, so we used the conventional 50 bp and 100 bp windows taken just after the three A-site bases of the slippery sequence (an offset of 3). The MFE scores were scaled by dividing by the window length, and since the MFE score is biased by the GC content of the window in question (25), we further normalized the MFE/bp by also dividing by the GC content. A visual representation of this transformation is shown in Supp Fig. 2.
The last property added to the model was the number of bases between the slippery sequence and the + 0 in-frame stop codon. This property helps distinguish between more probable motifs near the in-frame stop codon from those that occur randomly much further upstream of the in-frame stop codon. For instance, in the example in Fig. 3, slippery sites that occur towards the right side of the yellow box would be slightly more probable, or at least differentiable, than those occurring towards the left side of the box.
Table 1
The various cellular properties used to classify slippery sequences
property | description | range |
---|
DIR | The direction of the frameshift | -1, +1 |
RBS1 | The Prodigal ribosomal binding site score | 0–27 |
RBS2 | The RAST ribosomal binding site score | 0.0–6.3 |
MOTIF | The slippery sequence motif | 0–9 |
A0 | The frequency of the A-site codon usage in all genes | 0.0–1.0 |
A1 | The frequency of the + 1 A-site codon usage in all genes | 0.0–1.0 |
LF50 | The normalized LinearFold minimum free energy calculation of the 50bp window | 0.0–1.0 |
LF100 | The normalized LinearFold minimum free energy calculation of the 100bp window | 0.0–1.0 |
HK50 | The normalized HotKnots minimum free energy calculation of the 50bp window | 0.0–1.0 |
HK100 | The normalized HotKnots minimum free energy calculation of the 100bp window | 0.0–1.0 |
N | The distance of the slippery sequence from the in-frame (+ 0) stop codon | 0 - ∞ |
These eleven (DIR, RBS1, RBS2, MOTIF, A0, A1, LF50, LF100, HK50, HK100, N) properties were then used to train a histogram-based gradient boosting classification tree to predict the direction of the frameshift: 0 for no frameshift, -1 for backwards, and + 1 for forward.
Since only − 1 and + 1 ribosomal frameshifts appear in the SEA-PHAGES data, and for reasons discussed further below, other frameshifts size categories (such as -2 and + 2) were omitted. The HistGradientBoostingClassifier module from the Python Scikit-Learn package was used with default parameters except for the L2 regularization parameter, which was set to 1.0 to help prevent overtraining on the data. In addition, the early_stopping was turned off so that the training/validation/testing results would be deterministic rather than stochastic. Since multiple potential slippery sites can occur within the overlap region of a single pair of consecutive genes, only the highest-scoring slippery site (if any) is returned as the predicted PRF site by PRFect.
Training and Validation Data
The SEA-PHAGES genomes are highly repetitive; many of the phages have near identical, closely related, taxa in the database, and some phages are exact duplications of other genomes. The presence of multiple copies of the same genome makes splitting the data into training and validation more complex; therefore, four different leave-one-out levels were used: CLUSTER, SUBCLUSTER, MASH95, and GENOME. CLUSTER and SUBCLUSTER are the two taxonomic levels that phages are assigned to during the SEA-PHAGES workflow (26). For example, in a CLUSTER split, all of the phage genomes of one CLUSTER are removed, a HistGradBoost model is trained on the remaining genomes, and then predictions are made for the genomes of the omitted CLUSTER. As there are 89 different CLUSTERS, 89 different validation models were built, and the resulting predictions were merged. Likewise, there are 102 SUBCLUSTERS, hence 102 models were built, and the predictions at that level were merged. Since there are so few CLUSTERS and SUBCLUSTERS represented in the data, and the lowest taxonomic level GENOME is dubious due to the training set contamination discussed above, an intermediary taxonomic level, MASH95, was calculated for all of the genomes. This taxonomic level was assigned by using the genome distance estimation of MASH (27) to cluster the genomes at 95% identity, which is analogous to a MinHash distance of 0.05. The parameters used were the same as recommended in the publication: a sketch size of 400 and k of 16.
Testing Data
To assess the performance of PRFect on unrelated genomes and non-chaperone genes, a general PRF model was first trained on all of the SEA-PHAGES genomes that contained a joined tail assembly chaperone (TAC) gene, and then five different sources of frameshift data were tested. The first is the RECODE database (28); because it is currently unavailable, an archive.org snapshot of the 2010 downloadable files was used. The data includes many types of coding anomalies, spans all domains of life, and covers all gene functions. Each entry in the database corresponds to a single gene, for which the nucleotide sequence and site of the frameshift is given. From the SQL file, 725 entries labeled as "ribosomal_frameshift" were extracted and converted to GenBank files, each of which contained only a single gene. The second source of frameshift data was the set of 28 phage genomes listed with a potential frameshift site in the region analogous to the tail assembly chaperone gene of phage lambda (29). We downloaded the accessions that were listed in the supplementary documents from GenBank, and for those genomes that were missing the specified frameshift (as a joined CDS feature), we manually added the gene to the file at the specified slippery sequence location. The third source of frameshift data was the FSDB (Frameshift Database) which contains 253 frameshifts in a graphical website (30). The website was parsed to get the GenBank accession number of each frameshift, the slippery sequence, and the location of the slippery sequence, which was then used to retrieve the files from GenBank. In contrast to the other test datasets, due to the size and number of genomes in the FSDB an automated python script was used to add a joined feature to the GenBank file at the location specified in the frameshift data. Additionally, any pre-existing joined features were left in place. The fourth source of frameshift data was 106 virus genomes with known or predicted occurrences of ribosomal frameshifting in genes of different functions (31). The provided accession numbers were used to download the genome files from GenBank, and as before, frameshifted genes were added to those files lacking the specified feature. The fifth and last source of frameshift data was the quite topical single genome for the coronavirus Covid19. The GenBank file for SARS-Cov2 (accession NC_045512) contains 12 genes, one of which is frameshifted (32).