3.1 Identification of tRF targets from CLASH data
Based on CLASH data from HEK293 cells and CLEAR-CLIP data from Huh-7.5 cells, we obtained 547 tRF-mRNA pairs (532 from CLASH and 15 from CLEAR-CLIP) involving 28 tRFs (20 tRF-3s and eight tRF-5s) in CLASH and 15 tRFs (six tRF-3s and nine tRF-5s) in CLEAR-CLIP. A total of 115,418 seed matches paired with the tRFs were used as the background.
3.2 Characterizing target recognition features
The P-values of features considered in this study that may be related to tRF-mRNA interactions are listed in Table 1. We observed significant differences in most sequence features of individual seed matches, transcripts, or tRFs and the features of the tRF-mRNA duplex. The features most significantly different between the positive group and background are shown in Figure 2.
Table 1. The P-value of features considered in this study that can contribute to tRF-mRNA interaction
Category
|
Feature
|
Including feature
|
P value
|
Sequence feature of target sites
|
type of seed match
|
position 8 match
|
2.87E-28
|
position 1 match
|
7.28E-13
|
position 1 A
|
0.34
|
bases identity immediately flanking seed match (2nt)
|
base identity of position 2 upstream
|
0.80
|
base identity of position 1 upstream
|
2.88E-03
|
base identity of position 1 downstream
|
7.67E-07
|
base identity of position 2 downstream
|
0.43
|
bases component in the vicinity of seed match
|
GC percentage 35nt upstream
|
5.89E-13
|
GC percentage 25nt upstream(excluding 10 nts immediately upstream)
|
4.28E-14
|
GC percentage 15nt downstream
|
2.18E-11
|
score 35nt upstream
|
7.51E-03
|
score 25nt upstream(excluding 10 nts immediately upstream)
|
2.61E-16
|
score downstream
|
7.49E-12
|
base component 35nt upstream
|
See Supplementary Table
|
base component 15nt downstream
|
See Supplementary Table
|
dinucleotide component 35nt upstream
|
See Supplementary Table
|
dinucleotide component 15nt downstream
|
See Supplementary Table
|
distance to the end of tRF
|
distance to the 5’ end
|
2.37E-08
|
distance to the 3’ end
|
1.90E-03
|
distance to the nearest end
|
2.12E-03
|
Sequence feature of transcripts
|
3’-UTR properties
|
length of 3'-UTR
|
4.80E-08
|
GC percentage of 3'-UTR
|
5.40E-06
|
frequency of seed matches in 3’-UTR
|
0.08
|
base component(AGCT) of 3'-UTR
|
See Supplementary Table
|
dinucleotide component of 3'-UTR
|
See Supplementary Table
|
CDS properties
|
length of CDS
|
1.95E-04
|
GC percentage of CDS
|
2.33E-03
|
frequency of seed matches in CDS
|
6.51E-06
|
5’-UTR properties
|
length of 5'-UTR
|
0.40
|
GC percentage of 5'-UTR
|
0.08
|
frequency of seed matches in 5'-UTR
|
0.02
|
Sequence feature of tRFs
|
tRF sequence properties
|
length of tRF
|
2.77E-21
|
GC percentage of tRF
|
2.46E-11
|
TA
|
target abundance in genome
|
7.19E-03
|
target abundance in 3'-UTR
|
0.09
|
SPS
|
GC percentage of seed
|
8.69E-18
|
Stability and thermodynamics of tRF-mRNA interaction
|
secondary structure of target mRNA
|
nucleotides exposed at seed match
|
7.49E-03
|
nucleotides exposed surrounding seed match
|
0.99
|
energy to free base-pairing interactions of target site
|
3.51E-06
|
secondary structure of tRF-mRNA
|
number of bases paired in duplex
|
1.53E-20
|
MFE
|
1.36E-57
|
Sequence features of target sites
Seed match types and surrounding base properties determined the potential of a tRF target site. Position 8 matches (P=2.87E-28) and position 1 matches (P=7.28E-13) were more common in the positive group than in the background, while no significant difference was observed at position 1 A (P=0.34). The number of different kinds of seed types is provided in Supplementary Table 2. Consistent with the finding of PAR-CLIP (14, 15), 7-mer-m8 sites (binding to positions 2–8 in the tRFs) were most enriched in the positive group. The bases flanking the functional sites were more likely to be A or U compared with the bases in the background. The differences in nucleotide identity are listed in Supplementary Table 3 and Supplementary Table 4. The GC percentages were significantly lower for bases upstream (P=5.89E-13) and downstream (P=2.18E-11). When using the formula to incorporate distances to seed matches with different weights, the scores in the positive group were significantly higher for all three kinds of surrounding regions, including upstream (P=7.51E-03), upstream excluding 10 nts (P=2.61E-16), and downstream (P=7.49E-12).
The locations of seed matches in the 3'-UTRs were related to the tRF binding activity of the UTRs. The cumulative distribution curves of the distance to the 3'-UTR ends are shown in Supplementary Figure 2. We observed that effective sites avoided appearing immediately downstream of the stop codon, while tended to reside adjacent to the ends within the rest of the 3'-UTR in the positive group (P=2.12E-03), especially the 5' end (P=2.37E-08).
Sequence features of the transcripts
When focusing on the whole-sequence features to assess their impacts on the efficacies of sites, we found that the positive group had a significantly lower global GC content than the background in the 3'-UTR (P=5.40E-06), but this trend was not as significant as that detected in local comparisons near seed matches. In agreement with the finding that more sites were preferentially adjacent to the ends, the target sites were selectively located in the shorter 3'-UTRs (P=4.80E-08).
Sequence features of tRFs
We investigated whether some tRF sequences might be intrinsically more capable of targeting than others. In tRFdb, tRF-3 and tRF-5 sequences could be categorized into tRF-3a (18 nts), tRF-3b (22 nts), tRF-5a (15 nts), tRF-5b (22 nts), and tRF-5c (31 nts) sequences according to their lengths. Compared to those in the background, we observed significant differences in the lengths of tRFs (lengths in tRFdb) (P=2.77E-21), consistent with the finding that repression was mediated by tRF-3as instead of tRF-3bs derived from the 3' end of the same tRNA (11). tRFs in the positive group had a higher GC percentage for both whole sequences (P=2.46E-11) and seed sites (P=8.69E-18), which could contribute to pairing stability, especially for seed regions (34). In addition, we detected significant differences in target site abundance (TA) in the genome (P=7.19E-03), consistent with the finding that extensive pairing could decrease the function of sRNA pairing to their authentic target sites (35-37).
Structure and thermodynamic properties of the tRF-mRNA duplex
In addition to the sequence features of tRFs and their targets, the secondary structures of target mRNAs likely contribute to target recognition. We showed that nucleotides in the positive group were more exposed than those in the background at the seed matches (P=7.49E-03), while less energy was needed to free base-pairing interactions within the secondary structures of target mRNAs (P=3.51E-06). These features indicated that effective target sites were more accessible for tRF binding.
Thermodynamic features of the tRF-mRNA duplex were then analyzed. Compared with that in the background, the tRF seed-target binding in the positive group was usually more stable, as revealed by a lower MFE (P=1.36E-57), which represented the most stable structure of the helix. Moreover, additional pairing beyond seed matches contributed to target recognition in the positive group (P=1.53E-20), consistent with the experimental discovery of Kuscu et al. (11). These features were considered in our models.
3.3 The target prediction model established by SVM and GA
A total of 547 positive pairs (including 489 tRF-3 pairs and 58 tRF-5 pairs) and 2,000 negative pairs (including 1,596 tRF-3 pairs and 404 tRF-5 pairs) were retained according to our filtering criteria. The features with significant differences between the positive group and negative group are listed in Supplementary Table 5, with no significant differences observed for the vast majority of features between CLASH and CLEAR-CLIP. We considered the 96 features listed in Table 1 as potentially informative in relation to tRF targeting, while 51 features were identified by the GA after parameter optimization (Figure 3). These features were modeled in an SVM framework to determine their individual contributions for model implementation. The result of each fold was stable, with an area under the ROC curve (AUC) = 0.980 (from 0.977 to 0.983) during the training process and an AUC = 0.847 (from 0.83 to 0.861) during the validation process (Supplementary Table 6). Fourteen of 15 pairs detected by CLEAR-CLIP and 455 of 532 detected by CLASH were predicted, which indicated the efficiency of our model in different cell lines.
3.4 Predicting the targets with the probabilistic model
Functional tRF-target interactions, which account for a small proportion of seed pairings, derive directly from coevolution of the tRF and its target. Potential functional 3'-UTR targets should contain more complementary sites overrepresented relative to a random background, which was measured by Ps for each tRF-3'-UTR pair. Ps was significantly lower in the positive group than in the background (P=2.94E-10). We generated the final predictions by ranking adjusted Ps, showing that 26,380 tRF-3 targets and 8,670 tRF-5 targets were overrepresented. This method could be used to independently predict the targets of tRFs by simulating miRNA target prediction tools such as PicTar and PACMIT (38, 39).
3.5 Predicting the targets with conserved seed match properties
Biologically functional target sites tend to be located in conserved tRF-pairing motifs within 3'-UTRs (18). We found that target sites were significantly more conserved than the background (P=3.68E-08), as determined by phastCons scores from comparative sequence analyses of the human genome (hg19) with the genomes of 99 other vertebrate species. Similar levels of performance were observed immediately upstream (P=2.31E-10), immediately downstream (P=8.29E-10), and in the whole transcript (P=3.66E-13). We applied this measure of performance as an independent factor to predict the targets of tRFs (25, 30, 40). A total of 111,874 tRF-3 target sites and 152,464 tRF-5 target sites were predicted with 0.5 as a cut-off. We performed gene ontology enrichment analysis on the target genes predicted by the conservation analysis. The results were listed in Supplementary Table 7.
3.6 Algorithm evaluation of tRF target prediction models
The prediction abilities of tRFTars and common miRNA target prediction programs were assessed by comprehensively comparing their identified pairs with the pairs identified by CLASH/CLEAR-CLIP or reporter assays. The performance of our model was evaluated with ROC curve analysis, yielding an AUC = 0.980 in the training process and an AUC = 0.847 in the validation process (Figure 4A), better than commonly used miRNA target prediction models (intersection of TargetScan and miRanda) (AUC = 0.743, P < 0.0001). Moreover, five of seven positive pairs and two of two negative pairs were predicted by the SVM-GA model (Table 2) (11, 16, 41), while only three of seven positive pairs were predicted by miRNA models. The sensitivity, specificity and MCC are listed in Supplementary Table 8. Both lines of evidence suggested that the SVM-GA model was the most effective tool for distinguishing the targets of tRFs with a relatively high accuracy. In addition, we searched the pairs confirmed by reporter assay with a clear tRF sequence beyond those in tRFdb, and KLF12 was predicted to be the target of tRFGluTTC (positive probability=0.76) (42), which further proved the effectiveness of our model.
Table 2. Validation of the models with tRF-mRNA pairs reported by reporter assay
tRF
|
gene
|
Whether target of tRF
|
References
|
Prediction of SVM-GA model
|
Prediction of
conservative model
|
Prediction
of
probabilistic model
|
Prediction
of
miRNA model
|
3027b
|
RPA1
|
Yes
|
Maute et al
|
0.58(Yes)
|
0(No)
|
0.06(No)
|
Yes
|
3009a
|
DGCR2
|
Yes
|
Kuscu et al
|
0.26(No)
|
0(No)
|
0.72(No)
|
No
|
3009a
|
SLC6A9
|
Yes
|
Kuscu et al
|
0.89(Yes)
|
0(No)
|
0.30(No)
|
No
|
3009a
|
SMAD1
|
Yes
|
Kuscu et al
|
0.73(Yes)
|
0.62(Yes)
|
0.03(No)
|
No
|
3009a
|
TBL1X
|
Yes
|
Kuscu et al
|
0.10(No)
|
0(No)
|
0.32(No)
|
No
|
3009a
|
FER
|
Yes
|
Kuscu et al
|
0.86(Yes)
|
0(No)
|
0.05(No)
|
Yes
|
5030c
|
LRP8
|
Yes
|
Deng et al
|
0.89(Yes)
|
0.47(No)
|
0.17(No)
|
Yes
|
3027b
|
STAG2
|
No
|
Maute et al
|
0.22(No)
|
1(Yes)
|
0.01(No)
|
No
|
3027b
|
NSD3
|
No
|
Maute et al
|
0.21(No)
|
1(Yes)
|
0.02(No)
|
No
|
Moreover, we investigated the relationships between different tRF target prediction models. We discovered that most of the features for effective target site prediction were correlated with the probabilistic model or conservative model (Figure 4B). The intersections of three algorithms and miRNA target prediction models (tRFs found in CLASH) are displayed in Figure 4C, and the intersections are listed in Supplementary Table 9. The proportion of targets predicted by each model is presented in Figure 4D.
3.7 Validation of prediction results with expression correlations
We applied the filtering criteria and selected 1226 tRF-mRNA pairs with negative expression correlations in TCGA database (33). The pairs predicted by the SVM-GA model were more likely to have negative expression correlations (P=8.97E-04), and the model performed better than TargetScan (P=0.79) (Supplementary Table 10). We have listed the number of pairs in different kinds of tumors in Supplementary Table 11. Moreover, by analyzing tsRNA sequencing (tsRNA-Seq) and lncRNA + mRNA microarray data from our institution, 34 tRFs with adequate variability and their expression-related transcripts were chosen. Compared with those of TargetScan and miRanda, the predictions of the SVM-GA model were more enriched in pairs with negative correlations (SVM-GA model: P=3.09E-03; TargetScan and miRanda: P=0.09) (Supplementary Table 12). Additionally, we observed significant consistency between the correlation coefficients of the pairs and the likelihood of a positive result from the SVM-GA model (for tRF-5s: P=1.87E-10; for tRF-3s: P=2.77E-05). This finding indicates that our predicted tRF targets with higher scores tend to be more downregulated. We used Multiple Linear Regression to exclude the influence of miRNA in the sequencing and microarray data from our institution. We chose the pairs with mRNAs regulated by multiple highly expressed miRNAs, and 96 of 397 tRF-mRNA pairs still had significant negative correlation, listed at Supplementary Table 13. It showed that nearly 1/4 tRF-mRNA pairs played a most vital role in gene expression regulation. We showed the correlation heatmap of 20 tRFs/miRNAs and mRNAs with the highest frequency among the target gene predictions (Supplementary Figure 3).
3.8. Validation the predictions by qRT-PCR
We selected potential target genes of tRF-3001a (ELAVL1, SOCS7, ATF6B, RINL, PRR11, and ZNF268), potential target genes of tRF-3003a (CBX5, EIF4E, PRKAA1, TFDP2, SH3TC2, and PDE12) and potential target genes of tRF-3009a (ATF6B, ARF3, CDS2, CLN8, MAP2K7, and SNX12) from the top of the predictions for further confirmation using qRT-PCR (Supplementary methods). We evaluated potential target gene expression levels after transfection with tRF mimic or the mutated tRF mimic in MGC-803 cells. The expression of 13 target genes was markedly reduced at the mRNA levels by transfecting tRF-mimic compared to the corresponding NC group in MGC-803 cells (Supplementary Table 14) (Supplementary Figure 4).
3.9 Website
The predicted targets are available online from tRFTars (http://trftars.cmuzhenninglab.org:3838/tar/) (mirror site at http://trftar.cmuzhenninglab.org:3838/tar/) (Figure 5). Strict matches to the official gene symbols, RefSeq IDs, tRFs in tRFdb or anticodon/amino acids of the source tRNA are necessary as input. Users can choose to search the specified target sites or transcripts according to their needs. Candidates can be ranked according to the likelihood of a positive result assigned by the SVM-GA model, the conservation score or Ps by the probabilistic model, although we recommend adopting the SVM-GA model. Users can choose to comprehensively view the information for all seed pairings. Details of the prediction results can be found on the statistics page of the website.