smORF RNA quantification based on microarray
Microarray is one of the most common transcriptome quantification methods especially before the invention of RNAseq. Although RNAseq is more sensitive than microarray and have less noises[20], microarray requires fewer computational resources and has generally well similarity with RNAseq[21]. Studies using microarrays, such as the IMI MARCAR Project[22] and Microarray Innovations in Leukemia (MILE)[23], have made great contributions to medical researches. We collected 617462 unique smORFs from SmProt[8], sORFs.org[7] and the study by Thomas et al.[9]. Using probe reannotation, we remaped the probes of microarrays to smORFs and estimated smORF RNA expressions (Figure 1a, Method).
Then we tested the accuracy of this quantification. By comparing smORFs and known RNAs (Ensembl v75) using the samples that underwent both RNAseq and microarray, the correlations between the samples decreased in smORFs, but the correlations between the RNAs increased (Figure 1b). For example, KRASIM is a 99-aa microprotein expressed in hepatocellular carcinoma cells, whose overexpression reduces the level of KRAS.[14] In three datasets from Gene Expression Omnibus (GEO), KRASIM expression estimated by our method were significantly negatively correlated with expression of KRAS (Figure 1c), which does not match the same probe as KRASIM, suggests that our method could effectively evaluate the expression of smORFs.
Prediction of microprotein function based on expression similarity
Because of the large abundance of smORFs, it is difficult to construct a co-expression network like previous studies. Calculating correlations between smORFs and genes requires billions times of calculations, which is time-consuming and difficult to store and search. Inspired by the nearest neighbor algorithm, we built a BallTree for each dataset to find the nearest neighborhoods (genes) of smORFs. The estimated expressions of genes and smORFs in each dataset are converted to their rank orders by row (gene/smORF). We used Pearson correlation distance metric to measure the distances between nodes, which is equivalent to Spearman correlation since the expressions were converted to ranks in advance, but the time efficiency is greatly improved. By using the pre-ranking strategy and BallTree algorithm, the time consumption of searching correlated genes changed to 6% of that of no pre-ranking and brute force searching (Table 1).
Time / seconds
|
Brute force
No pre-ranked
|
Brute force
Pre-ranked
|
BallTree
Pre-ranked
|
Pre-rank
|
-
|
0.344 (±0.00299)
|
0.348 (±0.00805)
|
Build model
|
-
|
-
|
21.7 (±0.166)
|
Search
|
17.9 (±0.157)
|
1.713 (±0.0445)
|
1.08 (±0.471)
|
Table 1. The time consumption of searching correlated genes using different methods. The algorithms were run on Intel(R) Core(TM) i7-7700HQ CPU with 24 GB RAM.
Using this speed-optimized correlation algorism, we calculated the Spearman correlation between smORFs and other known genes. Furthermore, the functions of smORFs/microproteins can be predicted using correlated genes through pathway enrichment analysis. Considering that biomolecules have different functions in different tissues and diseases, we collected microarray data from 48 tissues/cells and 82 diseases (and normal) involving 173 data sets and built prediction models respectively. Moreover, by aggregating the predictions of multiple models, we could get more reliable results. After applying our method to several microproteins that have been studied, we found that our method could successfully predict the functions of these microproteins.
For instance, phosphatidylinositol glycan anchor biosynthesis class B opposite strand 1 (PIGBOS), a 54-aa microprotein, as well as mitochondrial elongation factor 1 microprotein (MIEF1-MP), a 70-aa microprotein, were both located in mitochondrion[24, 25]. By merging the results of multiple datasets of normal tissues, our method successfully predicted their subcellular location in mitochondrion (Figure 2a, b).
Additionally, micropeptide regulator of b-oxidation (MOXI), a 56-aa microprotein encoded by muscle-enriched long non-coding RNAs (lncRNA) LINC00116, was found to be located in mitochondrion and enhance fatty acid β-oxidation[15, 16]. By applying our method to several expression datasets of skeletal muscle tissues, we successfully predicted not only its cellular localization, but also the enrichment of cellular respiratory pathways such as oxidative phosphorylation (Figure 2c).
Moreover, non-annotated P- body dissociating polypeptide (NoBody), translated from LOC550643, was previously found to interact with the mRNA decapping complex, which involves in RNA degradation and mediates nonsense mediated decay (NMD)[26]. Using our method in a variety of normal tissue datasets, the functions of NoBody in RNA metabolism and NMD were successfully predicted (Figure 2d).
Lastly, mitochondrial micropeptide-47 (Mm47) is a 47-aa mitochondrial microprotein impacts the activation of the Nlrp3 inflammasome[17]. Although this microprotein is not annotated in the three studies we collected, the result of basic local alignment search tool (BLAST)[27] shows its high similarity to a 21-aa microprotein located at chromosome 7 (+):135358848-135358913 (GRCh37) (Figure S1a). It is reasonable to consider that they have similar functions. Prediction of the function of this 21-aa microprotein in normal tissues shows that it was located in mitochondrion, which is the same as Mm47 (Figure S1b).
Further validation of prediction process
To further observe the validity of our approach, we collected 270 microproteins from the Universal Protein Resource (UniProt)[28], as well as corresponding GO functional annotations. Using the Genotype-Tissue Expression (GTEx) microarray data set (GSE45878), we predicted the functions of these microproteins. The results showed that at least one function of 202 microproteins (74.8%) could be successfully predicted (Figure 3a). Moreover, we downloaded the human protein interactions from the STRING[29] database. Only interactions involving the microproteins we collected were retained. Using the estimated microprotein RNA expression from the GTEx microarray dataset, we calculated the expression correlation between microprotein RNA and known genes. We found that the correlation coefficients (Rho2 of Spearman’s test) of the microprotein-protein pairs with the interactions were significantly higher than those of the pairs without interaction records (Figure 3b). These results further demonstrate the accuracy of our method for the quantitative measurement and functional prediction of smORFs.
The cellular component overview of microprotein
Using our method, we explored the cellular components of microproteins. First, we randomly selected 10,000 microproteins. Then we selected up to 1,000 positively related known genes for each microprotein using the GTEx microarray dataset. The cellular components of these microproteins was predicted by enrichment analysis. The results showed that 52.04% of the microproteins were predicted to be associated with the catalytic complex (FDR<0.2, Figure 3c). The first ranking of the catalytic complex did not change when a stricter FDR (FDR<0.05) was used. Followed by transferase complex and ribonucleoprotein complex, with 44.95% and 44.90%, respectively. The possible reason is that the relatively large size of these gene sets (1355, 778, and 680) makes it more possibly to have significant results. On the other hand it also means that unknown proteins are more likely to belong to these components, providing a potential direction for future research.
A web tool for microprotein function prediction
By the advantage of the speed-optimized correlation algorism, it is possible to perform prediction while requesting. We developed our method into a web tool, smORFunction (http://www.cuilab.cn/smorfunction), which contains 617,462 unique smORFs annotated by SmProt, sORFs.org and the study of Thomas et al. smORFs can be searched by sequence using exact mode or BLAST, or by coordinate in reference genome (GRCh37 or GRCh38). For 526,443 smORFs that can be mapped to at least one probe of one microarray platform, we provide functional predictions in at most 48 tissues/cells, 82 diseases (and normal), including GO terms, KEGG pathways, and REACTOM pathways (Figure 4). This tool will provide inspirations for the research on the functions of smORFs and microproteins.