OpenVar adopts the widely used SnpEff code11 and the deep genome annotation OpenProt3 to annotate a variant calling file (VCF). To handle multiple ORFs on a single transcript, OpenVar combines the transcript and protein accession as a unique identifier. This strategy allows the visualization, for a single variant, of all consequences across every transcript and every ORF it affects (Fig. 1A). Alongside the common annotated VCF format, OpenVar outputs a table listing all effects for each variant submitted in the initial VCF. This table is named “annOnePerLine.tsv” as it contains one line for each effect of each variant. It lists the chromosome and position of the variant, the reference and alternative alleles, the description of the effect of the variant, the impact category, the transcript and protein accession, alongside the gene name. To ease interpretation of a much more complex annotation, OpenVar also outputs a table, named “max_impact.tsv”, listing the most affected canonical and non-canonical ORFs for each variant. The prioritization of effects is based on the predicted impact of the variant on the ORF of interest. Impacts are categorized as “modifier”, “low”, “moderate” and “high”. This classification is identical to that used in the SnpEff algorithm11. For a non-annotated ORF to be reported in the maximal impact table, a given variant must have a greater or identical impact on it than on a canonical ORF. OpenVar always reports the effect on canonical ORFs in each of its outputs. Thus, the effect of a given variant on a canonical ORF is always visible, but users also see the effect on any alternative ORF (Fig. 1A).
By offering deep annotation of genomic variants, OpenVar can help make meaningful discoveries. Here, we have used it to explore a specific class of variants in cancers. We took advantage of the recently published SynMic database12. The database lists variants leading to synonymous mutations in canonical proteins across 18,028 samples from 88 different tumor entities. Interestingly, synonymous mutations are the second most frequent in cancer samples behind missense mutations12. The mechanisms behind their pathological impact remain unknown for most synonymous mutations. Current theories revolve around splicing junctions and transcript structure and/or stability10,13. However, some of these variants may also have a greater impact on an overlapping ORF as recently demonstrated7.
To illustrate the potential of OpenVar, we analysed all the SynMicDB variants with it and we highlighted an overlapping alternative ORF in the HEY2 gene. 26 mutations where reported within HEY2, all falling within the union of the canonical ORF (Q9UBP5) and an alternative ORF (IP_145210) detected with a unique peptide by mass spectrometry on the OpenProt resource3. 19 mutations (73.1%) fell exclusively in the overlapping region, when the expected value was of 10.07±2.48 (38.7%), yielding a z-score of 3.60 (Fig. 1B). Additionally, we retrieved mutations within HEY2 listed in the COSMIC catalog. 243 mutations were reported within HEY2, with 157 falling within the union of the canonical and the alternative ORFs. 83 mutations (52.9%) fell exclusively in the overlapping region, when the expected value was of 60.81±6.10 (38.7%), yielding a z-score of 3.64 (Fig. 1B). Furthermore, looking at mutations synonymous for Q9UBP5, the COSMIC dataset contained 55 synonymous variants. Out of these, 32 variants (58.2%) clustered on the IP_145210 ORF. The expected value was 21.3±3.61 (38.7%), yielding a z-score of 2.96. Hence, both datasets present a significant enrichment of genetic variants at the locus of the alternative protein IP_145210 within HEY2.
To highlight the impact of including alternative ORFs in analyses of genomic variants, we compared the annotation of the SynMicDB and the COSMIC HEY2 datasets when analysed with OpenVar or the most common annotators: the Ensembl Variant Effect Predictor (VEP)14, Annovar15 and SnpEff11 (Fig. 2A-B). Although small differences were observed between VEP, Annovar and SnpEff, none predicted as many high impact variants than OpenVar. As the SynMicDB dataset is a database of synonymous mutations, most variants are classified as low impact with VEP, Annovar or SnpEff (Fig. 2A). By simply considering non-canonical ORFs overlapping the annotated ORF, OpenVar reclassify many low impact variants as high impact, yielding a 33.6-fold, 13.8-fold and 8.3-fold increase over Annovar, SnpEff and VEP respectively. Similarly, when considering a more heterogeneous set of variants with the COSMIC HEY2 dataset, OpenVar offers a 2-fold increase over Annovar and a 1.6-fold increase over SnpEff and VEP in high impact variants (Fig. 2B). With both datasets, the increase in moderate and high impact variants observed with OpenVar comes from the reclassification of modifier and low impact variants, as visible on figure 2 with a relative decrease in the latter with OpenVar (Fig. 2A-B). For example, in the HEY2 COSMIC dataset, the variant 6:g.125,759,806 T>G is located 4 nucleotides after the stop codon of the canonical ORF (Q9UBP5 with genomic coordinates 6:125,749,777-125,759,802) and is thus classified as a “modifier” impact by VEP, Annovar and SnpEff. However, it leads to a missense mutation (p.Phe137Leu) in the alternative ORF (IP_145210 with genomic coordinates 6:125,759,396-125,759,827) and is thus classified as a “moderate” impact on IP_145210 and a “modifier” impact on Q9UBP5 by OpenVar. Since the predicted impact is higher on the alternative ORF, this variant is counted as “moderate” impact on the general statistics presented on figure 2 (Fig. 2B).
Interestingly, when looking at the relative impact of variants on HEY2 (Fig. 2C), OpenVar predicts higher impact variants towards the carboxyl end of the annotated ORF (Q9UBP5 protein), which corresponds exactly at the position of the alternative ORF (IP_145210 protein). The IP_145210 ORF in HEY2 does not overlap with any known domain of the well-characterized Q9UBP5 protein. Q9UBP5 is a 337 amino acid long Hairy-related basic helix-loop-helix (bHLH) transcription repressor, but its functional domains span from amino acid 48 to 11616 (Fig. 2C). Meanwhile, the IP_145210 ORF overlaps the disordered carboxyl tail of the Q9UBP5 ORF. This observation agrees with previous reports suggesting intrinsically disordered regions are prone to host dual-coding events7,17,18. Thus, the results highlighted by OpenVar suggest these variants may be detrimental via their consequence on IP_145210 rather than the canonical Q9UBP5.