Chromosome distribution of circRNAs
Based on all three data sources, we defined the unique circRNA sets in our database according to their genomic location and strand. The three datasets shared a total of 11,781 circRNA records (Figure 2A). The circNet database was the largest database, with 185,019 unique circRNA records, while circBase had the fewest specific records (1934 circRNAs). In total, 101,535 circRNAs were validated using two or more data sources. For instance, circNet and circBase shared 75,659 circRNA records, providing cross-validation. In summary, our integration re-visited current public data sources for circRNAs and provided a non-redundant circRNA list for experimental verification.
To explore circRNAs in different chromosomes, we plotted the numbers of all circRNAs across all chromosomes and calculated the ratio of circRNAs using chromosome length (base pairs) as the basis (Figure 2B). The largest chromosome, chromosome 1, has the most circRNAs records (28,161). To confirm which chromosomes have enriched circRNAs, we calculated the ratio between the number of circRNAs and the chromosome length. Interestingly, we found that chromosome 17 and 19 have a comparatively higher number of circRNAs, although these chromosomes are relatively short. In contrast, chromosome X is large, but has only 7927 circRNAs with a small ratio of 0.00005. As the first comprehensively integrated human circRNA resource, the chromosome distribution of the circRNAs in our database provides insight into the abundance of circRNAs in different genomics regions, which is valuable for linking known genomics events and processes.
Linking circRNA variants to population frequency and phenotype by overlapping with genetic variant databases
In order to link those circRNA variants with phenotype information, such as clinical significance and allele frequency in various populations, we downloaded all single nucleotide mutations and INDELs based on the GRCH 37 coordinates from four resources: the 1000 Genomes Project, GWASCatalog, ClinVAR, and COSMIC V81. By using genome mapping algorithm, we identified those SNPs/INDELs within the chromosomal locations of those integrated circRNAs. Based on the coordinate mapping, we yielded: i) 37,399 circRNAs with 93,708 genetic variants from GWAScatalog; ii) 67,661 circRNAs with 1,858,343 genetic variants from clinVAR; iii) 236,762 circRNAs with 2,597,987 somatic variants from COSMIC; and iv) 291,729 circRNAs with 26,361,367 variants from the 1000 Genomes Project.
Since the genetic variants from the 1000 Genomes Project are mostly common variants in the healthy population (Genomes Project et al., 2015), the distribution of circRNA variants in the chromosome may provide a population-based overview. To explore this possibility, we plotted all circRNA-associated common variants from different chromosomes and checked the ratio for these variants by benchmarking in relation to chromosome length. As shown in Figure 3A, chromosome 2 has the highest number of common variants associated with circRNAs (12,232,567), followed by chromosome 1 with 10,765,543. Since the majority of the variants are at the single nucleotide level, SNPs overlapping with circRNAs may have a neutral or minor effect with regard to changing transcript information. However, thousands of large-scale structure variants (SV) were detected in the circRNA regions. For example, circRNAs on chromosome 2 overlap with 7294 SVs. SVs can change nucleotides and the number of gene copies of circRNAs which, in turn, may have a large effect on transcription expression status. Chromosome 17 is short, but its circRNAs have a comparatively large number of variants. In summary, mapping to the 1000 Genomes data for circRNAs may help users to understand the population-based genetic frequency of common variants within circRNAs.
The clinVAR database aims to aggregate genetic variation and its relationship to human health. Therefore, changes in circRNA regions may have significant effects on cellular process and, as shown in Figure 3B, the majority of the variants are germline and contribute to various clinical phenotypes. The circRNA variants from clinVAR on chromosome 2 have the most abundant circRNA variants related to human health. However, chromosome 1 has half as many clinical variants as chromosome 2. In addition, chromosome 17 has the highest clinical variants: chromosome length ratio. It is worth noting that most of the clinical variants are based on studies of protein-coding regions. Since circRNAs have many regions overlapping with their corresponding protein-coding transcripts, changes in circRNAs may profoundly affect gene expression.
Variant types of circRNAs in the human genome
To discover which genetic variants are more likely to occur in the circRNA regions, we overlapped the circRNAs with the GWASCatalog and COSMIC databases. GWASCatalog contains the records of all published genetic variants with phenotype information from large scale genome-wide association studies. The COSMIC database is primarily focused on cancer-related somatic mutations. For all genetic variants, we used the effects described in the original database to define their type. For example, there are 16 categories in the GWASCatalog, including three prime UTR variants and a transcription factor binding site (Figure 4A). The data from COSMIC grouped the variants into seven types: whole, complex, deletion, insertion, nonstop, substitution, and unknown (Figure 4B). By combining the genomic location and variant for each circRNA variant, we constructed a matrix to present the number of specific variant types in a given chromosome. Based on these numbers, we performed Z-transformation to identify those that were significantly higher or lower than the average.
Based on the data mapped to GWASCatalog, we found that a large number of circRNA variants were not within protein-coding regions. For example, chromosome 1 has 4754 introns and 1836 intergenic circRNA variants. In fact, the intron and intergenic variants were evenly distributed in all chromosomes (Figure 4A). However, there were a number of chromosomes with large Z-scores, indicating that they contain a higher than average number of circRNA variants. For instance, chromosome 12 has 33 circRNA variants located in the transcription factor binding region, which is the greatest number found among the chromosomes. As the fundamental regulatory mechanism, transcription factor binding affects protein coding gene expression and has profound effects on circRNA expression. In summary, over 90% of circRNA variants from the GWASCatalog belong to non-coding regions. In addition, these non-coding circRNA variants are spread equally among all chromosomes. Unlike mutations in coding regions, non-coding changes in circRNAs may have profound effects on the expression of circRNAs, suggesting novel mechanisms for common diseases.
Using variants from the cancer datasets (Figure 4B), we observed an even distribution of cancer-related somatic variants across multiple chromosomes. Chromosome 2 has the highest number of cancer-related variants: 3,246,573. This finding is not surprising, as the majority of somatic variants are of the substitution type. For example, there are 2,605,118 substitutions among a total of 2,810,771 variants on chromosome 1. It is important to note that chromosome 7 has a total of 10,756 complex mutations compared to the average of 2,225 in all chromosomes. Complex mutations have multiple insertions, deletions and substitutions. The huge number of complex mutations may provide driver mechanisms that influence the expression of circRNAs.