The functionality of topr is best described with examples of typical usage using the three inbuilt datasets (CD_UKBB, CD_FINNGEN, and UC_UKBB), where any association results with three columns representing variant positions (e.g., CHROM, POS) and p-values (P) could be used instead. The code used to create the figures and tables is available at the topr website.
Visual overview of annotated GWAS results
When an analyst acquires new GWAS results, he or she typically wants an instant visual overview of the results to evaluate the association pattern and identify potential association peaks and nearby genes. Here, I use topr´s inbuilt dataset CD_UKBB as an example of the newly obtained GWAS results. To obtain an immediate visual overview of the results, topr´s manhattan function is called using the association results (CD_UKBB) as the sole input. By also passing the annotate argument with a p-value (e.g., annotate = 1e-09) to the manhattan function, the top variants within each 1 Mb window of the genome with p-values below the assigned p-value are labelled with their nearest gene from genome build GRCh38. If the input dataset already includes gene information (in a column called gene, Gene_Symbol, or Gene_name), topr will use that for labelling instead. The size of the 1 Mb window can be altered with the region_size argument for sparser or denser labelling of association peaks. Alternatively, the annotate_with_vline argument can be used to further highlight the annotated variants by drawing a dashed vertical line through their genomic positions on the plot. To extract the lead variants labelled on the plot, the get_lead_snps helper function is called with the CD_UKBB dataset as the input. The extracted variants can subsequently be annotated with their nearest gene using the annotate_with_nearest_gene function, which takes any R data frame, including one or more genetic positions (represented as CHROM and POS), as input and returns the overlapping or nearest gene. The number of variants labelled on Manhattan and regional plots is controlled with the region_size and annotate arguments, whereas the number of variants extracted with helper functions is controlled with the region_size and thresh arguments. In addition to labelling variants with their nearest gene on the plots, the genomic location of specific genes of interest can easily be highlighted at the bottom of the Manhattan plots by providing their names (in a vector in R) to the highlight_genes argument (e.g., highlight_genes = c(“IL23R”, ”NOD2”)).
After visually examining the genome-wide association results, it is of interest to further explore regions around the association peaks (or other regions of interest). The regionplot function displays association results for smaller genetic regions defined by the gene name, variant identifier (ID), or the coordinates of the genetic region (provided by the user). For example, calling the regionplot function with the association results and the gene argument set to IL23R displays the results within a genetic region spanning the entire length of the gene, with an additional 100 kb extending in each direction. The gene_padding argument is used to alter the size of the extended region. The coordinates of the displayed region are always shown in the regionplot and can thus be easily copied and passed to the get_snps_within_region function to extract the displayed variants. Similar to the Manhattan plot, the top variants are labelled using the annotate and annotate_with_vline arguments, and the labelling density is controlled with the region_size argument. By default, the top variants are labelled with their variant identifier (ID) on regional plots and their nearest gene on Manhattan plots; however, this can be altered by assigning the annotate_with argument to either the gene or ID. Regional plots display exon structure on the gene plots for regions smaller than 1 Mb; otherwise, the gene structure is shown; however, this can be controlled with the show_genes argument.
When exploring regional association results, it is usually of interest to view the linkage disequilibrium (LD) pattern between the top variant and other variants in the region. This can be achieved using topr´s locuszoom function if the input dataset includes pre-calculated correlation scores (R2). The GTLD function in the GORpipe query language [11] was used to calculate the pairwise correlation scores between the top variant and other variants within the region previously shown on the regionplot. To display the regional correlation pattern between the variants, the regional dataset, including the pairwise correlation scores, is provided as the first and only argument for the locuszoom function. Additional input arguments for controlling the labelling density and plot aesthetics are optional and the same as for other topr plot functions.
Comparison of test statistics in two GWAS results
To further evaluate the newly obtained association results, it is useful to compare them with already available results on the same trait/phenotype from other studies or cohorts (if they exist). Here, I use the association results for Crohn’s disease from FinnGen, which also comes with the topr package (CD_FINNGEN), to compare with Crohn´s disease results from UKBB (CD_UKBB) used in the previous section. First, the two association results are displayed separately using the mannhattan function with two different colours to distinguish the datasets and annotate and extract the lead variants in each dataset (Fig. 2A and B). Because the manhattan plot function only provides a comparative overview of association p-values, topr includes another function called effectplot to visually compare effect sizes and p-values of the lead/top variants overlapping in the two datasets (Fig. 2 C and D). It requires two datasets (represented as a list in R), each containing the four columns (P, ALT, REF, OR/BETA) as input, or alternatively one dataset (a snpset) containing eight columns (P1, E1, ALT1, REF1, P2, E2, ALT2, REF2). If two datasets are provided as inputs, they are converted into a snpset in the effectplot function prior to plotting (Fig. 2 E). [was very long para, could be new one here as you change to new topic slightly]
If the input datasets include odds ratios (OR), they are converted into beta estimates prior to generating the snpset. A snpset can also be generated directly from two datasets using the get_snpset helper function. A snpset is generated by 1) extracting the top/lead variants from the first input dataset (D1) and 2) retrieving overlapping variants (by position) from the second dataset (D2). Like in other topr helper functions, the number of variants extracted from the first datasets can be controlled with the thresh and region_size arguments. In the third step 3); overlapping variants are matched by their reference and alternative alleles (REF and ALT), and the variant effect is flipped in the second dataset (D2) if the reference allele matches the alternative allele in the first dataset (D1). If the alleles cannot be matched, they are not included in the snpset. The final step 4) in creating a snpset is to convert the effect sizes so that the effect is shown for the positive allele in the first input dataset (D1).
Each of the 4 steps above can be applied independently within topr using thehelper functions; get_lead_snps, match_by_pos, match_by_alleles and flip_to_positive_allele_for_dat1 (see the get_snpset_code function for more details). To get information on which variants are excluded and why in the process of creating a snpset, the get_snpset function is called with the verbose argument set to TRUE. The function returns three datasets, the snpset (matching variants), the variants from D1 with no overlapping variants in D2 (variants not found) and the variants in D1 that could be matched by position in D2 but not by allele (no allele match).
Multiple GWAS results on the same plot
Instead of using multiple Manhattan or regional plots to compare association results by genome or region, they can be viewed simultaneously on a single Manhattan or regional plot using topr. Displaying the two Crohn’s disease datasets (CD_UKBB and CD_FINNGEN) on the same plot provides a clearer view of the overlapping peaks (Fig 2. F). By default, the two datasets are plotted using a single y-axis and are thus overlapping; however, this can be controlled with the ntop argument. For example, by setting the ntop argument to 1, the y-axis is duplicated with one result appearing at the top and the other at the bottom with a reversed y-axis (Fig 2. G). Displaying multiple association results on the same plot is especially useful when viewing meta-analysis results along with the association results used to generate them, to get a visual overview of dataset compatibility and contribution to respective association peaks. The association results shown in dark blue in Fig. 1A were generated by combining two Crohn´s disease datasets (CD_UKBB and CD_FINNGEN) in an inverse variance-based meta-analysis (4) implemented in the GORpipe query language [11]. It is evident from the plot, that most of the signals in the UKBB dataset (orange) strengthen when combined with the summary statistics from FinnGen (turquoise) in the meta-analysis (darkblue). There are however also peaks where the combined p-value is larger than the single study p-value, indicating that these signals do not replicate in the other dataset. The compatibility of the two datasets can be further explored using regionplot, by zooming into regions centred on specific genes or variants. By calling regionplot with the gene argument set to IL23R and annotate to 1e-06, one gets a more detailed view of the region and the top variant in each dataset. The rsids or rsid_with_vline arguments can be used to label variants by their identifier (ID) in all the dataset on display, and their position on the plot can be controlled with the nudge_x argument, which can be assigned with either a single value or a vector of multiple values.
Control over plot aesthetics
The most common ggplot and ggrepel arguments can be used with topr. The user has full control over the plot point and label aesthetics (e.g., colour, size, shape, transparency, font-face, angle, and position), as well as labelling thresholds and density, where different values can be assigned (with a vector of values) to multiple datasets displayed on the same plot (Fig. 3). For example, on the same plot, the user can label the top variant within each 100 kb window with p-values below 1´1012 in one dataset and the top variant within each 10 Mb window with p-values below 1´108 in another. Different plot points, shapes, sizes, and transparencies can be assigned to the two datasets and their top variants are labelled using distinct angles or positions in relation to the variants (using nudge_x and angle arguments). A legend figure is added by default to all Manhattan and regional plots displaying more than one dataset. The position of the legend can be altered, or it can be omitted from the plots. Furthermore, the user can change the colour and/or transparency of the shades/rectangles used to distinguish between the chromosomes in the Manhattan plot.