GTEx data collection and processing
We downloaded both RNA-seq raw sequencing data and whole-genome genotype data of the v8 release of the GTEx project from dbGAP (accession: phs000424.v7.p2). Expression outlier (eOutlier) and splicing outlier (sOutlier) data, and the metadata of samples (filename: GTEx_Analysis_v8_Annotations_SampleAttributesDD.xlsx) and subjects (filename: GTEx_Analysis_v8_Annotations_SubjectPhenotypesDD.xlsx) were downloaded from GTEx Portal (https://gtexportal.org/home/). The GTEx RNA-seq dataset contains 17,832 samples representing 54 biological tissues collected from 838 donors. For this study, we included data from 49 tissues, each with at least 70 samples. Original GTEx RNA-seq reads were aligned with the human genome (hg38/GRCh38) using STAR v.2.7.3a59, with the following alignment parameters: outSAMtype, BAM; SortedByCoordinate; outSAMstrandField, intronMotif; outFilterMultimapNmax, 10; outFilterMultimapScoreRange, 1; alignSJDBoverhangMin, 1; sjdbScore, 2; alignIntronMin, 20; and alignSJoverhangMin, 8. The aligned BAM files were sorted and further converted to bedGraph format using BEDTools v.2.27.160. The genotype data in VCF format (filename: GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.vcf.gz) was processed with vcftools v.0.1.13 to calculate MAF across all subjects and extract allele information for each variant.
3′ UTR APA and intronic APA quantification
To quantify the 3′ UTR APA, we analyzed alignment files in BAM format using the DaPars2 algorithm. We followed the workflow implemented in our 3′aQTL analysis25,61. Briefly, the BAM files were firstly transformed to bedGraph format with a bin size of 1, which records the read coverage of each position in the genome. Before analyzing APA, we downloaded the gene annotation file containing all transcripts of genome build hg38 in RefSeq database through the UCSC Genome Browser, from which we extracted 3′ UTR region of each transcript using script "DaPars_Extract_Anno.py". The DaPars2 algorithm then detects the proximal poly(A) site in the 3′ UTR region of each transcript and calculates the relative usage of the distal poly(A) site by the script “Dapars2_Multi_Sample.py” for all samples in each of the 49 tissues. This is indicated as the Percent of Distal Poly (A) site Usage Index (PDUI) only if a proximal poly(A) site is detected. For intronic APA detection and quantification, we used IPAfinder18, which is a python-based tool that enables de novo identification and quantification of intronic APA (IPA) events using RNA-seq data. IPAfinder can identify potential IPA sites and calculate the Intronic poly(A) site Usage Index (IPUI), which represents the proportion of total transcripts that are intronic-polyadenylated for each intronic APA event18,33,62. BAM files were analyzed together by IPAfinder and separated by tissues.
Covariate correction and normalization
To avoid batch effects and unobserved confounders in each tissue, we adjusted the sample genotype and APA usages with known covariates, such as population structure, sex, and sequencing platform. Briefly, for genotype data, we first removed sites marked as "wasSplit" from the GTEx analysis freeze variant call format (VCF) using BCFtools v.1.10.2. We further applied the PEER model63 with sex, age, sequencing platform, and the top five genotype principal components as known covariates to estimate a set of latent covariates for PDUI or IPUI values in each tissue. The number of PEER factors was optimized based on tissue sample size, as suggested by the GTEx Consortium; 15 PEER factors were chosen for sample sizes < 150, 30 PEER factors were selected for sample sizes ranging from 150 to 250, and 35 peer factors were chosen for sample sizes > 250. Before running the PEER model for inferring hidden covariates, PDUI/IPUI values in each tissue were quantile normalized to remove batch effects.
APA outlier calling
After inferring the hidden covariates for each tissue, we calculated PDUI/IPUI residuals by regressing out inferred PEER factors and known covariates, including population structure, sex, and sequencing platform, using the function "PEER_getResiduals". In each individual tissue, we obtained normal-distributed \(Z\left(g,t\right)\) score for each gene (\(g\)) in the tissue (\(t\)) by scaling the PDUI/IPUI residuals across samples with the following equation, \({X}_{r}\left(g,t\right)\) denotes the residuals of PDUI/IPUI values, \(\stackrel{-}{{X}_{r}\left(g,t\right)}\) and \(sd\left({X}_{r}\left(g,t\right)\right)\) represent the mean and standard deviation of the residuals across samples, respectively:
$$Z\left(g,t\right)=\frac{{X}_{r}\left(g,t\right)-\stackrel{-}{{X}_{r}\left(g,t\right)}}{sd\left({X}_{r}\left(g,t\right)\right)}$$
We defined two types of aOutliers in the current study. One is single-tissue aOutlier, which is called from a single tissue based on the Z-score of each gene in that tissue. When the absolute Z-score of an individual exceeds a threshold of three for a gene, then the individual is called a single-tissue aOutlier for that gene. The other is multi-tissue aOutlier, for which we calculated a median Z-score7,8 for each APA event across all tissues when data were available and restricted our analysis to individuals with APA measurements in at least five tissues. Multi-tissue aOutliers were defined as those with an absolute median Z-score > 3. The same threshold was used for eOutlier and sOutlier calling7,8. Our method allowed that one gene could have multiple aOutlier individuals, and one individual could also be aOutliers of multiple genes. To account for situations in which widespread aberrant APA might occur in an individual due to non-genetic influences, we removed 11 individuals in which the proportion of tested genes identified as multi-tissue aOutliers exceeded 1.5 times the interquartile range of the distribution for aOutlier gene proportion across all individuals. These 11 individuals were marked as global outliers.
Estimation of replication rates of aOutliers
To estimate the replication rate of aOutliers between different tissues, we selected one of the 49 GTEx tissues each as discovery tissue, and compared aOutliers detected in it with those of the other tissues. For replication rate calculation, we only considered the shared aOutlier genes in both tissues and an aOutlier to be replicated only when the gene and individual of the aOutlier matched between the compared tissue pairs. For multi-tissue aOutliers replication, we used the cross-validation method described in a previous study8 to estimate their replication rate. In brief, the 49 human tissues were separated into two groups, one group with 39 tissues as the discovery group, the other group has the remaining ten tissues as the replication group. Each time we randomly sampled t (t = 10, 15, 20, 25, 30) tissues from the discovery group and called multi-tissue aOutliers in the discovery group using a Z-score threshold of 3 in at least five tissues as described above. Then we estimated the replication rate as the proportion of multi-tissue aOutliers in the discovery group with an absolute median Z-score 2 or 3 in the replication group. We also computed the expected replication rate by randomly selecting individuals in the discovery group with at least five tissues that have APA usage for the gene and determined the replication status in the replication group. For each discovery group size (t), we repeated this process 10 times.
RV annotation
We defined RVs as those with MAF < 1% within the GTEx European individuals and with MAF < 1% in non-Finnish Europeans within gnomAD64. Singletons were defined as RVs with minor allele only presents once in GTEx European individuals and were extracted using vcftools. The annotation of RVs was performed by Ensembl VEP (release 104), which assigned 36 different annotation terms to each RV, including protein-coding gene position (e.g., "splice_donor" "splice_acceptor," "frameshift”) and regulatory regions (e.g., "TFBS_ablation", "TF_binding_site"). Annotation terms were grouped into one of the four classes based on predicted impact: "High", "MODERATE", "MODIFIER" and "LOW". The high-impact one was used for variants assigned with two or more annotations. In addition to 36 VEP annotations, we added two other annotations to each RV; "PAS region" describes variants located within 50 bp upstream of the annotated PAS, and "PAS signal" refers to variants located at the PAS motif "AAUAAA" and its additional 14 variants ("AUUAAA", "UAUAAA", "AGUAAA", "AAAAAA", "AACAAA", "AAGAAA", "AAUAUA", "AAUACA", "CAUAAA", "UUUAAA", "ACUAAA", "AAUAGA" and "GAUAAA"). We also used genomic annotations of variants extracted from CADD v.1.5 release (http://cadd.gs.washington.edu/download).
RV enrichment analysis
We examine the enrichment of Rare Variants (RVs), including single-nucleotide variants (SNV) and small insertion and deletion (indel) near aOutlier genes. Only genes with at least one aOutlier individual were considered, and the remaining individuals for the same genes were treated as nonoutlier controls. We first counted the RVs present within 1kb, 2kb, or 10kb of the outlier genes in both outlier and control individuals and built a 2 \(\times\) 2 contingency table for each of the flanking region, containing the number of aOutliers with RVs, the number of nonoutlier controls with RVs, the number of aOutliers without RVs, and the number of controls without RVs. We then calculated Odds Ratios (ORs), P value, and 95% confidence interval (CI) using Fisher’s exact test in R base package. We grouped variants into four groups based on their MAF (0–1%, 1–5%, 5–10%, and 10–25%), and performed enrichment analysis for each group. We also conducted enrichment for RVs that stratified by VEP annotations and CADD scores as described above.
Enrichment analysis for RVs that influence PAS and AU-rich motifs
To identify potential regulatory variants associated with aberrant APA events, we defined RVs located within the gene body or in the 10-kb region surrounding outlier genes in outlier individuals as aOutlier RVs. Those in nonoutlier individuals in the same region were classified as nonoutlier RVs. For each aOutlier RV and nonoutlier RV located in the 50-bp region upstream (PAS region) of the poly(A) sites annotated in PolyA_DB V.3.265,66, we extracted its upstream and downstream 5 base pairs sequences and examined whether it matched with one of the 15 known PAS motifs by using script "dna-pattern" in RSAT (https://github.com/rsa-tools). We then summarized all tested RVs and conducted PAS motif enrichment analysis using Fisher's exact test, which determined the odds ratios (ORs) and 95% confidence intervals (CIs) for each PAS motif. To perform enrichment analysis at the 12 known AU-rich motifs, we restricted RVs to those within the 100 bp flanking the annotated poly(A) sites. We then counted RV enrichment analysis for each of the AU-rich motifs using the same method as for PAS motifs.
Identification of aOutlier RVs enriched RNA motifs
We focused on multi-tissue aOutlier associated RVs located within the gene body region, which spans from 3 kb downstream of the transcription start site (TSS) to the end of the gene. We extracted the 3 base pairs of sequences flanking each RV from both sides. Next, we used DeepBind v0.1135 to score these 7-mer sequences using 617 pre-built models, including 515 transcription factors and 102 RNA-binding proteins from Homo sapiens. For each 7-mer sequence, we selected the top three motifs with a DeepBind score of at least 0.1. To validate the enrichment of RVs in predicted binding motifs, we created a control set of RVs by randomly shuffling the genomic locations of multi-tissue aOutlier associated RVs within the same gene body regions. We used Fisher's exact test to estimate the level of enrichment.
Identification of aOutlier RVs enriched RBPs
We obtained CLIP-seq data for 166 RNA-binding proteins (RBPs) from the Encyclopedia of DNA Elements (ENCODE) data portal for HepG2 and K562 cells. We only considered significant binding peaks with P-values < 0.01, shared by two biological replicates for each RBP. To assess the enrichment of aOutlier RVs in RBP binding peaks, we selected RVs associated with multi-tissue aOutliers within gene body regions representing the region of 3 kb downstream of the transcription start site (TSS) to the end of the gene. We created a control RV set by randomly shuffling the genomic locations of multi-tissue aOutlier associated RV set within the same gene body regions. We then counted the RVs in binding peaks of each RBP using bedtools. Finally, we compared the RVs between the two sets using Fisher's exact test to determine the enrichment.
Development of a Bayesian prediction model that integrates APA outlier signals
To prioritize rare functional variants with significant impact, we improved the Watershed Bayesian hierarchical model by incorporating APA outlier signals with other layers of transcriptomic outlier signals and genomic annotations. The improved model called aWatershed, includes a layer of genomic annotation features (G) which denotes the 40 observed genomic features aggregated over all RVs in the outlier individual that are within 10 kb region of the gene, a fully connected conditional random field (CRF) layer (Z) represent the unobserved regulatory variables for each of the three transcriptomic outlier signals (APA, mRNA expression, and splicing), and a layer of variables (E) representing the observed outlier status of each transcriptomic data type. The three layers were linked by the following conditional distributions:
Where \(K\) represents the three outlier signals (APA, Expression, and Splice), \({\beta }_{k}\) are parameters defining the contribution of the 40 genomic features to the CRF of the three outlier signals, \(\alpha\) defines the intercept of the CRF for each outlier signal, \(\theta\) represent parameters defining the edge weights between pairs of the three outlier signals, \({\phi }_{k}\) are the paramters denoting the categorical distributions of each of the three outlier signal, and \(C\) and \(\lambda\) are hyper-parameters.
To train and evaluate aWatershed, we utilized all gene-individual pairs that have at least one of the three multi-tissue outlier signals, which are defined as the absolute value of Z-score greater than 3 or P-value less than 0.0027 for splicing outliers, measured in GTEx v8 data. We also used a set of 38 binary and continuous genomic annotation features aggregated across all rare variants within the 10-kb region, flanking each gene. We then trained aWatershed to learn edge weights connecting the three transcriptomic outlier signals, weights representing the contribution of each genomic annotation for each type of outlier signal, and other parameters, as described previously7,8.
To evaluate aWatershed, we selected pairs of individuals with the same set of rare variants associated with the same gene (known as "N2pair") from the training dataset. We estimated the posterior probability of a functional rare variant in the first individual of the pair and used the outlier status of the second individual as a label for evaluation. We also trained and evaluated the genomic annotation model on each layer of the three transcriptomic signals to determine whether the integration of transcriptomic outlier signals contributes to the prediction of rare functional variants. We compared the results to those obtained from the aWatershed model. After evaluation, we utilized the aWatershed prediction model to calculate posterior probabilities.
3′aQTL mapping across 49 GTEx tissues
Genetic associations between GTEx common variants within 1 Mb of each gene and PEER-corrected APA usage were mapped by Matrix eQTL67, as described in our previous study25. Known covariates, including sex, RNA integrity number, platform, top five genotype principal components, and unobserved covariates inferred from PEER, were used during 3′aQTL mapping with Matrix eQTL. The number of PEER covariates for each tissue was used as suggested by the GTEx Consortium. We performed 1,000 rounds of permutation to obtain empirical P-values for each gene, which were then adjusted using the R package qvalue.
Colocalization analysis between GWAS summary statistics and 3′aQTL
We conducted colocalization analysis comparing GWAS summary statistics from the UKBB and literature and 3′aQTL summary data from 49 human tissues using the coloc v.5.1.0.1 package in R68. Only GWAS summary data for traits with at least 10,000 cases (binary traits) or 50,000 participates (continuous trait) and with at least 10 SNPs overlapped with aOutlier-associated RVs were kept, which resulted in 1,186 well-powered traits. We extracted the sentinel SNPs for each GWAS trait, defined as GWAS SNPs with P < 5 × 10‒8, located at least 1 Mb away from more significant variants. We then searched for colocalizing signals within the 1-Mb region surrounding each sentinel SNP. The coordinates from 3′aQTL summary data were converted from human genome build 38 (hg38) to build 37 (hg19) by CrossMap software69 to match the version used in all GWAS summary statistics. As defined by the coloc method, five posterior probabilities under five different null hypotheses were calculated. In detail, PP0 represents the null model of no association. PP1 and PP2 represent the probability that causal genetic variants are associated with disease signals or 3′aQTL only. PP3 represents the probability that the genetic effects of trait signals and 3′aQTL are independent, and PP4 represents the probability that trait signals and 3′aQTL data share causal SNPs. The current study classified colocalized events as those with PP4 > 0.75.
3′ UTR APA transcriptome-wide association study (3′aTWAS) analysis
We used APA quantitative data that was previously used for 3′aQTL mapping25,52,61 and genotype data of individual genomes from whole genome sequencing (WGS) of GTEx consortia to construct 3′aTWAS model using FUSION70 for each of the 49 human tissues. To avoid the effects of confounders, well-established factors used in 3′aQTL mapping, including gender, sequencing platform, and other covariates, were incorporated to adjust APA usages. To build the TWAS model, four different models embedded in FUSION were used for weight calculation, including best linear unbiased predictor (blup), elastic-net regression (enet), lasso regression (lasso), and single best eQTL (top1). Subsequently, the cross-validation approach was employed to choose the optimal 3′aTWAS model for each gene. Of note, only genes exhibiting significant heritability estimates (cis-h2) (Bonferroni-corrected P < 0.05) were retained for subsequent analysis. The built models were then applied to GWAS summary statistics for gene-based association analysis, and a significant association was defined by the FDR < 0.05. The disease risk genes identified by 3′aTWAS in two or more tissues were used for further analysis.
Prioritization of trait-associated RVs
To determine the frequency with which randomly selected aWatershed-prioritized RVs exhibit larger GWAS effect sizes than matched non-prioritized RVs, we conducted a random sampling test (n = 1000) on all RVs using posterior probabilities obtained from the aWatershed prediction model and effect sizes from UKBB GWAS summary statistics. We used aWatershed-prioritized RVs based on aOutlier signals, matched non-prioritized RVs, as well as GWAS effect sizes, gene IDs, and prioritized scores as input data. We defined matched non-prioritized RVs as those with a posterior probability of < 0.1 and MAF within ± 0.001 of the selected prioritized RVs in the UKBB cohort.
For each gene in each trait, we randomly selected one prioritized RV and one matched non-prioritized RV and then identified the one with the largest absolute GWAS effect size in the pair. By summarizing all genes in the trait, we computed the odds of observing a prioritized RV with a larger absolute effect size than a non-prioritized RV across all genes. To generate a null distribution of odds, we repeated this process for matched non-prioritized variants only and randomly selected and compared two non-prioritized RVs for each gene.
Cell culture
HEK293T and MCF7 cells were purchased from the Cell Bank of the Type Culture Collection at the Shanghai Institute of Biochemistry & Cell Biology, Chinese Academy of Science. Cells were maintained in Dulbecco's modified Eagle medium (DMEM; Invitrogen, #11960044) supplemented with 10% fetal bovine serum (Gibco), 100-µg/ml streptomycin, and 100-units/ml penicillin at 37°C in a humidified incubator with 5% CO2.
Plasmid construction
All primers used in this study are listed in Supplementary Table 8. For intronic APA (IPA) minigenes, the candidate intron and its flanking exons were amplified from genomic DNA as wild-type fragments. For 3′ UTR APA minigenes, the 3′ UTR of each gene was amplified from genomic DNA as wild-type fragments, and mutations were introduced by PCR-based site-directed mutagenesis. In short, genomic DNA from HEK293T and MCF7 cells was amplified by PCR using primers to generate two 20–25 bp overlapping fragments containing a mutant site. The IPA wild-type and mutant fragments were subcloned into the EcoRI and BamHI sites of the pcDNA3.1 vector, while 3′ UTR APA wild-type and mutant fragments were subcloned into the XhoI and PmeI sites of the mpCHECK2 vector by the One Step Cloning Kit (Vazyme). Two sets of predesigned shRNAs from Sigma against DDX18 were used to clone into pLKO.1-puro vector.
Transient transfection
For transient transfection, HEK293T and MCF7 cells were plated in a 2-ml culture medium at 6 × 105 cells/well in six-well plates. After 24 h of culture, cells were transfected with 2 µg of wild-type or mutant minigene plasmid using Lipofectamine 2000 (Invitrogen), according to the manufacturer's instructions. The culture medium was replaced at 6 h post-transfection, and cells were harvested for RNA extraction at 48 h post-transfection. Total RNA was extracted using TRIzol reagent (Invitrogen), according to the manufacturer's instructions, and cDNA was synthesized using the FastKing RT Kit (Tiangen, KR116) with the S-CDS primer. All cDNA was diluted 4-fold in nuclease-free double-distilled H2O for further use.
3′ RACE
The total length of 3′ UTR was identified and amplified from the total RNA of NCI-H1299 cells by 3′ RACE using the HiScript-TS 5′/3′ RACE Kit (Vazyme, RA101) following the manufacturer’s protocol. 3′ RACE was performed using the S-PCR primer and pcDNA3.1-F or mpCHECK2-F primer to distinguish minigene RNA from endogenous RNA, respectively. The 3′ RACE PCR products were separated by gel electrophoresis, and excised bands were purified for Sanger sequencing using the Zymoclean Gel DNA Extraction kit. Cleaned DNA fragments were cloned into the PCE2 vector using the 5 min TA/Blunt-Zero Cloning Kit (Vazyme, C601) and bidirectionally sequenced with M13 forward and reverse primers. At least five colonies were sequenced for every gel product that was purified. Primer sequences are listed in Supplementary Table 8.
Dual-luciferase reporter assay
MCF-7 cells were seeded 1 day prior to transfection. The Renilla luciferase in the mpCHECK-2 vector was transfected into cells using Lipofectamine 3000 Transfection Reagent (Invitrogen, cat#: L3000015) according to the manufacturer′s instructions. Forty-eight hours post-transfection, firefly, and renilla luciferase activities were measured by Dual-Luciferase Assay System (Promega, #E1980) on a BioTek Synergy H1 plate reader with full waveband. Each assay was measured in three independent replicates.
Cell viability and proliferation assays for shRNA-mediated knockdown
shRNA-expressing lentivirus was produced with the third-generation packaging system in human embryonic kidney (HEK) 293T cells. For lentivirus infection, target cells (MCF7) were seeded in a 6-well plate 16–18 h before infection and were grown to 70–80% confluency upon transduction. The culture medium was removed, and cells were incubated with virus supernatant along with 8 µg/ml polybrene. Puromycin was applied to kill non-infected cells 2days after infection. After two days of selection, when non-infected control cells were all dead, surviving cells were split and maintained with the same concentration of puromycin. Cells were trypsinized, resuspended at 1 × 104 cells/ml, and seeded in 96-well plates, with each well containing 100ul medium of 1 × 103 cells. Cell viability and proliferation were determined using CCK8 assays (Yeasen, cat#: 40203ES76) at designated time points (day 1, day 3, day 5, and day 7) by measuring the absorbance at 450 nm, following the manufacturer′s instructions. Values were obtained from three replicate wells for each treatment and time point. Results were representative of three independent experiments.
The comprehensive data portal for aOutliers
We have established a database along with a web interface called rareAPA (http://bioinfo.szbl.ac.cn/rareAPA/index.php) on a standard LAMP (Linux + Apache + MySQL + PHP) system, which serves as a comprehensive resource presenting detailed and comprehensive information on rare APA events and their associated RVs. All these data in the rareAPA were stored in MySQL (www.mysql.com). The interactive web pages were implemented using HTML, CSS, JavaScript, and PHP languages (www.php.net), with several JavaScript libraries (JQuery.js, DataTable.js, and IGV.js) and Bootstrap framework (a popular framework for developing interactive websites) on Red Hat Linux powered by an Apache server (www.apache.org). This data portal is valuable for exploring aOutliers and their associated functional rare variants. With rareAPA, users can search, browse, and visualize important information on aOutliers in 49 human tissues. Users can search by gene or tissue name and scrutinize rare APA events among individuals in each tissue. Additionally, users can also visualize aOutliers using a scatter plot or explore them through a genome browser. Furthermore, rareAPA provides a curated list of prioritized RVs using the aWatershed algorithm, allowing users to examine rare variants and their aWatershed posterior scores. Additionally, rareAPA offers batch downloading of all single-tissue aOutliers and multi-tissue aOutliers. The rareAPA is freely available online without registration or login requirements.