A Comprehensive Allele Specific Expression Resource for the Equine Transcriptome

Background: Allele-specific expression (ASE) analysis provides a nuanced view of cis-regulatory mechanisms affecting gene expression. Results: An equine ASE analysis was performed, using integrated Iso-seq and short-read RNA sequencing data from four healthy Thoroughbreds (2 mares and 2 stallions) across 9 tissues from the Functional Annotation of Animal Genomes (FAANG) project. Allele expression was quantified by haplotypes from long-read data, with 42,900 allele expression events compared. Within these events, 635 (1.48%) demonstrated ASE, with liver tissue containing the highest proportion. Genetic variants within ASE events were in histone modified regions 64.2% of the time. Validation of allele-specific variants, using a set of 66 equine liver samples from multiple breeds, confirmed that 97% of variants demonstrated ASE. Conclusions: This valuable publicly accessible resource is poised to facilitate investigations into regulatory variation in equine tissues. Our results highlight the tissue-specific nature of allelic imbalance in the equine genome.


Background
In diploid mammalian cells, autosomal genes are usually equally expressed. 1In some cases however, a gene can exhibit expression biased for one allele over the other. 2This allele-speci c expression (ASE) often results from cis-acting genetic variants on the same chromosome that are in proximity to or within the affected gene.Genetic variants within a gene can affect gene expression, mRNA stability, or mRNA function in different ways, leading to one allele being expressed more than the other.Various cis-acting have the potential to cause ASE.Although not changing the amino acid, synonymous variants could affect mRNA stability, splicing, or translational e ciency, potentially causing ASE. 3 Missense variants can affect the function of the RNA, possibly leading to a difference in expression between alleles. 3ariants in the 3' untranslated region (UTR) can in uence mRNA stability, localization, and translation, all of which can contribute to ASE.Variants identi ed in the 5' prime UTR could in uence ASE by affecting the initiation of translation and the stability of mRNA thus altering the amount of protein produced from each allele. 4Lastly, variants in splice regions can impact allele expression by altering splicing e ciency, exon skipping, creation or loss of splice sites, or affecting regulatory protein binding, all potentially in uencing mRNA function and expression. 4E may also arise from trans effects, or genetic in uences on the other chromosome of an affected gene or elsewhere in the genome. 5,6 pigenetic factors, such as DNA methylation or chromatin structure, also have the potential to signi cantly impact gene expression between alleles. 1,6 nterestingly, ASE predominantly manifests as tissue-speci c phenomena, with loci displaying distinct expression patterns across different tissues. 2,5 herefore, ASE analysis can provide a way to inspect gene regulation patterns and their broader impact on biological pathways in speci c tissues.
Advances in next-generation sequencing technologies, particularly RNA-sequencing (RNA-Seq), have revolutionized our ability to analyze gene expression and genetic variation.Furthermore, long-read RNA-Seq allows for the straightforward identi cation of variants that are inherited together as haplotypes from full-length transcripts. 7,8 he loci of these haplotypes can then be overlaid with short-read RNA-Seq reads.This integration provides the read counts of nucleotides existing at each locus, which can then be used to quantify the expression level of each allele for each gene that contains heterozygous loci.Finally, expression levels of each allele can be compared against one another to identify an allelic imbalance.
The study of ASE in horses is enabled by the availability of a high-quality reference genome 9 and longand short-read sequencing technologies, contributing to a more comprehensive transcriptome annotation. 10,11 [14] In this study, we performed an ASE analysis of the equine transcriptome using a combinatorial Iso-seq and short-read RNA-Seq approach.Our primary goal of this research is to contribute to the Functional Annotation of the Animal Genome (FAANG) project, a large-scale collaborative effort aimed to identify all functional elements for animals. 15To accomplish this, we provide a comprehensive resource of allelic expression detected from linked heterozygous loci.We have designed this publicly accessible database to enable future research into important regulatory variants that may impact equine health and disease at the molecular level.By extending the examination of ASE to the equine species using this method, we can further understand the uses of emerging next-generation sequencing technologies and the mechanisms underlying gene regulation in the equine genome.

Generation of Data
Data from nine tissues from prior FAANG analyses were selected for ASE analyses: lamina, liver, left lung, left ventricle of the heart, longissimus muscle, skin, parietal cortex, testes, and ovary from 4 healthy Thoroughbreds (2 mares and 2 stallions). 16,17 his selection aimed to encompass a broad spectrum of biological functions and, consequently, varied gene expression.
The RNA isolation for Iso-seq was performed separately from the same tissues as the RNA utilized for mRNA-Seq, using an identical protocol. 11For Iso-seq, we selected the highest quality RNA per sex for each tissue (except for parietal cortex and sex-speci c tissues).Since parietal cortex was the pilot tissue, long-read RNA sequencing was performed on all four horses.All selected RNA samples had integrity (RIN) values greater than or equal to 7. Selected tissues were processed for Iso-seq in a single batch.The cDNA libraries were developed and sequenced at the UC Berkeley QB3 Genomics core facility.Two randomly selected libraries were combined and then sequenced together on a single SMRT cell of the PacBio Sequel II system, as previously described. 10,11 IP-seq data for the histone modi cations evaluated in this study were sourced and analyzed from prior publications that examined the same eight tissues. 11,18 ntifying Haplotypes The overall work ow for identifying ASE events is outlined in Supplementary Fig. 1.Key to our approach was the use of Iso-seq data to extrapolate haplotypes for each horse individually, utilizing the isophase Cupcake software v29.0. 19The software's function was to identify variations between otherwise identical full-length reads, thereby pinpointing speci c positions of heterozygous SNPs.

Integration of Short-Read RNA Library
To prepare short-read RNA-Seq data for analysis, sequences were trimmed to remove adapters low-quality read, and PCR duplicates using trim-galore v0.6.10 20 and Cutadapt v4.7. 21Read qualities were inspected using fastQC v0.11.7 22 and multiQC v1.16 23 and ltered for reads > = 50bp and quality > = 30.From the identi ed heterozygous SNP loci, the varying nucleotides expressed at these positions were generated using SAMtools mpileup from SAMtools 1.18. 24We ensured that both the short-read and long-read RNA-Seq analyses were based on the same tissues and samples.

Quantify Expression for Heterozygous Loci
After overlaying reads found at the heterozygous loci, we quanti ed the expressed nucleotides at these positions.This was accomplished using a custom Python tool. 28This tool tallies the occurrences of each nucleotide variant at a given locus from the output of the SAMtools program.For example, if a particular position has both 'A' and 'G' variants, the tool calculates the frequencies of 'A' and 'G' in the reads.This process is repeated across all identi ed heterozygous sites in the genome.

Quantify Allele Expression Per Haplotype
The expression values from these heterozygous positions were then aggregated to quantify expression values representing each allele.Expression values of each nucleotide at each heterozygous loci were summed according to their respective haplotype sequence.When a certain haplotype sequence is expressed more frequently compared to its counterpart, it is evidence that this particular allele is being expressed at a higher level.
The distribution of read counts across all samples is provided in Supplementary Fig. 2.

Identify Signi cant Allele Speci c Expression
After deriving allele expression, we then pinpointed ASE using ltering and statistical techniques outlined below.First, we excluded cases where neither of the allele expression values being compared met or exceeded a threshold of 10.We then assessed ASE by examining the differential expression between haplotypes.To do so, we applied a log transformation to calculate allele expression fold change (aeFC) between both allele expression values.The equation for aeFC is shown below: The distribution of aeFC across all samples is provided in Supplementary Fig. 3.
For further support of aeFC results, we used another statistical test to examine disparities between calculated expression values.Operating under the null hypothesis that allele expression values for any gene should be roughly equivalent between alleles, we calculated p-values using a binomial test.
Considering the count-based nature of our data, we employed the Benjamini-Hochberg procedure to manage the expected false discovery rate, yielding adjusted p-values.
For the nal stage of signi cant ASE identi cation, we established stringent criteria to distinguish instances of signi cant allele expression imbalance: the aeFC between two alleles had to be at least an absolute value of 2, the adjusted p-value needed to be ≤ .05,and at least one of the calculated allele expression values be ≥ 5.

Incorporating Histone Modi cation Data
ChIP-seq data was integrated by delineating a region extending from the initial to the last heterozygous position in our haplotype transcripts.Leveraging this de ned region, we ascertained any overlaps with key chromatin peaks, speci cally H3K27me3, H3K27ac, H3K4me1, and H3K4me3, on a tissue and sample basis.

Identify Genes with ASE
We then identi ed the genes shown to exhibit ASE by employing Ensembl's Variant Effect Predictor (VEP) v10 25 on the delineated heterozygous loci.VEP may provide multiple annotations for a single variant, therefore, variants predicted as intronic or intergenic were ltered out to only annotate variants strictly from RNA-Seq.If a variant's annotation was predicted solely as intergenic or intronic, the data was not included in the downstream analyses.

Validation of ASE Events
Our validation set consisted of short-read RNA-seq data from 66 liver samples, 41 males and 26 females, with no evidence of liver disease.These samples were comprised of 12 breeds including 27 Quarter Horses, 17 Warmbloods, 8 Thoroughbreds, 3 Andalusians, 3 Arabians, 2 Lusitanos, 1 Percheron, 1 Shire, 1 Ponys of the Americas, 1 Friesian, 1 Mustang, and 1 Gypsy Vanner.The average age of this cohort was 3.49 years and ranged from 1 month to 8 years of age.RNA-seq was performed with Illumina polyAselection with a read length of 2 x 150bp.Adapter trimming, poly-A trimming, N trimming, quality trimming, length ltering, and removal of PCR duplicates were conducted using HTStream, version 1.3.3. 26Reads were aligned to EquCab.3.0 using STAR, version 2.7.10b. 27 Expression counts at identi ed ASE loci in liver were generated using SAMtools mpileup from SAMtools 1.18. 24Loci that were not heterozygous were ltered out.We used the same signi cance criteria as our original cohort -aeFC ≥ 2 and adjusted p-value ≤ .05alleleexp.foldchange = log2(allele1exp.value) − log2(allele2exp.value)

Allele Speci c Differentially Expressed Gene Enrichment Analysis
In our investigation of allele-speci c expression (ASE) and its impact on biological pathways, we performed gene enrichment analysis (GEA) using the KOBAS KEGG Orthology-Based Annotation System. 28Our method involved a deliberate combination of allele speci cally expressed genes (ASDEGs) from various samples on a tissue-speci c basis.By inputting these lists of genes into KOBAS for each tissue type, we could identify pathways with a signi cant proportion of ASDEGs.P-values were supplied by KOBAS for each impacted pathway, and signi cant pathways were identi ed as having a p-value ≤ 0.05.
Data frame management and statistical analyses were performed using scipy 29 , numpy 30 , and pandas. 31ta visualization was achieved using matplotlib 32 and seaborn. 33

Allele Speci c Expression in the Horse Genome
Recent data from the equine functional annotation of animal genomes (FAANG) initiative was leveraged for this analysis. 10Haplotypes for each horse/tissue sample were identi ed from Iso-seq data.Across all samples, we identi ed 87,174 heterozygous loci.Using these loci to differentiate alleles, and subsequently quantifying the nucleotide reads at these positions using associated short-read RNA data from the same horse/tissue sample, 42,900 allele expression events were compared.After ltering and performing statistical analyses described in Methods, we compiled this data into an allele expression resource.Using this resource, we identi ed 635 (1.48%) of allele expression events as having demonstrated ASE.ASE was found to occur in each tissue analyzed, with the liver containing the highest proportion of ASE occurrences (Table 1).Of the genes showing evidence of ASE, referred to here as allele speci c differentially expressed genes (ASDEGs), 80 exhibited ASE in more than one tissue or sample (Figure 1).Table 1.Distribution of Analyzed Genes Across Tissues enumerates the number of allele comparisons within each speci c tissue type (2 alleles for each comparison).The "Signi cant Allele Imbalance" column identi es the subset of these alleles that exhibited notable expression differences from the expected equilibrium in our study.

Table 2. Variant Types Detected in ASE Events
Variant types identi ed in allele-speci c expression events within the study's sample set.Variant predictions were made using VEP. 25

Differentially Expressed Gene Enrichment Analysis
We identi ed 168 KEGG pathways containing a signi cant proportion of ASDEGs, including metabolic pathways, endocytosis, and the Ras signaling pathway.In our study, the liver contained the greatest number of pathways signi cantly impacted by ASE (Figure 3).

Validation of Allele Speci c Variants
To validate our putatively identi ed ASE loci in liver tissue, we examined the loci in a larger dataset of liver tissue, consisting of 66 samples from horses of various breeds.All 155 heterozygous loci identi ed in liver tissue from our original FAANG horses were tested in the validation set, resulting in 8,849 heterozygous loci expression comparisons.Speci cally, 7,436 (84%) of the comparisons made across all n=66 samples in our validation set were con rmed to show ASE, with 7019 (94.4%) of these comparisons in the same direction (i.e.allele A demonstrates higher expression and allele B demonstrates lower expression: Figure 4).Of the 155 ASE loci we tested, 96.7% showed ASE in at least one of the samples used in our validation set, with 85 (54.8%) showing ASE in at least 90% of the n=66 samples tested (Supplementary Figure 4).There was no speci c effect of breed (data not shown).

ASE Analysis
The foundation of our ASE analysis was the identi cation and assignment of heterozygous loci.The incorporation of full-length transcripts from Iso-seq from the equine FAANG initiative enabled these assignments.Short-read RNA-Seq's segmented view of genetic sequences presents challenges in congruent sequence construction, 1 while the unfragmented view of the transcriptome obtained from Isoseq facilitates a more robust identi cation of haplotypes. 7,8 [8] To reduce the likelihood of including artifacts or minor variations that do not represent true differential expression, we excluded ASE events with variants that were only identi ed as within intergenic or intronic regions.Given the nature of RNA-Seq data, this ltering ensures that the analysis prioritizes variants that are most pertinent to the corresponding genes.This supplementary data could possibly be used to extend currently annotated transcribed regions, since this data was generated strictly from high-quality RNA-Seq.Additionally, we excluded cases where neither of the allele expression values being compared met or exceeded a threshold of 10 reads.Lastly, sequencing data from sex chromosomes was also ltered out because of this study's focus on autosomal genes.These ltered out events are still available for analysis in supplementary materials.

Tissue Regulation via ASE
One of the main goals of this study was to identify tissue speci c ASE.We discovered genes demonstrating ASE in one tissue, while exhibiting equal expression across alleles in other tissues.This suggests that the regulatory mechanisms contributing to these ASDEGs are unique to the particular tissues where they were found. 1,2 uch speci city could be due to tissue-speci c promoters, enhancers, or other regulatory elements that in uence gene expression differently in each tissue type. 1,3The apolipoprotein E gene (APOE), was among these genes and demonstrated ASE in the two liver tissues examined (one from each sex), while having equal allele expression in testis, parietal cortex, and lung tissues.Notably, all ASE events involving APOE favored the allele with the same missense variant at locus chr10:15714449.Another gene in this study, SLC6A17, is a part of the SLC6 family of transporters.This gene exhibited ASE in 2 out of 4 parietal cortex tissues in this study.Interestingly, both cases of SLC6A17 ASE were in parietal cortex tissue of mares, while stallions had a bi-alleleic expression for SLC6A17 in parietal cortex tissues.Both ASE events involving SLC6A17 favored the allele with the same 3 prime UTR variant at locus chr5:54600372.This example alludes to the use of our resource to compare ASE across sexes, helping to identify putatively sex-speci c tissue regulation.Lastly, transmembrane glycoprotein gene ENPP5 demonstrated evidence of ASE in 2 out of 2 samples of liver tissue used in this study, while having equal expression across alleles in parietal cortex and lung tissues.Both instances of ASE involving ENPP5 involved favoring alleles with 3 prime UTR variants at chr20:45654742.

Histone Modi cations
ChIP-seq data from the same FAANG horse/tissue samples was integrated to provide a broader evaluation of allele-speci c expression in the context of annotated histone modi cations. 21,35 his leads to an improved perspective on the epigenetic acting in uences in allele-speci c gene expression, illuminating the complex interplay between genetic variations and epigenetic regulation.Variants within H3K27ac peaks, typically present at active transcription start sites (TSS), highlight the potential relationship between coding changes within TSS and allelic expression. 35Additionally, intersection of ASE events with H3K4me1 peaks, often associated with enhancer regions, suggests that variants within these regions could lead to enhanced expression relative to the other allele. 35Variants were also identi ed within H3K4me3 peaks, which are commonly identi ed near promoters of actively transcribed genes.This suggests that a particular promoter sequence may be favored for transcription. 35Similarly the repressive mark, H3K27me3, further emphasizes the possible interplay between allele imbalance and the epigenetic landscape. 35The co-localization of ASE with active marks such as H3K27ac and H3K4me1, often found at transcription start sites and enhancer regions, respectively, suggests that these epigenetically active domains may predispose certain alleles for increased expression by facilitating a more accessible chromatin state. 35Simultaneously, the intersection with H3K4me3, associated with promoters of actively transcribed genes, and H3K27me3, a mark of transcriptional repression, indicates a nuanced regulatory landscape where alleles may be differentially expressed due to the combinatorial effects of epigenetic modi cations. 35These overlapping epigenetic regions may serve as hotspots for ASE, where the orchestration of gene activation and silencing is ne-tuned by histone marks and coding variants.However, it is important to note that not all impactful epigenetic markers can be found in transcribed regions.Interestingly, a large portion (~ 30%) of heterozygous loci within ASE events were not found to be within this histone modi ed regions.This could mean that the nucleotide variation across alleles may be directly responsible for the signi cant expression disparity.

Validation
To further validate our ndings, we employed short-read RNA sequencing data from 66 additional equine liver samples across various breeds of horses.Liver tissue was selected for validation because it showed the highest frequency of ASE events in our original cohort (Table 1).This robust validation approach enabled us to con rm the reliability and reproducibility of our ASE observations across breeds and ages of horses, reinforcing the utility of our integrated Iso-seq and short-read RNA-Seq methodologies for uncovering the complexities of gene expression regulation and the prevalence of ASE in the equine genome.

Limitations & Future Direction
This research presents the rst ASE analysis of its kind for the horse, however there are limitations.First, the FAANG sample size was relatively small, and we did not analyze all tissues, which may affect the generalizability of our ndings.Additionally, we did not have parental sequencing to inspect the origin of heterozygous loci.It is also important to note that this combinatorial RNA-Seq method simply will not detect all instances of ASE. 36Our approach of using heterozygous gene expression markers reduces the total number of ASE events we could identify in the study since some locations in the genome may be homozygous yet still exhibit ASE due to epigenetic factors alone, such as parental imprinting. 37Studies utilizing both RNA-Seq and whole genome sequencing have estimated the percent of genes exhibiting ASE from 1-20%, depending on the strength of statistical ltering. 1,36,37 I this study, we used high ltering standards to declare a signi cant allele imbalance and identi ed ASE in 1.48% of alleles compared.While we may not be able to detect all ASE events, using this method provides a way to inspect linked variants and their putatively high impact on gene expression.Future investigations will extend this approach to a larger set of samples and tissues.Furthermore, the identi ed heterozygous loci used in this analysis could be overlaid with short-read RNA-Seq from horses in different developmental stages.This may demonstrate that speci c allele sequences are pertinent to development and become otherwise unneeded as the horse reaches maturity.This integration will lay the foundation for a deeper understanding of the intricate relationship between imbalances in allele expression and their role within the equine genome.

Conclusion
This study introduces the rst multi-tissue analysis of ASE speci c to the equine genome, spanning 4 individual Thoroughbred horses and 9 diverse tissues, with validation of liver ASE across multiple horse breeds.Here, we provide an allele expression resource for the equine community to advance future gene regulation investigations.This resource was designed to easily allow researchers to investigate genes of interest and set their own ltering criteria to detect allelic imbalance.Additionally, this main database was divided into tissue speci c tracts, thereby allowing for tissue-speci c allele expression analysis.With this resource, we inspected the heterozygous loci across alleles, potentially responsible for signi cant regulation of gene expression, identi ed pathways containing a signi cant proportion of these regulatory events, and pinpointed ASE patterns on a tissue-wide scale.As a result, we have demonstrated the potential that this method provides to enrich our understanding of the intricacies of equine genetic regulation by identifying notable loci variations that putatively have a signi cant impact on gene expression.