Evolutionarily young L1 subfamilies are expressed in post-mortem brains, in in vitro models and in blood
L1 transcriptional deregulation is observed in multiple human disorders including ASD(42,43), through mechanisms that are not well understood. To explore the potential impact of L1 elements in ASD, we assessed whether and which L1 subfamilies are expressed in samples and experimental models of ASD. We retrieved public bulk RNA-seq data from post-mortem ASD brains(44). The dataset consists of a total of 31 individuals (from 15 ASD donors and 16 controls) with expression profiling of anterior cingulate cortex (ACC) and pre-frontal cortex (PFC). Data from both tissues was available for 10 individuals (7 ASD and 3 controls(45)), while for the remaining 21 individuals (8 ASD and 13 controls) expression data were produced from either ACC or PFC. Therefore, analyses were performed on a total of 41 RNA-seq derived samples, 18 for ACC (9 ASD and 9 controls) and 23 for PFC (13 ASD and 10 controls). We also retrieved RNA-seq data from induced pluripotent stem cell (iPSC) and neuron cells knockout (KO) for ten ASD-relevant genes (AFF2, ANOS1, ASTN2, ATRX, CACNA1C, CHD8, DLGAP2, KCNQ2, SCN2A, TENM1)(46). The dataset comprises a total of 87 RNA-seq samples, 42 from iPSCs and 45 from neurons. KOs were performed for each ASD-relevant gene through a CRISPR gene editing method to insert premature termination sites(46). Furthermore, we obtained RNA-seq data from whole blood of 73 pairs of discordant ASD siblings for a total of 146 samples.
To quantify L1 expression two software were used: TEspeX(45) and SQuIRE(47). TEspeX measures TE expression not counting reads possibly generated from TE fragments embedded in the exons of annotated transcripts, and cumulatively quantifies the expression of TE subfamilies annotated in RepeatMasker(29) by counting the sequencing reads mapped to the consensus sequence of each TE subfamily. SQuIRE(47) allows to count reads mapped to the genomic loci where annotated TEs are found. We quantified the expression of a total of 117 human L1 subfamilies. Firstly, we used TEspeX to determine which L1 subfamilies are more highly expressed independently from the host transcripts. We then exploited SQuIRE to pinpoint which specific L1 fragments belonging to the most expressed subfamilies are transcribed at the locus-specific level. For each dataset, we defined as expressed all L1s fragments with an average of at least 200 reads mapped among all samples. According to our analysis, L1HS is the L1 subfamily with the highest expression level in all datasets, followed by several L1PA subfamilies (Additional file 1). In ASD brain samples, KO cell lines and all their controls, L1HS and L1PA subfamilies make up ~80-90% of the total normalized L1 expression, while all other L1 subfamilies cumulatively show a modest overall expression (Figure 1A). On the other hand, L1HS and L1PA subfamilies constitute only about 35-40% of the total normalized L1 expression in the blood dataset, for both ASD and controls (Figure 1A).
Evolutionarily young TEs would be expected to show a lower number of mutations which might increase their probability to be transcribed compared to older TEs. A measure of the evolutionary age of TEs is measured as base mismatches in parts per thousand (MilliDiv). It is proportional to the sequence divergence of a given element compared to its consensus sequence. We retrieved MilliDiv values for each L1 fragment annotated in the human genome (version hg19) from RepeatMasker(29) and calculated the average MilliDiv for each subfamily. As expected, the average MilliDiv for L1HS and L1PA fragments is significantly lower compared to that of the fragments of other L1 subfamilies and, interestingly, the average MilliDiv for the group of expressed L1 elements is lower than the corresponding group containing also the non-expressed ones (Figure 1B). We therefore decided to focus our analyses on L1HS and 20 L1PA subfamilies.
L1 elements result upregulated in post-mortem brains of specific ASD individuals and ASD model samples
We wanted to assess whether any of our ASD samples would show altered expression of young L1 elements compared to controls. We therefore measured the expression of L1s annotated in the human genome (version hg19) with SQuIRE(47) in all samples, and performed L1-related differential expression analysis independently comparing ASD samples versus controls. To reduce the risk of quantifying the expression of small L1 fragments whose expression may be strictly linked to their host genes, we limited our analysis on likely full-length (>5,000 nts) fragments showing an average of at least 200 reads mapped among samples of each dataset and belonging to L1HS and L1PA subfamilies. A total of 128,506 L1HS and L1PA fragments are annotated in RepeatMasker(29) (hg19), of these, 9,077 are more than 5,000 bp long. From SQuIRE analysis we classified 100 L1s for Velmeshev et al and 175 L1s for Deneault et al as expressed (>200 average reads mapped among all samples) (Additional File 2). Of them 36 were upregulated in ASD brain samples or KO cell lines compared to controls in both datasets. We then performed L1 expression analyses on each single individual (see methods). Briefly, the number of reads mapped to each L1 fragment was first normalized with DEseq2(48). To infer for differential expression in each single individual in the absence of replicates we calculated a Z-score for each L1 fragment comparing its expression level in each sample against the distribution of its expression levels in the group of controls. For each sample, we classified as potentially upregulated all L1 fragments characterized by a Z-score > 3, and as potentially downregulated all L1 fragments characterized by a Z-score < -3. We then calculated the net number of expressed fragments by subtracting the number of downregulated L1s to the number of upregulated L1s for each sample. Overall, ASD brain and KO samples always show a positive net number of expressed L1s compared to controls. In most control samples the net number of expressed L1s is either null or very low but positive, except for two control samples of the Velmeshev et al ACC dataset, which show a moderately negative value (Figure 1c). In the case of the Velmeshev et al dataset, only three ASD samples in the ACC showed a considerably higher amount of L1 net expression compared to all other samples, showing a net number of about 60 upregulated L1s, while all other samples do not show more than 25 potentially upregulated L1s (Figure 1C). On the other hand, no PFC sample shows a substantially increased number of upregulated L1s compared to all other samples. Furthermore, the three samples characterized by strong L1 upregulation compared to all other samples in ACC (SRR9292614, SRR9292620, SRR9292621) do not show the same trend in their PFC counterpart (SRR9292613, SRR9292619, SRR9292623). Therefore, the observed L1 upregulation probably occurs only in specific areas of the brain. In the Deneault et al dataset, both iPSC and neurons KO for ATRX show a substantially higher mean net number of expressed L1s (about 60) compared to the other KOs, which do not exhibit an average net number of expressed L1s higher than 30 (Figure 1C). It is interesting to note how in all KO samples the average net number of expressed L1 compared to controls is always positive. We consider this result as an indication of L1 upregulation, suggesting that the loss-of-function of ASD-related genes might lead to an increase in L1 transcriptional activity.
We performed the same L1 expression analyses on a total of 146 RNA-seq whole blood samples derived from individuals affected by ASD and their unaffected siblings. However, no L1 fragment showed a Z-score above 3 in any ASD or control sample. We then plotted the cumulative expression of L1 subfamilies quantified with TEspeX. Overall, all blood samples (ASD and controls) show similar total L1HS/L1PA expression levels (Supplementary figure 1A). Furthermore, the distribution of the ratio of the total L1HS/L1PA normalized expression between each pair of discordant siblings (ASD/Control) show the absence of alterations in L1 levels in the blood of discordant siblings (Supplementary figure 1B).
Taken together, these results suggest that only a fraction of ASD samples display a strong increase in the transcriptional activity of young full-length L1 elements probably specific to specific regions of the brain.
Evaluation of L1 expression in the context of mutated genes
iPSCs and neurons KO for ATRX display a pervasive upregulation of young FL L1 elements, while the KO of nine other ASD-relevant genes leads to a more limited increase of L1 expression. Hence, we speculate that the loss-of-function of specific genes may strongly affect L1 transcriptional regulation. ATRX encodes for a protein which contains an ATPase/helicase domain belonging to the SWI/SNF family of chromatin remodeling proteins(49,50). It has been demonstrated that ATRX knockout mouse ES cells exhibit increased chromatin accessibility at genomic loci occupied by retrotransposons(49) suggesting that this gene has a direct role in the establishment and maintenance of heterochromatin also at the level of L1s retrotransposons in human. To validate this hypothesis, we retrieved ChIP-seq data from a recent publication(50), where the authors performed a comprehensive ChIP-sequencing on ATRX in human cell lines. To assess whether ATRX may specifically act on genomic regions occupied by upregulated FL L1s, we computed a profile of ATRX ChIP-seq enrichment separately on the genomic coordinates of total and upregulated loci compared to controls L1HS/L1PA, as well as of 1,000 of 6,000 bp long genomic coordinates randomly selected within introns as controls. The profile was computed with DeepTools(51) computeMatrix utility and plotted with plotProfiles. In both iPSC and neurons, total and upregulated L1HS/L1PA show a strong enrichment compared to both their up/down-stream boundaries and to random intronic segments (Figure 1D). Together with the strong L1 upregulation observed in ATRX KO cells, these results confirm that ATRX loss-of-function may directly lead to an increased transcription of young FL L1 elements.
We then asked whether ASD - ACC post-mortem samples characterized by strong L1 upregulation might be affected by disruptive mutations within genes having a role in repressing TEs transcription. We took advantage of the public blood-derived whole exome data samples of ASD individuals from Velmeshev et al(44) to search for putative deleterious mutations within ASD-relevant genes. Upon aligning raw fastq RNASseq reads to the reference genome (version hg19), the GATK4 variant discovery pipeline was used to detect all non-reference short variants, following GATK4 best practices(52) as described in methods. The variant discovery phase resulted in the detection of ~35,000 total non-reference variants for each sample. A filtering strategy was then employed to select a total of about 20 candidate strong deleterious variants per sample. We then selected only the variants within SFARI genes (SFARI score 1, 2, 3 or S)(11) finding 16 potentially deleterious variants within SFARI genes in a total of 10 samples (Additional file 3). While we were not able to find a suggestive candidate variant for all ASD individuals, the three samples from the Velmeshev et al dataset characterized by a strong L1 upregulation bear candidate variants within genes DNAH3, PRUNE2, MBD1 and LAMA1. Although none of these genes has yet been specifically linked to the transcriptional control of TEs, further studies are needed to fully assess a possible link.
Transcriptome-wide level of deregulation in ASD brain and in vitro neurons correlates with L1 upregulation
Altered gene expression is currently considered as one of the main molecular manifestations of ASD(14–16,18). Since specific ASD samples show strong L1 upregulation compared to controls in independent datasets, we wanted to investigate whether altered L1 expression is associated to gene expression alteration. To define differentially expressed (DE) genes, we used the same approach applied to FL L1HS/L1PA fragments, which allowed us to characterize each sample individually in comparison with controls in absence of replicates. We applied this method only to genes associated to an average number of mapped reads above 200 across samples of each dataset. We refer to this subset of genes as expressed genes. Genes were considered DE if associated to a |Z-score| > 3. Overall, the number of DE genes is not homogeneous among samples, with some ASD samples showing a number of DE genes very close to the controls (Figure 2A). It is interesting to note that samples which showed stronger L1 upregulation are among the samples with the highest number of DE genes, especially in post-mortem brain and in vitro neurons (Figure 2A). For example, sample SRR9292620 is associated with both the highest number of net expressed L1s and with the highest number of DE genes. Furthermore, the top six samples of the Velmeshev et al - ACC dataset with the highest L1 net expression are at the same time the six samples with the highest number of DE genes. Similarly, in vitro neurons KO for ATRX, TENM1, KCNQ2 and DLGAP2 show the highest L1 net expression and gene dysregulation.
In order to further explore this result, we performed gene DE analysis with DEseq239 by exploiting the technical replicates of KO iPSC and neuronal cell lines. The number of DE genes was plotted for all KOs (FDR < 0.05, |Log2FoldChange| > 0.5). While we observed transcriptional deregulation for all KOs in iPSCs, only KO samples for ATRX show a considerable number of DE genes also in neurons (Figure 2B). Given that the very same samples showed the strongest L1 RNA upregulation, consistently with what observed in the Z-score analysis, these results suggest that FL L1HS and L1PA elements might be associated to a pervasive gene deregulation in mature neurons of ASD subjects.
Upregulated L1 elements are expressed within introns of actively transcribed genes
Given the original observation that gene deregulation is generally stronger in post-mortem brain tissue and in vitro neurons showing L1 upregulation, we wanted to explore the genomic context of upregulated L1s and assess whether their transcription might be independent from the host transcripts. First, we analyzed ChIP-seq data from the NIH Roadmap epigenomics mapping consortium(53), focusing our attention on six major histone modifications (H3K4me1, H3K4me3, H3K9me3, H3K36me3, H3K27me3 and H3K27ac) from post-mortem mid frontal lobe of a control individual. Upon retrieving ChIP-seq data in wig format, and converting them to bigwig using USCS wigToBigWig utility, we computed a profile of histone mark enrichment on a bed file containing the genomic coordinates of a unique list of upregulated L1s in all samples and of all FL L1HS/L1PA annotated in the human genome separately. The profile was computed with DeepTools(51) computeMatrix utility and plotted with plotProfiles, as described in methods. Overall, all histone marks profiles show a substantial drop at the level of L1 genomic coordinates, which may be due to the typically lower mappability of repeated regions. However, upregulated L1s show a slightly weaker enrichment for transcriptional repressive marks H3K9me3 and H3K27me3 compared to total FL L1HS/L1PA, both within the boundaries of the L1 fragment and their flanking upstream and downstream regions. Interestingly, upregulated L1s exhibit a considerably stronger enrichment for activator marks (H3K36me3, H3K4me1 and H3K27ac) at the level of their flanking regions (Figure 3A).
Furthermore, we compared the percentage of upregulated, expressed and total FL L1PA/L1HS which overlap expressed genes. In both datasets the percentage of upregulated and expressed L1s found within expressed genes greatly exceeds the percentage of total L1s. However, upregulated L1s are neither enriched nor depleted within expressed genes compared to non-deregulated and expressed L1s (Supplementary figure 2A). Subsequently, we overlapped the genomic coordinates of upregulated L1s with the genomic coordinates of exons and introns retrieved from UCSC. All L1s not overlapping with either introns or exons were considered intergenic. For the vast majority of samples in all datasets, the number of upregulated intronic L1s greatly exceeds the number of exonic and intergenic L1s, especially in the samples with the highest number of upregulated L1s (Figure 3B). These results suggest that upregulated L1s may be enriched within introns of actively transcribed genes.
Given that most upregulated L1s are intronic, we wondered whether their upregulation might be due to an increased intron retention. We therefore performed a differential intron retention analysis focused on the groups of samples characterized by L1 upregulation with IRFinder(54). While we detected only a few differentially retained introns in iPSC and neurons KO for ATRX, we detected 5 downregulated and 806 upregulated introns in the group comprising samples SRR292614, SRR292620 and SRR292621 (Figure 3C). However, only one of these overlapped an upregulated L1, suggesting that intron retention is not at the basis of the observed L1s upregulation. Subsequently, we wanted to test whether intronic upregulated L1s would be spliced into novel transcripts together with nearby annotated exons. To this aim we counted, for all groups of samples characterized by L1 upregulation, the percentage of mapped reads whose fragments are shared by pairs of genomic segments comprising: a) a random expressed exons and its closest expressed exon, b) an upregulated L1 and its closest exon, c) a random 6,000 bp long sequence sampled from introns of expressed genes and its closest exon. The distribution of the percentage of shared reads between upregulated L1s and their closest exon is similar to that of the random intronic segments, and significantly lower than the average percentage of shared reads between couples of neighboring exons (Figure 3D). Taken together, these results suggest that upregulated L1s are mostly located within introns of expressed genes and likely expressed as independent transcriptional units.
Upregulated L1 elements may negatively impact the expression of specific ASD-related host genes
Starting from the observation that intronic upregulated L1s are mostly expressed as independent transcriptional units, we wanted to assess how they relate with the expression of the host genes. We limited our analysis to the samples with the highest L1 upregulation: samples SRR292614, SRR292620 and SRR292621 from post-mortem ACC (Velmeshev et al), and iPSC and neurons KO for ATRX. First, we counted the number of upregulated L1s found within the genomic loci of DE genes for each sample. We considered as DE those genes and L1s featuring a |Z-score| above 3. Overall, most upregulated L1s are not found within DE genes (Figure 4A). However, focusing on L1s mapping within DE genes, we found that 5 out of 11 upregulated L1s are located inside down-regulated genes in post-mortem ACC (Figure 4B, Table 1), suggesting that L1 upregulation might have a negative impact on the expression of a subset of specific host genes in the brain. To further explore this result, we selected all intronic upregulated L1s (Z-score > 3) in each of the three ACC sample displaying strong L1 upregulation (SRR292614, SRR292620, SRR292621) separately. For each of the three samples, we then associated the Z-scores of each intronic upregulated L1 with the Z-score of their overlapping genes. In all the three samples, Z-score associated to genes overlapping upregulated L1s are mostly negative (Additional file 4, Figure 4C), while genes overlapping non-upregulated L1s (|Z-score| < 2) do not show either a preferentially positive or negative Z-score (Additional file 4, Figure 4C). This result suggests that L1 upregulation might have a negative impact on the transcription of host genes for a subset of them.
A recent publication(55) revealed that, in mouse ESCs, L1-enriched genes are transcriptionally silenced by a not yet well understood mechanism which involves the transcription of the L1 element and the subsequent binding of the L1 RNA to the DNA at the level of specific loci. Indeed, through chromatin isolation by RNA purification (ChIRP) experiments, the authors highlighted ~25,000 genomic loci where the L1 RNA binds genomic DNA(55). Around 2,400 genes are found in proximity to these genomic loci (distance < 5 kbp). Interestingly, compared to a random distribution computed with equally sized sets of expressed genes, genes overlapping upregulated L1s and associated to a negative Z-score are enriched within the human orthologs of genes nearby L1-ChIRP peaks in all three ACC samples characterized by a strong L1 upregulation (SRR292614, SRR292620, SRR292621) (Figure 4D, Z-score > 5). This result reinforces the idea that the L1 upregulation occurring in some ASD samples might lead to the dysregulation of a physiological mechanism repressing the expression of overlapping intronic L1 by the host gene.
We then wanted to test whether genes associated to a negative Z-score, but overlapping upregulated L1s (negatively correlated genes), may be relevant for ASD. For each of the three tested samples (SRR292614, SRR292620, SRR292621), negatively correlated genes are enriched for SFARI genes, compared to distributions computed with equally sized sets of randomly expressed genes (Figure 4E). Furthermore, these genes are not significantly enriched for those highly expressed in adult brain(56) (Figure 4E). Interestingly, most SFARI negatively correlated genes are not unique to one of the three samples analysed, since 5 genes (1/3 of the total) are shared among all three samples (CADM2, CSMD1, DLG2, DLGAP1, GPHN) (Supplementary Figure 3A). Notably, the main function of these genes lies in synapse organization and function (Table 2). This evidence suggests that genes whose expression is potentially altered by L1 upregulation have similar roles crucial for neuronal function. One example of an L1 upregulated in all 3 samples, found within a SFARI gene (DLGAP1) and associated to a negative Z-score is displayed in figure 4F. These results may point to a subset of genes expressed in adult brain whose expression is negatively influenced by the expression of the overlapping intronic FL L1s.
To validate this expression pattern in an independent dataset, we exploited data from a recent study(57), where the authors calculated the coefficient of correlation between the expression of intronic and inter-genic L1s and the expression of the overlapping genes in 49 human tissues. The authors stratified L1s into elements expressed in all tissues and in those with a particularly high expression (>1.5 fold higher) in one tissue compared to all others. To ease data representation, we grouped the 49 tissues into 11 classes showing that the average coefficient of correlation between intronic tissue-specific L1s and overlapping genes was substantially lower for brain-derived tissues compared to all other tissues (Figure 4G). This adds further support to the existence of a transcriptional inverse correlation between intronic L1 and the host genes in healthy neurons for a subset of genes.
Taken together, these results suggest that L1 upregulation might, in some instances, lead to the downregulation of host genes. This could happen more frequently in genes important for ASD, with a mechanism not yet well understood.