Inference of molecular mechanisms of transcriptional regulation from co-expression data


 Co-expression analysis can be used to identify transcription factors and/or lncRNAs that directly regulate transcription of a gene. When all samples are in the same state, a transcription factor has to co-fluctuate with the target gene to be identifiable as regulatory. Co-fluctuation implies that a regulatory factor binds with intermediate strength. The binding is not too strong to make a factor constantly bound and not too weak to make a factor irrelevant. It follows that variables defining the binding of a factor are not all independent. For example, for transcription factors the following variables were suggested: expression level, binding strength and binding site accessibility. Any one of these variables can be made dependent. Relations between variable contain information about molecular mechanisms. This is true for regulatory lncRNAs as well. The significant difference is that the list of regulatory lncRNAs is unknown. Because of that, lncRNAs that were identified regulating transcription of a gene can only be compared to random lncRNAs. A method to distinguish regulatory lncRNAs by gene coverage was proposed.

interactions exist, but are not mature yet (Crossley et al., 2020). Our understanding of protein-RNA interactions is also relatively weak.
Binding strength varies among TFs. Strong binders can be reliably identified by ChIP, while weak binders might be missed. In many experiments, TFs identified by ChIP are used as ground truth, though those might constitute only a fraction of relevant TFs. It can be expected that expression of weak binders would fluctuate together with the expression of the target gene; i.e., that they would co-express. Thus, weakly binding TFs may be identified by co-expression analysis. Regardless of how lncRNAs make contact with DNA to regulate transcription, through RNA-RNA, RNA-DNA, protein-RNA or protein-protein interactions, they bind too, like TFs do.
Weakly binding lncRNAs can be identified in the same way. Molecular mechanisms inevitably become imprinted in the relationship between a regulatory factor and it's target. The analysis of the relation by (semi-)quantitative modeling, for example, might provide insight into how transcription factors and lncRNAs perform their functions at the molecular level.

1. Data sets.
Two data sets were (re)analyzed in this work. The data set#1 is the RNA-seq of purified transcription factories from HUVEC cell line (Caudron-Herger et al., 2015). The analysis was started from fastq files.
Intervals identified in CAGE experiment were taken as representative of active promoters in HUVEC cells. Specifically, file ENCFF179TLJ.bed was downloaded from ENCODE project site (Davis, 2018). If an interval was assigned to plus strand, 500 was subtracted from start coordinate and 100 was added to the end coordinate. If an interval was assigned to minus strand, 500 was added to end coordinate and 100 was subtracted from start coordinate. Intervals were annotated by closest gene (Ensembl release 104) based on the output of 'bedtools closest' command.
The data set#2 includes 74 RNA-seq experiments for human monocytic cells (Murray et al., 2021). There were two conditions per sample -unstimulated cells and cells stimulated with rhinovirus. Only expression data for the unstimulated condition were used in the analysis.
Processed data (file GSE166292_Ferreira_Normalised_CPM_counts_noFilter.tsv.gz) were downloaded from the NCBI SRA site.
DNAse-sensitive regions of human monocytic cells identified in Lee et al., 2020 were regarded as regulatory elements. The file ENCFF771GZH.bed was downloaded from the ENCODE project site. Intervals were not modified. If an interval lies closer than 20000nt to the gene start or gene end or overlaps the gene (Ensembl release 104), the interval was annotated by this gene.

Identification of adjusting transcription factors and adjusting lncRNAs.
The list of human TFs was downloaded from HumanTFDB database (Hu et al., 2019). The data set#2 was the source of expression data. For every TF present in the data set#2 (n=1363) the Pearson correlation coefficient between it's expression and expression of other genes (n=28458) was calculated. To identify adjusting TFs of (say) gene A five TFs with largest positive and five TFs with lowest negative were taken. The value of was assumed to measure the 'distance' between genes in a sense that genes that are farther from each other in the transcriptional network have lower absolute value of . Let denote values of Pearson correlation coefficients for ten candidate TFs chosen based on their high correlation with gene A, while Pearson correlation coefficients for whole set of TFs are . Relative distances were calculated: The mean relative distance for the th candidate TF is: A TF with minimal mean relative distance min� � was identified. The TF was assigned an 'adjusting' status. Z-transformation was applied to relative distances. T-test (alpha level 0.95; critical value 1.96) was used to estimate whether atanh( ) − atanh(min� �) is significantly different from zero. If the test was not significant, a TF was assigned an 'adjusting' status. The same calculations were carried out for lincRNAs (n=2170).

Transcription factor binding analysis.
The JASPAR2022 core database (Castro-Mondragon et al., 2021) was downloaded in TRANSFAC format. Only human TFs (n=504) were considered. Entries (i -motif position index, j -nucleotide index) of the binding matrix are numbers of observations. Binding matrix was converted to the matrix of probabilities , i.e. = ∑ 4

=1
. For a nucleotide sequence of length and binding motif of length the score was computed for every subsequence of length : The sequence score is equal to max( ) divided by motif length : The binding strength does not measure specificity. A sequence can contain one strong binding site or multiple weak binding sites and have the same value . This is the reason there is no dependence upon sequence length.

Sequence alignment of lncRNAs and the alignment analysis.
For every lncRNA, all transcripts were downloaded from Ensembl release 104 (Howe et al., 2021). LncRNAs of data set#1 were aligned to promoters of HUVEC cells; lncRNAs of data set#2 were aligned to regulatory sequences of CD14+ monocyte cells (see 'Data sets'). Sequence alignments were produced using blastn (version 2.6.0) with standard parameters. CAGE and DNAse-seq intervals were pre-annotated with gene names. The number of contacts that the lncRNA makes was equal to the number of different genes that transcripts of this lncRNA aligned to.
For lncRNAs from data set#2 distance and co-adjustment analyses were performed. The distance analysis is the evaluation of the minimum genomic distance (in nucleotides) between a gene given an lncRNA adjusts and a regulatory sequence this lncRNA aligns to. An lncRNA can adjust several genes, as well as align to more than one regulatory sequence. All possible combinations of genes and regulatory sequences were tested, and the minimum distance was taken as representative. If in all cases regions belonged to different chromosomes, the minimum distance was set to 2.5e08. Such cases were excluded from further analysis.
The co-adjustment analysis is the evaluation of co-expression (Pearson correlation coefficient) between a gene that a given lncRNA adjusts and a gene that a regulatory sequence that this lncRNA aligns to belongs to. Any lncRNA can adjust more than one gene, as well as align to more than one regulatory sequence. All possible combinations of genes and regulatory sequences were tested, and the following statistics were computed: mean, maximum and minimum Pearson correlation coefficients. For both analyses controls were produced by replacing adjusting lncRNAs with random lncRNAs. The structure of the input was preserved; i.e., the number of adjusting lncRNAs per gene; the number every lncRNA repeats itself, etc. were the same in control.

Analysis of correlation between transcription factor binding strength and DNA
accessibility.
In data set#2 55 human TFs (n=504) were adjusting and had known binding motifs. For the th adjusted gene with regulatory elements mean binding strength � = ∑ =1 and mean accessibility ��� = ∑ =1 were computed. Pearson and Spearman correlation coefficients between ̅ and ��� were estimated.
ChIP-seq results were available for 26 adjusting TFs (n=55). All ChIP-seq experiments were in K562 or HepG2 cell lines. DNAse-seq data were downloaded from ENCODE (file ENCFF274YGF.bed for K562, file ENCFF897NME.bed for HepG2). The correlation between occupancy from ChIP-seq experiments (column#7) and accessibility from DNAse-seq experiments (column#7) was estimated.

Identification of trans-acting lncRNAs.
Fastq files of the data set#1 were aligned to the human genome GRCh38 using bowtie2 with standard parameters. The alignment output was written in the sam format. Sam files were converted to bed format. 'bedtools coverage' command was used to compute gene coverage. The number of bases with non-zero coverage (column 2 of 'bedtools coverage' command output) was analyzed. The Ensembl gene annotation (release 104) was used everywhere.
Genes were divided into protein-coding and lncRNA according to Ensembl biotype and binned (n=20) according to the length with bin size 10000 nucleotides. The mean value of the number of bases with non-zero coverage ̅ was computed for each bin. Parameters of the regression model ̅ = + were computed using ordinary least squares (OLS) separately for proteincoding and lncRNA genes.
The aim was to divide lncRNAs into two groups based on their resemblance to protein-coding mRNAs. First, the lncRNA regression line was shifted, preserving the slope so as to be as close to the protein-coding regression line as possible in the sense that the quantity ∑ ( ̅ the inverse cdf of the t-distribution (the value of random variable such that the value of cdf for it is 0.05 provided degrees of freedom are equal to N). When both conditions were true, iterations were stopped. Removed lncRNAs were put into A-group. LncRNAs that stayed in the list were put into S-group.

Molecular mechanisms of transcription factors can be inferred from co-expression data.
Suppose we would like to identify the distance between TF1 and gene A (Fig.1) using partial correlation analysis. Partial correlation excludes the influence of another gene, for example, TF3 on the correlation. But what happens when the correlation coefficient between TF3 and TF1 is close to 1? In this case, the partial correlation coefficient between TF1 and gene A excluding TF3 1, / 3 would be close to zero. The partial correlation evaluates how well TF1 expression 'explains' gene A expression independently. Since the expression of TF1 is defined by TF3, i.e. not independent, the algorithm concludes that TF1 has little or no influence upon gene A. Also the algorithm can't tell whether TF1 acts upon TF3 or vice versa. Because of that, 3, / 1 would also be (close to) zero. Ultimately, both TFs would be identified as nonregulatory. The correlation between the two TFs can be large because their expression is defined by the 3rd TF. Again, referring to Fig.1, suppose 13 and 23 are both close to 1. Then, correlation between TF1 and TF2 is large, and none of the TFs appear as regulatory for gene A. See Supplemental Data for the numerical example.
The transcriptional network consists of genes (nodes) and inter-gene regulatory relations (edges).
The distance between genes is the length of the shortest path connecting them, measured in the number of edges. The main assumption is > , provided | | < | |. Obviously, 3 − 13 = 1 . Distances 3 and 13 can be replaced by sample Pearson correlation coefficients, i.e. |3 | − |1 3 | = 1 . The quantity 1 is an estimator of 1 and is called 'relative distance between TF1 and gene A'. The value of is calculated relative to a reference gene, which is TF3 in the example. In this work only transcription factor genes (n=1363) were used as references. The distribution of the Pearson correlation coefficient between two normally distributed random variables is close to normal (Hotelling, 1953). Therefore, it is assumed that has standard normal distribution. Fisher's Z-transformation can be applied to widen support from [−1,1] to [−∞, +∞] , but this changes results only slightly. Consequently, mean values of relative distances can be compared by t-test.
For some TFs binding motifs are known. Provided the knowledge of active regulatory elements, i.e. where TFs can bind, is it possible to predict adjusting TFs? It was shown that the occupancy of a sequence can be adequately modeled by TF expression level , strength of binding site and DNA accessibility (Blatti et al., 2015). According to the definition of adjusting TF the occupancy is linearly related to the expression level, i.e. ( , ) = + . It implies the three variables are not independent, i.e. for example, the knowledge of and allows evaluation of , if the TF is known to be adjusting. If single TF is considered, expression becomes constant and the relation between binding strength and accessibility can be analyzed across all genes that this TF adjusts. In data set#2 (n=16256) 55 TFs adjusted at least one gene.
The correlation between mean binding strength and mean accessibility varied between TFs from 0.6 to -0.2 (see Methods). At this point the model allows assigning directionality. TFs are characterized as adjusting due to "balance" between binding strength and accessibility. As one increases, another decreases. Thus, TF occupancy is kept within certain limits. Then, large positive correlation between binding strength and accessibility can only mean that TF binding causes increased accessibility. If accessibility causes binding, negative is to be expected. Notably, the TF with largest positive was ZNF263. ZNF263 is a repressor, which was implicated in trained immunity and enhanced response to immune stimuli (Edgar et al., 2021;Weiss et al., 2020). This functional description agrees with molecular mechanism suggested above. Hypothetically, ZNF263 primes genes for efficient transcription by decondensing regulatory DNA, but keeps transcriptional output low till stimulation.
TF occupancy depends upon sequence accessibility and how the TF binding is affected by DNA accessibility. The effect of accessibility upon TF binding is estimated by . Large positive means "no dependence on sequence accessibility; the TF increases DNA accessibility at the site of binding". Large negative means "highly dependent upon DNA accessibility". If binding of a TF is independent of accessibility, the correlation between occupancy and accessibility is expected to be low. On the contrary, TFs, whose binding depends upon accessibility, would be characterized by large values. Therefore, and must be negatively correlated. For 26 adjusting TFs Pearson and Spearman correlation coefficients between and were -0.31 and -0.28, correspondingly.

Molecular mechanisms of lncRNAs have more unknowns.
LncRNAs can bind due to protein-protein interactions, protein-RNA interactions or by nucleotide hybridization (DNA-RNA or RNA-RNA). If hybridization indeed plays a role, alignment of adjusting lncRNAs to regulatory elements has to differ from the alignment of random lncRNAs. Adjusting lncRNAs were identified for 1000 random protein-coding genes from data set#2. Adjusting lncRNAs hybridized closer to genes they adjust compared to random lncRNAs (p-value 0.064); the mean distance was 15086668nt. For seven random controls (n=110) the mean distance was shorter. For the majority of lncRNAs, a set of genes the lncRNA adjusts and a set of genes the lncRNA aligns to did not overlap. It was hypothesized that lncRNAs could contribute to co-expression between the two gene sets. The mean of mean, maximum and minimum coexpression (Pearson correlation coefficient) for adjusting lncRNAs were 0.154, 0.229 and 0.095, respectively. Control values (n=110) were larger in 0, 1, and 14 cases, respectively. It was concluded that genes that are linked by adjusting lncRNAs tend to coexpress (p-values are 0.0 for the mean coexpression and 0.009 for the maximum coexpression). Adjusting lncRNAs had to be contrasted to random lncRNAs, which can be regulatory and nonregulatory. Regulatory lncRNAs can be modeled. Models for adjusting and non-adjusting regulatory lncRNAs can be compared to draw conclusions about molecular mechanisms. The same is not possible for random lncRNAs. Consequently, the inference had to be less detailed.
To tackle this problem, a method to differentiate regulatory lncRNAs is proposed. The majority of lncRNAs are spliced. Transcriptional regulators stay in the nucleus, while other lncRNAs are exported. RNA molecules are degraded in the nucleus and cytosol by different systems (Bresson and Tollervey, 2018). Consequently, regulatory lncRNAs might be differentiated by splicednon-spliced transcript ratios. The difference in the ratio can be observed as the difference in gene coverage. Data set#1 is RNA-seq of transcription factories (Caudron-Herger et al., 2015). Since lncRNAs themselves are transcribed in factories, there must be two types of lncRNAs inside factories: regulatory (called A-group lncRNAs) and transcribed (called S-group lncRNAs).
Protein-coding mRNAs and lncRNAs had markedly different relations between coverage and gene length (Fig. 2).
A-group lncRNAs were distinguished from S-group lncRNAs based on the relation between gene coverage and length. A-group lncRNAs were identified as more dissimilar to proteincoding mRNAs. It was found that A-group lncRNAs aligned to significantly more promoters.
This was true for total, polyA and chromatin fractions (

Discussion.
The state and type of cells, which are a source of data for the co-expression analysis, define the transcriptional network that is being probed. Sometimes, data are collected across many states/conditions and even from different experiments. When a researcher is interested in TFs that contribute most to the difference between the two states, like cancer vs normal samples, this design must work effectively. When many states are included in the experiments, the likelihood that a regulatory interaction appears in the reconstituted network would depend upon its frequency among studied conditions. How well this network represents any real network is impossible to say.
Presumably, adjusting TFs bind as weak as possible for a regulatory TF. Thus, any weaker binding would make a TF irrelevant, while stronger binding TFs are baseline TFs. How well the division into adjusting and baseline TFs represents reality is an open question. Though several thousand TFs may constitute a continuum of binding strengths, a much lower number regulates transcription of any given gene. Studying adjusting TFs might be beneficial in itself, as they can be expected to generate variation in the transcriptional output between cells and in time.
There seem to be two main reasons why lncRNAs are harder to understand. First, regulatory lncRNAs can't be identified in experiment. In an experiment only physical properties can be discovered. The function has to be identified using physical properties plus certain assumptions.
The second big challenge is the prediction of RNA-protein interactions. RNA molecules perform their functions as ribonucleoprotein particles. DNA binding studies have successfully identified binding motifs for thousands of proteins. Perhaps, the same could be true for RNA-binding proteins. Once all bound proteins are known, prediction of function and targeting would become considerably easier.

Data availability.
All research data pertinent to this work including programs, source code, raw and processed analytical data were uploaded to Zenodo, DOI 10.5281/zenodo.5798678.

Conflict of interest.
I have no conflict of interest to declare.