1. Double shrinking (DOSH) algorithm.
A network consists of nodes connected by edges. In the transcriptional networks genes are connected by regulatory interactions. The neighborhood \({H}_{z}\) is a set of nodes directly connected to the node \(z\). For each of p genes in a network there are N samples of quantification data. Typically, \(N\ll p\). Because of that regression-based models require regularization. The proposed algorithm is called Double Shrinking (DOSH), because shrinking is conducted in two stages. The first stage is ridge regression. Ridge regression is conducted with the gene \(y\) as a regressand and all other genes as covariates. Believing that the linear model is the true one implies that there exists at least one subset of covariates that explains all the variance. Let’s consider what happens when a ridge regression model is estimated using gradient descent. The solution is unique [5]. Consequently, initial values can be arbitrary. Suppose the regression coefficients were initialized to their true values. If there is more than one subset of covariates that explain all the variance, the one with the smallest coefficients is chosen. Let’s call the chosen covariates informative. Regression coefficients of non-informative covariates are shrunk, while regression coefficients of informative covariates grow. The value of the ridge penalty might drive further shrinking. Since changing coefficients of informative covariates increases the error more than changing coefficients of non-informative covariates, the former stay larger. The takeaway is that ridge regression chooses informative covariates that explain all the variance and whose regression coefficients are shrunk less than others.
Before conducting ridge regression it is common to transform data to empirical mean of zero and empirical variance of one. DOSH does the first, but not the second. As a result, small covariate values have large coefficients and are shrunk more. This is a desired effect, since low gene expression values are less reliable; i.e., they are more likely to produce false high correlations.
The ideal situation is when all informative covariates belong to the neighborhood. We seek to estimate the probability that a neighborhood covariate \({x}_{k}\) is exchanged with a non-neighborhood covariate \({x}_{l}\) in the set of informative covariates (Fig. 1). When this occurs the coefficient of determination is decreased by \(0\le s\le 1\). For this to happen including a non-neighborhood covariate has to lower the ridge regression error \(L\). The condition for that is:
$$\frac{\partial L}{\partial {\beta }_{k}}-\frac{\partial L}{\partial {\beta }_{l}}>0$$
$$\frac{\partial L}{\partial {\beta }_{k}}=\sum _{i=1}^{p}2{\epsilon }_{i}{x}_{ik}+2\lambda {\beta }_{k}=2(\sum _{i=1}^{p}\stackrel{ˇ}{{\epsilon }_{i}}{x}_{il}+\lambda {\beta }_{k})$$
According to our assumption, informative covariates explain all the variance. It follows that \(E\left(\sum _{i=1}^{p}{\epsilon }_{i}{x}_{ik}\right)=0\), but \(E\left(\sum _{i=1}^{p}{\stackrel{ˇ}{\epsilon }}_{i}{x}_{il}\right)\ne 0\). The least squares error \(\stackrel{ˇ}{{\epsilon }_{i}}\) is equal to the unexplained fraction of the regressand \((1-s){y}_{i}\) plus random error \({\epsilon }_{i}\):
$${E(\stackrel{ˇ}{\epsilon }}_{i})=\left(1-s\right)E(\sum {y}_{i}{x}_{il})+{E(\epsilon }_{i}{x}_{il})=(1-s)E\sqrt{\sum {x}_{il}^{2}\sum {y}_{i}^{2}}\frac{\sum {y}_{i}{x}_{il}}{\sqrt{\sum {x}_{il}^{2}\sum {y}_{i}^{2}}}={N}^{2}(1-s)*{\sigma }_{{x}_{l}}{\sigma }_{y}{\rho }_{{x}_{l}y}={N}^{2}(1-s)\sqrt{s}*{\sigma }_{{x}_{l}}{\sigma }_{y}$$
$$E\left(\frac{\partial L}{\partial {\beta }_{k}}-\frac{\partial L}{\partial {\beta }_{l}}\right)={N}^{2}\left(1-s\right)\sqrt{s}*{\sigma }_{{x}_{l}}{\sigma }_{y}+\lambda ({\beta }_{k}^{2}-{\beta }_{l}^{2})$$
The number of covariates that \({x}_{k}\) is exchanged with can be more than one:
$$\frac{\partial L}{\partial {\beta }_{k}}-\sum _{l=1}^{m}\frac{\partial L}{\partial {\beta }_{l}}>0$$
To see what happens in this case replace \({x}_{l}\) by a compound variable \(w=\sum _{l=1}^{m}{\alpha }_{l}{x}_{l}\):
$$E\left(\frac{\partial L}{\partial {\beta }_{k}}-\sum _{l=1}^{m}\frac{\partial L}{\partial {\beta }_{l}}\right)={N}^{2}\left(1-s\right)\sqrt{s}*{\sigma }_{w}{\sigma }_{y}+\lambda ({\beta }_{k}^{2}-\sum _{l=1}^{m}{\beta }_{l}^{2})$$
When \(s=1\), only the difference in regression coefficients matter. In the following we assume that this is so, because it profoundly simplifies calculations and seems valid for the analyzed dataset. The validity of the assumption can be checked by establishing that the model including covariates with \(V\) largest ridge coefficients have the coefficient of determination close to 1. Under the assumption only the difference \({\beta }_{k}^{2}-\sum _{l=1}^{m}{\beta }_{l}^{2}\) matters. Since regression coefficients are normally distributed:
\({\beta }_{k}^{2}-\sum _{l=1}^{m}{\beta }_{l}^{2}=\frac{{\beta }_{k}^{2}}{{\sigma }_{\beta }^{2}}-\sum _{l=1}^{m}\frac{{\beta }_{l}^{2}}{{\sigma }_{\beta }^{2}}>0\) ,\(\frac{{\beta }_{k}^{2}}{{\sigma }_{\beta }^{2}} \tilde{\chi }_{1}^{2},\frac{{\beta }_{l}^{2}}{{\sigma }_{\beta }^{2}}\tilde{\chi }_{m}^{2}\)
The value of \(m\), the expected number of covariates that explain all the variance, can be estimated from the data.
The second round of shrinking is based on the following theorem.
Theorem 1, X, Y, Z are normal random variables with zero mean and (possibly) non-equal variances. The quantity is distributed normally with zero mean if and only if and does not depend upon .
\(x={\beta }_{1}z+{\epsilon }_{x}\) , \(y={\theta }_{1}z+{\epsilon }_{y}\)
If there is no dependence, \({\beta }_{1}∼N(0,{\sigma }_{\beta }^{2})\), \({\theta }_{1}∼N(0,{\sigma }_{\theta }^{2})\). The correlation between \((x-{\beta }_{1}z)\) and \((y-{\theta }_{1}z)\) is distributed normally with mean \({\rho }_{XY}\). Consequently, the distribution of \(\left|{\rho }_{XY}\right|-{|\rho }_{x-{\beta }_{1}z,y-{\theta }_{1}z}|=\left|{\rho }_{XY}\right|-\left|{\rho }_{XY|Z}\right|\) has a zero mean. The non-zero mean is possible only when \({\beta }_{1}\) and/or \({\theta }_{1}\) are not equal to zero. The converse is not true; the distribution of \(\left|{\rho }_{XY}\right|-\left|{\rho }_{XY|Z}\right|\) can have a zero mean, when regression coefficients are zero. End of the proof.
\(V<N\) covariates with largest by absolute value ridge regression coefficients are chosen for the second round of shrinking. Regression coefficients in this set of covariates are not expected to have zero mean, because predictive covariates tend to be chosen. According to theorem#1 the likeliness that a regression coefficient is equal to zero can be estimated from the distribution of \(\left|{\rho }_{XY}\right|-\left|{\rho }_{XY|Z}\right|\).
It shall be born in mind that DOSH does not guarantee to discover the whole neighborhood. Covariates that constitute a neighborhood can be co-linear. Then ridge regression would include only subset of them as informative. In the same manner the theorem1 works. If the distribution of \(\left|{\rho }_{XY}\right|-\left|{\rho }_{XY|Z}\right|\) has non-zero mean, then the regression coefficient is not zero. But if the distribution has zero mean, the regression coefficient can be zero or non-zero. The DOSH algorithm is aimed at identifying nodes that belong to the neighborhood with high probability, but does not make an attempt to identify the whole neighborhood. In some applications it is better to have a non-full, but reliable, list of genes that are directly connected with the gene of interest, rather than a list, where only a subset of genes constitute the neighborhood.
2. Assigning a direction to transcription factor interactions using ChIP-seq data.
In the next step, the identified interactions are to be validated. For that purpose, interactions with transcription factors (TF) are especially convenient. Chromatin immunoprecipitation with sequencing (ChIP-seq) identifies the binding sites of a TF in a genome. If a gene is regulated by a TF, its ChIP-seq signal has to be larger. It is important to understand the specificity of our data. Co-expression between a TF and a gene can be observed only when TF occupancy is relatively low. The larger the occupancy the less sensitive the regulated gene to fluctuations in TF concentration. Because of that, identified TF-gene interactions and, consequently, ChIP-seq peaks, are expected to be ‘weak’. The first part of the analysis is aimed at confirming that it is possible to detect ‘weak’ interactions in ChIP-seq data.
ChIP-seq reads cover a large part of the genome. We seek to understand whether low coverage has meaning; i.e., whether it is non-random. By analyzing ChIP-seq data in cell lines that overexpress a TF, this can be achieved. In over-expression conditions the amount of TF increases, the signal becomes stronger and more binding sites are detected. If the signal is non-random, there must be a positive correlation between normal and over-expression conditions. For example, loci that were detected as peaks under over-expression (but not normal) conditions would tend to have a larger number of reads. It would prove that such loci are not random errors. Systematic error depends upon the amount of antibody, not the amount of target TF. It follows that proving the non-randomness of the ChIP-seq signal amounts to proving that the signal measures binding of target TF. Genes that are regulated by the TF have to bind it, and, according to the above, have a ‘higher than average’ ChIP-seq signal. An obviously important covariate is gene length. In this work gene length is taken into account by first regressing the coverage of all genes onto their lengths. A set of genes has higher coverage if genes are positioned non-randomly relative to the regression line; i.e., above the regression line. Quantitatively, this is checked by the sign of residuals.
DOSH identified 138 TFs interacting with one or more genes. ENCODE [6] project site contained a ChIP-seq experiment with overexpression of one TF, NR2C2. Below ‘peaks’ refer to ChIP-seq peaks called by ENCODE pipeline, while ‘signal/coverage’ is the number of reads from ChIP-seq BAM files. First, ChIP-seq peaks that appeared in overexpression and normal conditions (intersect set) were analyzed. The fraction of saturated peaks was 0.45 and 0.87 in overexpression conditions and the intersect set, respectively. Second, peaks present only in overexpression conditions (diff set) were analyzed. Out of 44015 peaks in the diff set, 40661/41316 had non-zero coverage, while chromosome coverage ranged from 0.039 to 0.079. It follows that the ChIP-seq signal outside of peaks is non-random and measures real NR2C2 binding.
With ChIP-seq data we can identify whether a TF binds to a gene that it interacts with or not. This gives interactions a direction. If a gene regulates a TF, the interaction is called an inward interaction. Outward interactions are between a TF and genes regulated by the TF. The direction of the interaction is defined by the sign of the residual in the regression model of coverage and gene length – positive and negative residuals for outward and inward interactions, respectively (Fig. 2).
3. Neighborhood genes as cancer genetic markers.
Cancer’s hallmark is transcription misregulation. The nature and degree of misregulation can be estimated most efficiently by analyzing key players, transcription factors (TF). In disease, TF expression and/or activity may change. Protein mutations alter activity. Genome sequencing identifies mutations; however, how damaging a mutation is often unknown. Expression level also shouldn’t be the sole factor, since damaging mutations can lead to unaltered expression. Neighborhood genes provide a convenient alternative. Changes in TF activity or expression always alter expression of neighboring genes. It makes neighborhood genes better genetic markers than TFs themselves.
Genetic markers can serve as predictors of disease traits, like survivability. Genes from the same neighborhood are expected to be less independent of each other than random genes in explaining a trait. ‘Less independent’ means that such genes are more likely to explain the same part (be co-linear) or, on the contrary, complement each other. The second event can be detected as a change in the p-value of the regression coefficient; i.e., when a gene becomes ‘significant’ only when paired with another gene. We can re-state that as ‘a non-significant gene is more likely to become significant when paired with a gene from the same neighborhood’. This hypothesis was used to check the validity of interactions identified by DOSH. The significance threshold of 0.05 was chosen. Results of a binomial test (see Methods for details) showed that p-values were not uniformly distributed (Table 1).
Table 1
Results of the binomial test#2 (see Methods for details). P-values for separate genes (binomial test#1) can be found in Table S1.
p-value cutoff, p
|
binomial test p-value
|
p-value cutoff, p
|
binomial test p-value
|
0.05
|
0.3648447
|
0.3
|
0.1169005
|
0.1
|
0.1332651
|
0.35
|
0.02140162
|
0.15
|
0.04882253
|
0.4
|
0.02601014
|
0.2
|
0.03622801
|
0.45
|
0.1019576
|
0.25
|
0.09615822
|
0.5
|
0.01576975
|
The directionality of interactions is important as well. Genes that regulate the TF and genes that are regulated by the TF react differently to alterations in TF activity and/or expression level. Consequently, it is expected that genes with interactions of the opposite directions would complement each other more often. DOSH identified only one TF that was suitable for directionality analysis, MGA. MGA was found to have 8 inward (RPGRIP1L, CACNA1F, ZC2HC1A, AC018642.1, ARL6, C2orf16, PLCD3, C1orf145) and 4 outward (CTC-542B22.2, KDM4A-AS1, TTC23, MCF2) interactions. Out of 7 gene pairs, where one of genes changed its significance, 4 pairs were between inward and outward interactions (Table 2, binomial test p-value = 0.028).
Table 2
Change of significance in the neighborhood of transcription factor MGA.
Gene#1
|
Gene#2
|
Interaction direction
|
TTC23
|
ZC2HC1A
|
Out/In
|
ARL6
|
ZC2HC1A
|
In/In
|
KDM4A-AS1
|
ZC2HC1A
|
Out/In
|
KDM4A-AS1
|
PLCD3
|
Out/In
|
RPGRIP1L
|
C2orf16
|
In/In
|
KDM4A-AS1
|
C2orf16
|
Out/In
|
ARL6
|
C2orf16
|
In/In
|