Dark Matter of the Transcription Factor Binding Site Motif Universe


 Background: We study the limits imposed by transcription factor specificity on the maximum size of a genetic regulatory network. Results: Most regular expressions for natural transcription factor binding site motifs are separated in sequence space by only one to three motif-discriminating positions. This mild specificity requirement puts the number of transcription factors that can coexist with minimal crosstalk on the order of ten thousand, which would fully utilize the space of DNA subsequences. An expanded alphabet with modified bases can further raise this limit by several orders of magnitude, at the expense of sequence space usage. Conclusions: Based on this analysis, thousands of transcription factor binding site motifs may await discovery.


Introduction
Specific interactions between proteins and nucleic acids are fundamental to the regulation of gene expression by transcription factors [1] in genetic networks [2]. Transcription factor binding sites (TFBS) are short degenerate DNA sequences of up to 30 base pairs long [3]. Characterization of TFBS usually starts by the experimental and/or computational identification of several DNA subsequences (termed TFBS instances) that perform a certain function. Once multiple instances of a TFBS are known, a TFBS motif is defined as the set of all TFBS instances that match with a given model (i.e., the set of sites to which a transcription factor binds preferentially) [4]. The computational definition of the nucleotide pattern for a TFBS motif can be a fixed consensus sequence, a regular expression, or a scoring matrix.
Our main question is how many TFBS motifs can coexist in a genome or, in other words, what is the maximum size of a genetic regulatory network. The SwissRegulon database currently contains annotations for 684 di↵erent TFBS motifs in the human genome [5], providing an empirical lower bound. From a di↵erent viewpoint, considering that each predicted human protein with DNA-binding domains recognizes a di↵erent TFBS motif suggests 2604 TFBS motifs in the human genome [2].
Theoretical estimations from first principles provide upper bounds for the number of coexisting TFBS motifs. Di↵erent transcription factors usually recognize non-overlapping sets of sequences, possibly because overlapping would lead to detrimental crosstalk between the biological signals read by the two transcription factors. When observed, the overlap of TFBS motifs is generally small [6]. We may consider as upper bound the maximum number of sequences of length n, which is A(n) = 4 n . A finer approach is to calculate the maximal number of TFBS motifs with a minimal Hamming distance d between sequences belonging to di↵erent motifs: A(n, d)  4 n d+1 [7]. Thus, a linear increase in transcription factor specificity leads to an exponential decrease in the maximal number of coexisting TFBS motifs. Coding theory provides a third upper bound for the number of minimally overlapping TFBS motifs: 0.75 · 4 n · (n(4 1) 4)) [8]. The e↵ects of motif length and specificity on the maximal number of TFBS motifs are thus strong. In spite of this, published work does not consider the specificity of natural TFBS motifs.
Published estimations for the maximal number of coexisting TFBS motifs assume a four letter DNA alphabet. However, many genomes harbor multiple modified bases [9] that may play a role in TFBS motifs [10]. The e↵ective alphabet size of DNA may be over ten letters, which would significantly increase all theoretical estimates for the maximal number of coexisting TFBS motifs.
We apply regular expressions and theoretical tools developed for protein motifs (Bulavka et al., submitted) to the question of how many TFBS motifs can coexist in a genome. We consider empirical data for transcription factor sequence specificity, the e↵ect of stable nucleotide modifications and sequence space occupancy.

Database of transcription factor binding site motifs
All available 684 regulatory motifs weight matrices from the SwissRegulon hg19 database were retrieved in June 2018 [5]. We converted each protein weight matrix from the original database to a regular expression. For each position of the matrix we used the observed frequencies b for A, C, G and T to calculate the e↵ective alphabet size EAS [11]: We then assigned EAS letters to that position of the regular expression, by order of decreasing frequency. Last, we removed from the regular expression flanking positions that allow for all four bases.

Sequence specificity of transcription factor binding site motifs
We follow previous work on protein linear motifs (Bulavka et al., submitted). Briefly, we define a TFBS motif of length n as a sequence A = (A 1 , . . . , A n ) where each A i is a subset of A = {A, C, G, T }. A TFBS motif instance is a sequence (a 1 , . . . , a n ) with a i 2 A i for all i. The structure of A is the sequence (|A 1 |, . . . , |A n |), i.e., the number of allowed bases at each position.
Given an alignment of two TFBS regular expressions A = (A 1 , . . . , A n ) and B = (B 1 , . . . , B m ), the number of motif-discriminating positions is the number of aligned positions with at most 3 allowed letters where no letter can match both regular expressions: We calculate mdpAB for the alignments between the two corresponding regular expressions that do not leave a hanging end for the shorter regular expression and match at least one pair of positions with less than four allowed letters. Finally, we take the minimal mdpAB across all relevant alignments as a lower limit for the distance in sequence space between the two TFBS motifs.
When the number of TFBS motif-discriminating positions is 0 for a given pair of motifs, we calculate an alternative measure of specificity as 1 -(number of sequences that match both regular expressions / number of sequences that match at least one of the regular expressions).

Occupancy of the sequence space
In the case of zero motif discriminating positions, each motif instance may belong to multiple motifs and we were not able to find a formula for the potential occupancy of sequence space (Bulavka et al., submitted). For values of k of one or more motif-discriminating positions, motif instances belong to a single motif and the fraction of the sequence space occupied by a motif of structure e := (e 1 , . . . , e n ) 2 {1, . . . , 4} n is: 4 Results

Sequence specificity of known transcription factor binding site motifs
We considered positional weight matrices for 684 TFBS motifs in SwissRegulon (section 3.1).
We generated a regular expression from each matrix, using information theory to minimize the loss of information. Figure 1A shows the frequency of each motif length in the database and of the number of symbols allowed at each position. TFBS motif length ranges from 4 to 30 characters, peaking at 10 characters. We quantify the distance in sequence space between a pair of TFBS motifs as the number motif-discriminating positions (section 3.2 and Supplementary  Figure 1). This number is the minimal count of positions where no symbol can match both regular expressions, for every possible alignment where the number of aligned positions is the length of the shorter regular expression (Bulavka et al., submitted). Since other positions might not fully overlap, this is a lower limit for the distance in sequence space between the two TFBS motifs. We calculated the number of motif-discriminating positions for all possible 233586 pairs of TFBS motifs in our database ( Figure 1B, white bars and left Y axis). 77% of the comparisons the two regular expressions are separated in sequence space by at least one motif-discriminating position. This is in agreement with the use of regular expressions, where a mismatch at a single position is enough to rule out that a sequence belongs to a given TFBS motif. On the other hand, it is rare to find pairs of regular expressions separated by more than five motif-discriminating positions. 23% of regular expressions pairs are not separated in sequence space by a motif-discriminating position. In this case, we measure the distance in sequence space using the fraction of sequences matching any of the two regular expressions that match only one of them (section 3.2). We find that 95% of motif pairs share less than 5% of sequences (Supplementary Figure 2). We conclude that SwissRegulon motif pairs show significant separation in sequence space, in agreement with our assumption that there is little cross-talk between natural TFBS motifs.

Number of potential transcription factor binding site motifs
We used our theory based on the pigeonhole principle (section 3.3 and Bulavka et al., submitted) and the structures of TFBS motifs in SwissRegulon ( Figure 1A) to estimate the number of SwissRegulon-like TFBS motifs that can potentially exist in nature. We first converted the regular expressions in our database to motif structures (section 3.1). For each structure and a number of motif-discriminating positions, we calculated the number of potential TFBS motifs.
As expected from the heterogeneity in motif lengths and structures, the calculated values span several orders of magnitude (Supplementary Figure 3). We report the median of the distribution. Requiring one motif-discriminating position maximizes the number of potential TFBS motifs to over 9700 ( Figure 1B, black circles and right Y axis). The lower value for two or more motif-discriminating positions is due to higher non-overlap requirements, while the lower value for zero motif-discriminating positions arises because the overlap imposed by this condition is more restrictive than the non-overlap imposed by one or more motif-discriminating positions. It is interesting to compare bars and circles of Figure 1B. On one hand, natural TFBS motif pairs are most often separated in sequence space by a single motif-discriminating position.
On the other hand, this relatively low level of sequence specificity maximizes the number of potential TFBS motifs that can coexist while fulfilling the specificity requirement.

Role of nucleotide modifications
Current genome sequences only inform the four canonical bases, and it is often forgotten that nucleotide modifications are varied and frequent [9]. This increases the capacity of DNA to code for TFBS motifs [10]. Figure 1C shows the median number of potential TFBS motifs as a function of alphabet size for 0 to 4 motif-discriminating positions. Increasing the alphabet size from 4 to 10 increases the number of potential TFBS motifs by several orders of magnitude in all cases. When we consider an e↵ective alphabet size of 10 letters, the increase relative to an alphabet of four letters is highest at over 9500-fold for one motif-discriminating position (Supplementary Figure 4). This e↵ect decreases sharply with increasing motif specificity, becoming lower than ten-fold for 9 or more motif discriminating positions. This is notable since a single motif-discriminating position is the most frequent distance in sequence space between naturally occurring TFBS motifs ( Figure 1B).

Sequence space occupancy
A TFBS motif of length n is a subset of the sequence space of all possible 4 n DNA subsequences. We used the size of the sequence space for each TFBS motif (Supplementary Figure 5) and the corresponding maximum number of coexisting motifs to calculate the potential occupancy of sequence space for 1 to 10 motif-discriminating positions (3.3). The calculated values span several orders of magnitude (Supplementary Figure 6). As done for the number of potential motifs, Figure 1D reports the median of the distribution. For a single motif-discriminating position, all possible DNA subsequences belong to a potential TFBS motif. The potential occupancy of sequence space drops steeply for two or more motif-discriminating positions. The commonest numbers of motif-discriminating positions ( Figure 1B) maximize the potential occupancy of sequence space by the resulting TFBS motifs ( Figure 1D). For a single motif-discriminating position, the potential occupancy of sequence space is 100% regardless of alphabet size (Supplementary Figure 7). For two or more motif-discriminating positions, the potential occupancy of sequence space decreases as alphabet size increases. For two or more motif-discriminating positions, increasing alphabet size leads to a trade-o↵ between increasing the number of potential TFBS motifs ( Figure 1C) and decreasing the potential occupancy of sequence space (Supplementary Figure 7).

Discussion
Naturally occuring TFBS motifs from SwissRegulon ( Figure 1A) are commonly separated in sequence space by one to three motif-discriminating positions ( Figure 1B). This level of sequence specificity not only avoids crosstalk between transcription factors but may also help coding a genetic network with several thousand TFBS motifs ( Figure 1B) that maximizes sequence space usage ( Figure 1D), where increasing the DNA alphabet size would allow for an even larger network ( Figure 1C). This network level of TFBS motif specificity may inform the design of new specific DNA binding proteins able to function in a cellular context, be it TALEN, Zinc-finger, CAS9 or others.
Our theory is in principle valid for any set of molecules recognizing stretches of a linear polymer, regardless of the interacting partners. The overall picture for TFBS motifs is similar to our previous results for protein-protein interaction networks mediated by linear motifs (Bulavka et al., submitted). In that case, the observed sequence specificity also maximizes the potential size of the network up to around ten thousand motifs. The main di↵erences are that increasing the DNA alphabet size has a much larger e↵ect than increasing the protein alphabet size and that sequence space usage is much larger for the genetic network than for the protein interaction network at the same level of specificity. These di↵erences arise from both alphabet size and the motif regular expressions, i.e., from the physicochemical basis of protein-protein versus protein-nucleic acid complex formation [1].
TFBS motifs from SwissRegulon are commonly ten base pairs long, which corresponds to a space of ⇠ 10 6 sequences. Our theory predicts that this sequence space can be organized into a maximum of ⇠ 10 4 TFBS motifs, separated by a single motif-discriminating position. In turn, coding theory predicts a maximum of ⇠ 4.5 · 10 3 minimally overlapping TFBS motifs of length 10 [8]. A similar maximum of ⇠ 1.6·10 4 TFBS motifs can be obtained within the sphere packing approach of [7] and a minimal Hamming distance of 4 mutations between sequences belonging to di↵erent motifs. We find it reassuring that three di↵erent specificity-focused theories lead to similar estimates for the maximum size of a genetic network.
The actual upper bound for the number of TFBS motifs may be lower than 9700 due to phenomena not included in the theory. For example, the molecular interactions mediating protein-DNA interactions [1] may prevent some sets of DNA subsequences from becoming actual TFBS motifs. A need for mutational robustness [12] may further constrain maximal genetic network size. These factors could be accounted for in future models. E↵ectively reaching the upper limit may not be a requirement for the regulation of current genomes [2]. On the other hand, the gap between the 684 TFBS motifs in SwissRegulon [5] and the 2604 predicted DNA-binding human proteins [2], together with the observation of conserved DNA sequences of unknown function [13], directly point at significant amounts of dark matter of the transcription factor binding site motif universe awaiting discovery.

Declarations
Ethics approval and consent to participate

Not applicable
Availability of data and materials The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.