Exploration of Dark Chemical Genomics Space via Portal Learning: Applied to Targeting the Undruggable Genome and COVID-19 Anti-Infective Polypharmacology

Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones—a common dilemma in scientific inquiry. We have developed a new deep learning framework, called Portal Learning, to explore dark chemical and biological space. Three key, novel components of our approach include: (i) end-to-end, step-wise transfer learning, in recognition of biology’s sequence-structure-function paradigm, (ii) out-of-cluster meta-learning, and (iii) stress model selection. Portal Learning provides a practical solution to the out-of-distribution (OOD) problem in statistical machine learning. Here, we have implemented Portal Learning to predict chemical-protein interactions on a genome-wide scale. Systematic studies demonstrate that Portal Learning can effectively assign ligands to unexplored gene families (unknown functions), versus existing state-of-the-art methods. Compared with AlphaFold2-based protein-ligand docking, Portal Learning significantly improved the performance by 79% in PR-AUC and 27% in ROC-AUC, respectively. The superior performance of Portal Learning allowed us to target previously “undruggable” proteins and design novel polypharmacological agents for disrupting interactions between SARS-CoV-2 and human proteins. Portal Learning is general-purpose and can be further applied to other areas of scientific inquiry.


Key terms
To provide common ground for discussion with readers of various backgrounds, we list the specification of key terms related to the methods in Supplemental materials. The following list provides explanations at an intuition level without the attempt to establish formal definitions. Readers could refer to referenced materials for more formal definitions. Deep learning specific 1. model architecture: the design of a model as a set of trainable parameters without specification on the exact weight of the parameters [1].
2. loss landscape: the geometry of the global loss associated with a model architecture [2].
3. model instance: given a model architecture with certain amount of trainable parameters, a set of weights associated to the trainable parameters defines an instance of model; during the training process, each optimization step leads to a model instance. 4. optimization [1]: neural network is trained by optimizing a object function, usually in the form of minimizing a loss function. 5. global/local optimum: under the optimization formalization, the optimum point is at the end of the optimization process. As explained in [3], in an ideal world where the complete distribution of data is available to train a model, the optimum is global; while any stop point for any certain subdistribution is a local optimum.
6. model initialization: optimization always starts with an initialization of an model instance. 7. pretraining [4]: train a model on a pretext task before training on the target task; the trained model instances will become the initialization of the target task. 8. finetuning [4]: train a model on a target task with initialization pretrained task. 9. Independent and Identifically distributed (IID) [5]: if given a set of data x i , each of these x i observations is an independent draw from a fixed ("stationary") probability distribution.
10. Out-of-distribution (OOD) generalization [6]: Generalization consists in reducing the gap in performance between training data and testing data. When the data generating process from the training data is indistinguishable from that of the test data, it's "in-distribution", if not, it's called out-of-distribution generalization problem [6]. As in [7], considering datasets Data e := {(x e i , y e i )} ne i=1 collected under multiple domains e, each containing samples IID with a probability distribution D(X e , Y e ). The goal of OOD generalization is to use these datasets to learn a predictior Y ≈ f (X), which performs well across a large set of unseen but related domains e ∈ E all . Namely, the goal is to minimize , where is the risk under domain e. Here the set E all contains all possible domains.
11. generalization [1]: the most general goal of generalization is to enable the model make reliable prediction on unseen data; out-of-distribution prediction (OOD) is a more challenging type of generalization problem which requires the model to be generalizable to unseen data distribution.
12. mini-batch [1]: as a common practice for robustness and computation memory concerns, no matter how large a data set is available, only a sub set of data is sampled to train a model at each optimization step.
13. representation [1]: as coined by the line of work named "representation learning", the word representation is interachangable with "embedding", referring to a vector/matrix of learned features.
1. CPI prediction: formulated as a binary classification task, to predict whether or not a pair of protein and chemical will bind given only protein sequence and chemical SMILES string.
2. protein descriptor, chemical descriptor: the modules of a whole CPI prediction model to extract protein/chemical embedding in a euclidean space.
Portal learning specific 1. universe: a model architecture that defines a data transformation space, together with a data set.
2. portal: a model instance in a universe-which could be a local optimum in the current universe, but which facilitates moving the model to the global optimum in the ultimately targeted universe.
3. local loss landscape: optimize a model on a sub-distribution of the complete underlying distribution of the whole data set.
4. global loss landscape: the direction of gradient search towards global optimum for all subdistributions.
5. stress test: a technique [8] to evaluate a predictor by observing its outputs on specifically designed inputs; three common types are stratified performance evaluation, shifted evaluation and contrastive evaluation.
6. shifted evaluation [8]: the stress test employed in this paper, which splits train/test data set by Pfam families, i.e. proteins in the testing and training sets come from different Pfam families. This is a simple simulation off dark space model deployment.
7. deployment gap: the difference between the performance evaluated by the test set and that evaluated by the development set.
8. classic deep learning training scheme: randomly split a whole data set into train/dev/test set; optimize model on a randomly sampled mini-batch data; choose a final trained model instance based on the best test evaluation metrics; usually adopts empirical risk minimization [9] formulation.
2 Methods: PortalCG in four-universe configuration for dark chemical genomics space In this section, we present the detailed methodology used in portal learning in the context of a four-universe configuration.
• Protein sequence universe. All sequences from Pfam-A families are used to pretrain the protein descriptor following the same setting in DISAE [14] that highlights a MSA-based distillation process.
• Protein structure universe. In our protein structure data set, there are 30,593 protein structures, 13,104 ligands, and 91,780 ligand binding sites. Binding sites were selected according to the annotation from BioLip (updated to the end of 2020). Binding sites which contact with DNA/RNA and metal ions were not included. If a protein has more than one ligand, multiple binding pockets were defined for this protein. For each binding pocket, the distances between Cα atoms of amino acid residues of the binding pocket were calculated. In order to obtain the distances between the ligand and its surrounding binding site residues, the distances between atom i in the ligand and each atom in the residue j of the binding pocket were calculated and the smallest distance wa selected as the distance between atom i and residue j. In order to get the sequence feature of the binding site residues in the DISAE protein sequence representation [14], binding site residues obtained from PDB structures (queries) were mapped onto the multiple sequence alignments of its corresponding Pfam family. First, a profile HMM database was built for the whole Pfam families. hmmscan [15] was applied to search the query sequence against this profile database to decide which Pfam family it belongs to. For those proteins with multiple domains, more than one Pfam families were identified. Then the query sequence was aligned to the most similar sequence in the corresponding Pfam family by using phmmer. Aligned residues on the query sequence were mapped to the multiple sequence alignments of this family according to the alignment between the query sequence and the most similar sequence.
• Chemical universe. All chemicals in the ChEMBL26 database consists of the chemical universe.
• Protein function universe. CPI classification prediction data is the whole ChEMBL26 [13] database where the same threshold for defining positive and negative labels creating as that in DISAE [14] was used.

Algorithm
In the four-universe configuration, portal learning starts with portal identification in the protein sequence universe, then travels into protein structure universe for portal calibration before finally comes into the target protein function universe, where OOC-ML will be invoked for model optimization.
Along the way, shifted evaluation, one type of stress model selection, is used to select the "best" model instance, which splits train/test based on Pfam families, i.e. training set and testing set have proteins come from different Pfam families. Each phase will be specified in the following sections.

Chemical representation
A chemical was represented as a graph and its embedding was learned using GIN [16].

Protein sequence pre-training
Protein descriptor is pretrained from scratch following exactly DISAE [14] on whole Pfam families, making it a universal protein language model. With standard Adam optimization, shifted evaluation is used to select the "best" instance.

Protein structure regularization
With the protein descriptor pretrained using the sequences from the whole Pfam, chemical descriptors and a distance learner were plugged in to fine-tune the protein representation. The distance learner follows Alphafold [17] which formulates a multi-way classification on a distrogram. Based on the histogram of binding site distances, a histogram equalization 1 was applied to formulate a 10-way classification on our binding site structure data as in Supplemental material Figure S 11. Since protein and chemical descriptors output position-specific embeddings of a distilled protein sequence and all atoms of a chemical, pair-wise interaction features on the binding sites were created with a simple vector operation: a matrix multiplication was used to select embedding vectors of each binding residue and atom; multiply and broadcast the selected embedding vectors into a symmetric tensor as shown in the following, where H is embedding matrix of size (number_of _residues, embedding d imension) or (number_of _atoms, embedding d imension) and A is selector matrix [18], This pair-wise interaction feature tensor H interaction binding_site was fed into a Attentive Pooling [19] layer followed by feed-forward layer for final 10-way classification. Detailed model architecture configuration could be found in Table S 10 and Figure S13 .The intuition for the simplest form of distance learner is to put all stress of learning on the shared protein and chemical descriptors which will carry information across universes. Again, with standard Adam optimization, shifted evaluation was used to select the "best" instance. Two versions of distance structure prediction were implemented, one formulated as a binary classification, i.e. contact prediction, one formulated as a multi-way classification, i.e. distogram prediction. The performance of the two version are similar, as shown in Figure S 12.

Out-of-cluster Meta Learning (OOC-ML) in protein function universe
With fine-tuned protein descriptor in the protein function universe, a binary classifier is plugged on, which is a ResNet [20] layered with two linear layers as shown in Table S 10 and Figure S13. What plays the major role in this phase is the optimization algorithm OOC-ML as shown in pseudocode Algorithm1 and main content Figure 1(B),(C.1). The local loss landscape exploration is reflected in line 4-9, and line 10 shows ensemble of global loss landscape. Note that more variants could be derived from changing sampling rule (line 3 and 5) and global loss ensemble rule.
OOC-ML is built on MAML [21] but has significant differences. Echoing to steps illustrated in the Figure 1 of the main text: 1. As shown in main content Figure1 (B), OOC-ML has a sub-distribution data split into support set and query set, or as MAML named it, meta-train and meta-test within training set and test set. However, MAML sub-distributions are identified from the label space {Y } while OOC-ML identifies sub-distributions, i.e. clusters, from input feature space {X}. In PortalCG, the clusters are identified by Pfam. Further, OOC-ML allows the utilization of very small clusters where very limited known data points are available for training. For example, in PortalCG, some Pfam families with too few samples to be split into support and query set are organized as query-set-alone, which participate only in the global loss optimization, as detailed below.
2. In each mini-batch, a few sub-distributions are sampled. The whole optimization has two layers, inner loop and outer loop. At the inner loop, each sub-distribution data has its own local loss landscape. The support set is used for in-distribution optimization on the local loss landscape.
3. The locally optimized model is then used on query set to get a query set loss, which will be fed to the global loss landscape. Each sub-distribution is independently optimized. This step is the same as MAML. What is different is that OOC-ML also calculates query-set without local in-distribution optimization for the small clusters.
4. Local query set losses are pooled together and the model will be optimized on the global loss landscape as meta-optimization defined in MAML.

5.
After finishing train, the model will be deployed.
6. MAML is designed for multi-class classification in few-shot learning, at deployment stage, it's expected to meet new unseen class. And it's assumed that there are a few labelled sample available as support set, hence named as few-shot learning. For each unseen class, the trained model will carry out a fast in-distribution adaptation using support set before final prediction on the query set. However, this is impossible in the context of dark space illumination. Portal learning trained model has to make robust predictions without any chance of in-distribution adaptation.
Algorithm 1: Portal Learning Optimization Algorithm: Out-of-cluster Meta-learning input : p(D), CPI data distribution over whole pfams, where each D i ∈ D is a set of CPI pairs for the pf am i ; α, β, learning step size hyperparameters; L, number of optimization steps in each round of local exploration; T , number of global training steps; K, number of points sampled from a local neighborhood output: θ the whole model weights 1 initialization whole model weights θ (with weight transfered from portal for protein and chemical descriptors and random initiliazed weights for binary classifier) Compute adapted parameters with gradient decent:

Stress model instance selection
In classic training scheme common practice, there are 3-split data sets, "train set", "dev set" and "test set". Train set as the name suggested is used to train model. Test set as commonly expected is used to set an expectation of performance when applying the trained model to unseen data. Dev set is to select the preferred model instance. In OOD setting, data is split (main content Table 1) such that dev set is a OOD from train set and test set is a OOD from both train and dev set. Deployment gap is calculated by deducting ODD-dev performance with OOD-test performance.

Implementation details
With portal learning being a framework, all experiments are based on the configuration of a fouruniverse design. Four major variants of models are trained as shown in main content Table 2 for controlled factor experiments to verify the contribution of key components of Portal Learning. In this section we present implementation details.
Due to the large number of total samples, all training are carried out under global step-based formalization instead of epoch-based. Typically, a deep learning model is trained for numerous epochs, in each epoch the model will loop over all training data. Evaluation will be carried out once on the whole test data set at the end of each epoch. In the global step formalization, a mini-batch is sampled at uniform random from pre-split training data set. For a pre-defined total number of global steps, this mini-batch sampling will be repeated. Training is stopped when loss decreases are within a pre-defined error margin. To evaluate along the way of training, for every m global steps of training, a subset of test data is sampled uniformly randomly from a pre-split test set. To compute generalization gaps, in addition to evaluate on test set split according to the shifted evaluation, a dev set is held out from the train set for the evaluation as well. In this way, dev set and train set are iid. The performance difference between dev and train is the observed space generalization gap while the performance difference between dev and test is the dark space generalization gap.

Evaluation metrics
Distogram prediction uses an average accuracy on the distogram. CPI binary classification uses F1, ROC-AUC and PR-AUC for overall evaluation with breakdown by class F1, recall and precision scores.

Docking as baseline
Protein-ligand docking was performed using Autodock Vina [22]. The whole protein surface search implemented in the Autodock Vina was applied to identify the ligand binding pocket. The center of each protein was set as the center of the binding pocket. The largest distance of the protein atoms to the center of the protein is calculated for each x, y, and z direction to define the edge of each protein. 10 Angstrom of extra space was added to the protein edge to set up the search space for the docking.

Production level for deployment
To create a production level model, three models were trained in PortalCG with only difference in data split. Dev set was OOD in respect of training set to make sure there was no overlapped Pfam families between them. By rotating Pfam families between training set and OOD-dev set in the fashion of a cross-validation, each of the three models was trained on different train set in light of Pfam families involved. Then a voting mechanism was used to make the final prediction.

Dark space exploration from a theoretical lens
A neural network classifier is trained by minimizing a loss function with a standard form as the following: where p θ (x, y) is the probability that a sample x belongs to the class y according to the trained neural network with parameters θ, and D t is the training data set with the number of samples |D t |. As laid out in the recent framework in [3] that reasons about generalization in deep learning, the test error of a model f t could be decomposed as follows, real world generalization gap When data are sampled as independent and identically distributed (iid) random variables, "ideal world" is a scenario where the complete data distribution is available with infinite data and optimization is on a population loss landscape. By contrast, "real world" has only finite data, where optimization is on an empirical loss landscape. In the dark space context of the OOD setting, this decomposition needs to be changed to observed space generalization gap This explains that the effort could be devoted to decrease the observed space error and/or the dark space generalization gap to reduce T estError(f OOD t ).
When stochastic gradient descent (SDG) is applied to the optimization, it approximately estimates ∇ θ J(θ), the expectation of gradient, using a small set of sample of size m, i.e., the mini-batch drawn uniformly from the training set. When all data are IID, this approximation works fine to update θ with g. However, for the ODD with unknown distribution, this θ updating function could easily fall into a local minimum based on the m mini-batch samples.
The test error for a trained model in the OOD setting includes two parts: test errors in the observed IID space and a generalized gap when stepping into the OOD space. Furthermore, as discussed and proved in [6], [7], not all OOD tasks are equal. Depending on how different the OOD data set is from the train set, some OOD task could be more challenge. It is true for predicting ligand binding to dark proteins. It is impossible for training data to provide sufficient coverage of the whole distribution in the dark chemical genomics space. The motivation of Portal Learning for exploring the dark space follows: one model architecture defines a functional mapping space, together with a data set defines a universe. The model initialized instance in a universe closer to the global optimum universe is a portal that is transferred from an associated universe. CPI dark space is impossible to be explored if the learning is confined only in the observed protein function, i.e. CPI universe since the known data are far sparse as shown in main content Figure 3. Hence STL is important to identify portals. The model optimization on a loss function can decrease IID training errors but will not help with the observed IID space generalization gap T estError With Portal Learning, stress model instance selection can narrow the first gap and OOC-ML can narrow the second gap.       When we consider the proteins in Tbio, there are 9545 proteins which are not int Casas's druggable proteins. If 0.67 was used as the cutoff, 219 proteins were predicted as positive hits. The gene enrichment analysis result for these proteins was listed in Table S4. Disease associated with these 219 human proteins were also listed in Table S5. Since one protein is always related with multiple diseases, these diseases are ranked by the number of their associated proteins and the top 10 diseases were listed in the table. Most of top ranked diseases are related with cancer development. 21 drugs that are approved or in clinical trial are predicted to interact with these proteins as shown in Table S6.

Additional tables
If the proteins in Tbio were removed from the undruggable list, only 2930 proteins were left.  If 0.67 was used as the cutoff, there will be only 41 proteins predicted positive and no significant enrichment with David gene enrichment analysis. So 0.665 was used as a cutoff, and 348 proteins were predicted as positive hits. The gene enrichment analysis result for these proteins was listed in Table S7. Disease associated with these 348 human proteins were also listed in Table S8. 42 drugs that are approved or in clinical trial are predicted to interact with these proteins as shown in Table  S9.

OOD generalization in deep learning
The recent work Invariant risk minimization (IRM) [7] is a dedicated algorithm to OOD generalization, which is under the goal of transformative solution for invariant representation. However, given its completeness in theory, many experiments [23] report IRM are not doing well in large real-world data set. Many deep learning tasks are inherently OOD generalization. Among those jargon, some are famous for defining a type of OOD scenario problem, for example, Domain generalization [24] can be taken as the equivalent of OOD generalization; domain shift [24] rephrases the fact of distribution change in terms of D(X, y). Some jargon define a type of solution: domain alignment [25] minimizes the difference/distance between source domains and target domains distributions for an invariant representation where the distance between source and target domain distributions are measured by a wide variety of statistical distance metrics from simple l 2 , f − divergence to Wasserstein distance; domain adaptation [26] is to leverage pretrained model on a different domain and is just one idea to achieve domain generalization, the more general term that is equivalent to OOD generalization in a more practical sense; causal learning is proved by [6] to be equivalent to OOD generalization when causality makes senses (taking into consideration the existence of cases where causality is meaningless); robust optimization [27] that focuses on worst-group performance instead of the average one in ERM; although robust optimization has not quite been adapted to modern deep learning, its sub-field distributional robust optimization [28] has witnessed quite a few recent works adapted to be used in deep learning.
Worth to be clarified that, many works that are solving the sub-group or sub-population shift problem is quite different from the OOD generalization problem as discussed in the setting of dark chemical genomics space. Sub-population shift is more like a imbalanced data problem where the test set has major resemblance with training data just the shift from a major class to a minor class or vise-versa. For example, GroupDRO [29] was published in 2018 to address this problem, proposing to incorporate structural assumptions on distribution, which could be straight forward in some data sets which has more meta-data or is a multi-label classification case where the label structure could be used as the structural assumption.

Portal learning key components related
(Model architecture) Even since the debut of the survey [30] enlightening the perspective of representation learning, enormous research passion is motivated for model architecture design, almost taken as equivalent to deep learning and overshadowing all other directions. A key idea that echos the demand of generalization is to learn global representation which helps to decrease both T rainError(f iid i ) and known space generalization gap, denoising large data set. Hence, to solve OOD, good model architecture design is not enough.
All existing work in CPI is confined in the known space and limited works have concerned generalization. Generally, proposed CPI deep learning models follow the same fashion: build model architecture of three key modules, protein descriptor, chemical descriptor and interaction learner formulating a classification problem with a few variants as regression problem. Innovation is seen mostly for model architecture, particularly active for chemical descriptor, reflecting all milestones in recent years deep learning advancement from CNN, LSTM to Transformer and GNN as demonstrated in DeepPurpose [31]. Generalization has not shown in any previous work as a main goal of research except for DISAE [14] which proves generalizability to orphan GPCR protein drug screening mainly relying on a general purpose pretrained protein language model. It's fine-tuned on GPCR data set with shifted evaluation. Hence, DISAE becomes the baseline model in this work.
(Model intialization) Although could be categorized as a type of representation learning, transfer learning became an iconic independent concept for its huge success with breakthroughs in many NLP and CV benchmark tasks. It features a pretraining-finetuning procedure. An intuitive example is to pretrain a language model on large general English vocabulary with pretext task formulation such as predicting next word and then to finetune the language model on specific downstreawm task such as machine translation in biology domain. Well-renowned Transformer based pretrained models starting from human language models are a combined success of attention based model architecture design and transfer learning. In the computation biology field, most eye-catching equivalent is protein language model, i.e. protein descripto, which inspired several similar works at the same time by different groups: TAPE and ESM showcases pretraining on large protein vocabulary could significantly improves downstream task such as protein-protein interaction prediction; MSAbased-tranformer and DISAE incorporates MSA in pretraining. From the perspective of the target downstrem task, the power of transfer learning comes from a better model initialization. This is a major breakthrough that could fill the gap of T estError(f OOD t ) − T estError(f iid t ) but not necessarily, depending on how it's incorporated into the whole training scheme at system level, particularly depending on data fed in. DISAE is used in our work here as a pretrained protein descriptor. This choice over other protein language models is due to the fact that DISAE is the smallest among other in terms of memory required to use and optimize with same level of performance. STL is a way to leverage transfer learning to fine better model initialization. The main difference and innovation is that transfer learning naively relies on the belief that more general knowledge transferred will bring better performance while STL in portal learning actively leverage biology endorsed biased when transferring general knowledge. Further, by defining the goal "to learn the portal", which will be closer to global optimum in target universe loss landscape, the whole training system is steered actively solve ODD.
(STL ) Sparked by the breakthrough of Alphafold 1 [17] and Alphafold 2 [32] in protein structure prediction, deep learning has been trusted in molecule interaction distance map prediction to learn structure information. The inclusion of CPI-structure, i.e. protein function prediction portal calibration is inspired by recent success in protein structure prediction led by the great work of AlphaFold1 [17] and AlphaFold2 [32]. Specifically, we pretrain the model to predict residue-residue contacts for a protein whose structure is solved and chemical atom-protein residue contacts given a known CPI complex structure. There are three popular ways of residue-residue pairwise distance matrix prediction depending on how to formulate it as a machine learning task. On the one end is to formulate it as a binary classification where a distance threshold is set defining whether a pair of residues are in contact or not, hence the name contact prediction. On the other end is to formulate it as a regression problem where the exact distance is used as a regression target, hence the name exact distance prediction. AlphaFold1 showed another way in between the two ends, which is to formulate it as a multi-class classification problem, where the distribution of pair-wise residue distances is broken down into multiple class labels according to a histogram, hence the name distogram prediction. We first focus on residue-atom pair wise distance at binding sites and then experiments contact prediction and distogram prediction. In our results, the two formulations have similar performance in light of the final CPI prediction through ablation study as shown in Supplemental Figure S12.
(OOC-ML) It's long be aware that the sequence order of training data exposed to the model has an impact on model generalizability. Active learning [33] emphasises to actively query data in a iterative fashion to only expose the model to data close to he decision boundary . Curriculum learning [34] emphasises to sort all training so that the model is exposed to challenges of increasing difficulty. This element of data logistics has also been closely weaved into many optimization algorithms that aim to improve model generalizability. For example, contrastive loss [35] requires certain ratio of positive v.s. negative samples in each mini-bath. Most related to portal learning is meta-learning which can be categorized into metric-based, model-based and optimizer based "learn to learn" algorithms [36] with application to few-shot learning and zero-shot learning. Meta-learning started for the data-efficiency challenge instead of generalization or OOD. Although meta-learning is defined very general, making many algorithms seem to be mere an variants falling under its umbrella, in practice, algorithms proposed bearing the name of meta-learning are defined on multiclass classification data set, typically image classification, where the main challenge is the huge number of classes while limited data points are known in each class. Because of this underlying motivation, meta-learning features a very involving data logistics with multiple layer of optimization each has its meta-train/meta-test set sampled based on label distribution. These unspecified facts reveal that there is no existing meta-learning algorithm fit into CPI data.
However, the idea of "learn to learn" is attractive. MAML [21] is the optimization based metalearning work that inspired OOC-ML proposed as a major component of portal learning. The differences are major. OOC-ML algorithm expands on it by focusing on data feature distribution instead of label distribution, encouraging active sampling in local neighborhood, which simplifies the support/query meta-train/meta-test data logistics, and ensembling a few local loss directions to learn global gravity direction.    Figure S 2, there are three main ranges in terms of the binding targets in a pfam for one chemical: [2,5], [5,20], [20,). For each of the range, a heatmap is shown with y axis representing each chemicals, x axis representing each pfam, each point representing the known binding pairs for one chemical and one pfam. As we can see, there is huge dark space.