Data sets
Benchmark datasets. To compare the performance of the PharmGNN with those of other GNN models, we tested our models on eleven benchmark datasets from MoleculeNet [38]. Among MoleculeNet, the physical chemistry, bioactivity and physiology data sets except for the PDBbind data sets were tested in this work. Three of the eleven datasets were used for regression tasks and eight for classification tasks.
Kinase datasets. The core idea of our model PharmGNN is to integrate information of pharmacophores which are regarded as abstract features of molecules for molecular recognition of a ligand by a biological target. Therefore, in theory, our model is more suitable for the task of predicting molecular bioactivity towards targets of interest. To systematically test the prediction performance of various algorithms on the bioactivity datasets, we collected inhibitors of ten kinase targets (see Table 1). The principle of kinase target selection was to cover each kinase family as much as possible and select the targets of great prospects for drug development. All these datasets were derived from ChEMBL [39]. After a series of operations such as data deduplication, salt removal, and electrical neutrality, ten kinase data sets were ready for classification task, with the numbers of molecules ranging from 807 to 8800 and the ratios of positive and negative samples ranging from 0.19 to 0.82.
Table 1
Basic information of kinase datasets used in this work
kinase family
|
kinase full namea
|
short name
|
totalb
|
active
|
inactive
|
ratioc
|
TK
|
Epidermal growth factor receptor erbB1
|
EGFR
|
8800(8718)
|
4514
|
4286
|
0.51
|
TKL
|
Serine/threonine-protein kinase B-raf
|
BRAF
|
4761(4669)
|
3629
|
1132
|
0.76
|
CAMK
|
Serine/threonine-protein kinase PIM1
|
PIM1
|
4507(4437)
|
3322
|
1185
|
0.74
|
Atypical
|
Serine/threonine-protein kinase mTOR
|
mTOR
|
4135(3924)
|
3390
|
745
|
0.82
|
AGC
|
Serine/threonine-protein kinase AKT
|
AKT1
|
3967(3904)
|
2175
|
1792
|
0.55
|
Other
|
Serine/threonine-protein kinase Aurora-A
|
AURKA
|
3882(3854)
|
2286
|
1596
|
0.59
|
TK
|
Tyrosine-protein kinase BTK
|
BTK
|
2640(2560)
|
1894
|
746
|
0.72
|
CMGC
|
Cyclin-dependent kinase 2
|
CDK2
|
2603(2531)
|
1302
|
1301
|
0.50
|
STE
|
Mitogen-activated protein kinase kinase kinase kinase 2
|
MAP4K2
|
897(891)
|
280
|
617
|
0.31
|
CK1
|
Casein kinase I alpha
|
CK1
|
807(801)
|
154
|
653
|
0.19
|
a: preferred name listed on ChEMBL website; b: number of all molecules (number of molecules that are suitable for reduced-graph processing); c: active/inactive ratio. |
Molecular Graph
In graph neural networks (GNNs), a molecule is regarded as a graph G = (V, E), where atom is regarded as node V and chemical bond is regarded as edge E. The nodes and edges are encoded according to the rules shown in the Table S1and Table S2. For instance, node features include atom type, formal charge, etc., and edge features include bond type, stereo type, etc. These encoded features are the initial features of molecular graphs which are used as raw inputs to train GNN models. After training, we can get the final task prediction value, together with the task-learned graph representations that also can be called molecular fingerprints.
Reduced Graphs (RGs)
RGs provide simplified representations of chemical structures by collapsing atom groups into super nodes while maintaining the topological properties of the original molecular structures. RGs have been mainly implemented to the varied applications of similarity searching, scaffold hopping, de novo design and structure-activity relationships extracting [34, 36, 40, 41].
By altering the rules used for collapsing atom groups, RGs provide flexible ways of generalizing super node features. There is a research trend to collapse the atom groups into RGs through the pharmacophore rules and the resulting RGs can be regarded as topological pharmacophore [36, 40]. It is worth emphasizing that the pharmacophore rules need to be improved before applied to graph reduction. This is because each atom in RGs needs to be mapped to one or more super nodes, while atoms that do not belong to any pharmacophore are not labeled according to classical pharmacophore rules.
In this work, we adopted the graph reduction scheme developed by Harper [34], which defines 18 types of super nodes as shown in Figure 2 (a): three types about defining rings (aromatic ring, aliphatic ring, or acyclic) intersected with six types about defining features (positively ionizable, negatively ionizable, joint H-bond donor and acceptor, donor, acceptor, or no feature), and it should be noted that the items within the three ring types and the six feature types are listed in order of priority from high to low. Figure 2 (b) lists some comparative examples of molecules and their RGs. Readers can find more graph reduction schemes in literature [34, 35, 42].
Message-Passing Neural Network (MPNN)
MPNN is a general framework for supervised learning on graphs. Within its forward pass, there are two phases: a message-passing phase and a readout phase. Here we take an undirected graph \(G\)as an example, within which the node (atom) features are represented as \({x}_{v}\) and the edge (bond) features as \({e}_{vw}\). In terms of the message-passing phase, the message function is defined as \({M}_{t}\), and the vertex update function is defined as \({U}_{t}\), where \(t\) is the running time step. During message-passing process, the hidden state of each node \({h}_{v}^{t+1}\)can be updated based on message \({m}_{v}^{t+1}\) according to:
$${m}_{v}^{t+1}= \sum _{w\in N\left(v\right)}{M}_{t} ({h}_{v}^{t}, {h}_{w}^{t}, {e}_{vw})$$
1
$${h}_{v}^{t+1}= {U}_{t} ({h}_{v}^{t}, {m}_{v}^{t+1})$$
2
where \(N\left(v\right)\)is the set of neighbors of the node \(v\)in \(G\). In addition, \({h}_{v}^{0}\) is derived from the initial node features \({x}_{v}\) through some function.
In terms of the readout phase, it uses a readout function \(R\) to make a task prediction for the whole graph according to:
$$\widehat{y}=R(\left\{{h}_{v}^{T} \right| v\in G\})$$
3
where the output \(\widehat{y}\) can be a scalar or a vector, depending on whether it is used for single task prediction or multi-task predictions.
During training process, taking the molecular graphs as inputs, the model predicts the properties of each molecule. The loss is computed based on the predicted properties and the true ones, then of which the gradient is backpropagated through the readout phase and the message-passing phase.
Reduced-Graph MPNN architecture (RG-MPNN)
The reduced-graph message-passing neural network architecture (shorted as RG-MPNN architecture) was proposed in this work, which consists of four phases: a message-passing phase at atom level, a graph reducing phase, a message-passing phase at RG level and a molecule readout phase. In short, compared with common MPNN architecture, the RG-MPNN architecture has one more graph-reducing phase and one message-passing phase at RG level. The RG-MPNN architecture works as follows.
Atom-level Message-Passing. During the atom-level message-passing phase, the operation of RG-MPNN architecture is very similar to the message-passing phase of typical MPNNs, with one difference that \(e\)is not directly considered in the message function \({M}_{k}\) since \({h}_{{atom}^{\text{'}}}\) is derived from \(cat({x}_{ {atom}^{\text{'}}},{e}_{atom-{atom}^{\text{'}}})\) by linear transformation. This phase runs for \(K\) time steps. The hidden state of each atom \({h}_{atom}^{k+1}\) can be updated based on message \({m}_{atom}^{k+1}\) according to:
$${m}_{atom}^{k+1}= \sum _{{atom}^{{\prime }}\in N\left(atom\right)}{M}_{k} ({h}_{atom}^{k}, {h}_{{atom}^{{\prime }}}^{k})$$
4
$${h}_{atom}^{k+1}= {U}_{k} \left({h}_{atom}^{k}, { m}_{atom}^{k+1}\right)$$
5
Graph Reducing. During this phase, the whole graph \(G\) is operated by the function \(Reduce\)which maps each atom to one or more super nodes with the rules we have mentioned in the method part of reduced graphs, resulting in a reduced graph \(RG\). Then, we define \({RG= (V}^{\text{'}}, {E}^{\text{'}})\), where \({V}^{\text{'}}\) represents the super node, which is one of the 18 predefined super nodes, and \({E}^{\text{'}}\) represents the edge between super nodes, which is equal to one plus to the number of chemical bonds shared between two adjacent super nodes. The hidden state of initial super node \({h}_{rg}^{0}\) according to:
$${h}_{rg}^{0}=Reduce(\left\{{h}_{atom}^{K} \right| atom\in rg\})$$
6
RG-level Message-Passing. This phase runs for \(T\) time steps and the hidden state of each super node \({h}_{rg}^{t+1}\)can be updated based on message \({m}_{rg}^{t+1}\) according to:
$${m}_{rg}^{t+1}= \sum _{{rg}^{{\prime }}\in N\left(rg\right)}{M}_{t} ({h}_{rg}^{t}, {h}_{{rg}^{{\prime }}}^{t})$$
7
$${h}_{rg}^{t+1}= {U}_{t} ({h}_{rg}^{t}, {m}_{rg}^{t+1})$$
8
Molecule Readout. During this phase, the molecule embedding \({h}_{mol}\), also as the task-learned representation of molecular graph, is achieved by a readout function \(R\) based on the hidden states \({h}_{rg}^{T}\) within \(RG\):
$${h}_{mol}=R(\left\{{h}_{rg}^{T} \right| rg\in RG\})$$
9
then the prediction of molecular property is achieved through MLP layers:
$$\widehat{y}=MLP\left({h}_{mol}\right)$$
10
where the output \(\widehat{y}\) can be a scalar or a vector same as that in the MPNN process, depending on whether it is used for single task prediction or multi-task predictions.
Theoretically, the RG-MPNN architecture proposed in this paper can be applied to any model under the MPNN architecture, that is, before readout of the whole molecule, the graph reducing and message-passing at RG level can added, and the latter operation is optional.
PharmGNN
Applying the RG-MPNN architecture, we proposed a model called PharmGNN (short for pharmacophore-based graph neural network), which was designed by adding RG-MPNN architecture based on the residual graph neural network (shorted as RGNN). As shown in Figure 1 (b), the PharmGNN follows the pattern of RG-MPNN architecture, going through the four processes mentioned above in turn: a message-passing phase at atom level, a graph reducing phase, a message-passing phase at RG level and a molecule readout phase.
Within the message-passing phase at atom level, when gathering messages from neighbor atoms, our model adopts the attention mechanism, which was proposed by Velickovic and Bengio et al. in constructing GAT [43] model. The core idea of this mechanism is to receive messages from neighbors according to a certain weight that is calculated based on the feature vectors of the center atom and its neighbor atom. This mechanism is in line with our basic chemical understanding, that is, each atom is influenced by its neighbor atom with different degrees, which may lie in factors such as the strength of the electrostatic attraction, the shift of the electron cloud, etc. Moreover, in the update process, the new hidden state of atom \({h}_{atom}^{k}\) is obtained by adding attention message and residuals. It is worth emphasizing that there are two residual items based on skip-connection mechanism, the linearly transformed values of the previous two hidden states \({h}_{atom}^{k-1}\) and \({h}_{atom}^{k-2}\) (when \(k\) >=2), since the skip-connection residual can effectively avoid the problem of gradient disappearance during process of the network training.
The graph reducing process can be regarded as a pharmacophore-based graph reduction along with a one-step message-passing. Firstly, the graph \(G\) is reduced into \(RG\) (reduced graph) according to the previously defined pharmacophore-based RG pooling rules, and the sum of vectors of child nodes inside one pharmacophore is regarded as the initial state \({S}_{rg}\). Then it comes to the message-gathering step in MPNN architecture. Each pharmacophore node receives the messages from their child nodes through their attention weights. This is consistent with the chemical intuition that each atom contributes differently to its pharmacophore. Finally, the pharmacophore nodes are updated through a GRU (gated recurrent unit) [44], with the expectation that the network can weigh the initial state \({S}_{rg}\) and the messages passed over.
During the message-passing phase at RG level, the operation is similar to the MPNN step in the graph reducing process, that is, the attention mechanism is applied to gather messages, and then the GRU is used to update the nodes.
The implementation of the molecule readout is very similar to that of graph reducing process since the readout operation can be regarded as a special case of graph reducing, that it, all child nodes belong to one super node.
Model Evaluation
In this work, we used two methods to split each dataset into a training set, a validation set and a test set. The first was to split randomly according to the ratio of 8: 1: 1. Noted that in each round of comparing the performance of algorithms, the random seed was kept the same to eliminate the impact of different dataset divisions. The second is scaffold splitting. The core idea of scaffold splitting is to put molecules with different scaffolds into different sets to evaluate the prediction ability on new scaffolds that not encountered during training. Additionally, each model was repeated for five rounds.
For the benchmark data sets, we used RMSE (root mean square error) to evaluate regression tasks, and AUC (area under curve) to evaluate classification tasks, to be consistent with other models on benchmark evaluation. For kinase data sets, two indicators were used to evaluate the model — AUC and MCC (Matthews correlation coefficient), as the two are not sensitive to data imbalance [45]. In different scenarios, the best model can be selected according to different indicators. The AUC indicator is suitable to select models in the scenarios where the correct sorting is counted such as shortlisting compounds for bioactivity testing in virtual screening, since it measures the ability of model to rank positive samples before negative ones. While the MCC indicator is suitable for the models used in the scenarios where the correct classification is counted such as evaluating whether the molecule is active or toxic.
Model training
Pytorch [46], a deep-learning framework, was used for developing all parts of the PharmGNN, RDKit (v.2018.09.2) [47] for processing molecules and Pytorch Geometric [48] for transforming a molecule into a graph. MSELoss and CrossEntropyLoss were used as loss functions for regression and classification tasks, respectively, whereas Adam [49] was used for gradient descent optimization.