Algebraic Graph-assisted Bidirectional Transformers for Molecular 1 Prediction

The ability of quantitative molecular prediction is of great significance to drug discovery, human 10 health, and environmental protection. Despite considerable efforts, quantitative prediction of various molec11 ular properties remains a challenge. Although some machine learning models, such as bidirectional encoder 12 from transformer, can incorporate massive unlabeled molecular data into molecular representations via a 13 self-supervised learning strategy, it neglects three-dimensional (3D) stereochemical information. Algebraic 14 graph, specifically, element-specific multiscale weighted colored algebraic graph, embeds complementary 3D 15 molecular information into graph invariants. We propose an algebraic graph-assisted bidirectional trans16 former (AGBT) model by fusing representations generated by algebraic graph and bidirectional transformer, 17 as well as a variety of machine learning algorithms, including decision trees, multitask learning, and deep 18 neural networks. We validate the proposed AGBT model on five benchmark molecular datasets, involv19 ing quantitative toxicity and partition coefficient. Extensive numerical experiments suggest that AGBT 20 outperforms all other existing methods for all these molecular predictions. 21

boosted decision tree (GBDT) and deep neural networks (DNNs), including single-task DNN (ST-DNN, Figure S7a) and multitask DNN (MT-DNN, Figure S7b). Our training follows the traditional pipeline. [37] 107 To evaluate the variance from machine learning predictions, we repeat our calculations 20 times on each set 108 of parameters and use the average result as the final prediction. In this work, squared Pearson correlation 109 coefficient (R 2 ) and root-mean-square error (RMSE) are used to assess the accuracy of predictions. Further 110 details on our AGBT model are given in the Section 5 and Section S2 of the Supplementary Information. 111 Toxicity prediction Toxicity, a critical issue to consider in drug lead optimization, measures the degree 112 to which a chemical compound can affect an organism adversely.
[2] Indeed, toxicity and side effect are 113 responsible for more than half of drug candidate failures on their path to the market. [38] The LC50DM set  fine-tuning procedure for task-specific data. We applied MT-DNN in the downstream task and obtained 122 R 2 =0.830 and RMSE=0.743. As shown in Figure 2b, our model yields the best result, which is over 13% 123 better than the previous best score of R 2 = 0.733. This result indicates the power of our method.

124
The IGC50 set is the second-largest toxicity set and its toxicity values range from 0.334 −log 10 mol/L 125 to 6.36 −log 10 mol/L [2]. As shown in Figure 2a,  framework not only can overcome overfitting problem, but also is not sensitive to dataset size.

130
The oral rat LD50 set measures the number of chemicals that can kill half of the rats when orally 131 ingested. [43,35,36] This dataset is the largest set among the four sets with as many as 7413 compounds.   Figure S8c.  to the best performance of 0.783 obtained by BT-FPs. This indicates that the AGBT model can produce  can be significantly improved from R 2 0.587 to 0.781.

217
The above discussion indicates that SSL can acquire general molecular information and universal molec-218 ular descriptors without the guidance of labels. In downstream tasks, the MT-DNN model can also help to 219 extract the task-specific information from related data. As for extremely small datasets, such as the LC50DM matrices [24].

298
As shown in Figure 3d, for a given molecule, we first construct element-specific colored subgraphs using selected subsets of atomic coordinates as vertices, where E = {H, C, N, O, S, P, F, Cl, Br,...} is a set of commonly occurring element types for a given dataset.

301
And ith atom in a N -atom subset is labeled both by its element type α i and its position r i . We denote all the 302 pairwise interactions between element types E k1 and E k2 in a molecule by fast-decay radial basis functions where r i − r j is the Euclidean distance between ith and jth atoms in a molecule, r i and r j are the atomic 304 radii of ith and jth atoms, respectively, and σ is the mean standard deviation of r i and r j in the dataset.
305 Figure 3e gives the illustration of Laplace and adjacency matrices based on the weighted colored subgraph.

306
For the prediction of toxicity, van der Waals interactions are much more critical than covalent interactions 307 and thus the distance constraint ( r i − r j > r i + r j + σ) is used to exclude covalent interactions. In 308 biomolecules, we usually choose generalized exponential functions or generalized Lorentz functions as Ψ,

309
which are weights between graph edges [46]. Here, η k1k2 in the function is a characteristic distance between 310 the atoms and thus is a scale parameter. Therefore, we generate a weighted colored subgraph G(V, W). In 311 order to construct element-specific molecular descriptors, the multiscale weighted colored subgraph rigidity 312 is defined as where µ G i (η k1k2 ) is a geometric subgraph centrality for the ith atom. [47] The summation over j µ G i (η k1k2 ) 314 represents the total interaction strength for the selected pair of element types E k1 and E k2 , which provide 315 the element-specific coarse-grained description of molecular properties. By choosing appropriate element 316 combinations k 1 and k 2 , the characteristic distance η k1k2 , and subgraph weight Ψ, we finally construct 317 a family of element-specific, scalable (i.e., molecular size independent), multiscale geometric graph-based 318 molecular descriptors [24].

319
To generate associated algebraic graph fingerprints, we construct corresponding graph Laplacians and/or 320 adjacency matrices. For a given subgraph, its matrix representation can provide a straightforward description 321 of the interaction between subgraph elements. To construct a Laplacian matrix, we consider a subgraph G k1k2 322 for each pair of element types E k1 and E k2 and define an element-specific weighted colored Laplacian matrix 323 L(η k1k2 ) as [24] 324 L ij (η k1k2 ) = −Ψ( r i − r j ) if i = j, α i = E k1 , α j = E k2 and r i − r j > r i + r j + σ; minimum, maximum, average, and standard deviation of nontrivial eigenvalues. Note that the Fiedler value 334 is included as the minimum.

335
Similarly, an element-specific weighted adjacency matrix can be defined by 336 A ij (η k1k2 ) = Ψ( r i − r j ) if i = j, α i = E k1 , α j = E k2 and r i − r j > r i + r j + σ; Mathematically, adjacency matrix A ij (η k1k2 ) is symmetrical non-negative matrix. The spectrum of the 337 proposed element-specific weighted colored adjacency matrix is real. A set of element-specific weighted 338 labeled adjacency matrix-based molecular descriptors can be obtained by the statistics of {λ A i } i=1,2,3,.. , i.e., such as four toxicity datasets in the present study (Table S4)