Quantitative Toxicity Prediction via Ensembling of Heterogeneous Predictors

Background: Representing molecules in the form of only one type of features and using those features to predict their activities is one of the most important approaches for machine-learning-based chemical-activity-prediction. For molecular activities like quantitative toxicity prediction, the performance depends on the type of features extracted and the machine learning approach used. For such cases, using one type of features and machine learning model restricts the prediction performance to speciﬁc representation and model used. Results: In this paper, we study quantitative toxicity prediction and propose a machine learning model for the same. Our model uses an ensemble of heterogeneous predictors instead of typically using homogeneous predictors. The predictors that we use vary either on the type of features used or on the deep learning architecture employed. Each of these predictors presumably has its own strengths and weaknesses in terms of toxicity prediction. Our motivation is to make a combined model that utilizes diﬀerent types of features and architectures to obtain better collective performance that could go beyond the performance of each individual predictor. We use six predictors in our model and test the model on four standard quantitative toxicity benchmark datasets. Experimental results show that our model outperforms the state-of-the-art toxicity prediction models in 8 out of 12 accuracy measures. Conclusion: Our experiments show that ensembling heterogeneous predictor improves the performance over single predictors and homogeneous ensembling of single predictors.The results show that each data representation or deep learning based predictor has its own strengths and weaknesses, thus employing a model ensembling multiple heterogeneous predictors could go beyond individual performance of each data representation or each predictor type. Code Availability: Our implementation of the proposed model is freely available from our GitHub repository


Results:
In this paper, we study quantitative toxicity prediction and propose a machine learning model for the same. Our model uses an ensemble of heterogeneous predictors instead of typically using homogeneous predictors. The predictors that we use vary either on the type of features used or on the deep learning architecture employed. Each of these predictors presumably has its own strengths and weaknesses in terms of toxicity prediction. Our motivation is to make a combined model that utilizes different types of features and architectures to obtain better collective performance that could go beyond the performance of each individual predictor. We use six predictors in our model and test the model on four standard quantitative toxicity benchmark datasets. Experimental results show that our model outperforms the state-of-the-art toxicity prediction models in 8 out of 12 accuracy measures.
Conclusion: Our experiments show that ensembling heterogeneous predictor improves the performance over single predictors and homogeneous ensembling of single predictors.The results show that each data representation or deep learning based predictor has its own strengths and weaknesses, thus employing a model ensembling multiple heterogeneous predictors could go beyond individual performance of each data representation or each predictor type.
Code Availability: Our implementation of the proposed model is freely available from our GitHub repository https://github.com/Abdulk084/HPE Keywords: Deep Learning; Ensembling; Quantitative Toxicity Background Every year a great number of chemical compounds are produced. A large number of them are suspected to be toxic and many of them are eventually proved so. Toxicity is the degree to which a chemical compound can harm humans or animals. The main metric employed to measure the toxicity of chemical compounds is the concentration of the compounds and the time of their exposure to the organism [1]. The concentration of compounds is measured by experiments known as endpoints measuring experiments. Toxicity endpoints mainly are either qualitative or quantitative. The qualitative endpoints categorize the chemical compounds in two groups: toxic and nontoxic. On the other hand, the quantitative endpoints record the minimal amount of chemical compounds that can reach given lethal effects. In this paper, we study quantitative toxicity prediction. Toxicity predictions, similar to the prediction of various characteristics of chemical compounds, are traditionally performed by in-vivo or in-vitro techniques. However, these techniques are very time-taking and cost-intensive. They also raise ethical concerns because of the involvement of animals. To address these issues, in-silico methods (computer-aided methods) have recently attracted much attention because they are efficient in time and cost and they do not compromise with accuracy much. There exist many in-silico methods, but the quantitative structure activity relationship (QSAR) method is one of the most successful ones. The main intuition behind the QSAR method is that chemicals molecules that are similar in structure should have similar activities and properties. Therefore, studying the relationships between chemical structures and biological activities of existing chemicals enables prediction of the activities and properties of new chemicals.
QSAR modelling using deep learning techniques have become very acceptable in recent years [2]. Most of these methods work on the features generated from the text format of the molecules. The text format is in a chemical language named the simplified molecular-input line-entry system (SMILES), which is used to describe the chemical structure of a molecule as a string of characters [3]. There is a special grammar for SMILES strings and different characters represent atoms or bonds among them. Such SMILES strings are utilized to obtain various types of numerical features (e.g. physicochemical descriptors) and molecular graphs by using different featurization methods [4,5]. Traditional machine learning approaches such as K-Nearest Neighbours (KNN), Support Vector Machines (SVM), Random Forest (RF), and Fully Connected Neural Networks (FCNN) are based on numerical features, particularly when used to predict activity or toxicity of a chemical compound [6]. Besides, numerical features, SMILES strings can also be used to generate molecular graphs or images, which then can be used in various types of convolutional neural network (CNN) to predict molecular activities [7]. Using CNN for molecular graphs or images needs relatively less domain expertise. It should be noted that SMILES strings can also be transformed into a vector representation or their respective fingerprints (fingerprints are bit strings composed of 0's and 1's) to be used in Recurrent Neural Networks (RNN) for molecular activity prediction [8].
Recently in the area of quantitative toxicity prediction, specialized type of features called element-specific topological descriptors (ESTDs) are used in deep neural networks and consensus models by TopTox to predict quantitative toxicity activity level [9]. Another recent work named AdmetSAR used molecular fingerprints to predict toxicity values by RF, SVM, and KNN models [10]. Yet there is another research with a name Hybrid2D, which used joint optimization of shallow neural networks and decision trees on 2D features only to predict toxicity measurement levels [11].
The performance of all these quantitative prediction methods is restricted by the specific type of features or model used in prediction.
In this paper, we propose a model comprising an ensemble of heterogeneous predictors (HPE). HPE uses six different deep learning methods, thus called predictors in the paper hereafter, to predict the regression values of four bench-mark quantitative toxicity data sets. These predictors are: (1) fully connected physico-chemical (FCPC) (2) fully connected physico-chemical extended (FCPCe) (3) convolution 1D SMILES (C1DS) (4) convolution 2D fingerprints (C2DF) (5) molecular graph convolution (MGC) and (6) molecular weave convolution (MWC). FCPC and FCFCe are fully connected neural networks, C1DS and C2DF are two types of convolutional neural networks, and MGC and MWC are two types of graph convolutional networks. In our HPE model, we ensembled the outputs of these predictors to achieve the overall performance. It should be noted that these predictors varies (heterogeneity) on either class, architecture or feature levels as shown in the Table 3. For instance, FCPC and FCPCe vary on feature level only. They both use numerical features (different in number only) but share the same architecture. C1DS and C2DF vary on the architecture and the feature level both. C1DS uses SMILES directly as input while C2DF converts SMILES into fingerprints first. MGC and MWC also vary on the architecture and the feature level. The details of these predictors are given in the methods section. Thus by introducing heterogeneity in each predictor with respect to the others, we were able to make a single model that utilizes different types of features and architectures to obtain collective performance that could go beyond individual performance of single predictor type. On four bench-mark quantitative toxicity based datasets, our proposed method obtains significantly better accuracy levels in 8 out of 12 metrices than that obtained by the state-of-the-art quantitative toxicity prediction methods. In terms of datasets, we outperform in IGC 50 , LD 50 and LC 50 -DM than all other methods. Moreover, our experiments also show that HPE model significantly improves the performance over individual predictors and their homogeneous ensembling for all four quantitative toxicity datasets.

Results
We report the prediction results of the proposed HPE model. We have compared the proposed model with each of the single predictor (i.e., FCPC, FCPCe, C1DS, C2DF, MGC and MWC) used in our HPE model, and also with their homogeneous ensembles. The homogeneous ensembles (Hom) of each predictors are obtained by ensembling each individual predictor with itself six times. We also have compared the proposed model against known best-performing models in the literature: TopTox [9] and the methods used in the development of TEST software [12] which are based on hierarchical method nearest neighbor methods. Table 1 presents the prediction results of individual predictors, their homogeneous ensembles (Hom), and our final model HPE in four datasets using three metrics. It should be noted that HPE is the ensemble of all six predictors. Comparing columns Ind and Hom in each dataset, in each metric, we see that each Hom obtains better performance compared to the corresponding individual predictor (Ind). These are expected results and we include to reaffirm the strength of homogeneous ensembles. Our main results come from the ensembling heterogeneous predictors or HPE. Comparing columns Hom and HPE, we see that the HPE outperforms the homogeneous ensembles in all metrics in all datasets. The difference is in the range of 0.018-0.084 with an average of 0.03825 in a scale of 1.00. This clearly demonstrates the strength of the HPE over the homogeneous ensembles.

Comparison of HPE against Individual Predictors and their Homogeneous Ensembles
As can be seen from Table 1, in all four data sets, and in all three metrics, the proposed HPE model outperforms all six predictors. These results confirm that using a heterogeneous predictors ensembling (HPE) model using 6 different predictors is better than using just a single predictor. The results show that each data representation or neural network type has its own strengths and weaknesses, thus employing a model ensembling multiple predictors could go beyond individual performance of each data representation or each neural network type. For further clarification, the results are discussed below for each data set in detail.
• IGC 50 : the proposed HPE obtained a correlation coefficient (R 2 ) of 0.831, RMSE of 0.426 log(mol/L), and MAE of 0.182 log(mol/L). However, among those 6 various individual predictors, MGC obtained the best R 2 value with of 0.782. FCPC obtained the best RMSE and MAE values with of 0.472 log(mol/L) and 0.223 log(mol/L), respectively. HPE improves the R 2 by 6.26% and 4.52%, RMSE by 9.74% and 7.59% , MAE by 9.03% and 8.73% from the best Ind and best Hom respectively. • LD 50 : the proposed HPE model obtains better results in all three metrics with R 2 of 0.680, RMSE of 0.536 log(mol/L), and MAE of 0.407 log(mol/L) . HPE improves the R 2 by 7.59% and 4.61%, RMSE by 10.96% and 4.79% , MAE by 8.94% and 4.23% from the best Ind and best Hom respectively. • LC 50 -DM : as table shows, for this data set, the proposed HPE model obtains better results in all three metrics as well. It obtains R 2 of 0.811, RMSE of 0.787 log(mol/L), and MAE of 0.620 log(mol/L). HPE improves the R 2 by 8.13% and 6.29%, RMSE by 3.14% and 2.95% , MAE by 8.01% and 5.05% from the best Ind and best Hom respectively. • LC 50 : for this data set, the proposed HPE obtained R 2 of 0.742, RMSE of 0.788 log(mol/L), and MAE of 0.621 log(mol/L). HPE improves the R 2 by 7.53% and 4.50%, RMSE by 8.26% and 6.52% , MAE by 15.85% and 11.91% from the best Ind and best Hom respectively.

Evaluation of HPE model against several best-performing models
After finding the effectiveness of the proposed HPE model over various individual predictors and Hom, here we are to examine its performance against the state-ofthe-art algorithms in the literature; the models used in the development of TEST software [12], TopTox [9] and Hybrid2d [11]. The results are shown in Table 2. As can be seen, from total 12 metrics, the proposed HPE model obtain the best results in 8 of them, especially in two of the data sets, it dominates other algorithms with obtaining better results in all three metrics. The detailed results are discussed below.
• IGC 50 : As can be seen, for this data set, TEST consensus obtained the highest R 2 among different models in TEST software with of 0.764, while TopTox model achieved R 2 of 0.802. However, the proposed model obtained R 2 of 0.831 which is better than all 6 models compared including TopTox. The proposed model also obtained better RMSE and MAE values with of 0.426 log(mol/L) and 0.282 log(mol/L), respectively. • LD 50 : for this data set, the proposed model dominates other algorithms in all three metrics with R 2 of 0.680, RMSE of 0.536 log(mol/L), and MAE of 0.407 log(mol/L). The results of TopTox model, in all three metrics, was better than TEST software models but worse than the proposed model in this paper. • LC 50 -DM: For R 2 and MRSE, the proposed model obtained 0.811 and 0.787 log(mol/L) which was the better than all other models compared. However, for MAE, the proposed model obtained 0.620 log(mol/L) which was better than all other models but TopTox with of 0.592 log(mol/L). • LC 50 : As this table indicates, the proposed model obtained better R 2 results than 6 comparing models yet TopTox [9] with R 2 of 0.788. The TopTox [9] also obtained better results in terms of RMSE and MAE with of 0.677 log(mol/L) and 0.446 log(mol/L) respectively.

Discussion
Representing molecules in single type of representation and then using homogeneous modeling techniques might not help to capture the whole information about that molecule. For instance, basic molecular graph representation does not capture the quantum mechanical structure of molecules or necessarily express the information.
Similarly the models which uses molecular graphs as input like graph convolution will not be able to distinguish between chiral molecules (molecules having same graph structure with a mirror image to each other). In case of fingerprints as an input, it is also possible that different molecules may have identical fingerprints which will make it difficult for a model to distinguish if it only takes fingerprints as input. There is also some information loss when one type of features are converted into another type of features.
In our experiments on the quantitative toxicity datasets, HPE obtains the highest performance followed by Hom and then individual predictors. The percentage improvement of HPE over Hom and Ind in all four datasets indicates that various predictors might be learning different knowledge from the same dataset. As it can be seen in Table 1, graph based predictors like MGC and MWC achieves better performances in most of the metrices and datasets. Specifically in maximizing R 2 for IGC 50 , LD 50 , and LC 50 , MGC produces the best results whereas for LC 50 -DM, MWC produces best results. The quantitative toxicity datasets considered in this study contain relatively smaller molecules which makes them more suitable for graph based predictors. The second highest performers on the average are FCPCe and FCPC which uses the features based on physico-chemical properties. These features have proved to have high predictive power in literature. It can be noticed that predictors like C1DS and C2DF struggle to perform as compared to other predictors. Yet, when all of them are ensembled to form an HPE model, they help in improving the results.
Even though various heterogeneous predictors ensembling enhance the overall accuracy. Yet it would be interesting to see the commonality between the learnt representation of various individual predictors and to what degree one predictor's captured knowledge differ to the others.

Conclusion
Toxicity prediction methods of chemical compounds recently achieved enhanced performance in terms of accuracy after the introduction of various deep learning models in this space. Usually, molecules are represented in a fixed representation which are then used as features with a specific machine learning method to predict the toxicity. Among various other types of compounds toxicity, quantitative toxicity measurement has a paramount importance in pharmaceuticals. The performance of any quantitative toxicity prediction method depends upon the specific features and model used. This restricts the overall performance to single type of features and a model. In this paper, we propose a method which uses various heterogeneous predictors ensembling (HPE) to achieve better performance in quantitative toxicity prediction of four bench mark data-sets. Thus eliminating the restriction of model and data representation bound approach, each of our model's predictor vary either on features level, deep learning architecture level or both. These predictors include FCPC, FCPCe, C1DS, C2DF, MGC and MWC. FCPC and FCPCe vary on feature level only. They both use numerical features (different in number only) but share the same architecture. C1DS and C2DF vary on architecture and feature level both. C1DS uses SMILES directly as input while C2DF first converts SMILES into fingerprints. Molecular graph convolution (MGC) and molecular weave convolution (MWC) also vary on architecture and feature level both. Our motivation is to make a single model that utilizes different types of features and architectures to obtain collective performance that could go beyond individual performance of single predictor type. We also performed experiments which showed that heterogeneous ensembling method performs better than ensembling the homogeneous predictors. We achieved better performance in 8 out of 12 accuracy metrics for four quantitative toxicity data sets.

Methods
In this section, we first overview four data sets used in this paper. Then, we discuss the evaluation criteria for the individual predictors and their ensembles. Lastly, we present the implementation details of the individual predictors of our HPE model.

Data Sets
Mathematical representation of toxicity is the simplest way to understand the unwanted effect of a given compound on human health. These mathematical formulas for toxicity are based on two factors: (i) dose and (ii) time. These two factors combine and formulate quantitative toxicity of a compound. Quantitative toxicity, due to its mathematical characteristics, is not only easy to understand but is also proven to be compatible with prediction algorithms. This study considered four toxicity datasets. The datasets have different with endpoints indicators that shows a quantitative number to represent toxicity. LC 50 , IGC 50 and LD 50 are the three different endpoints used for all four datasets. Here, two datasets (LC 50 and LC 50 -DM) share the same endpoint i.e. LC 50 however they are tested with different animals. All these three measures are being used in toxicology for estimating the toxicity behaviour of any given chemical compound. LC 50 and LD 50 are the concentration and dose of the compound that kills half members of the tested population. LC 50 is mostly used as an endpoint for testing the compounds on aquatic animals while LD 50 is used for indicative toxicity on laboratory mice/rat. Here, LD 50 depends on the route of administration: oral administration could cause less toxicity than intravenous route. IGC 50 is the third measure that is also used on aquatic animals, but instead of showing the completely neutralizing effect, it shows the growth inhibition effect on tested population. In addition to concentration, these measures are also dependent on the duration of exposure of a given organism to the compound. Thus, each data set shows not only the type of endpoint but also the exposure time. Here, the first dataset shows the 96 hr LC 50 record on fathead minnow, the second dataset shows 46 hr LC 50 -DM record on Daphnia magna, the third dataset shows 40 hr IGC 50 record on Tetrahymena pyriformis and the fourth dataset shows oral LD 50 record on rat population. The concentration of compounds used in LC50 was in milligrams. Datasets were obtained from Wu et. al. work [9] while the original repository is available at http://cfpub.epa.gov/ecotox/ and http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp. These datasets have different sizes ranging from hundreds to thousands. For instance, LC 50 -DM contains 353, LC 50 contains 823, IGC 50 contains 1792 and LD 50 contains 7413 molecules.

Evaluation Criteria
In order to evaluate our predictors, we used K-fold cross validation with K = 10. Data was split into 10 equal random parts. One part was kept for testing while the 9 other parts were used for training. This process was repeated for 10 times. All the results shown later in the section represent an average value of 10 fold cross validation.
We have used three evaluation metrics for reporting the performance of our HPE and individual predictors. The first metric is mean absolute error (MAE) which is calculated as below. This metric calculates the average absolute error (i.e., differences between the prediction y j and the actual observationŷ j ) over the test data set. In this metric, all errors have equal weights.
The second metric is Root mean squared error (RMSE), which calculates the square root of the average of squared errors as follows. In this metric, the errors are squared, so large errors will have higher weights. This metric is more useful when large errors are undesirable.
Both of the above-mentioned metrics range from 0 to ∞, and the lower the M AE and RM SE values, the better the model performance.
The third metric used in the paper is correlation coefficient R 2 which is calculated as below. In the equation below,ȳ is the average of the actual observations. This metric explains the relationship between the prediction and the actual observation. It varies between 0 and 1, and the higher the value of R 2 , the better the model's performance.
Methods In our HPE model, we ensembled six various deep learning based predictors to achieve the overall performance. It should be noted that these predictors vary (heterogeneity) on either class, architecture or feature levels as shown in the Table 3. We used an ensemble averaging method to combine the output of each individual predictor and to compute the final output of our model.
We refer the reader to [13] for the concepts and mathematics of deep learning and neural networks. In the rest of this section, we explain these predictors in terms of their classes, architectures and features. FCPC and FCPCe vary on feature levels only, C1DS and C2DF vary on architectures and feature levels both. MGC and MWC also vary on architectures and feature levels.

Fully Connected Physico-Chem (FCPC) and Fully Connected Physico-Chem ext (FCPCe)
The first challenge in any machine learning algorithm is selecting a specific representation of the training data. The most common type of representation is numerical value based features. Usually for numerical features, a standard fully connected neural network is used. A neural network that has each unit of each layer connected to all the units of the next layer is termed as a fully connected neural network (FCNN). FCNN operates on a fixed shape input by passing information through multiple non-linear transformations. The first two predictors of our method (FCPC and FCPCe) use standard fully connected neural networks as shown in Figure 1 . FCNN in both FCPC and FCPCe predictors consist of 10 layers with 1000 neurons in each layer. The final layer consists of single unit with a linear function. Non linear activation function of sigmoid is used after each layer except the final layer. A dropout value of 0.5 is used after each layer. The learning rate was kept 5e −6 with a batch size of 32. Optimization was performed using the ADAM optimizer [14]. Both of these predictors are built using a Keras deep learning framework on a system with Nvidia Tesla K40 GPU [15].
In FCPC component of our model, we used only 2D pysio-chemical features. These 2D pysio-chemical features are numerical in nature and are computed using Padel descriptor [4]. Out of total 1444 features, we computed 1148 features because Padel fails to compute features for large molecules due to time and memory constraint. For FCPCe component, we extended the feature set to 3D (total of 1826) using Modred descriptor [16]. It should be noted that 580 features are common between these two sets of physico-chemical features given in the supplementary as additional information.

Convolution 1D SMILES (C1DS) and Convolution 2D Fingerprint (C2DF)
A convolutional neural network (CNN) is a special type of neural network for the image data. CNNs can extract low level features from images and compute more complex features as we go deeper in the networks [17]. Variants of CNN like Inception, Alexnet and Resnet have been developed and employed as highly accurate image classification models [18]. 1D convolution is a special type of convolution which uses convolution operation over one dimension such as sequence or time series data as opposed to 2D convolution which works for 2 dimensional data such as images. It should be noted that there is another type of specialized neural network called recurrent neural network (RNN) which also works for sequential data but suffers from high computational cost as compared to 1D convolutional neural network [19].
We developed 1D convolutional neural network (C1DS) as a third predictor of our model. C1DS was trained directly on SMILES strings of the molecules. SMILES is a chemical language that describes the chemical structure of a molecule in a string of characters [3]. There is a special grammar for SMILES strings. Different characters represent atoms or bonds between the atoms. For instance, a small c represents aromatic carbon whereas capital C represents aliphatic carbon. To represent a single or double bond between atoms, special characters like = and -are used between the atom characters. An example of a SMILES string is COc(c1)cccc1C#N, which represents 3-cyanoanisole.
The architecture of CIDS predictor is shown in Figure 2a. The SMILES strings of molecules are of different lengths. We pad each smile with "0" and make them all equal to the length of longest smile in particular data set. The longest SMILES string is 52, 103, 75 and 181 for IGC50, LC50-DM, LC50 and LD50 respectively. Each character of the SMILES is encoded into a numerical value. Thus we obtain equal length vectors of each SMILE to be used in convolution 1D predictor. This fixed dimensional feature vector goes into the embedding layer of convolution 1D predictor. Each integer value of the fixed sized vector is embedded into 400 dimensional vector, thus creating a matrix of the shape [maximum length of a SMILES string, 400]. This matrix is trained along with the rest of the model training. After the embedding layer, we applied three 1D convolution layers, each with 192 filters with size of 10, 5 and 3 respectively. A ReLu activation function and batch normalization is used after each convolution layer. After flattening out, a fully connected dense layer with 100 units followed by a ReLu activation and dropout of 0.5 is applied. The output layer is a single neuron with a linear activation function. It should be noted that learning rate, batch size and optimization algorithm are kept the same as of the FCPC and FCPCe components. We used Keras with NVidia Tesla K40 GPU for building convolution 1D SMILES predictor [15].
As described before, 2D convolutional neural network is a special type of neural network used for 2 dimensional data such as images. This predictor (C2DF) of our model (HPE) is based on 2D covolutional neural network inspired by FP2VEC model [20] as shown in Figure 2b. Each SMILES string of a molecule in all 4 data sets is first converted into their respective fingerprint. We used RDKit to convert the SMILES strings into 1024 bit Morgan fingerprints of a radius 2 [21]. Fingerprints are bit strings composed of 0's and 1's. The position at which there is 1 represents a chemical feature defined by a specific design of fingerprint [22]. We computed a fingerprint indices vector by only taking those indices of the fingerprints with the value "1". The length of the fingerprint indices vector is computed to be 92 as the maximum number of "1s" in 1024 bit fingerprint of any molecule is 92 for all the four data sets under consideration. Those molecules with less than 92 "1s" in their 1024 bit fingerprint were padded with zero at the end. Thus we obtain fixed length vectors called fingerprint indices vector of each 1024 bit size fingerprint. This fixed length fingerprint indices vector goes into the embedding layer of C2DF predictor. Similar to C1DF, each integer value of the fingerprint index vector is embedded into 400 dimensional vector, thus creating a matrix of the shape [92, 400]. This matrix is trained along with the rest of the model training as similar to C1DS.
Unlike C1DS, in C2DF we used 2D convolution layer followed by maxpool layer. The output of the embedding layer in C2DF is fed into a 2D convolutional layer. The number of filters in this layer is chosen to be 2024 each with a size of [4,400]. A maxpool layer with a kernal size of 89 followed by a dense layer with 100 units in it is applied. Rest of the hyper-parameters were kept same as that of C1DS. It should be noted that parameters like embedding size (which is chosen to be 400 for both C1DS and C2DF), filter/kernal sizes, number of filters, learning rate, batch size and optimizer type are chosen to be inspired from the previous published research [20,11,23,24,25] and initial experimentation.

Molecular Graph Convolution (MGC) and Molecular Weave Convolution (MWC)
Molecular Graph Convolution (MGC) and Molecular Weave Convolution (MWC) belong to the third category of our developed predictors. They use similar features and classes but different architectures as given in the Table 3. As the name suggests, Graph Convolution Networks (GCN) are inspired from the convolutional neural network by redefining them for graphs instead of typical pixel based images [26]. Typical neural networks like fully connected, recurrent and convolution neural networks extract latent representation from Euclidean space but they fail to work efficiently on graph data applications [27]. For instance, in the space of chemistry, a molecule can be represented in the form of a molecular graph, where the nodes represent the atoms and the bonds are represented by edges in the graph as shown in Figure3. MGC and MWC are graph convolution neural networks trained on molecular graphs as input data. Conceptually, MGC only requires the structure or graph of a molecule and a vector of features for every atom (A) that describes the surrounding local chemical environment whereas MWC requires pair features (P) as well.
SMILES of each molecule is converted into their respective molecular graphs using RDKit [21]. Atom features such as atom type, chirality, formal charge, partial charge, ring sizes, hybridization, hydrogen bonding and aromaticity are computed using deepchem library [28,29]. Pair features include bond type, graph distance, and same ring as described in previous papers [28,29]. MGC predictor applies convolution layers to the central and its surrounding atoms, thus capturing the local chemical environment. As opposed to MGC, MWC predictor applies global convolutions to central atom along with all other atoms in a molecule while taking into account their corresponding atoms pair features as well. We used MGC and MWC as our two predictors for our HPE model from deepchem library with default settings [30]. The specific architecture details of both MGC and MWC can be found in the original molecular graph convolution paper by Google and deepchem open source library [28,29,30].