Topological Distance-Based Electron Interaction Tensor: A Novel Molecular Structure Representation to Bridge Convolutional Neural Network Studies in Computer Vision to Drug-Like Compound Datasets


 Owing to the success achieved by deep learning, researchers are exploringthe application of deep learning in drug discovery to improve the accuracy of prediction models. Significant performance improvement has been achieved by diverse convolutional neural network (CNN) models in computer vision, and the preparation of an input format suitable for CNN is one of the major questions required to be answered in order to harness the advancements in using CNNs for chemical data. It was reported that the models achieved improvement in prediction accuracy, in deep learning studies on molecular structure data; however, the improvement was insufficient from an industry perspective. Furthermore, a recent study suggested that conventional machine learning models can outperform deep learning models on chemical data. As only a limited number of feature calculation methods are available for molecules in deep learning studies, it is crucial to develop more methods to calculate features appropriate for deep learning model development.A topological distance-based electron interaction (TDEi) tensor has been introduced in this study to transform a molecular structure into image-like 3D arrays based on electron interactions (Eis) within a molecule. The prediction accuracy of the CNN model with the TDEi tensor was tested with four datasets: MP (275,131), Lipop (4,193), Esol (1,127), and Freesolv (639), and the models achieved desirable prediction accuracy. Ei is the fundamental level of information that determines the chemical properties of a molecule. Feature space variation was visualized by taking outputs from the middle of the CNN architecture as the CNN model exhibited outstanding performance in automatic feature extraction.The correlation between features from the CNN, and target endpoints was strengthened as outputs were extracted from the deeper layer of the CNN.


Introduction
Diverse in silico models have been used in drug discovery projects to reduce the time and cost required for drug development [1,2]. Quantitative structure-activity relationship (QSAR) is a type of computational model that predicts the physicochemical properties, potency, pharmacokinetic properties, and safety of drug candidates from only their molecular structures [3]. Even though QSAR models have been successful in ltering out poor molecules in the early phase of drug discovery, the models have failed to discover good drug candidates based on their prediction outcomes alone, which implies that the prediction accuracy of QSAR models is not satisfactory [4]. Most QSAR models have been developed using machine learning (ML) algorithms; however, deep learning algorithms have recently been used in QSAR model development to improve its prediction accuracy [5]. Typically, the rst attempt to apply deep learning in different elds involves the application of identical hyperparameters and architectures studied in other deep learning studies. As convolutional neural networks (CNNs) have achieved outstanding performance, previous CNN studies were followed by diverse studies; however, it was not easy to do so in chemistry because chemical data has a completely different structure from image data.
The application of advanced deep learning algorithms to molecular structures requires the development of novel descriptors [6]. Fully connected architectures of arti cial neural networks, such as feedforward neural network (FNN) models are commonly used in conventional QSAR model development. In FNN, molecular descriptors calculated from the molecular structures are the model input. As molecular descriptors were calculated in 1D vector format, FNN was an appropriate architecture for QSAR model development [7]. Thus, when deep learning was applied to molecular structures, the easiest approach was to use an FNN with various hidden layers, on 1D descriptor vectors. Although many complicated neural network architectures have been developed, these architectures cannot be used with a 1D descriptor vector as an input. Thus, more research is required to develop novel feature tensors for molecular structure representation, that are suitable for application of advanced neural network architecture, in order to exploit advancement made by a wide range of deep learning research.
Graph neural networks (GNNs) have been widely used to train a model directly with the molecular structure because molecular geometry is considered as a graph, chemical bonds as edges, and atoms as nodes, [8]. In the GNN, the feature of each atom and their neighbor information are used as descriptors [9]. The feature for each atom represents the character of the atom based on its microenvironment, and neighbor information is described by the distance between the atoms, or connectivity features, such as chemical bond features [10]. In natural language processing, string data are one-hot encoded, and word embeddings are used to encode the tensor, before the model training. The simpli ed molecular input line entry system (SMILES) code is a string format representation for molecular structures, [11] and is broadly used in public databases. As molecular structures can be represented by SMILES, one-hot encoding on each symbol of SMILES [12], or SMILES-embeddings [13], was used to generate a matrix, representing each molecule as an input for the CNN architecture.
Most deep learning studies claim that the application of deep learning algorithms improved prediction accuracy in molecular property prediction [6,9,[14][15][16]. However, Jian et al. experimented with diverse datasets used in deep learning model development studies to compare the prediction accuracy between deep learning, and feature-based ML models. This study showed that the feature-based ML models outperformed the deep learning models in terms of prediction accuracy [17]. Moreover, ML models do not require demanding computation in the training process; therefore, the feature-based ML algorithm is a much more e cient way of developing the model. Unfortunately, the volume of datasets used in deep learning studies on molecular structures is much smaller than that in computer vision. Few datasets contain a large number of compounds; however, their labels are signi cantly unbalanced [18], which hinders the appropriate training of deep learning models [19]. Given that deep learning models in computer vision achieved signi cantly improved prediction accuracy owing to the application of deep learning algorithms and a huge amount of data, deep learning models on molecular structures are still not appropriately validated because of the small size of available datasets. This creates a hindrance as mining of molecular structure data is one of the paramount tasks that is required to be performed for the meaningful application of deep learning in chemistry.
In this study, a topological distance-based electron interaction (TDEi) tensor has been developed as a novel molecular representation for CNN architecture, in order to transform molecular structures into image-like 3D arrays. In the TDEi tensor calculation, each atom was represented by the electron con guration of atoms, and the number of interactions between each atomic orbital was calculated to prepare the TDEi tensor. Molecular properties are calculated in quantum mechanics (QM) based on the interaction of electrons within a molecule according to the distance between atoms; similarly, the TDEi tensor was designed based on the assumption that CNN can extract signi cant features from Eis, through weights in lters to predict molecular properties. The TDEi tensor was designed to be adjustable as per the size of the data and the complexity of the molecular structure, by changing the electron con guration vector and topological distance channel to avoid the generation of a sparse tensor. The CNN model developed with the TDEi tensor achieved desirable prediction accuracy, and analysis of the features processed by the CNN lters revealed that extracted features achieved higher correlation with target properties, when the features were obtained from the deeper layer.

TDEi tensor calculation
De nition of an electron con guration vector TDEi tensor was calculated based on the electron con guration (EC) of atoms in a molecule. The EC vector was de ned in a previous study by giving a zero, for each unoccupied atomic orbital (AO) and one, for each occupied AO with two different electron spins marked by positive and negative signs [20]. The EC vector can be varied by combining degenerated AOs or electron spins because these electrons possess identical levels of energy. Given that the size of data in chemistry is generally much smaller than the size of data in computer vision, thus reducing sparse information and condensing the feature size are signi cant for e cient model training and accurate prediction model development. Therefore, sparse information or invariable features were integrated to condense the information without loss. Such information condensation was successful in the prediction model development with a small dataset [21]. The possible variations of EC vectors are summarized in Fig. 1.

Transformation of molecular structure into tensor shape
In QM calculations, molecular orbitals were calculated through a linear combination of AOs, and coe cients for each AO were estimated during the calculation in the density matrix, which is a diagonal matrix whose rows and columns are all AOs in a molecule. As a molecule can be translated into a matrix format with AO information, the concept was used with adaptations in the TDEi tensor design. First, the Ei matrix was designed with rows and columns with a xed size of the EC vector such that every input has an identically sized matrix. The size of the density matrix is dependent on the number of AOs in the molecule; however, the input shape must be equal regardless of the size of the molecule in order to input them into the CNN. As the size of the Ei matrix was xed, molecular geometry differences were lost in the matrix because the EC vector was solely based on the composition of molecules; however, molecules with identical compositions can have different molecular geometries. To consider the difference in the topological structure of a molecule, a matrix was generated based on the topological distance within a molecule.
The matrix for the topological distance 0 is shown in Fig. 2A. Topological distance 0 means the atom itself; thus, the EC vector of a C atom was multiplied by a row and a column in order to calculate the number of interactions between all electrons within the C atom. The topological distance 0 matrix is the sum of the Ei matrices for all atoms within a molecule. The matrix for the topological distance 1D is explained in Fig. 2B. Further, the pairs of atoms within the molecule were considered to calculate the Eis between them. As the Ei matrix was calculated for all atoms in a molecule, the Ei matrices between the C and N atoms were calculated twice in the example. Therefore, they were divided by two, and all Ei matrices for topological distance 1D were added. All Ei matrices with topological distances greater than 1 were calculated, as explained in Fig. 2B. In QM calculations, chemical bond information is not required, thus the physical distance between the atoms was measured based on the coordination of each atom. The precise 3D geometry of a molecule should be prepared to accurately calculate the physical distance between atoms. However, 3D geometry optimization requires an expensive computational cost, and it is not suitably accurate. Hence, 2D structural information alone was used, and the topological distance between atoms was used in the TDEi tensor calculation. The GetDistanceMatrix function implemented in RDKit was used to obtain the topological distance of atoms within a molecule once hydrogen was added to it.
The Ei matrix can be calculated from the topological distance. In the example molecule (Fig. 3), atom pairs existed up to a topological distance of 4D. The Ei matrices from atom pairs with greater topological distance can be calculated if the size of the molecule increases. When Ei matrices were prepared from a range of predetermined topological distances, they were concatenated to form the TDEi tensor (Fig. 4). As the TDEi tensor size can be varied based on the size of the EC vector and the topological distance, it can be exibly adjusted according to the size of the data or the diversity of chemical space.

Datasets
In this study, four datasets were selected for the regression tasks: melting point (MP), water solubility (Esol), octanol/water distribution coe cient (Lipop), and hydration free energy (Freesolv). There are various publicly available datasets for the classi cation problem that are used in deep learning model development studies [9,10,13,17] such as human immunode ciency virus replication inhibition (HIV), human ß-secretase 1 inhibition (BACE), blood-brain barrier penetration (BBBP), toxicity in clinical trials (ClinTox), drug adverse reactions (SIDER), biological targets screened in Tox21 and ToxCast, and PubChem BioAssay data (MUV); however, they were not used in this study because labels in the datasets were seriously imbalanced, whether they were binary, or multiple classi cation tasks.
MP was obtained from the study by Igor V. Tetko et al., in which 275,131 compounds were extracted using their normal melting point values by mining patent documents [22]. The dataset was divided into training, validation, and external test sets by a random split, in a ratio of 8:1:1. It is the largest publicly available, labeled chemical dataset. ESOL, Freesolv, and Lipop were obtained from the study by Jian et al. [17]. Because the datasets were already divided into the three given categories by the authors, I used them as such. The number of data and the range of the endpoint are listed in Table 1, and the chemical space of the datasets were plotted to verify the structural diversity in the training, validation, and external test sets (Fig. 5).  [23], and the network architecture was designed based on VGGNet as a backbone with modi cations, such as (1) the size of the initial lter channel was reduced by half from that is 64 to 32, (2) the lter shape was reduced from three by three, to two by two, (3) average pooling was used to minimize information loss, and (4) a convolutional layer was applied once before the pooling layer (Fig. 6). A grid search was performed on the CNN architectures, activation functions, and epoch numbers to obtain the nest hyperparameters for model development. Model training was conducted using the NEURON system of the National Supercomputing Center of South Korea (https://www.ksc.re.kr/eng/resource/neuron).
The prediction accuracy of the model was measured using four metrics: mean absolute error (MAE), normalized mean absolute error (NMAE), R square (R 2 ), and Spearman's rank correlation coe cient (S r ).
where, y pred is a model's prediction value, y obs is an observation value, n is the number of compounds, is the average of observation values, and d is the difference between the ranks of each compound. The prediction model with R 2 higher than 0.6, on the external test set, is considered as an accurate model. Even though the model did not achieve R 2 > 0.6, it was still able to make an accurate prediction of the target value when NMAE was less than 10%. As the QSAR model was used in the prioritization of compounds, S r higher than 0.6 implies that the model's prediction is valid and useful in relative comparison of chemicals, even if NMAE is over 10% [24].
Model analysis CNN models were developed over four datasets; however, the CNN model developed with MP alone was analyzed because this model was trained with the largest dataset. In the QSAR study, the capacity to separate different molecular structures was the most signi cant point in the descriptor design to facilitate valid predictions using the descriptor. As the CNN extracted features from the TDEi tensor, the performance of these features in distinguishing compounds along the MP was examined. The nal model outputs were extracted from the middle of the CNN before the nal prediction value was calculated. Principal component analysis (PCA) was used to project extracted features into 2D space, and extracted feature variation was examined to determine how the model correlates with the extracted features for the prediction of target values.

Results And Discussion
TDEi parameter search The TDEi parameter search results are presented in the supplementary tables: MP (Table S1), Lipop (Table S2), Esol (Table S3), and Freesolv (Table  S4). As the TDEi tensor can be varied by changing the EC vectors and topological distances, the in uence of different options in the TDEi tensor on prediction accuracy was analyzed. In the Lipop, Esol, and Freesolv datasets, a dramatic decrease in prediction accuracy was observed regardless of topological distance, when the EC vector size was reduced from full, and full bit strings were condensed to EC vectors without degenerated AOs and spin numbers, whereas the MP model showed a mild decrease in prediction accuracy. The full EC vector achieved the highest accuracies in MP, Lipop, and Esol, whose data size was greater than 1,000, and the condensed full EC vector in Freesolv, whose data size was less than 1,000. Thus, it appeared that the EC vector size in the TDEi tensor reduced if the models were trained with a smaller size of data. According to this experiment, full or condensed full EC vectors should be used in the development of a CNN model for drug-like compounds.
Desirable prediction accuracy was achieved in MP when the TDEi tensor with full EC vector and topological distance 3D was used, and a further increase in topological distance did not lead to a signi cant improvement in the accuracy. In the other three datasets, the TDEi tensor with topological distance 2D achieved the highest accuracy. In the MP model, prediction accuracy gradually increased as the topological distance increased until 3D, whereas it uctuated in other datasets. Instability in prediction accuracy in the three datasets implied that the training process of the deep learning model could be stabilized if larger datasets were used. Because the prediction accuracy of the model varied signi cantly based on the topological distance of the TDEi tensor in each dataset, a preliminary search was required to select the most suitable topological distance for the datasets and the target endpoint.
Based on this preliminary study, the most suitable TDEi options were selected for each dataset, such as topological distance 3D with full EC vector for MP, 2D with full EC vector for Lipop and Esol, and 2D with condensed full EC vector for Freesolv. In this study, experiments were performed on small drug-like compounds. Given that models developed for drug-like compounds showed poor prediction accuracy for molecules whose structural diversity was dissimilar to the drug chemical space [25,26], TDEi tensor options should be examined before model development if the structural diversity of datasets is different from that of drug-like molecules.

Model prediction accuracy
In CNN model development, TDEi tensors with the leading results in the preliminary search have been used for each dataset (Table 2), and the goodness-of-t of each model is shown in Fig. 7. R 2 of the MP model was 0.565 for the external data set. Prediction errors between 0 and 400°C were relatively high, as data points were widely distributed across the best-t line (Fig. 7A). However, NMAE = 5.27% indicates that prediction values were accurate on average, and S r = 0.729 implies that the model correctly ordered the molecules as per normal melting point values. It was arduous to make precise predictions for LogP, as the Lipop model achieved an R 2 of 0.516 on the external test set, and NMAE was 10.93%. Even though most of the data points were close to the best-t line, some of the compounds that were located away from the best-t line were predicted inaccurately (Fig. 7B). The models developed by Esol and Freesolv achieved high R 2 , and Figs. 7C and 7D display the fact that the model achieved goodness-oft. CNN models were developed with more convolutional layers to examine whether the prediction accuracy would improve signi cantly. However, adding more convolutional layers or increasing the number of nodes within fully connected layers did not lead to a meaningful improvement in prediction accuracy. Thus, CNN models with deeper layers, such as ResNet and Inception, were not applied. This was similar to the previous study where increasing the weights within the neural network architecture did not always improve prediction accuracy [20]. Moreover, it is important to search for a model architecture with the minimum number of weights and the highest prediction accuracy, because the use of an excessive number of weights in the model could induce false positives in prediction outcomes [17]. In QSAR modeling, datasets were collected from a wide range of studies in which experimental values were measured using different experimental protocols. This difference is a source of experimental error in the target dataset [27]. As the model aims to predict the endpoints with their experimental noises, understanding the experimental errors of the dataset is of great aid in determining whether the prediction accuracy of the model is meaningful. In particular, deep learning studies in chemistry have attempted to utilize large volumes of datasets; thus, it is inevitable to integrate datasets measured by different protocols to increase the size of data, which deteriorates data quality and increases inherent experimental errors. If the prediction errors of the model were lower than the experimental errors, then there is a possibility that such accuracy was not a meaningful achievement, even though prediction accuracy was improved as compared to other methods [3]. In deep learning model studies, the authors compared the prediction accuracy of their models with others to prove that their own methods achieved improvement in prediction accuracy.
However, it is challenging to nd studies that have compared the prediction accuracy of their model with the experimental errors of the target endpoint. It may be attributed to dataset curation being done without an understanding of their inherent experimental errors; however, it is critical to verify the prediction accuracy of the model based on experimental errors to test the validity of the improvement in prediction accuracy. Among the four datasets used in this study, MP data analyzed experimental errors based on 18,058 duplicated compounds and estimated that the inherent experimental error of the dataset was 35°C [22], which was larger than the MAE of the MP model in this study. The higher inherent experimental errors in the MP dataset than the MAE of the MP model suggest that the actual prediction accuracy of the model might be higher than that measured by the external test set when the model was used for the prediction of unseen compounds. Unless the prediction errors of deep learning models in chemistry are analyzed based on experimental errors in the dataset, a simple comparison between the prediction accuracy of deep learning models may not be adequate to provide decisive evidence of signi cant improvement in prediction accuracy. As models in computer vision predict unambiguous and invariable labels, a higher prediction accuracy implies a better model., If the problem of mislabeling was excluded from the discussion, prediction models in computer vision achieved great success because of certainty in the dataset. It is practically impossible to obtain experimental noise-free datasets in chemistry. To make a successful case in chemistry, inherent experimental errors in the dataset must be understood precisely, such that the models are trained and validated reliably.

Model analysis
In QSAR, descriptors aim to distinguish compounds based on their structural similarity. As its purpose is comparison, differences in feature values between different compounds are signi cant in predicting the target endpoint. To examine the performance of the extracted features on clustering molecules, PCA was performed to exhibit how the feature space was varied as the TDEi tensor was processed within the CNN architecture. In Fig. 8, the brightness of colors implies the value of the melting point; dots are brighter if the value is higher and darker if they are lower. Initially, the original TDEi tensor's feature space established a low correlation with the normal melting point (Fig. 8A). Once the TDEi tensor was processed up to the last convolutional layer that is the eighth layer, compounds were prioritized as per the normal melting point (Fig. 8B). An additional pooling layer strengthened the trend in data distribution by separating compounds with a low melting point to the upper left side and a high melting point to the lower right side in the projected space (Fig. 8C). When the extracted features from the convolutional layer and pooling layer were processed using a fully connected layer, most of the compounds were arranged with a stronger correlation with their normal melting point values (Fig. 8D).
To examine the change in feature extraction by changing the CNN architecture, an identical analysis was performed using the CNN model with an increased number of convolutional layers. In Fig. 6, the convolutional layer is applied once before the pooling layer. Here, convolutional layers were applied twice with identical hyperparameters before the pooling layer. Features from the last pooling layer and the second fully connected layer were extracted and visualized (Fig. 9). PCA showed that the features extracted after additional convolutional layers were strongly correlated with the melting point. Although an additional convolutional layer did not improve the prediction accuracy, this analysis established that the CNN architecture can be modi ed for novel feature extraction.
In the CNN model trained with image data, initially, the fundamental level of features was extracted, and a higher level of features was found as the layer went deeper. EC is fundamental level information as compared to atom-level features; thus, the use of EC in CNN was expected to fully harness the CNN's automatic feature extraction capacity through lters establishing signi cant Eis for prediction of the target endpoint. Feature space variation in PCA supported this idea because the extracted features were rearranged with a stronger correlation toward the melting point values as the layer went deeper.

Conclusions
The TDEi tensor was introduced in this study as a novel way to represent the molecular structure. Since electron interactions in a molecule determine molecular properties, the TDEi tensor was designed with electron interactions between each atom in a molecule, based on topological distance.
Given that the data size is much smaller in chemistry as compared to computer vision studies, and sparsity in features can signi cantly deteriorate the performance of the model, the TDEi tensor was devised to be robust to the size of the dataset and structural diversity by changing the EC vector and topological distances considered in Ei calculations. Because the chemical space difference signi cantly in uences the model prediction accuracy, a preliminary search on TDEi tensor options may be required if the structural diversity of the target dataset is different from the target chemical space in this study, namely small-molecule drugs.
The TDEi tensor was used in the CNN model, whose architecture was developed based on VGG net, with reduced weights because an increase in the number of layers did not improve the prediction accuracy. Desirable prediction accuracy was achieved for the four datasets. Unlike image data in computer vision, data in chemistry contains experimental noise in the target endpoint; therefore, deep learning studies in chemistry require a comparison between the prediction errors of the models and the inherent experimental errors in the dataset. As CNN was suitable for extracting relevant features automatically to predict the target endpoint, feature space changes were traced, through PCA on features obtained from the middle of the CNN architecture. A stronger correlation was found between the extracted features from the deeper layer and the target endpoint, implying that the CNN correctly modeled Eis that are signi cant for the prediction of the property.

Declarations
The MP dataset was obtained from the work ofIgor V. Tetko et al. [22], and others were obtained from Dejun Jian et al. [17].

Competing interests
The author declares that I have no competing interests.

Funding
This work was nancially supported by a National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (No. NRF-2019R1F1A1061955) Author contributions Not applicable. Figure 1 Calculation of electronic con guration (EC) vectors of each atom. EC vector of each atom was used in order to calculate Eis within a molecule. (A)

Figures
Indices of full EC vector are atomic orbitals with different spin, and EC vector was designed to be reduced by integrating (B) atomic orbitals in an identical energy level (degenerate orbitals) and (C) in different spins. Each EC vector was condensed in order to reduce sparsity in feature space (D-F).

Figure 3
Possible atom pairs in the example molecule. In this molecule, the longest topological distance was 4D; however, further atom pairs can be found if size of the molecule increases.

Figure 4
Preparation of TDEi tensor. TDEi tensor was prepared by concatenating Ei matrices in each topological distance and designed to be adaptable according to structural diversity and the size of datasets by adjusting EC vectors and topological distances. Examination of goodness-of-t on four endpoints.  Results of using deeper layer of CNN in feature space variation.Results shown in this gure were obtained by slightly modifying the CNN shown in Figure 6 by applying convolutional layer twice before the average pooling layer. (A) Features extracted from the last pooling layer prioritize compounds accurately. (B) After fully connected layer, features were in stronger correlation with the target endpoint.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.