Group Contribution-Based Graph Convolution Network: Pure Property Estimation Model

Properties data for chemical compounds are essential information for the design and operation of chemical processes. Experimental values are reported in the literature, but that are too scarce compared with exploding demand for data. When the data are not available, various estimation methods are employed. The group contribution method is one of the standards and simple techniques used today. However, these methods have inherent inaccuracy due to the simplified representation of the molecular structure. More advanced methods are emerging, including improved molecular representations and handling experimental data. However, such processes also suffer from a lack of valid data for adjusting many parameters. We suggest a compromise between a complex machine learning algorithm and a linear group contribution method in this contribution. Instead of representing a molecule using a graph of atoms, we employed bulkier blocks—a graph of functional groups. The new approach dramatically reduced the number of adjustable parameters for machine learning. The result shows higher accuracy than the conventional methods. The whole process was also examined in various aspects—incorporating uncertainties in the data, the robustness of the fitting process, and detecting outlier data.

connected networks. However, this method did not include the connectivity of groups. Lu [12], Kearnes [13], Sanchez-Lengeling [14], and others [15] showed successful estimation of quantum mechanics calculation results with molecular graph convolution network. However, they used about 130,000 to 1,000,000 data for training adjustable parameters, which is impossible to cover in thermophysical experimental data.
Such limitations can be overcome by compromising the molecular graph and the group contribution framework. Instead of using a fully connected graph of atomic contributions, we employed a connected graph of bulky functional groups to reduce the number of adjustable parameters. By using such a technique, adjustable parameters can be effectively trained using a few thousands of data with reasonable accuracies.
This study proposed a new model called the Group Contribution Graph Convolution Neural Network (GC-gcn) model for estimating pure thermodynamic properties. The model maintains a conventional group contribution method framework, while the group contributions are updated according to its adjacent group by graph convolution network.
The fully connected networks are used for machine learning applications for property estimation. The experimental data are from the NIST TDE SOURCE database [16]. The capability of our model was tested by comparison with the commonly used group contribution methods.

Group Contribution Method
The group contribution method is a widely used qualitative structure-property relations method. The procedure assumes that a functional group contributes to the same properties for all the molecules. Some small but complex molecules can be treated as independent groups for some group contribution methods. For example, water (H 2 O) and methanol (CH 3 OH) are treated as a single group in UNIFAC [17]. However, due to the over-simplifying of the molecular structure, many methods have questionable accuracy for complex molecules, including isomers. The higherorder groups are suggested to overcome these limitations. However, such higherorder methods are only effective for particular molecules and difficult to generalize the rules of defining higher-order terms.
The general term for group contribution method for property X is: where f is a simple function, X 0 is a constant, n j,i is the number of j-th order of groups i in the molecule, R j,i is the contribution of j-th order of group i, and j is a weight for j-th order group. (1)

Group Contribution Graph Convolution Network
With the advances in machine learning algorithms, many researchers tried to analyze property-structure relationships using innovative methods. Among them, converting molecules into a graph format and analyzing them through a graph convolution neural network was the most successful. Several applications are suggested [12][13][14]18], but most require a lot of training data for adjusting parameters. In the case of thermophysical properties, the amount of measured data is relatively small, and they typically contain significant uncertainties.
To reduce the number of parameters, we simplified the graph of atomic constituents into the graph of group contributions. The group contribution graph network can be trained by a machine learning algorithm. The abbreviation of the new method is GC-gcn (Group contribution Graph Convolution Neural Network).

Molecule into a Graph of Group Contributions
A component is fragmented into individual groups, the same as other group contribution methods. Then, it is represented in a molecular graph G(V, E) , which is an undirected graph consisting of a vertex set V(G) and an edge set E(G) . Each element i ∈ V represents the consisting group of the compound, and it is labeled 1 to represents the edge connecting vertices v i and v j . The list of groups used in this study is summarized in Table 1. The table shows that the number of composing groups (48) is minimal compared with conventional group contribution models.

Group Feature Matrix
From the molecular graph G(V, E) , we obtain the group feature matrix G 0 . The rows of the G 0 indicate the vertexes, and the columns indicate the group IDs. G 0 i,j equal to 1 means that the group ID of the v i is j, and the other elements are zero. For consistency of model inputs, we fixed the size of the group feature matrix G 0 to (N max × N G ) , where N max represents the maximum number of groups that consists of a compound and N G represents the number of groups. The compounds with more than N max groups are neglected. If the compound has fewer than N max groups, the vacant elements are filled with zeros.

Connectivity Matrix
The adjacency matrix A was obtained from the molecular graph G(V, E) . A i,j = 1 if {i, j} ∈ E and A i,j = 0 otherwise. However, we add an identity matrix if the adjacency matrix does not have information about the vertex itself.
(2) A I = A + I Because the functions of the graph convolution network were designed to have values between − 1 and 1, we normalized the A I with Symmetrically Laplacian and obtained the connectivity matrix C. The process can be expressed as: where D is a degree matrix of the A I .

An Example of Group Contribution Graph Interpretation of a Molecule
This section shows a graph interpretation of 1-propanol, for example. The maximum number of groups N max was set to 5, and the groups used in this example are shown Table 2.
First, the compound is fragmented into groups and numbered. This process is shown in Fig. 1.
Then, the group feature matrix G 0 can be represented as: where rows represent the group index of 1-propanol and columns represent its group ID. The adjacency matrix A is represented as: where rows and columns both represent the group index, and it is filled with 1 if they are connected. For example, the groups 1 and 2 are connected, so the A 1,2 and A 2.1 are 1. From (3) and (4), the connectivity matrix C is calculated as:

Graph Convolution
After the conversion, the graph convolution is performed next. In this step, the information of the groups is updated according to their surroundings. It starts with multiplying the group feature matrix G k and connectivity matrix C, obtaining H k+1 .   The group feature matrix G k is updated to G k+1 with H k+1 by the following equation.
where W k is a weight and b k is a bias for the layer k. ReLU is the Rectified Linear Unit function which gives non-linearity. This procedure is repeated several times. After the graph convolution process is completed, G k i represents the contribution of v i . These terms are summed up column-wise, just as group contribution methods. In addition, we can use higher-order method developed recently [19] with more complexity. However, due to the limitation of experimental data for the training, we should keep the number of adjustable parameters minimum. Therefore, we concluded that the summation of contribution, slightly modified by the adjacent groups, will be enough for our purpose of property estimation. By this process, we could obtain molecular feature vector M.

An Example of Graph Convolution Network of Proposed Model
An example of graph convolution is described for 1-propanol in this section. The length of the M is set to 6 and held graph convolution only once. The parameters used in this example are just for illustrative purposes, not actual fitted results. From the previous example of a graph representation of 1-propanol, we can obtain H 0 by Eq. [5] as: H 0 contains the information for the neighbor groups. The third column refers to > CH 2 in the ring, so it is 0 for all rows. Because -CH 3 does not exist near group 3 of 1-propanol, H 0 3,1 is 0. On the other hand, since > CH 2 exists one more than -OH near group 3, H 0 3,2 has a higher value than H 0 3,4 . If we assume W 1 and b 1 are: We can obtain G 1 by (6) as: Then by summing all the values of columns in G 1 , we could obtain M as:  where M is the one-dimensional molecular vector that came from the read-out layer, W 1 , W 2 are weights and b 1 , b 2 are bias, and ReLU is the Rectified Linear Unit function. Figure 2 shows the outline of the calculation procedure.

Reduction of Parameters
The replacement of atomic segments in molecular graph with bulkier functional groups is expected to reduce the number of parameters. Although many property estimation models [12][13][14]18] used graph convolution networks, no method was designed for thermophysical properties. Therefore, we also implemented a new simple Molecular Graph Convolution Network (M-gcn) for the illustrative comparison. The specific details can be found in Appendix A. N max is set to 25, and the length of the group feature is set to 150. Table 3 shows the number of adjustable parameters for the two models. The number of parameters was dramatically decreased for GC-gcn model. Since the M-gcn uses multiple graphs to distinguish the bonds between atoms, a graph convolution network with a more complex architecture [20] must be used. In the case of the proposed method (GC-gcn), the bonds between atoms are lumped into a group description.

Datasets
The structural information required for compounds is constituent groups and their connectivity. The data were obtained by analyzing about 50,000 PubChem bulk SMILES databases [21]. The pure-component physical properties data were from NIST TDE SOURCE database [16]. The number of properties data used in this study is summarized in Table 4.

Consideration of Uncertainties in the Data
The advantage of using the NIST SOURCE database [16] as a training set is that each data point has critically evaluated uncertainty [22]. The conventional cost functions -such as average absolute error (AAE) or means square error (MSE)-can be modified to reflect the uncertainty of each point.
where N is the number of data, X pred i and X exp i are the estimated result and experimental data i, i is an uncertainty ratio of data i, and X unc i is an uncertainty of data i. The cost function is slightly modified from the conventional ratio-based approaches. The weight of the square error decreases as the value of increases. Data with high σ are penalized because of a high uncertainty ratio. On the other hand, when the decreases near zero, the exponential term approaches one. The modified scheme prevents exceptionally high weight.

Model Training and Evaluation
The proposed model (GC-gcn) was constructed using the open-source toolkit RDkit, Keras, Numpy. The composed network may fail to converge or fall into local minimum when the properties data are insufficient. Such behavior can be prevented by initializing the parameters with a well-known prediction model. For such reason, the model was first trained from the predicted data produced by Joback & Reid group contribution method [1]. After the initialization, the model was trained and verified with actual experimental datasets. Five-fold cross-validation has been used for determining the training rates. Adam [23], with 0.001 learning rates, was used as the optimizer. The parameter fitting processes with different random initializations (including train-test splitting) were conducted ten times for the reliability test.

Result of Estimation
The result of estimation for the proposed model (GC-gcn) is summarized and compared with the well-known conventional group contribution models in Table 5. The results are analyzed by average absolute error (AAE) and average relative error (ARE) between estimated and experimented data. where N is the number of data. For comparisons, we used Marrero and Gani method [7] for most properties and Rowley and Wilding method [24] for the flashpoints. The group contribution methods cannot predict all the properties due to limitations in the representation of the molecules. In contrast, the proposed model can cover all the molecules in the database. For fair comparisons, we only compared the properties of molecules when the conventional method can predict the properties.
As shown in Table 5, the proposed model prediction is much better for all properties considered in this work (T b , T c , P c , V c , T m, and T f ). For the boiling point and the critical point, the improvements are evident. On the other hand, for flashpoints and melting points, the qualities of data points are not very good, and the improvements are substantial. Figure 3 shows a plot of experimental values vs. predicted values for two models. The plotted points are more concentrated near the crossing-line for the GC-gcn model than the group contribution models.

Properties of Isomers
Distinguishing the properties of isomers is a challenging task for the group contribution methods. Several methods [6,7] introduced the second or the third order to characterize the properties. However, such methods introduce additional complexity, and it is difficult to generalize the scheme for representing molecules. The proposed method uses the connectivity information contained group contribution graph and successfully distinguishes the difference in isomer properties.
The results are shown in Table 6, 7, 8, and 9 for n-boiling points. The proposed GC-gcn model shows a correct trend with experimental data, while the conventional group contribution cannot distinguish the difference in boiling points.

Robustness in fitting
Due to the limited availability of experimental data points, robustness to outliers is critical for property estimation models. Some weird data points where data are scarce may influence the whole estimation network. In this section, we would like to present the robustness of the proposed model compared with a more comprehensive molecular version of the graph convolution network (M-gcn). The M-gcn model is described in Appendix A. Figure 4 and Table 10 show the estimation results of methyl branched hydrocarbon at carbon numbers 14 to 18 using GC-gcn and M-gcn. Interestingly, the previous version (9.0) of NIST SOURCE database [16] contains a suspected outlier(CASRN: 2882-96-4, Name: 3-methyl pentadecane, Boling point: 539.45 K), but it was corrected (548.16 K) in version10.1 [25]. In addition, Yaws [26] reported some data points, but they are not evident as experimental or estimated values. Two models (GC-gcn and M-gcn) are trained, including the suspected data point, and the results are pretty different. As shown in Fig. 4 and Table 10, the M-gcn overfit the outlier, and the prediction matches the outlier data point exactly. Such behavior is due to an excessive number of adjustable parameters. However, GC-gcn shows robust behavior even though the training set contains the outlier data point. Furthermore, the whole  [7] for T b , T c , P b , V c , T m, and Rowley&Wilding group contribution method [24] for T f prediction results match the data points reported by Yaws [26]. This result shows the robustness of our model compared with a model with more parameters.

Detection of outliers
There are various reasons for outliers of experimental properties databases -simple typos, careless experimental procedures, mistakes in database construction, etc. As shown in Sect. 4.4, the proposed model is resistant to outliers. Thus, our new model can detect possible outliers in the extensive database. A similar process has been implemented in the construction of databases, and we only show a small set of outliers in the flashpoints. The points deviate excessively from the estimation values are screened and classified as outliers. Table 11 shows the flashpoints of 1-pentene and 1-hexene. The experimental values are reported in different data sources [27,28], but the flash point of longer chain molecules has a lower flash point. One of the data points is definitely an outlier. From the estimation result, we can conclude that the flash point of the 1-pentene is suspected. A similar situation can be found for 1-alkanols [29][30][31], shown in Table 12. In this case, the flash point of 1-pentadecanol is suspected.

Conclusion
This study suggested a new architecture for thermodynamic properties estimation, a hybrid form of the group contribution method, and a molecular graph convolution network. The application of graph convolution network has led to successful unique group representation and isomer discrimination of a molecule. Compared with conventional group contribution methods, the proposed method shows a more accurate estimation and broader applicability of molecular structures. In addition, simplification using bulkier groups has reduced the number of parameters to fit while maintaining the connectivity among the groups. Finally, the robustness of the model has been examined by comparison with the molecular graph convolution network. The proposed model can be further extended to other properties when the experimental data are available.

Implementation
A web page was constructed to use the new Group Contribution graph convolution network (GC-gcn) model conveniently. The result of this work can be accessed at https:// www. chemi cai. org/ produ cts/ gcgcn/ online. Figure 5 shows a  Exp. T f (K) 255.15 [ 27 ] 247.15 [ 28 ] Estimated T f (K) 220.52 241.89  result example of a calculation result on an online platform. The input requires a SMILES representation of the molecule. The uncertainty of the model estimation result is also shown. The uncertainty of the machine learning prediction will be discussed in a different material.

Appendix A
Description of the Molecular Graph Convolution Network (M-gcn) used in this work.
Similar to the proposed model GC-gcn, the compound is represented in molecular graph G � (d � , V � , E � ) . d ′ represents the dimension of the graph, V ′ represents the vertex, and E ′ represents the edge. A vertex should not need to be connected to another vertex but must be connected once in the entire dimension d ′ . The atom feature (or called as node feature) is represented at G � (V � ), while the edge feature is represented at The rows of the V ′ indicate the vertex, and the columns indicate the atom types of the vertex. For consistency of model inputs, the shape of the atom feature matrix is fixed to (N max × N A ) , where N max represents the maximum number of atoms and N A represents the number of atom descriptors. Dimension d ′ represents the type of connection between vertices. The description of atoms and bonds is summarized in Table 15. To consider the atom itself, the identity matrix is generally added to G � (d � , E � ) . The functions of multi-dimensional graph convolution networks are also designed to have values between -1 and 1. So matrix must be normalized with symmetrical Laplacian.
where B ′ is called the bond feature matrix, D ′ is the degree matrix of G � (d � , E � ) + I, and I is the identity matrix.

An Example of a Molecular Graph Interpretation of a Molecule-Glycolonitrile
The maximum number of groups N max was set to 4, and the descriptor we used in this example is shown in Table 16. First, atoms are numbered. The process is shown in Fig. 6. From Fig. 6, we can obtain the atom feature matrix as: V ′ is also can be expressed as V ′ 0 , where 0 indicates that any processing has been held. The bond feature matrix B ′ (d', j, k) , where d' is the dimension, j and k are the indexes of the atom, can be obtained by Eq. 11 as: Because there is no double bonds and rings in glycolonitrile, the dimensions 2,4 and 5 of G ′ d, E ′ are zero matrix.

Multi-Dimensional Graph Convolution
In the multi-dimensional graph convolution process, or message passing phase, the contributions of atoms are updated according to their neighbor atoms and bond types. Unlike a graph convolution network, this procedure consists of two steps. First, the atoms are updated to each dimension. The result of this process is called the message passing function H � k+1 , which can be expressed as follows: where W ′ k is a weight and b ′ k is a bias for the layer k. ReLU is the Rectified Linear Unit function. In this work, we assume that there is no interaction across the graph dimensions. Then, the contribution of atoms V ′ k is updated as: where Concat is concatenating the all the matrix, W ′′ k is a weight, ReLU is the Rectified Linear Unit function.