Recognizing Molecular Structural Features by Pattern Recognition Techniques

Recognition of molecular structural features is one of the most attractive elds in chemistry, especially when combining with machine learning techniques. Pattern recognition techniques are straightforward in recognizing graphic features, but little attention was given to recognize molecular structural features. In this work, we propose a new method taking advantage of pattern recognition techniques to analyze structural features and obtain novel chemical insights. Specically, the cluster analysis is presented to recognize structural features, which provides an alternative to the most widely used root mean square deviation (RMSD) method and the recently proposed blob detection method. Based on this, the convex hull of the molecule is constructed. The convex hull of molecules is highly appealing in the sense that one can introduce established theorems and properties from other disciplines into chemistry. Novel molecular descriptors based on convex hulls can be dened and show encouraging results, especially in providing new insights in understanding non-covalent interactions, adsorption processes, etc.


Introduction
Machine learning techniques have prevailed across many disciplines in recent years. In chemistry, it has exhibited great strengths in different elds, such as conformation exploration, catalysis design, reaction optimization, etc.(1) Given the fast development of machine learning techniques and the increasingly complex molecular systems, one would expect machine learning techniques would become more critical in understanding chemical behaviors.
For successful supervised or unsupervised learning, a large amount of input data is critical. Accordingly, comparing and categorizing different samples are of paramount importance to avoid redundancy or bias during the learning process. In chemistry, this identi cation process could refer to differentiating molecular structures, such as comparing atomic coordinates between theoretical and experimental structures. Such a comparison is one of the most fundamental applications in computational chemistry, as it is often the starting point for various sophisticated computational studies (2)(3)(4)(5)(6)(7)(8). In addition, there are studies to combine existing benchmark sets by generating a larger benchmark set (9). The construction of such a super benchmark set needs the attention of removing duplicated samples from individual sets. Accordingly, it is necessary to recognize unique molecules for building a non-redundant super set.
To examine structural similarities, the root-mean-square-deviation (RMSD) calculation is probably the most commonly used method. It calculates the square sum of distances between corresponding atoms (d i ) in the two structures, and takes the division by the total number of atoms (N), followed by a square root operation. However, the same molecule in different benchmark sets may have totally different XYZ coordinates, though one may translate and rotate the molecule to align molecular orientation. Despite this, the RMSD measurement also suffers other limitations such as lack of normalization, being di cult for interpretation, and diminishing ability to distinguish conformers with increasing system size (10)(11)(12).
On another aspect, pattern recognition techniques have received signi cant succusses in recent years.
Notably, there are mathematically proved theorems, which can be brought in chemistry for structural analysis. However, very few studies were carried out in this respect. Previously, the blob detection technique was used to recognize molecules (24). Although it achieves considerable success, there are still some limitations remaining. Firstly, during the blob detection, the graphic color was converted to grayscale to boost e ciency. However, such a trick sacri ces the ability to differentiate isotopes or elements in the same family (although the original blob detection study was designed for conformation analysis). Secondly, the blob detection uses a Gaussian function-based kernel for the convolution calculation. Yet this noise-lter step is not necessary as long as the image is not transformed.
In this work, we propose a new method to recognize molecular structural features taking advantage of pattern recognition techniques. The blob detection was circumvented by applying cluster analysis to the image matrices, which successfully detected all atomic positions. The new method is fast and accurate for molecular structural comparisons. Based on this, the convex hull, which sets up a polyhedron to enclose the molecule, is constructed. Accordingly, the established theorems and properties of convex hulls from other disciplines can be introduced to chemistry to analyze structural features. Speci cally, by creating the convex hulls, the molecular volume and surface area can be de ned. One can therefore explore new chemistry with these new molecular descriptors. A few applications are bought up to exhibit some applications, which show promising results in providing novel chemical understandings.

Methods
In this work, the proposed method consists of three steps to recognize structural features: pre-treatment of the molecules and images, feature extraction, and post-treatment with convex hull constructions, as shown in Fig. 1.

Pre-treatment of molecules and images
To remove redundancy, the chemical bonds are eliminated from molecular structure images, since the molecular structures are determined solely by the atom positions. As a result, the problem of recognizing Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js molecules is equivalent to identify a set of scattered atoms/dots. The molecular image is the basis for pattern recognition. Providing the atoms have been well aligned (25), the molecular 3D image is generated from its XYZ coordinates (Fig. 2), taking the C60 system as an example. The 60 carbon atoms are scattered after removing all chemical bonds. For easier visualization and treatment in the latter stage, the azimuthal angle and the elevation angle were set as 90 degrees for exhibiting the image (along Z axis). Unless otherwise stated, the following discussion is based on this projection angle.
For complicated molecules, it is increasingly di cult to nd a projection angle that all atoms can be projected on a plane without overlaps. One may recognize the molecular features by analyzing its facet of pro le. However, the core structure cannot be recognized by this way. To circumvent this problem, we sliced the whole molecule into layers, and took snapshots for each layer to extract features (Fig. 2).
Nonetheless, it is not a trivial work to slice the molecule, as the double counting of atoms may take place. Eventually, we sliced the molecule along the projection angle, and set the distance between layers to be 0.7 angstroms. This value is close to a H-H bond distance. For any reasonably determined structure, it is not possible to have two atoms with a distance smaller than 0.7 angstroms. Therefore, the layers separated by 0.7 angstroms can well slice the whole molecule into different layers.
As any other pattern recognition applications, the quality of the picture is essential. Following the parameters given by the blob detection study, we set the picture height and width of 10 * 10 inches with 80 dots per inch (dpi). Accordingly, the nal resolution of the gure is 800 * 800 pixels.

Feature extraction
To recognize the atoms on each layer, we took advantage of cluster analysis to lter the image matrices.
The image matrices are non-diagonal sparse matrices, with dimensions equal to the resolution. In the previous blob detection study, the colored image was rst converted to gray scale. Consequently, if atoms were assigned with close color codes, the grayscale conversion would mistakenly consider the different atoms as identical ones.
In this work, the colored image matrix was rst separated into three primary-color matrices, namely the R(ed) matrix, G(reen) matrix and B(lue) matrix. As a result, if an atom in the molecule is substituted by its isotope or its family member, the color matrices can reveal its trace. And the atoms with close color codes can be distinguished.
By converting a graph into an image matrix, an atom in the graph is represented by a group of pixel coordinates. Ideally, the atom size determines the number of pixel coordinates. However, such a number is not unambiguously determined, as the boundary of an atom may be blurred especially if the resolution is low. The number of pixel coordinates is thus subject to the round-off error. As a consequence, it is generally not helpful to directly compare the image matrices.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js Instead, the K-means algorithm(26) of cluster analysis was used in this work. For a given primary-color matrix, the local extreme values were rst ltered out. The local extremes represent pixel occupation for each atom in the image matrix. Whether the local extreme is a maximum or a minimum depends on the background color being black or white. Supposing the background color is black, the non-zero elements were rst leted out as the basis for cluster analysis.
For the K-means algorithm, it clusters the matrix elements by relocating each point to its new nearest center. In the context of feature extraction, this corresponds to determine the center of each set of pixel coordinates. The metric mean of the member points to corresponding cluster centers was calculated, and such relocating-and-updating process iterated until the desired number of cluster centers was found, which was the number of atoms in each layer.
By clustering the centers, the arrays containing each center position were obtained. The Euclidean norm between centers of two structures was compared. If the norm differed by more than 5 pixels, the corresponding atoms were considered as occupying different locations.
To nd out which atoms differ in the two structures, a register table was rst established to map 2D pixel coordinates and 3D atomic coordinates. The table was constructed by mapping atomic coordinates and pixel coordinates atom by atom. Next, the cluster analysis was carried out for the second structure. By differentiating the cluster centers out of two structures, the atoms at different positions can be ltered out by mapping with the register table.

Post-treatments of constructing convex hulls
The convex hull is the smallest polyhedron that encloses a set of points, where intersections between any points in the polyhedron are still in the polyhedron. Originally, the concept of convex hulls was used in other disciplines such as computational geometry, functional analysis, image processing, etc. It depicts a set of n-dimensional (usually 2-dimensional) data, with many mathematically proved theorems or properties such as the separating hyperplane theorem. Such theorems are very appealing in the context of molecule recognition that if properly used, one may readily know the molecular properties without complicated calculations. Therefore, we are particularly interested in studying the convex hull for molecules, as established theorems and properties of convex hulls can be borrowed from other disciplines to study molecular interactions.
To construct the 3-dimentional convex hull for a molecule, the QHull algorithm was used. (27) A polyhedron enclosed the molecule was generated based on atomic coordinates. The molecular surface area was calculated as the total surface area of all facets of the convex hull. The speci c molecular area was calculated as the molecular surface area over the molecular mole mass. Similarly, the molecular density was calculated as the molecular mole mass over the total volume of the convex hull.

Results And Discussion
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js To recognize molecular geometric features and compare their structures, the C60 molecule and the manually distorted C60 were examined. This comparison resembles the comparison between structures in different databases or between theoretical and experimental structures. The distorted molecule was generated by adding a random displacement between − 0.5 to 0.5 Å to the XYZ coordinates of the rst 5 carbon atoms. A constraint is given to the random numbers that the displacements (d x , d y ) should be larger than 0.2 Å. Otherwise, the geometric difference might not be recognized. The d z displacement is left out from the constraint since by convention the projection angle is along the Z direction. Figure 3a shows the graph of the undistorted C60 molecule, where the rst 5 carbon are marked as blue squares. The remaining carbon atoms are plotted as gray circles. As a comparison, Fig. 3b shows the distorted C60 molecule, where the distorted carbon atoms are highlighted with red color. Figure 3c. shows the overlap of the two structures. The atoms at different positions are contrasted from the gure, and the atoms at same positions are overlapped. Since the random number is involved to generate the distorted molecule, 100 trials have been carried out for the recognition. The successful hit reaches 100%.
Having extracted graphic features, we construct convex hulls for molecules. The convex hull is the smallest polyhedron that encloses the molecule. Fig.4 shows some examples of molecules with their convex hulls. For high-symmetry molecules, such as SF 6 (Oh point group) in Fig. 4a, its convex hull is an octahedron. The 6 uorine atoms locate on the vertices of the octahedron, while the sulfur atom sits in the center. By de nition, all atoms are enclosed in the octahedron. And connections between two atoms are still in the octahedron. Fig. 4b and 4c show two other examples with more complicated geometric features and their convex hulls.
The convex hulls have been widely used in other disciplines. In chemistry, the probably easiest way of taking advantage of convex hulls is to de ne the molecular density and the speci c surface area. For molecular density, it is calculated as the molecular mole mass over the volume of the convex hull. Although one can also calculate the density by dividing the mole mass over volume of a cubic cell, this cubic cell volume cannot re ect the shape of the molecule (cf. the SF 6 instance). And the volume of the cubic cell would be always larger than that of the convex hull, since the convex hull by de nition is the smallest polyhedron enclosing the molecule. Such a difference may lead a signi cant improvement in the data training, as the molecular density obtained based on convex hulls might be a better molecule descriptor. Fig. 5 shows the molecular density and corresponding convex hulls for different sizes of fullerenes. It is evident that the molecular density decreases as the sphere size increases.
The calculation of surface area is another possible application regarding convex hulls. The speci c surface area is an important parameter in studying adsorption processes. Fig.6 shows the speci c surface area for different types of fullerenes. It can be seen that the speci c area variates less than the molecular density. If we approximate that the inner surface is equal to the outer surface of the polyhedron, the method can be further used to study adsorption processes of zeolites or nanotubes.
Lastly, the Temozolomide (TMZ)-C60 system was exhibited as a preliminary application to study noncovalent interactions by analyzing convex hulls (Fig. 7). The TMZ-C60 system was theoretically studied Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js as a brain anticancer drug.(28) The fullerene loads TMZ and transports the drug across the blood brain barriers. Obviously, the drug adsorbs the molecule by non-covalent interactions. And Such interactions are subjected to the contact area. However, it is not quantitatively known about the relationship between the contact area and interaction strength. Therefore, it would be bene cial to study such correlation for better design of drug delivery. Further study is under process in this lab.

Conclusions
In this work, pattern recognition techniques are developed for molecular structure recognition. The method provides a new approach to recognize molecular geometrical features, and thus can be used for structural identi cations. The cluster analysis of K-Means algorithm was used to determine the pixel centers. This is more straightforward than the previous blob detection technique in the sense that the convolution calculation is saved. A new post-treatment is proposed to construct convex hulls of molecules. Accordingly, the properties of convex hulls can be borrowed into chemistry and provide novel insight. To illustrate some possible applications, the molecular speci c surface area and density were calculated based on the total surface and volume of convex hulls for different sizes of fullerenes. It shows that such properties are promising to be used as new molecular descriptors in machine learning studies, and it provides a new dimension to understand molecular interactions. Further study is under development in this lab. Abbreviations RMSD, root mean square deviation TMZ, Temozolomide Declarations Availability of data and materials All data and source code are freely available by the request from the authors.

Competing interests
The authors declare no competing interests.

Funding
The author gratefully acknowledges the support from the National Natural Science Foundation of China (Nos. 22003068), the Beijing Municipal Natural Science Foundation (Nos. 2214065).  Illustration of the molecule pre-treatment.    The molecular density and convex hulls for different type of fullerenes.