Optimal Division of Molecules Into Training And Test Sets With A New Tool To Predict Pharmacophore In 3D-QSAR

According to the descriptors in the pharmacophore model, dividing molecules into training and test sets serves to create a good model. It is dicult to track the Local Reactive Descriptor (LRD) effect of the pharmacophore at each interaction point in the 3D metric system. A subset of clusters of atoms can correspond to all or part of the pharmacophore structure. In this study, the multidimensional system of the subset was reduced to a one-dimensional index and the Vector Fingerprint Functions (VFF) of the molecules were created. Models were established by dividing molecules with close and similar VFFs into training and test sets. Sub-clusters were examined for all molecules by applying the Genetic Algorithm (GA). The model was predicted using the Leave One Out-Cross Validation (LOO-CV) method and veried with an external test set. The statistical results of the model obtained according to the division in the new method we developed (Q 2 = 0.604 and R 2 = 0.760 for training-80 and external test-20 sets, respectively) were compared with random and manual division results. as VFF, which is among the splitting methods, has the best performance. In the 3D metric system, different subsets formed between clustered atoms formed as a result of alignment and superimpose were examined and their effects on activity were investigated by GA. The contribution of each element of the subset to the activity was observed as a ngerprint of the relevant molecule. A subset that gives the best results shows the geometric and electronic properties of Pha. The vectorial index of the interaction points of a subset that is thought to have the best result was given and the activity change in each index is plotted graphically for 5 molecules chosen as samples 4 in the training set and 1 in the test set. As seen in the graph, how much the activity of a molecule increases or decreases for each interaction point of the subset can be shown for the rst time in this study. How the subset contributes to the activity of any molecule at each point of a one-dimensional vector index has formed the molecule's ngerprint. According to the similarities in VFF, the molecules were safely divided into training and test sets, have been preventing the accumulation of two separate molecules that have representative or very similar to each other in one set.


Introduction
In 3D-QSAR studies, it is necessary to group local descriptors derived from the geometric and electronic values of the atoms in the 3D metric system into common clusters. According to the similarities of descriptors within a set, molecules can be divided into training and test sets in the most appropriate way. However, the similarity must be in all clusters in the VFF's index, not in one cluster. Similarity and dissimilarity in VFF serve this. Similar ones can be divided into training and test sets at the rates of 4/5 and 1/5. The accumulation of molecules with similar descriptors only in the training or test set prevents a good model to be found. Therefore, by dividing such molecules in appropriate ratios into two sets, a model is created in one set and validated in the other. The creation and validation of the model is very meaningful by placing similar molecules in two different sets. Optimum division of molecules with similar activities and descriptors into training and test sets increases the reliability of the model.
To create clustering, the molecules must be arranged to interact with the receptor in the same orientation and geometry. For this reason, the compounds are superimposed 3D compatible against the receptor in a common chemical structure. Atomic stacks in similar coordinates can interact with a corresponding common point of the receptor. Some, but not all, of the clusters formed may play a role in the activity of molecules. To determine the sub-cluster that constitutes the pharmacophore structure, different sub-clusters formed by combination should be considered. One of the subgroups may have a 3D structure that represents all or part of the pharmacophore. Since such a sub-cluster in the 3D coordinate system creates a multidimensional vector eld, the resulting pharmacophore interactions cannot be directly observed. To easily analyze the relationship between the descriptor and the activity, the multidimensional systems of the sub-clusters can be reduced to a one-dimensional vector system as an index. The ordering of the sub-cluster as an index can reveal the similarity in the change in the activity of molecules with similar descriptors in the same index. The activity similarities of two molecules with similar descriptors throughout the elements of the sub-cluster can be clearly observed by imitating each other very well along the one-dimensional index. Accordingly, molecules with close activity values can be followed clearly according to their onedimensional functional changes. Thus, one of the two molecules with the same descriptors can be divided into the training set and the other into the test set, considering both activity similarity and descriptive similarity.
The purpose of reducing the multi-dimensional vector space to one-dimensional index; a) Helping to obtain a better model from spatially complex data and avoiding loading more information, b) Viewing data in a simpler and compact size rather than a high-dimensional area, c) In this way, the main variation types are If it is expressed less than the number (making estimates from these numbers if possible), d) It may be appropriate to carry the most important variations to the model instead of meaningless and unnecessary dimensions.
The "similarity principle" approach, in which similar molecules will likely exhibit similar (physicochemical and/or biological) properties, is often used in cheminformatics. In chemistry, properties close to the de ning properties of the leading compounds are an important paradigm as a guide and determinant in the design and synthesis of new analogues [1]. Addressing similarities in clusters is widely used, such as the Dimensionality reduction procedure of Locally Linear Embedding (LLE) [2]. The multidimensional input area has been transformed into a one-dimensional function and the concept of Dimensional Reduction (DR) has been applied [3]. Principal Component Analysis (PCA) is a standard algorithm commonly used to reduce size [4]. There is a Multi-Dimensional Scaling (MDS) technique such as PCA and more general [5].
There are also many different size reduction algorithms such as PCA, from classic Linear Algebra to non-linear techniques such as Kohonen Self-Organizing Maps (SOM) [6], MDS [7], Stochastic Embedding [8]. These techniques are more widely used to look for a common purpose of similarity and similarity matrices [9]. In previous 3D-QSAR studies, the optimum division in both sets was discussed by the similarities of the descriptors found in the sets in the vector space system rather than in the metric space [3]. In other studies, vector space has been found to have superior performance than metric space [10][11]. Independent methods of many different descriptors are used for molecular similarity, whereas there are other molecular similarity comparisons, including molecular graphic matching approaches [12]. The similarity coe cient of Tanimoto (Jaccard) is widely used to measure molecular similarity [13][14]. However, many other similarity/distance methods have been considered as vector spaces by researchers [15].The most common preference is to select the regulated variables after the sequence given by the descriptor variance of sub-clusters [16]. In particular, the Most Predictive Variable Method (MPVM) has been used successfully to select the optimum variables [17]. Clusters can have in nite dimensions in space, but cluster centers with total M-dimensions for the molecule being studied are discussed [18].To identify the best subcluster of molecular descriptors the vector space reduction is performed using the Truncated Singular Value Decomposition (TSVD) [19]. The more compatible the structure, descriptors, and activity values of the molecules, the better the model is built [20].
There are a variety of techniques that can produce training and test sets in QSAR reviews. The simplest of these, the rst of the four techniques, is random or the second is manual [21][22]. According to this, molecules are placed in training and test sets using random computer-generated numbers between the numbers of the members of the data set, or by using the manual technique considering the structure and activity values [23]. The other third and fourth methods are automatic and rational algorithms that provide activity balancing. In the third method, in order to differentiate the data values in the examined molecules into sets; Automatic model generation such as Bayesian Neural Networks (BNN) [24], Relational Neural Networks (RelNNs) [25], Gauss Process (GP) [26][27] are used. In the fourth method for rational division of sets, SOM [28], Doptimal designs [29][30][31][32], sphere exclusion [32], Directional Sphere Exclusion (DISE) [33], Kennard-Stone (KS) algorithm [34][35], such as different rational algorithms are applied.
In this study, the experimental activity values were compared to the theoretical activity values of 100 avonoid molecules taken from the literature [36], and the molecules were divided into training and test sets according to their similar VFFs. While the receptor parameter remains constant for each index during the formation of VFF, the electronic values of atoms in the same index can vary from molecule to molecule. The change in the activity of a molecule can be observed graphically in VFF.
The increase or decrease in VFFs is because the electronic value of the atom of the molecule in that index is positive/negative and large/small. Since ngerprints VFFs show how molecule activity varies throughout the index, molecules that show similarity in all index elements are optimally distributed to training and test sets without clustering in the same set. By sharing similar activity changes of molecules between training and test sets, the pharmacophore model can be suggested more reliably.

Vector Fingerprint Function (Vff)
VFF is characteristic of each molecule that reduces the vector space eld of the sub-cluster responsible for activity from multiple dimensions to one dimension. As the name suggest, the activity of a molecule varies functionally in each of the subcluster elements listed as a one-dimensional index. Increasing or decreasing contributions of atoms are easily seen in interaction points formed throughout the index. At each of these points, while the parameter of the receptor is constant, the molecule's Local Reactivity Descriptors (LRDs) (large/small and positive/negative) contribute differently and lead to different ngerprints in each molecule. Therefore, VFF, which shows interaction at every point, is a good recognition tool that shows the behavior of molecules. In short, the changes of the activities in the interaction points are different for each molecule and with adding up contributions of the activity at each point, a value close to the experimental activity shown by the molecule can be calculated. The change in activity calculated throughout the index is the ngerprint of the molecule. As can be seen from here, the biological or physicochemical properties of the molecules in the studied series can be easily determined by one-dimensional VFF value. The increase and decrease in the activity of molecules examined against VFF values can be monitored graphically. The effect of each interaction point for the proposed pharmacophore is clearly visible with VFF. Here, similarities and differences in VFF can be easily seen in the form of a molecule's ngerprint.
Since the activity change of molecules like each other at each point is functionally visible with VFF, it is possible to divide the molecules into training and test sets safely. VFFs of the molecules in the training set can be divided similarly to the molecules in the test sets. Thus, it may be possible to verify the model proposed in the training set with analogues in the test set. The purpose of preparing VFF is to create a good model by dividing the molecules into two sets in the best way according to the similarities of VFF without the need for larger molecule sets.
Here, for the rst time, we will divide the molecules into training and test sets by tracking the activity change in the onedimensional vector index. Accordingly, we will try to create an optimum model by preventing similar molecules from accumulating in one set, distributing the appropriate number between the two sets. We will discuss the division of the compounds with an automated and rational approach according to the activity similarities originating from VFF, which will be introduced to the literature for the rst time. Since we do not have the chance to rewrite the algorithm of other rational or automatic division methods in the homemade Molecular Conformer Electron Topological (MCET) method, we will not be able to compare their performance with that of VFF. We will compare this separation method with only random and manual separation methods to show the differences, innovations, and developments.

Principle And Method
In Table 1, 100 avonoid molecule series given the skeletal structure and activities were taken from the literature and the pharmacophore structure of the activities were examined [36]. Molecular conformers were determined by using MMFF with Spartan'10 program and quantum chemical calculations were made with Hartree-Fock 6-31G*. For each conformer to be used in MCET method, '*.txt' les produced from Spartan were converted to "Electron Topological Matrix" (ETM) les with ETM-Program (ETM-P) [37][38][39][40].
Natural, Mulliken and Electrostatic atomic charges, Fukui indices such as f +, f-and coe cients in HOMO / LUMO orbitals, etc. electronic values of atoms can be regarded as LRDs of molecules. These descriptors provide detailed analysis to understand the 3D interaction of the molecule with the receptor. The Fukui index and Frontier Orbital approach, using atomic coe cients in Frontier orbitals, are related to hydrogen bond interactions that are H-donor/H-acceptor, and covalent interactions. On the other hand, Natural, Mulliken and Electrostatic atomic charges are calculated according to the energy values of all orbitals occupied and their atomic coe cients and are associated with columbic interactions. Since the "Klopman Index" relates to both ionic and covalent interactions using atomic charges and atomic coe cients it is more realistic than LRD descriptors mentioned above, and it was used for the rst time in a study submitted by us for L-R interactions. According to these characteristics, the Klopman Index used in the interaction energy between L-R is a very comprehensive and powerful descriptor. In MCET [38,[40][41][42][43][44], we developed a new algorithm for KI that considers the total value of both Coulombic and covalent interactions.
The molecules must match according to a template for LRDs to be optimally clustered in the 3D metric system and brought into a receptor-compatible geometry. For this purpose, a common core structure at all molecules, which is part of the pharmacophore, can form the beginning of the matching. The core structure is a geometric and electronic structure formed by the combination of 3, 4 or 5 atoms in the selected template. At least one of them must be functional (X: O, N or S at C-X and C=X) atom. Its core structure is not only common to all molecules electronically and geometrically, but it must also ensure that the remaining atoms matching in the maximum amount. With this structure, the beginning of the pharmacophore is formed, and then the pharmacophore can be completed by adding only useful ones from the remaining clusters [45].
both the clustering of other atoms and the determination of the pharmacophore structure. The more realistic the core structure chosen as the common area of the pharmacophore, the better the remaining atoms of the molecules will overlap with those of the template. For this purpose, the x, y, and z values of the rst three atoms in the common core structure of molecules is placed in Cartesian coordinates as (0, 0, 0-origin)1, (x, 0, 0)2 and (x, y, 0)3. Thus, the rst three atoms of all conformers are in a common plane and the coordinates of the other atoms are rearranged accordingly. The atomic coordinates of molecules can be categorized in the same set of other molecules consistent with the cube volumetric tolerance (dτ = dx * dy * dz; where dx = dy = dz). Unlike atoms in the core structure, for the other atoms, the electronic values in the same cluster do not need to be within a certain limit. Molecules containing atoms in a cluster will contribute to the interaction with the receptor according to their positive / negative and small / large electronics values.
According to the core structure, the conformer with the maximum number of atoms matching the template is chosen to represent its molecule. However, it is possible that different conformer structures are representative for each of the core structures derived from the template. The second critical step is to select the molecular conformation that matches the template at the maximum number of atoms and the most compatible with respect to the receptor. Initially, the molecules align with the core atoms of the template, while the conformation with the highest number of superimposed atoms will represent all conformers in its molecule. The most compatible structure of the molecules according to the receptor is involved in the interaction. When talking about the skeleton of a molecule, this selected and compatible conformation should come to mind. By preventing unsuitable structures from representing the molecule, perfect clustering of atomic mechanisms is ensured. At a given volume tolerance scale, the atoms of the conformers can optimally superimpose, depending on their corresponding positions. Of course, the best clustering occurs when all molecules can be superimposed with the template structure with the highest number of atoms. The clustering of superimposed atoms provides a 3D similarity between molecules, and some of these clusters may have similarities that can interact with the receptor. With an arrangement resulting from the similarity in the 3D coordinate system, the ligands are directed against the receptor in a harmonious manner by guiding in the same way.
As a result of alignment and superimposition, molecules clustered in the same vector space and their atoms are determined as vector elements in two separate sequences. Furthermore, the coordinates of all atoms in the template framework may not carry the entire structure of the receptor and the pharmacophore. Although the template frame is used as a reference, it may not have enough representative power for a mature clustering. In addition to the template, atoms of different skeletal molecules give rise to different space coordinate values. When constructing a 3D-QSAR model, the number of clusters that can meet all the interaction points of the receptor side is achieved by means of multiple and different structure samples of the molecular samples. numerous and different structures of molecules may be required for enough clustering. This leads to reference molecules that provide new cluster centers. The idea that most and least active molecules have atoms that signi cantly alter activity may mean that their atomic coordinates are worth referencing. Some of these molecules are reference molecules that provide number-rich clustering at different coordinates in addition to the atomic coordinates of the template. The selection of reference molecules with different skeletal structures allows the production of more diverse clusters. If the coordinate values of the atoms in the reference molecules are like those in the template, it does not create a new eld, but if it is different, there may be the beginning of a new cluster eld. However, when a cluster resource does not have enough atoms within its tolerance limits, it cannot lead to clustering 38 . Since an atom from each molecule can be placed in a cluster, clusters that do not contain enough atoms are neglected. Although enough is relative, it can be determined as the ratio of the number of atoms in the cluster to the number of molecules (e.g. 1/3, 2/5, 1/4).
To keep the total number of M clusters manageable, the number of atoms in the cluster is increased or decreased by increasing or decreasing the tolerance value, dτ. Clustering of atoms by their positions is to consider their reactivity depending on where they are located. In ligand-based approaches, the receptor's interaction points can be determined by Clusters in the total number M form M-dimensional vector spaces with different atoms in each set. It can interact with the ligand receptor in an m-portion of the geometric surface consisting of these clusters. This contact can be considered as the geometrical structure of the m-dimensional sub-cluster, the structure of the receptor responsible for activity. Clusters that do not provide enough improvement to prevent unwanted background noise are ignored in the sub-cluster and an effective subcluster is created accordingly. Once the total number of clusters has been determined, sub-clusters are created by applying a combination. A stochastic control is performed with GA for each sub-cluster examined. During the creation of the sub-cluster, a new set is created by adding or subtracting eld using GA with acceptance or rejection. For this, a lter is made with minimum errors in all compounds, and a sub-cluster is created by rejecting insu cient variables with GA. The most suitable sub-cluster to construct a pharmacophore model has the best statistical result compared to another sub-cluster.
As can be seen from many previous applications, GA has been able to nd the most appropriate solutions to the problems with an equally wide range of research areas 46 . In this context, even the lowest solution level produced by GA cannot be proven to include the most suitable solution. Optimization problems are solved by checking the accuracy of parameters to save/restore parameter sets to ensure that they are applied effectively to many problems 47 . Some results produced by GA mean that it con rms an optimal solution that is consistent across samples. This does not mean that the problem is found with an analytical solution, but it can be of great help in non-polynomial problems 48 . As with other programs, there are many uncertainties in MCET program that cannot be avoided due to the algorithm GA is applied to. Therefore, it is more appropriate to talk with GA about a compatible and practical solution rather than an analytical exact solution.
A 3D QSAR model using LRDs is required to demonstrate the 3D interaction between L-R. The creation of a good model is possible with the correctly selected LRD type. Different models are formed from different types of LRDs of atoms corresponding to the sub-cluster [43]. The parameter values of the receptor are calculated as adjustable constant according to LRD values of the ligand atoms corresponding to the sub-cluster. Using the sub-cluster model, the activity of each molecule is calculated using LRD values of the atoms on the ligand side and the parameters of the receptor side.
The activity of the molecule that contains atom in a cluster will vary depending on the electronic value of the atom and the corresponding parameter of the receptor. However, given the large number of clusters, only a certain sub-cluster corresponding to the interaction points of the receptor will make sense. Since the elements of the sub-cluster are multidimensional vector spaces, the contribution of each is di cult to follow. Therefore, it is useful to place the sub-cluster in a consecutive index along an axis to show the activity change of each molecule depending on the atom it contains. Since the discussed sub-cluster items are listed on the axis with the same index number, the activity change of all molecules will take an individual shape. Changes in the activity of the molecule at each point along the axis create a VFF. It is now both safe and easy to divide molecules into training and test sets based on VFF similarity.
Consider an example of a series molecule consisting of a series of high-dimensional clusters. Another transform (DR with VFF) is xn∈Rm → zn∈R: zn = g(xn, w), z∈R, (3) The function zn in Eq. (3) shows the ngerprint for n-molecule.
According to the z-vector values in the one-dimensional z-axis, the y activities in the y-axis can be calculated as in Eq. (4).
The x-independent variable values and w-parameters in the multi-dimensional vector space are used on the m indices along the z axis. The x and w arguments in multi-dimensional sub-clusters are placed in indexes on the one-dimensional z-axis.
Accordingly, plotting dependent variable activity along the y-axis is an important simpli cation. The values of z are also not plotted, since z, easily, appears as positive or negative with increasing or decreasing activity in each index. The difference between y and z is that the values of z are the positive or negative value for the corresponding index, while the values of y represent the total change in activity until that index due to a positive or negative change. In short, the values of y are the sum of the values of z. The values of the y-axis values corresponding to each index along the axis form VFF. Molecules with similar changes in their VFF can easily replace each other. For the same zn element of any two molecules examined for similarity, the differences in the y-axis in the partial least square method were evaluated by the total deviation and two molecules with the best similarity were placed in the training and test sets.
The activity of the molecule varies in proportion to the electronic amount of the atom settled in any element of the subcluster. There is no change in the activity of molecules without atoms in any index of the sub-cluster. Here, m is the total number of interactions on the receptor side, which is the interaction point of the pharmacophore. The interaction number of the ligands can be mn, maximum m (mn <= m), and if a molecule contains an atom in the entire sub-cluster, it becomes mn = m. Therefore, molecules can have the same or different number of mn. The formation of different VFFs for each molecule is not only due to the difference in clusters, but also because of the difference in LRD value in each cluster. Therefore, VFFs of each compound are determined using LRD values of the atoms occupying the sub-cluster. Molecules with different activities due to different VFFs may have different LRDs within the same element, even if they do not contain atoms in different elements of the same sub-cluster. The difference between VFFs is due to both having different sub-clusters and different LRD values corresponding to the same element index. For molecules that contain atoms in different elements of the same pharmacophore sub-cluster, two different sub-clusters can occur. While the pharmacophore sub-cluster belongs to the receptor, two different sub-cluster belong to the molecules. Accordingly, even if the two molecules have the same sub-cluster, they may have different VFFs due to different LRD values. VFF change resulting from the multiplication of LRD value of the Ligand side in the lled sub-clusters by the parameters on the receptor side is considered as the changes of the activity. The sub-clusters of molecules with similar activities and the similarities of LRD values of the atoms in each cluster give rise to similar VFF. The similarity or difference in VFFs of the two compounds is calculated by using the atoms present and absent in each cluster and LRD values in the atoms. This similarity or difference leads to different divisions in training and test sets, and therefore different models. It is pointless to use two VFFs that are very similar to each other in training or test sets. However, placing the molecules with two similar VFF, separately, into sets helps identify and validate the model. For structureactivity similarity, the molecules whose activity is close to each other are grouped together by ordering the molecules according to their activities. Considering that the molecules in the training and test sets are approximately 4/5 and 1/5, the molecules are divided into groups of 5. The maximum similarity between structure-activity is due to the similarity in VFFs used as arguments in the model. If LRD values of the two molecules in a sub-cluster were the same (this is almost impossible), the function curves of VFFs would be exactly overlapping and would necessarily have the same activity change at each interaction point. Due to VFF function curve being very close to each other, each of the two molecules can be placed in training and test sets. It is a good approach to estimate and verify the model created by this arrangement.
LRD value in the sub-cluster can contribute positively or negatively and larger or smaller to molecular activity. Due to LRD similarities in the same indices of the sub-cluster, the activities of the molecules are similar at these points. VFFs of the two molecules, whose LRD similarity is very close at each point of the sub-cluster, are activity changes. This means that the index numbers in the sub-cluster and LRD values of atoms involved in the activity of a molecule divided into the test set are like at least one of the molecules in the training set. On the other hand, despite the same clusters, different VFFs are formed from different LRDs. VFFs of the two molecules can only be like the same clusters and close LRD values. The better this similarity, the stronger the probability that one of the two compounds will be included in training and the other in the test set. Thus, both the same sub-cluster and similar LRD at each point generate similar VFF values, indicating the degree of similarity of the molecules, and two similar molecules can be divided into sets accordingly. Here we present an important innovation in the best and most consistent splitting algorithm by comparing the models obtained with VFFs. For samples in the same series of molecular activity data, VFF-dependent splitting performance results in different parameters and models. The model selection will be based on the prediction (Q 2 ) and validation (R 2 ) forces of the training and test sets using the independent variable LRDs in the sub-cluster. The original training and the displacement of several samples in the (external) test sets indicate the emergence of different models; The separation of both sets cannot be left to chance, random or manual algorithm. Therefore, it will be tried to show the examples in the training and test sets in a model that can explain the interaction between L-R optimally by splitting it according to VFF descriptor.
To reduce the effect of parameters due to the variation of the original training and (external) test sets, advanced models with LOO-CV (although the number of molecules in the training set are small) are presented. The proposed parameters of the model in the training set were checked in the external test set. The number of samples in the data set was used more e ciently without the need for a separate validation set by the LOO-CV method. As a result, the training and testing sets of VFFs were both trained and externally judged with a coherent distinction.

Result And Discussion
Since the geometric structure of each conformer remains constant, the non-diagonal values given in length as Å remain the same. However, the electronic data of the atoms at the diagonal location are given as variable in rows on the matrix. Diagonal values are optionally marked and used as atomic descriptors in Figure 1.
Since the template is chosen from a simple and low atomic structure and only one conformation cannot contain all interaction points, it is necessary to create new reference positions to increase the number of proposed interaction points. In addition to the atomic positions of the template, the atomic positions of the most and least active molecules can be used as reference points for clustering. It is possible to nd the positions that increase the activity from the most active those and decrease it from the least active those. Therefore, the positions of the interaction points can be created from the coordinate values of the atoms of some selected molecules. These coordinate values are the values after the rst three atoms of the core structure are drawn to the coordinate center in all molecules. For the interaction points of the proposed pharmacophore; a) atomic numbers of the conformer in the reference According to the tolerance scale (e.g. 0.3 Å), atoms with similar positions were placed in the same cluster. With respect to the atoms in each cluster, it leads to two vectors of the number sequences of molecules and atoms. Thus, a multidimensional vector space was created from clusters with different positions in the 3D system and the vector elements in them. Except for the cluster elements that make up the core structure, in other clusters it is unlikely that all molecules will have one atom. Generally, clusters do not contain the atoms of all molecules. Therefore, depending on the number of atoms associated with a threshold, a cluster may exist or be ignored, and the total number of clusters increases or decreases relative to a relative threshold value. If the number of atoms settled in the cluster reaches approximately 2/5 of the total molecule, the number of sub-clusters has been increased by one. Among these clusters, the total number of members of the sub-cluster predicted by GA was found to be m = 9 in this study and their coordinate values are given in Table 2. Once the coordinates of a compound are arranged with respect to the core structure, the clustering of the atoms in the compound occurs readily. The x, y, z-coordinates of the rst three atoms of the core structure (01,01,01; x2,02,02; x3, y3,03) are drawn to provide the common order in the coordinates of the compounds. According to this, atom-1 (O1) is in the coordinate center and origin, atom-2 (O2) is on the axis of x, and atom-3 (C2) is in the xy-plane. According to these three atoms, other atoms were subjected to common geometric arrangement with internal coordinate values.
By sorting activities (e.g. from large to small), grouping of molecules with similar activity can be easy. Since the ratio of molecules in the test set to those in the training set is approximately 1/4, a group of 5 molecules is created to divide 1 into the test set and the other 4 into the training set. The activities of the two molecules can mimic each other as much as the similarities or differences of LRDs corresponding to the same index of interaction points. Therefore, the similarities of molecular descriptors in a group can be followed from the sum of squares of differences of LRD values at each interaction point. In other words, the smaller the difference between LRD values of the two molecules at each point, the smaller the sum of the Partial Small Square (PLS) values. Since molecules may not contain atoms in the same element of the sub-cluster studied, or may contain atoms with different LRDs, each molecule may have different VFFs, and therefore, differences arise in the activity of each molecule. VFFs of a selected example 5 molecule group are graphically shown in Figure 2. The similarity seen in the graph can be determined by the smallest total value calculated using PLS. Thus, considering the similarities at each point, two most similar molecules at all points are divided into training and test sets. The most appropriate division of the molecules into both clusters is understood by the predicted and validated activities being very close to the observed ones.
In this article, the automated and rational VFF method we have just developed has been compared with random and manual modeling techniques for the avonoid series. Comparing VFF's other automated or rational approaches can only be within MCET method. The paradigm of MCET, which operates according to LRD like the Klopman index, is quite different from any method. Therefore, it does not make sense to compare the automatic or rational approach used in a study with the applied VFF in only MCET. On the other hand, since we could not write any automated or rational methods in MCET for now, we were able to compare VFF only with manual and random methods. However, dealing with the details of molecular interaction at every point of the pharmacophore shows that the method we developed is a very safe and rational approach. The statistical numbers of the models from VFF (shown in bold) and the other two approaches are given in Table 3. The fact that the model resulting from the split with VFF compared to other approaches has high statistical results shows that VFF is safer. Since 100 molecules have very different skeletons and scattered conformation structures, it has not been easy to choose the pharmacophore. After calculating the parameters of the receptor side of the interaction between L-R, the activity changes in VFF form containing one-dimensional index elements is graphically shown. The user can understand how the interaction at each point of the proposed model changes with the graph in VFF. Therefore, for researchers, the 3D pharmacophore proposed by using VFF has been more attractive than expected.
A model obtained with the LOO-CV applied in the training set of molecules was proposed with the value of Q 2 = 0.604 and was con rmed with R 2 = 0.760 in the strategically excluded test set. The reasons why R 2 is greater than Q 2 are.
a) The number of molecules in the test set is considerably less than the training set, b) The molecules in the training set are divided su ciently similarly to the molecules in the test set, c) The independent variables used in the model are small and well chosen.
When the Q 2 and R 2 values from which the experimental activity values obtained from the literature 36 are compared with the values found because of the MCET method, differences are observed. The reason for this is that we use the 4D-QSAR MCET method while using the hologram quantitative structure-activity relationships method in the literature [36]. Since the calculations are made according to different parameters in two methods, no comparison can be made.

Conclusion
In order to calculate the activity of the avonoid series derivatives and divide them into training and test sets, three different division methods were used in the MCET method and the models created accordingly were compared. It has been found that a new application, which we de ne as VFF, which is among the splitting methods, has the best performance.
In the 3D metric system, different subsets formed between clustered atoms formed as a result of alignment and superimpose were examined and their effects on activity were investigated by GA. The contribution of each element of the subset to the activity was observed as a ngerprint of the relevant molecule. A subset that gives the best results shows the geometric and electronic properties of Pha. The vectorial index of the interaction points of a subset that is thought to have the best result was given and the activity change in each index is plotted graphically for 5 molecules chosen as samples 4 in the training set and 1 in the test set. As seen in the graph, how much the activity of a molecule increases or decreases for each interaction point of the subset can be shown for the rst time in this study. How the subset contributes to the activity of any molecule at each point of a one-dimensional vector index has formed the molecule's ngerprint. According to the similarities in VFF, the molecules were safely divided into training and test sets, have been preventing the accumulation of two separate molecules that have representative or very similar to each other in one set.
Declarations Funding This work was nancially supported by Erciyes University Scienti c Research Projects (BAP) of Turkey (Grant no. FDK-2018-8187).
Con icts of interest There is no con ict of interest for all participating authors. On behalf of all authors, I declare that there is no con ict of interest.
Author contribution The authors of the current manuscript Tuğba Alp Tokat, Burçin Türkmenoğlu and Yahya Güzel contributed equally to this work. All authors read and approved the nal manuscript.
Data Availability Availability of data and material The data generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Code availability Not applicable. ETM matrix; the distances and bonds between atoms are given with the Å value in place of non-diagonal elements, while the electronic values of the atoms (here, atomic charges) are given diagonally.