SubMo-GNN: Property- and Structure-Aware Diverse Molecular Selection with Representation Learning and Mathematical Diverse-Selection Framework

Selecting diverse molecules from unexplored areas of chemical space is one of the most important tasks for discovering novel molecules and reactions. This paper develops a new method for selecting a diverse subset of molecules from a given molecular list by utilizing two techniques studied in machine learning and mathematical optimization: graph neural networks (GNNs) for learning vector representation of molecules and a diverse-selection framework called submodular function maximization. Our method ﬁrst trains a GNN with property prediction tasks, and then the trained GNN transforms molecular graphs into molecular vectors, which capture both properties and structures of molecules. Finally, to obtain a diverse subset of molecules, we deﬁne a submodular function, which quantiﬁes the diversity of molecular vectors, and ﬁnd a subset of molecular vectors with a large submodular function value. This can be done efﬁciently by using the greedy algorithm, and the diversity of selected molecules measured by the submodular function value is mathematically guaranteed to be at least 63 % of that of an optimal selection. We also introduce a new evaluation criterion to measure the diversity of selected molecules based on molecular properties. Computational experiments conﬁrm that our method successfully selects diverse molecules from the QM9 dataset regarding the property-based criterion, while performing comparably to existing methods regarding a standard structure-based criterion. The proposed method enables researchers to obtain diverse sets of molecules for discovering new molecules and novel chemical reactions, and the proposed diversity criterion is useful for discussing the diversity of molecular libraries from a new property-based perspective.


Introduction
Chemical space, 1-4 a concept to represent an ensemble of chemical species, was originally established in medicinal chemistry 2,5 and is used in a wide area of chemistry. The size of chemical space, i.e., the number of molecules in it, is estimated to be 10 60 even if it is limited to drug-like molecules, 6 and other estimations of chemical-space sizes have also been reported. 4,7 In any case, the number of molecules is too large to explore exhaustively. Currently, more than 68 million molecules are registered in the chemical abstracts service (CAS) of Americal Chemical Society, 8,9 and some accessible online molecular databases, e.g., PubChem, 10 ZINC, 11 have been constructed. Moreover, owing to recent advances in high throughput screening, chemoinformatics, 12 and machine learning, 13 many chemical compounds have been discovered from chemical space in the fields of organic light-emitting diode, 14 organic synthesis, 15 and catalyst. 16 These are, however, small fractions of chemical space, and vast areas remain unexplored.
Selecting diverse molecules from chemical space is an important task for discovering molecules that exhibit novel properties and new chemical reactions. 3,17 In medicinal chemistry, diversity selection algorithms have been widely studied for exploring chemical space and discovering bioactive molecules. 5,[18][19][20][21] The diversity of a set of molecules is also essential in molecular  Figure 1. A high-level sketch of our method. In Step 1, a GNN is trained with property prediction tasks. The black and red arrows indicate the forward pass and backpropagation, respectively. In Step 2, the trained GNN is used to generate molecular vectors of candidate molecules, and then molecules are selected via submodular function maximization.
studies utilize both of them for a single purpose. The only exception is a recent study on multi-robot action selection, 41 which uses GNNs in selection methods, while we use GNNs to design submodular functions. In view of this, our work provides a new type of application combining GNNs and submodular function maximization. Furthermore, to evaluate the diversity of selected molecules based on molecular property values, we introduce a new diversity measure using the Wasserstein distance 42,43 to uniform distributions defined on molecular property values. This property-based measure can play a complementary role to the existing structure-based measures such as the MPD of the Tanimoto coefficients, thus enabling researchers to more profoundly discuss the diversity of molecules. Computational experiments compare the proposed method with the existing structure-based methods and confirm that our method selects more diverse molecules regarding molecular properties. Furthermore, although our method does not explicitly use structure-based descriptors (e.g., ECFP and MACCS key), it successfully selects diverse molecules in terms of MPD values calculated with the Tanimoto coefficient of such structure-based descriptors.

Method
This section presents our molecular selection method, which comprises two steps: training a GNN that generates molecular vectors and selecting GNN-based molecular vectors via submodular function maximization. Figure 1 shows a high-level sketch of our method. In Step 1, we train a GNN and task-specific layers with property prediction tasks, where the GNN converts molecular graphs into molecular vectors and the task-specific layers take them as input and predict molecular properties. In this step, parameters of the GNN and the task-specific layers are updated by the error backpropagation method. In Step 2, we transform graphs of candidate molecules into molecular vectors by using the GNN trained in Step 1, and then select a predetermined number of molecular vectors based on submodular function maximization. We also introduce a new property-based diversity criterion, which quantifies the diversity of selected molecules as the Wasserstein distance to uniform distributions defined on molecular property values. Intuitively, we regard a set of molecules as diverse if the property values of those molecules are evenly distributed.

Graph neural networks for generating molecular vectors
We briefly explain how molecular vectors are generated by GNNs, which are deep learning architectures that work on graph domains. Taking a graph with node and edge features as input, GNNs capture structures of the graph by iteratively passing messages, which are calculated based on the features. Specifically, each node iteratively receives messages from its neighbors, aggregates them, and pass them to its neighbors; after this message passing phase, a molecular vector, denoted by x x x, is computed based on the resulting messages of all nodes. Along the way, messages are updated with certain parameterized functions. Our specific choice of a GNN architecture is Attentive FP, 36 which is reported to achieve high performances in molecular property prediction. For the sake of completeness, we present mathematical details of GNNs in the supplementary information.
In the task-specific layer, molecular properties, y y y, are predicted with molecular vector x x x via simple linear regression aŝ y y y = W W W x x x + b b b, whereŷ y y is a prediction of y y y. In the training step (Step 1 in Figure 1), with a loss function defined by the mean square error betweenŷ y y and y y y, we update W W W , b b b, and the parameters of the GNN by backpropagation. Consequently, the GNN, which can capture structures of molecular graphs via message passing, is trained to predict molecular properties. Therefore, the trained GNN generates molecular vectors taking both structures and properties of molecules into account.

Selection of diverse molecular vectors
Given molecular vectors generated by the trained GNN, we aim to obtain a set of diverse molecules by selecting diverse molecular vectors. For selecting diverse vectors, we utilize the mathematical framework called submodular function maximization.

Submodular function maximization
Submodular function maximization has been studied in the field of combinatorial optimization. This framework enables development of effective diverse selection methods by offering flexible models for representing the diversity and efficient selection algorithms with mathematical performance guarantees; below we detail these two advantages.
The first advantage of using the submodular-function-maximization framework is that there are various known functions for representing the diversity. To find a diverse subset from a large pool of molecules, researchers usually specify a diversity criterion and search for a diverse subset based on the criterion. Here, a diversity criterion is formally regarded as a set function, which assigns to each subset a real value that indicates how diverse the subset is. Some of such functions have a special property called submodularity, and they are called submodular functions. Many submodular functions have been developed as diversity criteria for various kinds of data such as images, documents, and videos. Therefore, we can choose a suitable function from them for modeling the diversity of molecular vectors. For example, the Shannon entropy is known to satisfy submodularity with respect to the selection of random variables. Other diversity criteria that have submodularity include the ROUGE-N score for document summarization 44,45 and facility location functions. 46 In the area of bioinformatics, submodular functions for peptide identification are also developed. 47 The second advantage of the submodular-function-maximization framework is that we can utilize various simple, efficient, and mathematically rigorous algorithms for selecting a diverse subset. When selecting a subset from a large number of molecular vectors, there are exponentially many possible candidate subsets. Therefore, we need efficient algorithms for finding diverse subsets. In a series of studies on submodular function maximization, many simple and efficient algorithms for finding subsets with large submodular function values have been developed. Notably, the resulting submodular function values are often guaranteed to be nearly optimal by mathematical analyses. Therefore, once we specify a submodular function as a diversity criterion, we can automatically ensure that those algorithms return highly diverse subsets with respect to the criterion. Among such algorithms, the greedy algorithm is widely used due to its simplicity, efficiency, and strong mathematical guarantee. 39 In the supplementary information, we present mathematical details of submodular function maximization and the greedy algorithm.

Log-determinant function
In our computational experiments, we use a submodular function called a log-determinant function, which quantifies the diversity of selected molecular vectors based on the volume of a parallelotope spanned by the selected vectors. As depicted in Figure 2a, the more diverse the directions of vectors are, the larger the volume of the parallelotope spanned by them. Thus the log-determinant function provides a volume-based measure of the diversity of vectors, and it is often used for expressing  Figure 2a is a parallelotope spanned by vectors colored in red. Figure 2b illustrates an example where the log-determinant function value for dissimilar vectors becomes small if vectors are allowed to have negative elements. Here and in the next two figures, points with different colors (red and blue) represent molecules with dissimilar properties, while those with the same colors have similar properties. Figure 2c shows why maximizing the log-determinant function without normalization may result in a non-diverse selection, and Figure 2d presents how normalization helps the log-determinant function maximization to select diverse vectors.
the diversity of vector datasets. 48 Note that the volume-based diversity can capture relationships of vectors that cannot be represented in a pairwise manner. Therefore, the log-determinant function yields a different selection rule than MaxSum and MaxMin.
Formally, the log-determinant function is defined as follows. Suppose that n candidate molecules are numbered by 1, . . . . , n and that the i-th molecule is associated with m-dimensional molecular vector be an m × n matrix whose columns are given by n molecular vectors. For any S ⊆ N := {1, . . . , n}, we denote by X X X[S] an m × |S| submatrix of X X X with columns restricted to S. We define the log-determinant function f logdet by for any S ⊆ N, where I |S| is the |S| × |S| identity matrix. The relationship between the f logdet value and the volume of a parallelotope can be formally described as follows. Letx x x i (i = 1, . . . , n) be a vector of length m + n such that the first m elements are given by x x x i , the (m + i)-th element is 1, and the others are 0. When S ⊆ N is selected, f logdet (S) indicates the volume of a parallelotope spanned by {x x x i } i∈S .
Given the function, f logdet , and the number, k, of molecules to be selected, the greedy algorithm operates as follows: it first sets S = / 0 and sequentially adds i ∈ N \ S with the largest f logdet (S ∪ {i}) − f logdet (S) value to S while |S| < k holds. In our computational experiments, we use a fast implementation of the greedy algorithm specialized for the log-determinant function. 49 Function f logdet satisfies f logdet ( / 0) = 0, monotonicity (i.e., S ⊆ T implies f logdet (S) ≤ f logdet (T )), and submodularity. With these properties, we can mathematically guarantee that the greedy algorithm returns a subset whose f logdet value is at least 1 − 1/e ≈ 63% of an optimal selection.

Refinements to molecular vector generation: ReLU and normalization
We refine the GNN-based vector generation process so that it works better with the log-determinant function. Specifically, we make GNNs output non-negative and normalized. Below we detail why we need these refinements and explain how to modify the vector generation process.
First, as in Figure 2b, if vectors are allowed to have negative elements, nearly origin-symmetric vectors form a parallelotope with a small volume even though their directions are diverse. Consequently, the log-determinant function fails to indicate that such molecular vectors correspond to diverse molecules. To circumvent this issue, we add a ReLU layer to the end of the GNN, which makes all entries of output vectors non-negative.
Second, if GNNs are allowed to output vectors with different norms, task-specific layers may distinguish molecules with different properties based on the norm of molecular vectors. In such cases, maximizing the log-determinant function may result in selecting non-diverse vectors due to the following reason. As mentioned above, the log-determinant function represents the volume of the parallelotope spanned by selected vectors, and the volume becomes larger if selected vectors have larger norms. Consequently, molecular vectors with larger norms are more likely to be selected, which may result in selecting molecules with almost the same properties as in Figure 2c. To resolve this problem, after applying the ReLU layer, we normalize molecular vectors so that their norms become 1 by projecting them onto a hypersphere. In other words, we add a layer that transforms  Figure 3a, the variance of a diverse set (red) is smaller than that of a non-diverse set (blue), which does not suit our idea of diversity. In Figure 3b, the KL divergence of a diverse set (red) is equal to that of a non-diverse set (blue).
As a result,x x x becomes non-negative and its norm is equal to 1. In the training phase, we train the GNN with the additional ReLU and normalization layers, where non-negative normalized vectorx x x is used for predicting property values asŷ y y Due to the above normalization, the task-specific layers cannot distinguish molecular vectors by using their norms, and thus the GNN learns to generate molecular vectors so that task-specific layers can predict molecular property values based not on norms but on angles of vectors. Consequently, as illustrated in Figure 2d, diverse molecular vectors can be obtained by maximizing the log-determinant function value.
Property-based evaluation of diversity By using our selection method, we can select molecules so that corresponding molecular vectors are diverse. However, even if molecular vectors are diverse, selected molecules themselves may not be diverse. This issue is also the case with the existing structure-based methods, and it has been overlooked in previous studies. That is, the existing methods select molecules that are diverse in terms of the Tanimoto coefficient of molecular descriptors (e.g., MACCS keys or ECFP), and thus it is natural that those methods can achieve high mean parwise distance (MPD) values, which are also calculated by using the Tanimoto coefficient of such descriptors. If we are to evaluate selection methods fairly, we need diversity criteria that do not use molecular descriptors employed by selection methods. This section presents such a criterion for evaluating the diversity of selected molecules in terms of their property values without using molecular vectors. In contrast to the existing structure-based criteria (e.g., the aforementioned MPD values), our criterion is based on the diversity of property values, thus offering a new perspective for evaluating the diversity of molecules. Our idea is to regard molecular property values as diverse if evenly distributed over an interval on the property-value line. We quantify this notion of the diversity using a statistical distance between a distribution of property values of selected molecules and a uniform distribution. As a distance between two distributions, we use the Wasserstein distance, which is defined by the minimum cost of transporting the mass of one distribution to another, as detailed below. We call this diversity criterion the Wasserstein distance to a uniform distribution (WDUD). A smaller WDUD value implies that selected molecules are more diverse since the distribution of their property values is closer to being uniform.
Formally, WDUD is defined as follows. Let v max and v min be the maximum and minimum property values, respectively, in a given list of molecules. Suppose that k molecules with property values y 1 , y 2 , · · · , y k are selected from the list. We assign probability mass 1/k to each y i and compute how far this discrete distribution is from a uniform distribution over [v min , v max ]. Let V and U be the cumulative distribution functions of the two distributions, respectively. Defining the transportation cost from y ∈ [v min , v max ] to y i as |y − y i |, the WDUD value can be computed as v max v min |U(x) −V (x)|dx, 43 which we use for quantifying the diversity of property values {y 1 , y 2 , · · · , y k } of selected molecules.
There are other possible choices of statistical distances, such as the variance or the Kullback-Leibler (KL) divergence. However, the Wasserstein distance is more suitable for measuring the diversity than the variance and the KL divergence for the 6/15 following reasons. If we use the variance of property values of selected molecules as a diversity measure, a set of molecules with extreme property values is regarded as diverse, although this selection is biased since it ignores property values nearby the mean (see, Figure 3a). If we use the KL divergence between the property-value distribution of selected molecules and the uniform distribution, the distance structure of the support is ignored unlike WDUD, which takes the ℓ 1 -distance, |y − y i |, into account. As a result, we cannot distinguish molecular sets with completely different diversities as in Figure 3b.

Existing selection methods and evaluation criterion
In the computational experiments, we use MaxMin and MaxSum as baseline methods, which greedily select molecules based on dissimilarities of molecular descriptors. We use MACCS keys and ECFP as descriptors and define the dissimilarity of those descriptors based on the Tanimoto coefficient, i.e., given the i-th and j-th descriptors, we compute the Tanimoto similarity of them and subtract it from 1 to obtain dissimilarity values d i, j . Given dissimilarity values d i, j , MaxSum and MaxMin operate as with the greedy algorithm; formally, MaxMin (resp. MaxSum) sequentially adds i ∈ N \ S with the largest min j∈S d i, j (resp. ∑ j∈S d i, j ) value to S whlie |S| < k holds, where the first molecule i ∈ N is set to the one with the largest ∑ j̸ =i d i, j value. We denote MaxMin and MaxSum methods by MM and MS, respectively, and MACCS keys and ECFP by MK and EF, respectively, for short. We use, for example, MS-MK to represent the MaxSum method that uses MACCS-keys as descriptors.
When evaluating selection methods in the experiments, we also use MPD, an existing structure-based criterion, in addition to WDUD. Specifically, given dissimilarity values d i, j for all pairs in n molecules, we compute an MPD value as 1 ( n 2 ) ∑ i< j d i, j . We define the dissimilarity values by the Tanimoto dissimilarity of MACCS keys or ECFP. Depending on the choice of descriptors, we denote the diversity criterion by MPD-MK or MPD-EF, respectively.

Details of computaitonal experiments
We conducted computational experiments with the QM9 dataset in MoleculeNet, 50,51 which is a quantum mechanical dataset with labels of energetic, electronic, and thermodynamic properties computed based on the density functional theory (DFT). Each molecule in the dataset is associated with 12 property values: dipole moment in Debye (mu), isotropic polarizability in Bohr 3 (alpha), highest occupied molecular orbital energy in Hartree (HOMO), lowest unoccupied molecular orbital energy in Hartree (LUMO), gap between HOMO and LUMO in Hartree (gap), electronic spatial extent in Bohr 2 (R2), zero-point vibrational energy in Hartree (ZPVE), internal energy at 0 K in Hartree (U0), internal energy at 298.15 K in Hartree (U), enthalpy at 298.15 K in Hartree (H), free energy at 298.15 K in Hartree (G), and heat capacity at 298.15 K in cal/(mol K) (Cv). Following the previous work, 36 we used all the 12 properties to train GNNs. The QM9 dataset contains 133885 molecules, and we randomly divided them into three datasets as is done in the previous study: 36 80% (107108 molecules) for training a GNN, 10% (13389 molecules) for validating prediction accuracy of the trained GNN, and 10% (13388 molecules) for a test dataset, from which we selected molecules. Each method selected 133 molecules (1% of the test data) from the test data. Note that when training GNNs, we did not use the test data. We thus created the situation where we select molecules whose property values are unknown in advance.
The diversity of property values of selected molecules was evaluated by computing WDUD values for each molecular property. In this evaluation, we used the above 12 properties in the QM9 dataset. However, among the 12 properties, the use of U0, U, H, and G would be inappropriate for evaluating the chemical diversity because their magnitudes depend mostly on the system size. For example, these values are more similar between acetone and acetamide, isoelectronic molecules, than between acetone and methyl-ethyl-ketone, even though most chemists would say that acetone and methyl-ethyl-ketone are both alkyl ketones and chemically more similar. Therefore, we additionally used molecular energy values divided by the number of electrons (denoted by N elec ) in the evaluation to weaken the system-size dependence and focus more on chemical diversity. These values for U0, U, H, and G are denoted by U0/N elec , U/N elec , G/N elec , and H/N elec , respectively. Similarly, we used variants of the two values, ZPVE and Cv, divided by N mode = 3N atom − 6, where N atom is the number of atoms. These values for ZPVE and Cv are denoted by ZPVE/N mode and Cv/N mode , respectively. Consequently, for evaluating molecular diversity based on WDUD values, we used 18 properties in total: the 12 properties of the QM9 dataset and the additional six properties, ZPVE/N mode , U0/N elec , U/N elec , G/N elec , H/N elec , and Cv/N mode .

Results and discussion
We show the results obtained by the molecular selection methods and random sampling (Random), which selects 133 molecules from the test dataset at random. We call our selection method SubMo-GNN (Submodularity-based Molecular selection with GNN-based molecular vectors). As mentioned above, MS-MK and MM-MK (MS-EF and MM-EF) stand for the existing MaxSum and MaxMin methods, respectively, that use MACCS keys (ECFP) as molecular descriptors.  Figure 4a shows the results on the first six properties. Figure 4b is the results on the six properties correlated with N elec or N mode (the WDUD values are computed with the raw property values). Figure 4c is a modified version of Figure 4b, where the property values are divided by N elec or N mode for making those values capture subtle molecular characteristics such as connectivity patterns.

Property-based diversity evaluation with WDUD
We evaluated the diversity of property values of selected molecules by the Wasserstein distance to uniform distribution (WDUD). Note that a smaller WDUD value is better since it means the distribution of selected molecules is closer to being uniform. Table 1 shows the WDUD values attained by the six methods for the aforementioned 18 properties. Since the results of our SubMo-GNN fluctuate due to the randomness caused when training GNNs, we performed five independent runs and calculated the mean and standard deviation. The results of Random also vary from trial to trial, and thus we present the mean and standard deviation of five independent trials. Figure 4 visualizes the results in Table 1, where the WDUD values are rescaled so that those of Random become 1 for ease of comparison.
In this experiment, each method obtains a single set of molecules, for which we calculate the 18 WDUD values. Therefore, choosing a set of molecules that attains small WDUD values for some properties may result in large WDUD values for other properties. Such a choice of molecules does not meet our purpose, and it is better to balance the trade-off so that none of the 18 WDUD values become too large. A reasonable way to check whether this is achieved is to compare the results with those of Random, which selects molecules randomly according to the original distributions. If WDUD values of some properties become larger than those of Random, it is probable that the method selects molecules in a biased manner where the diversity of some properties is sacrificed for achieving small WDUD values of other properties.
We first compare the results of the selection methods with those of Random. Figure 4 shows that SubMo-GNN, MS-EF, and MM-EF attained smaller WDUD values than Random for all molecular properties. This indicates that both our method and the ECFP-based methods were able to choose diverse molecules in terms of WDUD, even though they do not explicitly minimize WDUD. Since we did not feed the test dataset when training GNNs, the results suggest that GNNs well generalized to unknown molecules and achieved diverse selection from the test dataset consisting of unknown molecules. In contrast to SubMo-GNN and the ECFP-based methods, MS-MK and MM-MK resulted in larger WDUD values in mu than Random as in Figure 4a. This suggests that MS-MK and MM-MK selected molecules in a biased manner trying to achieve small WDUD values in some properties, resulting in non-diverse selection as regards mu. In particular, MS-MK seems to have selected molecules in an attempt to achieve small WDUD values in U0, U, H, G, and Cv (see, Figure 4b) at the sacrifice of diversity in mu.
We then compare our SubMo-GNN with the existing selection methods. Compared to MaxMin-based methods (MM-MK and MM-EF), SubMo-GNN achieved smaller WDUD values for all properties. SubMo-GNN also outperformed MaxSum-based methods (MS-MK and MS-EF) for all but six properties (U0, U, H, G, Cv, and ZPVE/N mode ). Note that U0, U, H, and G are related to molecular energies and their values are strongly correlated with each other; previous studies have reported that property prediction methods applied to the QM9 dataset exhibited almost the same performances as regards the four properties. 36 This is consistent with our results in Figure 4b, where each method attained almost the same performance regarding the four properties. Furthermore, when the energy-related properties are divided by N elec , MS-MK and MS-EF are   Figure 4c). In view of this, the MaxSum-based methods seem to have put too much weight on the diversity of properties correlated with N elec , which resulted in biased selections and degraded the WDUD values of mu. In summary, in terms of WDUD values, the overall performance of SubMo-GNN is better than those of the existing methods. Figure 5 shows property-value distributions of all molecules in the dataset (blue) and molecules selected by each method (red). The horizontal and vertical axes represent property values and frequency, respectively. For ease of comparison, the histogram height is normalized to indicate the density rather than the count. We regard a set of molecules as diverse if its distribution is close to being uniform. As expected, the distribution of molecules selected by Random is close to the distribution of the original dataset. By contrast, SubMo-GNN and MS-MK selected molecules that did not appear so frequently in the dataset, particularly for HOMO, R2, U0, and U0/N elec . As a result, the distributions of selected molecules became closer to being uniform than Random. Regarding the results of mu, both SubMo-GNN and MS-MK chose many molecules with near-zero mu values; this seems to be necessary for selecting diverse molecules regarding other properties than mu due to the aforementioned trade-off between properties. Nevertheless, MS-MK chose too many molecules with near-zero mu values, resulting in a biased distribution. This visually explains why the WDUD value of MS-MK in mu is larger than that of Random. Compared with MS-MK, SubMo-GNN selected more molecules with large mu values, which alleviated the bias and led to diverse selection in all properties. SubMo-GNN selected more molecules with large R2 and high HOMO values than MS-MK, and consequently SubMo-GNN's distributions were closer to being uniform. In U0, however, MS-MK selected more molecules with high U0 values than SubMo-GNN and MS-MK's distribution was closer to being uniform than SubMo-GNN. By contrast, as regards U0/N elec , MS-MK selected too many molecules with high N elec values compared with SubMo-GNN, resulting in a distribution that is farther from being uniform.
To conclude, by incorporating supervised learning of GNNs into the system of diverse molecular selection, our method can select diverse molecules regarding target molecular properties in the sense that their distributions are close to being uniform. On the other hand, if we use standard molecular descriptors (e.g., MACCS keys and ECFP) that encode only structural information of molecules, selected molecules can be biased regarding some molecular properties.
In the supplementary information, we provide additional experimental results, which digress from the main subject. We conduct an ablation study to show what if not only SubMo-GNN but also MaxSum and MaxMin use our GNN-based molecular vectors, and demonstrate that the MaxSum method with our GNN-based vector can also achieve small WDUD values, although its high performance can be brittle as detailed therein. We also take a closer look at the GNN-based vector-generation process and examine how the normalization layer affects the selection step.

Structure-based diversity evaluation with MPD
We then evaluated the diversity of the molecular selection methods in terms of molecular substructures. As a criterion for evaluating the diversity of molecular substructures, we used the mean pairwise dissimilarity (MPD), where molecular descriptors were given by MACCS keys or ECFP. We denote those criteria by MPD-MK and MPD-EF for short. A larger MPD value is better since it implies that selected molecules are more dissimilar to each other. It should be noted that MS-MK and MS-EF greedily maximize MPD-MK and MDP-EF, respectively, and thus they are inherently advantageous in this setting. MM-MK and MM-EF also explicitly maximize the diversity calculated with MACCS keys and ECFP, respectively, and thus the setting is also favorable for them. By contrast, our SubMo-GNN uses neither MACCS keys nor ECFP, and thus it has no inherent advantage as opposed to the other methods. Table 2 shows the results. Our method and the existing methods outperformed Random in terms of both MPD-MK and MPD-EF, implying that all the selection methods successfully selected diverse molecules in terms of molecular substructures. Notably, although SubMo-GNN did not use the structure-based descriptors (MACCS keys and ECFP), it selected molecules with diverse substructures and attained higher MPD values than Random. This is achieved because when training GNNs with property prediction tasks, GNNs generate molecular vectors by taking structures of molecular graphs into account. These results and those in the previous section demonstrate that our SubMo-GNN can select diverse molecules in terms of both molecular properties and substructures.
MS-MK and MM-MK, which explicitly aim to maximize the diversity calculated with MACCS keys, achieved high

11/15
MPD-MK values. In particular, MS-MK attained a far higher MPD-MK value than the others. This result is natural since MS-MK has the inherent advantage of greedily maximizing MPD-MK. As regards MPD-EF, all methods achieved high MPD values. Notably, SubMo-GNN delivered almost the same performance as MM-EF, which explicitly maximizes the diversity calculated with ECFP. This implies that our GNN-based molecular vectors capture structural characteristics of molecules as well as ECFP.

Remark on direct minimization of WDUD
In the above experiments, our method selected molecules based on the diversity of vectors generated by trained GNNs, which were then evaluated by WDUD (and MPD) values. Considering the evaluation criterion, one may think that it is better to select molecules by directly minimizing WDUD values instead of selecting molecular vectors. This approach, however, does not fit our goal, as discussed below.
In this study, we aim to select diverse molecules for efficiently exploring the chemical space. To this end, we utilized the power of GNNs, which can generate vectors taking both molecular structures and properties into account. As a result, our method achieved diverse selection not only in molecular properties but also in molecular structures. On the other hand, the direct minimization approach focuses only on a fraction of molecular properties used for computing WDUD values, and thus it is prone to overfitting; selected molecules may not be diverse with respect to unknown properties. In addition, the direct WDUD minimization requires property values of all candidate molecules to be known, hence its limited application scope. A possible remedy for these issues is using DFT or other machine learning methods to predict unknown property values. This, however, is computationally more costly than our method. Moreover, even if unknown property values are successfully predicted, minimizing WDUD values is also a computationally hard task for which no efficient algorithm is available to the best of our knowledge. In a nutshell, although the direct minimization approach may improve WDUD values, it is not a reasonable way to select diverse molecules for efficiently exploring the chemical space.

Conclusion
We addressed the problem of selecting diverse molecules for facilitating chemical space exploration. Our method consists of two steps: construction of molecular vectors using the GNN and selection of molecules via maximizing submodular functions defined with molecular vectors. Owing to the use of GNNs trained with property prediction tasks, we can take both molecular structures and properties into account for selecting diverse molecules. Moreover, the submodular function maximization framework enables the greedy algorithm to return subsets of molecules, which are mathematically guaranteed to be nearly optimal. We also introduced a new evaluation criterion, the Wasserstein distance to uniform distributions (WDUD), to measure the diversity of sets of molecules based on property values. Computational experiments on the QM9 dataset showed that our method could successfully select diverse molecules as regards property values. Regarding the diversity of molecular structures, it performed comparably to the existing structure-based methods (MaxSum and MaxMin with MACCS keys and ECFP). Our diverse selection method helps researchers efficiently explore the chemical space, which will bring great advances in searching for novel chemical compounds and reactions.