Computational Approach for Molecular Design of Small Organic Molecules with High Hole Mobilities in Amorphous Phase using Random Forest Technique and Computer Simulation Method

.


Introduction
Hole mobility is an important physical property for designing hole-transport layers in electronic devices, such as organic light emitted diodes [1] (OLEDs) and organic field-effect transistors [2] (OFETs). To realize the flexibility of efficient devices, new organic molecules with high hole mobilities in the amorphous phase have been identified by trial-and-error approaches.
Recently, theoretical and computational approaches have been proposed for evaluating the hole mobility of a given molecule.
A popular approach is the kinetic Monte Carlo (KMC) simulation technique. It can be used to calculate the transit time for a hole to pass a certain distance based on the evaluated kinetic rate constants for all the adjacent molecular pairs in the model amorphous system. In this method, the reliability of the calculated rate constants depends significantly on the computational levels applied to the transfer integral evaluations, which are generally related to the accuracy of the MO calculations. Therefore, a considerable amount of computational time is required to perform the KMC simulation if ab initio MO levels are adapted for the transfer integral calculations. To reduce the computational time, empirical methods such as the extended Hückel MO [3], fragment molecular orbital [4] and semi-empirical MO [5] methods have been applied.
Meanwhile, we recently proposed the SC model [6]. It is a statistical approach to evaluating the hole mobility in the amorphous phase using a limited number of transfer integral calculations. Although the SC model can dramatically reduce the computational time, it is difficult to provide preferable molecules to realize the desired hole mobility in the amorphous phase. This is an important inverse problem for molecular design.
In recent years, machine learning (ML) techniques have become effective approaches [7][8][9][10][11][12] for linking descriptors and observable physical properties. If large datasets, including molecular structures and corresponding hole mobilities, are available, the important descriptors for determining hole mobility are likely to be extracted using ML techniques. Although various types of ML have been proposed, we adopted random forest (RF) [13][14][15][16][17] in this study. It consists of many simple decision trees, and the average predicted value by each tree is used to determine the final expected properties. RF was applied because it has the advantage of handling large numbers of descriptors, even when it uses a relatively small number of datasets (as indicated by Janai et al). [18] A schematic representation of our strategy for generating new molecules with the desired hole mobility is illustrated in Scheme 1. First, a dataset for hole mobilities in the amorphous phase was constructed by collecting 321 experimental data of organic molecules from published papers manually. The RF was then trained using the datasets to predict the hole mobilities for the given molecular structures. The optimized RF was used as a simple, inexpensive, high-throughput function for many molecules designed using the modified ChemTS molecular generation algorithm proposed by Tsuda et al. [19] ChemTS uses a recurrent neural network (RNN) [20] and a Monte Carlo tree search (MCTS) [21]. Finally, the hole mobilities of the molecules by designed using the modified ChemTS were ensured using the SC model. It is a more expensive but reliable theoretical calculation method. Using these methods, high-performance molecules can be proposed without the need for experimental knowledge.
Scheme 1. Schematic representation of the strategy to propose reliable molecules with desired hole mobility without experimental knowledge.

Datasets construction
To construct the datasets for hole mobilities in the amorphous phase, the experimental conditions were examined to determine whether the molecules were small molecules or polymers and what the measuring method was. These were highly labor-intensive tasks. Thus, simple scraping approaches from published papers pose a severe risk of yielding low-quality datasets. In our procedure, the hole mobilities in the amorphous phase of small organic molecules are primarily collected for the flexibility of electronic devices. In this concept, the data for the solid state [22,23] are excluded because the mobilities depend largely on the molecular packing structures in the associated condensed phase. Because the measured mobilities also depend on the experimental methodology [24,25], our datasets include only time-of-flight data for undoped pure molecules [26]. As is well-known, the drift mobility generally increases with an increase in the external electric fields applied on the system. This is known as "Poole-Frenkel behavior" [27]. Thus, the hole mobilities measured under external electric fields of 1.0 − 5.4 × 10 5 V/cm were collected preferentially when certain data were available for different external electric fields applied [24]. Finally, the datasets constructed for hole mobility in the amorphous phase included 321 small organic molecules.
These are effective for a wide range of research, particularly electronic device engineering and/or artificial-intelligence investigations. The MO energies were also calculated as described below for 321 molecules to be the meaningful datasets, which are publicity released on GitHub.

Machine Learning
The descriptors used in this study are listed in Table 1. The distributions of each variable for all the molecules in the datasets were analyzed, as shown in Figure S1 by the kernel density estimation. It should be noted that the predicted values with descriptors outside these ranges are not reliable owing to the limitations of ML.
In this study, Python was adapted for ML. The rdkit.Chem.Descriptor [28,29] and rdkit.Chem.AllChem.MorganFingerprint modules [30] were applied to obtain the structural parameters and Morgan fingerprint indices of functional connectivity fingerprints (FCFP) and extended connectivity fingerprints (ECFP), respectively, to reflect the characteristics of the partial molecular structures. The fingerprints were calculated by specifying a bit length of were performed using the CAM-B3LYP/6-31G(d,p) level of theory to evaluate MO energies for the isolated optimized structures using the Gaussian09 programs [31]. The computational level of CAM-B3LYP/6-31G(d,p) including long-range corrections were same levels as our previous paper [6], in which we have considered the spin-contamination and the charge transfer excitations effects. While the other functionals could be available [32,33], this level could successfully reproduce experimental hole mobilities for small organic molecules in the amorphous phase. Because holes are predominantly transferred between neighboring molecules through the highest occupied MOs (HOMOs), the orbital energies from HOMO -6 to the lowest unoccupied MO (LUMO) + 6 were considered. Supplemental Figure S2 shows To reduce the number of descriptors, the mean R 2 scores were evaluated by performing k-fold cross-validation. [18] Here 321 molecular data points were divided randomly into five datasets consisting of 80 % training and 20 % testing datasets, implemented in the scikit-learn module [34]. Thus, 256 and 65 molecular datasets were used as training and testing datasets, respectively, to examine the quality of the optimized RF.

Molecular structure generation
The ChemTS algorithm was used to generate simplified molecular input line entry system (SMILES) [35] representations and where logP(S) is the octanol-water partition coefficient, SA is the synthetic accessibility [36], and RingPenalty is the penalty by unrealistically large rings. To generate high-performance molecules with high hole mobilities, we modified J(S) as where mobility(S) is the predicted hole mobility evaluated using the optimized structure-based RF.

Calculation of hole mobility by SC model
The SC model proposed in our previous work [6] was applied as a more reliable approach to evaluating the hole mobilities in the amorphous phase. This model is a statistical approach based on the hopping model to evaluate the electron or hole mobilities in the amorphous phase. The charge transfer rate constants were calculated using the Marcus theory. It involves the transfer integrals and reorganization energies, which were evaluated by ab initio MO calculations using the CAM-B3LYP/6-31G(d) level of theory. A model amorphous phase was prepared using conventional molecular dynamics (MD) simulations. Because the detailed procedure for preparing the model amorphous phase structure is described in our previous papers [37,38], only a brief outline is presented. The initial structures of the molecular assemblies were generated using the Leap module in the Amber20 package [39]. Herein, general amber force fields with AM1-bcc atomic charges were applied to the molecules. After the prepared structures were heated to 1000K at 1 atm of pressure, there were cooled gradually to 300 K at this pressure. The prepared structures were used as model amorphous structures to evaluate the hole mobilities.

Random Forest
There are two effective parameters to consider the quality of the ML model: the coefficient of determination (R 2 ; it is the correlation between the predicted and experimental data) and root mean squared error (RMSE; it is the averaged error between these). The smaller the RMSE, the better is the quality of the ML model. The R 2 and RMSE values evaluated using all the descriptors in the structure-based and MO-embedding models are given in Table 2 and Supplemental Table S1, respectively. In Table 2, the highest mean R 2 score in a five-fold cross-validation could be obtained in both models 4 and 5 applying ECFP fingerprint, although the differences were small for all the models. Although model 5 using 2048 bit ECFP showed the highest mean R 2 scores in Table S1, the difference between models 4 and 5 was 0.001. Because a smaller number of descriptors tends to reduce the overlearning in general, the structure-based model 4 with 1024 bit ECFP fingerprint and a radius of two was applied.
Because 1024 fingerprint bits were still excessively large for the descriptors compared with 256 training datasets, the linear dependent bits in the fingerprint with a correlation coefficient of 1.0 were excluded. Subsequently, the number of descriptors including the remaining fingerprint bits was reduced marginally to 917. These descriptors were additionally reduced according to a given threshold of importance evaluated by the RF to increase the stability.  Figure 1). The discrepancy in these values between the training and testing datasets was marginal. This indicated that the optimized RF was not overtrained. Thus, the obtained RF is likely to evaluate the acceptable hole mobilities of a given molecule at a low computational cost.
As described above, one of the characteristic features of the RF is its capability to rank the relative importance of descriptors to predict properties such the mobility. In the obtained RF, num_S and TPSA were extracted as the two most important descriptors (see Figure 2). Meanwhile, the importance of the other variables reduces gradually and spread to many descriptors. The presence of non-aromatic sulfur atoms (num_S) tended to decrease the hole mobility, as shown in Figure3. A likely reason is that more than half of the molecules with non-aromatic sulfur atoms in our constructed datasets were proposed mainly for thermally activated delayed fluorescence (TADF) devices [40][41][42][43] in which the non-aromatic sulfur atoms were included in the phenothiazine ring, sulfone, and thioether. The hole mobilities of molecules with phenothiazine rings have been reported to be lower than those of molecules carbazole groups [14,44]. Meanwhile, sulfone functions as a strong electron acceptor group to destabilize cations. This tends to increase the reorganization energy, and the hole-transfer rate constant decreases according to the Marcus theory. Most of these molecules have been proposed as suitable molecules for light emission layers rather than hole-transport layers. Then the hole mobility is considered to be low if non-aromatic sulfur    Table 1.
atoms are included. The favorable num_S and TPSA values were evaluated 0-1 and three or eight, respectively, indicating high hole mobilities. It is empirically known that TPA and carbazole groups are favorable moieties in the molecule used for the hole-transport layer to design high-performance flexible devices that function as an amorphous phase [45][46][47]. Molecules with two TPA and/or carbazole groups are likely to show high hole mobilities, according to the TPSA values for the TPA (3.24 Å 2 ) and carbazole (4.93 Å 2 ) moieties.
In contrast, the num_S and LUMO + 2 eigen-values were extracted as important descriptors, (see Figure S4) for the MO-embedding model. Meanwhile, the HOMO eigen-values showed a lower importance (< 0.03). This is noteworthy because a hole should be transferred between neighboring molecules, mainly through the HOMOs in the hopping model. From a theoretical perspective, transfer integrals can be calculated as the coupling between the HOMOs of two interacting molecules [6,48]. The interaction of MOs in the molecular pair at the distance of van der Waals contacts is mainly caused by MO polarizations and/or charge transfer interactions according to the energy decomposition scheme by Morokuma et al. [49]. Thus, the interactions between the HOMOs and unoccupied MOs may be important. In contrast, the HOMO-HOMO interaction causes exchange repulsion only at short distances. Then, the unoccupied MO levels are considered to be extracted as the Note that all the descriptors are required in RF to precisely determine hole mobilities. Furthermore, the predicted hole mobility and physical significance of each descriptor need not be directly related. Thus, the significance of the rank of importance should be assigned to each descriptor with caution.

Molecular Generation
Although an optimized RF can predict the hole mobility of a given molecule, it is difficult to design new molecules with the desired hole mobility. Therefore, the ChemTS algorithm was used to design the molecules. The ChemTS Python library was modified to include the hole mobilities predicted by the RF in the reward function J(S) (see eq.3). Ten and twenty thousand molecules were generated independently using modified ChemTS and optimized RF. The molecular structures of the top five molecules in each generation with high predicted hole mobilities are depicted in Figure 4. The corresponding SMILES, hole mobilities predicted by RF, and glass transition temperatures Tg evaluated by the model provided by Zhao et al. [50] are summarized in Table 4. Although 1 to 5 generated by ten thousand molecular generations did not have common substituents or molecular substructures in the moiety generated by MCTS, the predicted hole mobilities were of an almost equal magnitude (of the order of 10 −4 cm 2 /Vs, which is similar to the mobilities of NPD or BPD molecules) [6]. Thus, the mobility in these molecules may have originated from the molecular parts generated by the RNN. In contrast, 6 to 10 generated by twenty thousand molecular generations have a common acridone substructure, which was generated by MCTS and was not included in our datasets. The hole mobilities of these molecules were the order of 10 -3 cm 2 /Vs, which were similar magnitudes of widely used CBP molecule (1.8 × 10 −3 cm 2 /Vs). [51] This is important because the current MCTS approach has the ability to find novel molecular fragments by repeating molecular generations, since different molecules are obtained in each generation cycle. This is because the MCTS develops molecular structures based on the Monte Carlo technique to refine the score of the reward function in eq.1. These production programs are also provided in Github. In addition, to use the molecules generated in electronic devices in the amorphous phase, a high Tg is required. These molecules were predicted to have a higher Tg than the experimental value of 368 K for -NPD [24][25][26], which is widely used as a hole-transport layer in electronic devices Although the quality of the optimized RF appeared reasonable, (see Figure 1), the error between the RF and experimental values was occasionally large. Thus, to increase the reliability of the predicted mobilities obtained by RF, we have calculated the hole mobilities using more reliable alternative approaches. If both the evaluated values show consistently high hole mobilities, the generated molecules are worth synthesizing. For this purpose, the SC model using ab initio MO calculations and Marcus theory was adapted. The model amorphous phases of the molecules in Figure 4 were prepared using MM-MD simulations. Their structural parameters are summarized in Table 5. The closest interatomic distances Dmin and the calculated densities  for all the molecules were 1.7 Å < Dmin < 2.1 Å and ca. 1.2 g/cm 3 , respectively. The coordination numbers Nc ranges from 21 to 26.
Because the SC model is a statistical approach to evaluating charge mobilities in the amorphous phase assuming the hopping model, the "local" hole mobilities [6] should be calculated between a focused (target) molecule and its' surroundings in the model amorphous. In this study, 9 molecules were selected randomly as target molecules in the SC model. The charge transfer integral, t, was calculated for all the pairs of molecules between the target molecules and coordinated molecules using ab initio MO calculations. Table 6 summarizes the calculated hole mobilities, absolute overlap integrals, transfer integrals, and reorganization energies.
It is noteworthy that molecules 5 and 8 were evaluated to have high calculated hole mobility calc of 7.8 × 10 −2 cm 2 /Vs and 2.9 × 10 −2 cm 2 /Vs, respectively owing to both the high transfer integral t and low reorganization energy . Therefore, these molecules are worth synthesizing because both RF and SC calculations showed relatively high predicted hole mobilities.
In contrast, the hole mobilities for molecules 1 and 3 were less than 1.0 × 10 −6 cm 2 /Vs, which were significantly low compared with the RF-predicted values in Table 4. Reorganization energies  were approximately two times as large as those of the others. Therefore, molecules 1 and 3 were considered to have smaller hole mobilities than those anticipated based on the RF. Because the electronic properties t and  were excluded from the current structure-based RF model to save computational time, the predicted mobilities were significantly different from the SC calculated results for certain cases. Figure 5 shows the structures and HOMOs of the molecular pair with the highest contribution to hole mobility max in the SC calculation. Although the HOMOs were spread over the acridone and ketone moieties in molecules 7 and 8 and the contribution to the hole mobility imax were also large in these molecules, the statistically averaged hole mobility calc by the SC model was not highest because of the randomly arranged configurations in the model amorphous phase. In addition, the largest hole-transfer contribution occurred between neighboring carbazole moieties in molecule 5. It is important to note that the hole mobilities of 7.80 × 10 −2 and 2.94 × 10 −2 cm 2 /Vs in 5 and 8, respectively predicted by the SC model were higher than those included in the constructed datasets. It can be concluded that the combined approach of RF and modified ChemTS successfully generated novel molecules without the need for experimental knowledge and the constructed datasets would also contribute to the development of research fields such as artificial intelligence and electronic devices.

Conclusion
To adapt machine learning of hole mobilities to design novel advanced molecules for flexible electronic devices, datasets of hole mobilities in the amorphous phase for 321 organic molecules were constructed by collecting reported experimental data, which is publicity released on GitHub. This should be effective for a wide range of research, particularly electronic device   coefficients of 0.885 and 0.764 for the training and testing datasets, respectively. The favorable number of sulfur atom and topological polar surface area (TPSA) values, which were extracted based on the importance of the descriptors, were evaluated to be 0-1 and three or eight, respectively. These indicated high hole mobilities. Thus, it is considered that the molecules with two TPA and/or carbazole groups are likely to show a high hole mobility according to the TPSA values for TPA (3.24 Å 2 ) and carbazole (4.93 Å 2 ) moieties according to the prepared datasets.
Because it is difficult to design new molecules with the desired hole mobilities, the ChemTS algorithm was modified to design molecules with high hole mobilities. The molecular structures with high hole mobilities predicted by RF were generated with mobilities (almost 10 −3 cm 2 /Vs ) that are similar to those of CBP. These molecules were predicted to have a similar Tg as the experimental value of 368 K for -NPD, which is a widely used molecule for the hole-transport layer in electronic devices such as OLED.
Although the RF could reasonably predict the experimental results, the reliability of the predicted mobilities obtained by the RF was occasionally low, and the SC model using ab initio MO calculations and the Marcus theory was adapted. Molecules generated (molecule 5 and 8 in Figure 4) by the modified ChemTS were verified to have high calculated hole mobilities calc of 7.8 × 10 −2 and 2.9 × 10 −2 cm 2 /Vs owing of the high transfer integrals t and low reorganization energies  by the SC model. It was concluded that these molecules are worth synthesizing for high hole mobilities because both RF and SC calculations predicted high hole mobilities.