Organic semiconducting materials exhibit great synthetic flexibility, which allows for excellent tunability over the bandgap, energy level, and carrier mobility, offering great potential in the design of efficient optoelectronic devices like organic solar cells (OSCs). In comparison with inorganic counterparts, OSCs show unique advantages like light-weight, good flexibility, semi-transparency etc.1–3. Advances in the last decade in functional materials design, morphology optimization and device architecture engineering have led to certified power conversion efficiencies (PCEs) of over 19%, demonstrating great potential for emerging photovoltaic technology. However, exploring suitable organic molecules in the vast organic compound space is extremely difficult, and efficiency breakthrough in the lab needs the constant input of intensive labor and time.
Although DFT calculations allow us to acquire many electronic structural properties of organic molecules without complex organic synthesis, we still lack an effective mathematical model to calculate the PCEs directly from the physical properties of the molecules4–5. In addition, although DFT calculations save economic costs, the huge time cost still limits their application in the high-throughput screening of molecules. Therefore, it’s an urgent problem to establish a Quantitative Structure-Property Relationship (QSPR) model that can conduct the high-throughput screening of the organic compound space to find more suitable molecules.
As a powerful technology for mining relationships hidden in big data, artificial intelligence has brought great development and prosperity to the field of machine learning6–8. With the development of material informatics, a new generation of material research and development paradigms is gradually formed: 1) Use the material database to train machine learning models. 2) Use the model to predict new materials. 3) Verify the results with experiments or calculations9–11. Machine learning shows an excellent performance in accelerating the discovery of new materials, guiding the design of new materials, and exploring the QSPR of materials12–14.
The recent application of machine learning in the field of OSC shows its potential in performing the high-throughput screening of organic molecules effectively15–17. Scharber et al. built a model that can calculate the PCEs with a function of the bandgap and the energy levels of the conjugated polymer18. Harvard Clean Energy Project (CEP) collected the calculations and experimental data of thousands of organic photoelectric molecules and predicted their PCEs using Scharber’s model19–20. The same team later used the Gaussian Process Regression (GPR) method in machine learning to correct Scharber’s model, which increased its Pearson correlation coefficient (r) from 0.3 to 0.43 and make a rough estimate of the PCEs21. However, due to the low accuracy and the time-consuming quantum mechanics calculations of the energy levels, Scharber’s model is not competent for the fast and accurate high-throughput screening22.
By removing the expensive input of quantitative microscopic properties, Sun et al. established a model with deep learning that can quickly classify photoelectric molecules 23–24. This model can use molecular graphs or fingerprint information as input to predict the PCEs interval (0–3% or 3-14.6%). Since the acquisition of molecular fingerprints does not require additional quantum mechanics calculations, this model can achieve a fast classification of molecules, but cannot predict the value of PCEs. Moreover, the accuracy of this model is not satisfactory (69.41%) since the existing data cannot meet the high demand for deep learning. The input of the molecular microscopic properties is very helpful to improve the accuracy of the model. For example, Alessandro et al. trained a KRR model that can predict the PCEs better(r = 0.68) by combining both structural and electronic descriptors, and such accuracy has met the requirements of the high-throughput screening22. With the data collection of published literature and quantitative calculations, Ma et al. trained a model that can directly predict the PCEs using the GBRT method25–26. This model takes the 13 microscopic properties of molecules as input and shows a high accuracy (r = 0.79). Obviously, the addition of molecular microscopic properties improves the accuracy of existing models, which meets the requirements of high-throughput screening.
However, many expensive calculations of microscopic properties (especially excited states) greatly limited the high-throughput screening of the large-scale organic compound space for suitable molecules. Therefore, we have to train an accurate machine learning model for the high-throughput screening of organic optoelectronic molecules with input that can be easily obtained.
In this work, we established an automated framework that can quickly predict the PCEs of OSCs. First, a small dataset containing high-quality experimental data were used to train an ensemble learning model which can predict the PCEs based on the physical and chemical properties of molecules. Then we trained a deep learning model that can predict the molecular properties accurately by using a graph neural network (GNN) architecture and a dataset containing a large number of molecular structures and properties. Specifically, we used SLI-GNN (Self Learning Input GNN), which was recently developed by ourselves27. Based on these two models, we designed this framework that can directly predict the PCEs based on the molecular structure. Finally, our framework shows excellent performance and universality in high throughput screening and is verified by our experimental results. By combination of deep learning and ensemble learning, we achieve direct, fast and accurate prediction of PCEs based on molecular structure.