Data analytics accelerates the experimental discovery of new thermoelectric materials with extremely high figure of merit

Thermoelectric (TE) materials are among very few sustainable yet feasible energy solutions of present time. This huge promise of energy harvesting is contingent on identifying/designing materials having higher efficiency than presently available ones. However, due to the vastness of the chemical space of materials, only its small fraction was scanned experimentally and/or computationally so far. Employing a compressed-sensing based symbolic regression in an active-learning framework, we have not only identified a trend in materials' compositions for superior TE performance, but have also predicted and experimentally synthesized several extremely high performing novel TE materials. Among these, we found Ag$_{0.55}$Cu$_{0.45}$GaTe$_2$ to possess an experimental figure of merit as high as ~2.8 at 827 K, which is a breakthrough in the field. The presented methodology demonstrates the importance and tremendous potential of physically informed descriptors in material science, in particular for relatively small data sets typically available from experiments at well-controlled conditions.

Excellent thermoelectric performances have been achieved in different types of thermoelectric materials due to the devotement of a substantial amount of research efforts to identify material compositions and configurations with optimized transport characteristics providing best possible TE performance. Among them, half-Heusler alloys show great potential for high-temperature power generation applications owing to their high stability at high temperature (a high zT of~1.5 at 1200 K was achieved in p-type NbFeSb). 5 Inorganic Ba8Ga16Sn30 clathrates have achieved high peak zT of 1.45 at around 500 K which indicates their great potential for low-and mediumtemperature applications. 6 Skutterudites achieved a high zT of~1.8 at 823 K for (Sr, Ba, Yb)yCo4Sb12 + 9.1 wt% In0.4Co4Sb12. 7 In recent years, chalcogenides were identified as very promising thermoelectric materials due to the discovery of a few members of this class with extremely high thermoelectric performance (zT > 2). [8][9][10][11][12][13][14][15][16] In particular, the binary chalcogenides, including PbTe, Cu2Se, GeTe, and SnSe, were "superstars" in the thermoelectric family. [9][10][11][12][13][14] The ternary chalcogenide compounds are also extensively investigated due to their large variety and high TE performance (zT~1.9 at 585 K was achieved in AgSb0.96Zn0.04Te2). 15,16 The mechanisms governing the performance of TE materials were also proposed. While substitution of Te by Se in PbTe has reportedly enhanced the zT by increasing phonon scattering, 17 doping with thallium has improved the TE performance of PbTe by introducing new states in the band gap and thereby improving the Seebeck coefficient. 18 The "liquid-like" behavior of copper ions around the Se sublattice in Cu2-xSe results in low lattice thermal conductivity, which enables high zT. 19 The n-type SnSe single crystals have an extremely high zT of~2.8 at 773 K by Br-doping via well-controlled synthesis techniques due to a unique 3D charge and 2D phonon transport behavior. 13 The polycrystalline phases of SnSe, which are more easily obtained experimentally, also possess a high zT value of~1.8 at 816 K. 20 Despite all these substantial efforts over the years, only a handful of materials and their optimized working conditions (i.e. temperatures) for best performances have been identified hitherto. Thus, a quick yet reliable means to scan the huge compositional space of TE materials is essential for the accelerated discovery of new highperformance TE materials.
With the advance of computational resources, first-principles calculations have been proven very useful not only for the initial screening process but also for understanding and interpretation of experimental results. 21,22 Electronic transport properties of more than 48,000 inorganic materials have been calculated and reported as a database to be used in many fields from thermoelectricity to electronics and photovoltaics. 23 While a lot of high-throughput screening with densityfunctional theory (DFT) inputs have been performed to identify materials with high power factors or low thermal conductivities, including transition-metal oxides, nitrides, sulphides, etc., the actual performance and stability of the predicted materials have remained unconfirmed. [24][25][26] This is unfortunately a typical situation in many fields of materials science: powerful theoretical and data analytics methodologies are intensively developed and employed to screen materials space, but experimental confirmation is very scarce. The reason is not only the relatively high cost of experiments, but also a lack of effort to produce consistent experimental datasets with well-documented metadata (including experimental conditions, materials synthesis protocol, etc.).
Recently an attempt to address this problem systematically has been made in the field of heterogeneous catalysis. 27 There have been several attempts in developing experimental databases in the field of thermoelectricity, 28 but the sparsity of the available datasets make the screening either biased or wrong. Herein, we generate~600 data points for in-house experimentally synthesized ternary chalcogenide materials A-B-C2 (where A = Cu or Ag; B = Bi, In, Sb, or Ga; C = Se or Te) with proper doping to ensure consistency of environmental and instrumental effects. Using the recently developed compressed-sensing based data analytics approach SISSO (sure independence screening and sparsifying operator), 29 we have identified descriptors that can be used to predict the TE performance of ternary A-B-C2 type chalcogenide materials in a fast yet reliable manner. SISSO enables us to identify the best low-dimensional descriptors (sets of descriptive parameters of materials that are either known or can be easily obtained) in an immensity of offered candidates. The search for materials with optimized zT values is performed in an active learning framework. Starting with an initial pool of experimental data, we first identify a descriptor and predict new candidates. Followed by that, we synthesize few of the predicted high-performance TE materials and include them in the data pool. This procedure ( Figure 1) has been repeated several times until the SISSO model has converged. Thus, our study goes beyond the traditional expensive and time-consuming intuition-driven or trial-and-error experimental/theoretical approaches. It reveals a relationship between TE figure of merit (zT), elemental composition, and physical features of atoms of involved species. Using data driven active learning, we have successfully predicted and experimentally verified several new TE materials of the ternary A-B-C2 type chalcogenide family, which are not only highperforming but also are stable over a broad temperature range.

Results
Around 600 data points (zT values) from experimentally synthesized ternary A-B-C2 type chalcogenide compounds within the temperature window of 300-850K are used as the training dataset. The in-house synthesis provides our dataset the consistency necessary for machine learning (ML). In the SISSO method, a huge pool of candidate features of increasing complexity is first constructed iteratively by applying a set of mathematical operators to a set of primary features. The primary features include only temperature and experimentally known materials properties listed in Table 1. The target property (zT) is expressed as a linear combination of the complex features : The set of is called a descriptor with dimension . A is the intercept. Compressed sensing is used to identify best model and the corresponding descriptor for each dimension up to a maximum value. The initial choice of primary features is crucial for the predictive performance of descriptors identified by SISSO. Materials properties such as atomic species and their relative concentrations, 30, 31 atomic radii, 32, 33 atomic weights, 33 electronegativities, 34,35 ionization energies, 36 and heats of fusion/vaporization [37][38][39] are reported to have strong impact on TE performance of different thermoelectric materials. Therefore, we have constructed the primary feature space with these properties. In SISSO overfitting may occur with increasing dimensionality of the descriptor. To avoid overfitting, 10-fold cross-validation (CV10) 40 is employed to identify the optimal dimension of the model. For each cycle displayed in Figure 1, the materials pool is first split into 10 subsets (the materials datasets for each cycle are collected in the Supplementary Data1), and the descriptor identification along with the model training is performed using 9 subsets. Then the error in predicting properties of the systems in the remaining subset is evaluated with the obtained model. The CV10 error is calculated as the average value of the test errors obtained for these ten subsets.
The results for each cycle are collected in Figure S1. As can be seen in Figure S1, for each cycle the CV10 error reduces gradually from h to h (1D to 7D) descriptors, which suggests that overfitting does not occur for dimensions up to 7D. This is due to the relatively large number of training data points for SISSO (although it is still small compared to typical dataset sizes needed for traditional ML approaches such as neural networks). The root-mean-square fitting error (RMSE) for the 7D descriptor for each cycle is already small and is only slightly smaller than for the 6D descriptor. Since higher dimensions mean higher model complexity and computational cost of training, we have carried out all our analysis with up to 7D descriptors. To further confirm the predictive power of our best model (7D) at the final iteration, we have validated the model based on the zT values from previous reports on chalcogenide materials from other groups in Table S1 alongside with the SISSO-predicted values. Despite experimental uncertainties, the predictions remain overall consistent with experiment.
The overall good performance of our best model at the final iteration is demonstrated by low RMSE in zT (0.14). The distribution of errors in predicted zT for different temperature ranges, different zT ranges, and the overall distribution are shown in Figures 2Sa-c. Figure 2 shows the distribution of zT for different temperature ranges in the final dataset. The top five largest deviations between SISSO model predicted zT values and the experimentally measured zT values are collected in Table S2. The maximum absolute error reaches 0.76 for Ag0.55Cu0.45GaTe2 at 730.3 K. In general, the absolute errors larger than 0.6 are all from Cu1-xAgxGaTe2 systems with experimentally measured zT values larger than two at temperature larger than 730 K. Fewer data points measured at high temperatures (>750K) and high-zT materials (> 1.2) explain why the prediction errors are larger for these cases. However, for all the materials the SISSO model underestimates zT, i.e., all materials predicted to have good thermoelectric properties can be expected to have even better properties in reality. The descriptor components , the coefficients , importance score, and the occurrence number of each component during the CV10 processes for the best SISSO model at the final iteration are given in Table 2 (the results for other iterations are collected in Table S3-5). One of the advantages of SISSO over other data analytics approaches is the physical interpretability of the models. By inspecting the descriptor components, one can see how SISSO selects physically meaningful correlations between the target property (zT) and the combinations of primary features. In particular, occurrence of T 2 in h reflects the fact that zT/T = α 2 σ/κ consistently increases with temperature for the materials in the considered class. The occurrence of 1-CA* in the denominator of h indicates that TE efficiency increases with increasing disorder in the Asite, which reduces thermal conductivity. 4,41,42 It should be noted that the dependence of zT on the dopant concentration is strongly nonlinear. A small dopant concentration can have a strong effect on the TE performance due to the formation of impurity band within the gap. 18 The dependence of h and on atomic features indicates that zT is increased for smaller and lighter atoms at the B-site, with at least the minority species (B*) being as light as possible relative to the majority species at the C-site. According to , additional gain in zT can be achieved by choosing B atoms lighter than majority of atoms at A sites. The disparity of size and weight between B and C/A atoms maximizes the rattling effect, which is responsible for a decrease of thermal conductivity in clathrates. 43 At the same time, smaller atomic radius of majority species at site B and disorder at site A are detrimental for electrical conductivity, consistent with the dependence of on these features. This reflects the well-known dilemma in solid solutions that impurity scattering will reduce both the thermal conductivity and carrier mobility. 33,44 We note that concentration of A* will only improve zT up to a limit, namely while the phonon scattering rises due to the lattice disorder but the carrier mobility remains unaffected. 17 In Cu1−x−δAgxInTe2, zT reportedly increases only until x = 0.2, 45 while in Cu1-xAgxGa0.6In0.4Te2 zT reaches its highest value of 1.64 (at 873K) for x = 0.3. 15 Thus, our data analysis shows that selecting impurity atoms with small radius is indeed an effective strategy to suppress lattice thermal conductivity and at the same time maintain the carrier mobility.
The appearance of the heat of vaporization HVA in can be related to the ability of element A with higher heat of vaporization to form stronger bonds with surrounding atoms and thus also reduce disorder and maintain higher electrical conductivity at elevated temperatures. [37][38][39] The appearance of electronegativity (EN) in accounts for the effects of electrical conductivity (σ) alongside with κ. The larger the ENB is compared to ENC the higher is the zT. This is due to the increase of the carrier mobility and carrier concentration induced by the ejection of electrons from element C. This is also consistent with previous findings that high electronegativity mismatch is beneficial for performance of thermoelectric alloys. 34  Results of HT screening of more than 10,000 data points (zT values) from ternary A-B-C2 type chalcogenide compounds are presented in Figure 3a. We have found many new promising TE materials that have not been realized experimentally before. Moreover, based on these results, we have synthesized the compound Ag0.55Cu0.45GaTe2, showing an experimental zT value as high as 2.8 at 827K, which have been repeated for several times ( Figure S3). The results of thermal diffusivity are also reproduced from a third-party laboratory (see Figure S4   The phase stability of Ag0.55Cu0.45GaTe2 can be seen from the XRD pattern in Figure S5. One can also see in Figure 4 that, although increasing the Ag concentration beyond 0.55 increases Seebeck coefficient, it does not reduce the lattice thermal conductivity any further. Rather, the additional Ag lowers the electrical conductivity so that zT decreases. It is exactly this interplay of various parameters that makes the search for high-performance thermoelectric materials so non-trivial. In summary, we have not only overcome one of the major hurdles of traditional TE material synthesis, namely exploring the very vast configurational space, but have also demonstrated the power of data analytics and machine learning in accelerating thermoelectric materials design. By learning from our own experimental dataset, obtained at consistent well-controlled conditions, we have developed a ML model that allows us to predict many new materials with exceptional TE performance effortlessly and reliably. Due to physical interpretability of our model, we were able to clearly disentangle interconnected, often conflicting effects of basic materials parameters on TE performance, and find a way to overcome the optimization difficulties by fully utilizing the large variability of the parameters within the explored materials class. The complex relationship between the primary features and the target property zT emphasizes the absolute necessity of advanced data analytics tools for the advancement of new functional materials discovery. Moreover, the successful synthesis and characterization of our predicted materials have paved out new lanes in TE research. In addition to the fact that these materials have exceptionally high zT values and stabilities and are therefore very promising for several TE applications, the success of our active-learning ML strategy clearly shows that such concerted approaches have a great potential for future materials design.

Data-analytics:
The descriptors are obtained with SISSO, using~600 experimental zT inputs as training data. SISSO is meant to single out a simple yet physically intuitive descriptor from an immensely large set of candidates. Initially, a huge pool consisting of more than ten billion candidate descriptors, is constructed. Then an iterative approach is employed to search the descriptors by combining pre-defined primary features and a set of mathematical operators (+, −, ·, /, log, exp, exp−, −1 , 2 , 3 , √, 3 √, |−|). The complexity of the obtained descriptors depends on how many times the operators are employed. In the present study, we have considered feature space of complexity level up to two, Φ1 and Φ2. 29 Any given feature space of complexity n (Φn) also contains all of the lower (i.e. n-1) feature spaces. The details of the SISSO model identification procedures and the high-throughput screening of new materials/compositions at each iteration are given in the Supplementary Methods.

Sample preparation
Bulk samples of polycrystalline Cu1-xAgxGaTe2 (x=0-0.6) were prepared by vacuum meltingannealing combined with spark plasma sintering (SPS) using elemental Cu, Ag, Ga and Te (99.999%, Emei Semicon. Mater. Co., Ltd. Sichuan, CN). All the raw materials were weighed and mixed according to the above formulae and sealed in quartz tubes under vacuum, which was gradually heated up to 1273 K at a heating rate of 100 Kh -1 , and incubated for 28 h. Afterwards, the ampoules were slowly cooled to 873 K at a rate of 15 Kh -1 followed by quenching in water and then dwelt at 813 K for 72 h. The obtained chunks were ball-milled into fine powders for 10 h, and then sintered by the spark plasma sintering apparatus (SPS-1030) at 673 K with a pressure of 55 MPa.

Transport property measurements
The densified bulk samples of size~2.5×3×12 mm 3 and ϕ10×1.5 mm were prepared for electrical property and thermal diffusivity measurement. The Seebeck coefficients and electrical conductivities were performed with a ZEM-3 device (ULVAC-RIKO, Japan) under a helium atmosphere from room temperature to~870 K with an uncertainty of < 5.0%. Thermal conductivity (κ) was calculated via κ=DCpρ, where the thermal diffusivity (D) was measured by the laser flash method (TC-1200RH, ULVAC-RIKO, Japan) with a precision of~10.0% and confirmed by NETZSCH LFA457, Germany. The heat capacities (Cp) were estimated following the Dulong-Petit rule, Cp = 3nR (here n is the number of atoms per formula unit and R gas constant). The sample density (ρ) was measured by the Archimedes method. When calculating the electronic thermal conductivities (κe) according to the equation κe = LσT, the Lorenz numbers L were estimated by using the formula L=1.5+exp(-|α|/116) (where L is in 10 -8 WΩK -2 and α in μVK -1 ). 46 The three physical parameters (α, σ, and κ) were finalized after three measurements. The total uncertainty for zT is~20%. 413026, Russia. 4

Supplementary Methods
For constructing the Φ1 and Φ2 feature spaces we made use of the set of algebraic/functional operators given in eq. 1. Ĥ (m) ≡{+, −, ·, /, log, exp, exp−, −1 , 2 , 3 , √, 3 √, |−|}, (1) The superscript m indicates that when applying Ĥ (m) to primary features φ1 and φ2 a dimensional analysis is performed, which ensures that only physically meaningful combinations are retained (e.g. only primary features with the same unit are added or subtracted). All primary features included in this study were obtained from the literature. 1 The values of the primary features for the training data sets at each iteration can be found in the file "Supplementary Data".
The sparsifying ℓ0 constraint is applied to a smaller feature subspace selected by a screening procedure (sure independence screening (SIS)), where the size of the subspace is equal to a userdefined SIS value times the dimension of the descriptor. The SIS value is not an ordinary hyperparameter and its optimization through a validation data set is not straightforward. Ideally, one would want to search the entire feature space for the optimal descriptor. However, this is not computationally tractable since the computational cost of the sparsifying ℓ0 constraint grows exponentially with the size of the searched feature space. Instead, the SIS value should be chosen as large as computationally possible. The reasonable SIS values were chosen based on the convergence of the training error.
The high-throughput (HT) screening of the huge compositional space of ternary based chalcogenide materials A1−xA*xB1−yB*yC2−zC*z (where A = Cu or Ag; B = Bi, In, Sb, or Ga; C = Se or Te; A* = Cu, Ag, Zn, or Na; B* = Ga, In, Bi, Sb, Zn, or Sn; C* = Te or Cl) was performed at the final iteration. The values of x, y, and z were limited to smaller than 0.1, 0.1, and 0.2, respectively, which corresponding to 10% of dopant concentration. The mixtures of CuInTe2, CuGaTe2, AgInTe2, and AgGaTe2 are stable in a large temperature range according to the literature. 2 Thus, their mixtures are also included in our high-throughput screening. The best ternary, quaternary, and quintenary A-B-C2 type based chalcogenide materials were systematically screened at the first, second, and third iteration, respectively. The newly found best systems at each iteration are collected in the Supplementary Data. Figure S1. CV10 error at each iteration.