In order to showcase the generality and effectiveness of the DQN algorithm, we design multilayer WS-EMs in the following for three applications in emissivity engineering, including TC, RC and GS, respectively, under the same optimization framework and utilizing a common material library

As mentioned earlier, the reward function needs to be meticulously defined to ensure that the optimization progress in the desired direction. So firstly, for TC, since an ideal TC emitter requires low emissivity inside AM (8–14 µm) but high emissivity outside, we therefore define the reward *R* as the difference between the average emissivity inside and outside the AM, which can be calculated as:

$$R=\frac{{\int_{5}^{8} {\varepsilon (\lambda ){I_{BB}}(\lambda ,T)d\lambda +\int_{{14}}^{{20}} {\varepsilon (\lambda ){I_{BB}}(\lambda ,T)d\lambda } } }}{{\int_{5}^{8} {{I_{BB}}(\lambda ,T)d\lambda +\int_{{14}}^{{20}} {{I_{BB}}(\lambda ,T)d\lambda } } }} - \frac{{\int_{8}^{{14}} {\varepsilon (\lambda ){I_{BB}}(\lambda ,T)d\lambda } }}{{\int_{8}^{{14}} {{I_{BB}}(\lambda ,T)d\lambda } }}$$

1

where \({I_{BB}}=h{c^2}/{\lambda ^5} \cdot {[\exp (hc/\lambda {k_B}T) - 1]^{ - 1}}\) is the spectral radiance of a blackbody at wavelength *λ* and temperature *T*. *h* and *k**B* are the Planck’s constant and Boltzmann constant, respectively and *c* is the speed of light. \(\varepsilon (\lambda )\) is the emissivity spectrum of the designed TC emitter. The temperature here is set to 350 K, which is slightly higher than the average surface temperature of armored vehicles in the military.45 The reward *R* yields a value between 0 and 1 based on Eq. (1). By pre-trial, the iteration threshold is set as 0.2, that is to say, when the reward *R* of a state falls below this iteration during the iteration process, the iteration will be stopped, and agent will re-initialize a new state and proceed to the next iteration. In addition, the rewards *R* less than 0.2 are mandatorily modified to − 0.2, which signals to the agent that the states corresponding to the negative rewards do not meet the design requirements. As for the state initialization method, it is set to be randomly initialized at the beginning, namely, two materials are randomly selected from the material library, and the thickness of each layer is randomly generated within the range described above. When the reward *R* of a state exceeds the iteration threshold, the state with the highest historical reward is chosen as the initial structure for the next iteration. This initialization method may introduce randomness to the optimization results. To mitigate the impact of randomness, the optimization process is 5 times to obtain the optimal TC emitter structure. Each run consists of 1000 iterations, which is sufficient to reduce epsilon in the Epsilon Greedy algorithm to its minimum value. This ensures that the agent dominates the selection of actions. Once the optimization is completed, the optimal structure is experimentally fabricated using magnetron sputtering to demonstrate the feasibility of the structural optimization.

The schematic of resulting optimal structure and corresponding scanning electron microscopy (SEM) image of fabricated multilayer are shown in Fig. 3a. It can be seen that DQN finally choose ZnS and Ge as the materials for the TC emitter. The thicknesses of each layer are also presented in in Fig. 3a, including the value of designed and the ones obtained from the SEM image of the fabricated sample. It is evident that the layer thicknesses in the optimal TC emitter are irregular and aperiodic, which is difficult to design accurately for manual optimization. However, due to the manufacturing precision, there are certain deviations between the thicknesses of fabricated sample and its designed values, resulting in the discrepancy of their corresponding emissivity spectra as depicted in Fig. 3b. In addition, the differences between the optical properties of the sputtered materials used for fabrication and the input parameters used in the numerical simulation also make a certain impact. Nevertheless, both the designed and fabricated structures exhibit low emissivity within the AM and high emissivity outside the window. The calculated average normal emissivity in AM of simulation is 0.19, while 0.80 is obtained outside the AM, resulting in the reward value of 0.61. The excellent camouflage effect is attributed to low thermal emission in the AM (IR camera detected band) and high emission outside AM for further radiative cooling. For further verification, the normalized electric field intensities of the optimal structure at 6.65 µm and 8.93 µm are plotted in Fig. 3c. The intensity of the electric field at 8.93 µm is degraded heavily, which means a forbidden band is formed in AM resulting in low absorption (and therefore low emissivity) in this band. While the intensity outside AM remains relatively unchanged, resulting in high emissivity for the structure with the lossy SiO2 substrate. The emissivity of the optimal structure as a function of incident angle and wavelength is shown in Fig. 3d, indicating the angular independence of the excellent performance.

In order to demonstrate the efficiency of the optimization under the framework of DQN algorithm, we quantitatively analyze the reward *R* as a function of the percentage of the number of calculated structures. As shown in Fig. 4a, DQN only calculated less than 0.2% of the all the calculated structures to obtain 70% and 90% of the maximum reward and calculated only 4.428% of the structures to find the optimal structure for TC. It can be obviously seen that, with the progress of optimization iterations, the emissivity within the AM decreases continuously, while the emissivity outside the window gradually increases, aiming to achieve a better camouflage effect. In addition, the material combinations of structures with 70% and 90% of the maximum reward are the same as optimal structure as shown in Fig. S2, which indicates that DQN is capable to select appropriate materials at a rapid pace and then performs subsequent structural optimization. The parametric distribution curves of each layer thickness are presented in Fig. 4b, which indicated that the optimal layer thicknesses are derived from the peak of the curves. To further validate the correctness of the optimal structure, Bayesian optimization (BO) is adopted to optimize the emitter for TC under the specified material combination, namely ZnS and Ge. The histories of the rewards are shown in Fig. 4c, which reveals that the maximum reward and the corresponding structure configuration obtained by BO are the same as those achieved by DQN. The more detailed information about BO for TC is provided in Supplementary Information Note 1.

For designing a RC emitter, the objective is to maximize the emissivity within the AM, while minimizing it in the solar band so as to achieve maximum net energy power outflow. The net energy power also called cooling power, which can be denoted by

$${P_{cooling}}(T)={P_{rad}}(T) - {P_{atm}}({T_{amb}}) - {P_{sun}}(\theta ) - {P_{cond+conv}}$$

2

where \({P_{rad}}\) is the output power from the RC emitter, \({P_{atm}}\) is the input power from the atmosphere radiation, \({P_{sun}}\) is the input power from the sun and \({P_{cond+conv}}\)describes the heat exchange between the RC emitter and the environment by conduction and convection. *T* and *T*amb are the temperature of RC emitter and ambient, respectively. \(\theta\) is the angle of solar radiation. A more detailed calculation method of each power is provided in the Supplementary Information Note 2. In the following calculation, the conjugate heat transfer coefficient in \({P_{cond+conv}}\) is set as \({h_c}=5W/({m^2} \cdot K)\) and the ambient temperature is kept at to simulate a breeze situation46. Obviously, the greater the cooling power, the better the performance of the designed RC emitter. However, it seems not intuitive to use cooling power as reward, and it is difficult to set a suitable iteration threshold. So, the reward *R* is set as the difference between the steady-state temperature (*T**steady*) of the RC emitter and the ambient temperature, namely the temperature drop below the *T**amb*. If the \({P_{cooling}}\) is positive at the initial temperature \({T_{init}}\) (\({T_{init}}\)=\({T_{amb}}\)), the RC emitter starts to be cooled down. As the temperature of cooler decreases, the cooling power \({P_{cool}}\) also reduces until \({P_{cool}}({T_{steady}})=0\). At that time, the RC emitter reaches an equilibrium state and the \({T_{steady}}\) can be obtained from the Eq. (2).22 Previous studies have shown that the temperature difference (\(\Delta T={T_{amb}} - {T_{steady}}\)) can reach 8 ℃ or even higher6,33,47, so the iteration threshold is set as 5 ℃. Similar to the previous design for TC, the rewards *R* less than 5 will be mandatorily modified to − 5. The structure initialization method is also set to random initialization at beginning until the reward is larger than the iteration threshold, and the optimal structure is selected as the initial structure for subsequent iterations. The optimization is also implemented for 5 times with 1000 iterations each to eliminate the randomness of the optimal structures and materials.

The design and optimization results of RC emitter are presented in Fig. 5a. SiO2 and TiO2 are finally chosen as the materials for the optimal RC structure. The layer thickness of the optimal RC emitter also exhibits irregular and aperiodic. The emissivity spectra of the designed and fabricated structures are shown in Fig. 5b. It can be seen that the designed RC emitter exhibits near zero emissivity in solar spectrum band, allowing it to reflect the input solar radiation energy. In contrast, a high emissivity is obtained within the AM, enabling it radiates heat efficiently to outer space. Due to the differences between the thickness of the fabricated sample and designed values, their emissivity spectra are not completely consistent. The reward *R* of the optimal RC emitter is 16.99, which means it can maintain 16.99 ℃ below the ambient temperature at thermal equilibrium in theory. The cooling power at the initial temperature is 132.40 W/m2. The equilibrium temperature difference and cooling power both exhibit the excellent performance of the designed RC emitter. The normalized electric field intensities of the optimal structure in the visible wavelength band and AM are illustrated in Fig. 5c, indicating the strong reflection of the Ag substrate and the high emissivity caused by the electric field enhancement, respectively. Furthermore, the angular independence of emissivity spectrum can also be observed within an angle of less than 80 ° as shown in Fig. 5d.

The optimization process is quantitatively shown in Fig. 6c. In the early stage of optimization, the reward *R* increases sharply, which means that DQN can quickly identify suitable materials for the RC emitter and performs optimization under this material combination until the optimization process tends to be smooth (as shown in Fig. S3). The material combination of the structure yielding 50% of maximum reward is Si/SiO2, which indicates that DQN replaces Si with TiO2 to achieve better cooling performance as shown in Fig. S3a. During the smooth optimization period, the thickness of each layer is continuously optimized to further enhance the radiative cooling performance. When calculating less than 2% of the candidate structures, the RC emitter could reach a temperature drop of 14.94 ℃ below the ambient temperature at a steady state. After 1000 iterations, only 6.31% of structures need to be calculated to find the structure for RC emitter with the maximum reward. To further exhibit the details of the optimization, the parametric distribution curves of each layer thickness are shown in Fig. 6b. In addition, except for the material combination of the optimal RC emitter, other material combinations are shown in Fig. 6c. It can be seen that ZnS and Si3N4 also exhibit potential as the materials of RC emitter, in addition to TiO2 and SiO2. The occurrence of less frequent material combinations can be explained by the random initialization of the DQN and the random selection of the Epsilon Greedy Exploration algorithm in DQN.

In the final part of this study, we adopt DQN to tackle a more rigorous task, that is, to achieve peak emissivity at a fixed wavelength for GS. More specifically, the target is to obtain a narrow-band emission peak with a high emissivity at the wavelength of absorption peak of the detected gas, while the emissivity at other wavelength is zero to eliminate the impact of absorption by other gases. Here, we take carbon dioxide (CO2) as the target gas, which has an absorption peak at 4.26 µm. The reward R is defined as the difference between the average emissivity within and outside the narrow band:

$$R={\varepsilon _t} \times Q$$

3

where *Q* is used to ensure that a narrow-band emission peak can be generated in the GS WS-EMs and \({\varepsilon _t}\)is to ensure a high emissivity at target wavelength, 4.26 µm. Maximize the product of the two terms to optimize the resulting GS emitter with a narrow-band emission peak that matches the carbon dioxide absorption peak. By pre-train, the iteration threshold is set to 2. The optimization was implemented for 5 times with 1000 iterations in each round to eliminate the randomness of materials and structures.

As shown in Fig. 7a, the Si and SiO2 are chosen as the materials of GS emitter by DQN. The emissivity spectra of the optimized structure are shown in Fig. 7b. The simulation result shows that a sharp and high emissivity peak can be realized with the optimized structures at 4.26 µm, and the emissivity outside the narrow-band is close to zero. The corresponding emissivity of the peak is 0.9996, and the reward *R* of the structure is 60.62. The result shows that the designed WS-EM is sufficient to be an excellent CO2 sensor. Due to the thickness deviation of the fabricated sample, the measured wavelength of the emissivity peak deviates from the target wavelength but still within the CO2 absorption peak. The emission peak is located at 4.3 µm and the peak value is 0.905. In addition, the sample generates a certain low emission outside the absorption peak. Figure 7c displays the normalized electric field intensities of the optimal structure at 4.26 µm and 5 µm. Due to the excitation of the localized Tamm plasmon state, the electric field intensity is significantly enhanced at the thickness of 0.3 µm from the top of the substrate, resulting in peak emissivity at 4.26 µm. However, there is no notable enhancement of the intensity of electric field at 5 µm, resulting in near-zero emissivity at this wavelength. The incident angle related emissivity spectrum is displayed in Fig. 7d. It can be seen that the angle independent only occurs within 30 °, but it does not have any effect on gas sensing since the emitter typically faces the detected gas in the normal direction.

The optimization process of the GS emitter is presented in Fig. 8c. In the early stage of optimization, the emitter has only a small emissivity peak within the research band and the wavelength of emissivity peak deviates from 4.26 µm. As the iteration progresses, the more suitable material combinations can be found so that the emissivity peak becomes more obvious and the wavelength of emissivity gradually approaches the target wavelength of 4.26 µm. Eventually, a near perfect emissivity peak is achieved at 4.26 µm with a Q-factor of 60.64. Further insights into the structure evolution during the optimization process can be obtained from Fig. S5. The distribution of each layer thickness as well as the material combinations are shown in Fig. 8b and 8c, respectively. Compared to the emitter for RC, although a greater variety of material combinations are generated, the number is small. This indicates that the combination of Si and SiO2 is undoubtedly the most suitable choice for achieving the target emissivity spectrum for CO2 sensing.