Self-controlling photonic-on-chip networks with deep reinforcement learning

We present a novel photonic chip design for high bandwidth four-degree optical switches that support high-dimensional switching mechanisms with low insertion loss and low crosstalk in a low power consumption level and a short switching time. Such four-degree photonic chips can be used to build an integrated full-grid Photonic-on-Chip Network (PCN). With four distinct input/output directions, the proposed photonic chips are superior compared to the current bidirectional photonic switches, where a conventionally sizable PCN can only be constructed as a linear chain of bidirectional chips. Our four-directional photonic chips are more flexible and scalable for the design of modern optical switches, enabling the construction of multi-dimensional photonic chip networks that are widely applied for intra-chip communication networks and photonic data centers. More noticeably, our photonic networks can be self-controlling with our proposed Multi-Sample Discovery model, a deep reinforcement learning model based on Proximal Policy Optimization. On a PCN, we can optimize many criteria such as transmission loss, power consumption, and routing time, while preserving performance and scaling up the network with dynamic changes. Experiments on simulated data demonstrate the effectiveness and scalability of the proposed architectural design and optimization algorithm. Perceivable insights make the constructed architecture become the self-controlling photonic-on-chip networks.

The streaming of immersive multimedia content, the migration of traditional software applications to the cloud computing platform, the widespread deployment of data mining programs and big data applications, and the broadband access demands have led to the explosive growth of bandwidth consumption [1][2][3] . The advent of the data-intensive spectrum has created a playground for large-scale photonic switches, which play a pivotal role for the next-generation telecommunication networks. More noticeably, the large-scale photonic switches also create a premise for developing advanced data center networks and the state-of-the-art photonic neural information processing systems. Recently, silicon photonics has emerged as a powerful platform for realizing high-density photonic integrated circuits because silicon photonics can enable the monolithic integration of complex circuits at a reasonable cost and high yield by utilizing the advanced features of complementary metal-oxide-semiconductor manufacturing technology [4][5][6] .
Some various sizable configurations have been introduced for large-scale silicon photonic switches 7-12 enabling advancements of broad bandwidth, high transmittance, fast response time, and low power consumption 13,14 .
Recently, metasurface and phase change materials have been introduced as promising platforms for the nextgeneration active and low-loss optics due to their unprecedented ability to control incident electromagnetic fields in the subwavelength regime 15,16 and agile reconfigurable photonic functionalities owing to adjustable properties to fully manipulate the key features of photons, the information carrier in photonic platforms [17][18][19][20][21][22] . However, such technologies have not been completely developed for the adequate integration with CMOS technology at the silicon-on-insulator (SOI) wafer-scale level. Hence, currently, most fully programmable and scalable switching fabrics in large-scale silicon photonic switches are constructed primarily from multistage structures by manipulating the phase-shifting control technique, for instance, Mach-Zehnder Interferometers (MZIs) [23][24][25] , multi-mode interference (MMI) couplers 26,27 , and microring resonators (MRR) 28 . N×N optical switch fabrics are built up by interconnecting multiple stages of elementary switch cells in an available switching topology via passive waveguide elements or Benes types [29][30][31][32][33][34] . However, current silicon photonic switches due to having the ability to support a large number of input/output ports (up to 128 × 128 ) are being still limited in the bidirectional switching ability without supporting higher dimensions. Furthermore, some approaches have been developed by using multistage structures for wavelength division multiplexing (WDM) 35 , mode division multiplexing (MDM) 36  www.nature.com/scientificreports/ intelligence (AI) and spreading in wide applications such as strategy games 53 , autonomous driving 54 , autonomous control 55 , language processing 56 , mobile robots 57 , IoT security 58 , and communications 59,60 . We, in this investigation, propose Multi-Sample Discovery that utilizes Proximal Policy Optimization (PPO), one of the state-of-theart on-policy deep reinforcement learning models, for routing in the optical networks 61,62 . Our proposed MSD model is a hybrid Deep Learning model which overcomes PPO's drawback by exploration ability and inspired by those ideas such as Curiosity-driven Exploration (CDE) 63 and Hindsight Experience Replay (HER) 64,65 which is demonstrated better performance by creating denser reward signals from the environment. In MSD, we create an Advisor and Sample Extraction Buffer that are able to auxiliary explore and create multiple efficient samples in PCN 66 , therefore MSD can converge faster to optima. To verify the effectiveness of our proposed method MSD on routing in photonic-chips networks, we design a simulated PCN environment to test it. We let the MSD optimize the transmission, power consumption characteristics and time by providing the optimal routing path. Our experiments comprise of a few comparisons between the performance of our model and several other the state-of-the-art RL models to make sure that MSD is the best fit for PCN. The result shows that MSD significantly outperforms the others in reducing transmission loss, power consumption and routing time. In addition, MSD can improve the speed of the training process while effectively handling noise or sparse signal. Based on the experiment, we see that MSD is easy to apply for routing in optical networks, able to provide optimal strategy in strict condition, adapt to the dynamic environment such as PCN and stable when the number of nodes in the network is large. Such interior-deductive capabilities make the constructed PCN comparable to an all-optical spiking neurosynaptic network 67 . Figure 1 shows the schematic diagram of a proposed full-grid photonic-chip network (PCN) which is constructed by connecting multi-degree optical switches as unit cells in the two-dimensional space following the full-grid topology. Each unit cell is a multi-degree photonic switch based on on-chip silicon waveguide structures, and therefore, the proposed PCN can be monolithically integrated on a standard silicon-on-insulator wafer by manipulating CMOS-compatible fabrication processes. PCN is an M×N rectangle structure comprising M rows and N columns of photonic chips. Each multi-degree photonic switch enables the operations of the fourdegree non-blocking switching connections. Furthermore, the proposed photonic switch can be dynamically programmed corresponding to some arbitrary connection mechanisms from east-west-south-north directions of the 3 ×3×3× 3 structure via the controlling progress at the crossroads by driving thermo-optic phase shifters. In the remaining of this section, we will first describe the components of a photonic chip. We will then explain the control mechanism and report the experiments that confirm the favorable characteristics of our hardware design.

Components of a photonic chip.
Overall structure of a photonic chip. Figure 2 shows the structure of a four-degree 3 × 3 × 3 × 3 photonic chip based on silicon photonic waveguides. The switch has twelve input/ output ports distributed in four groups including North, South, East, and West, where each group has three input/output ports. Inside the switch, a 3×3 optical waveguide cross-connect switch (OWXC) playing the central role is responsible for connecting and switching optical channels at the waveguide-level. To the best of our knowledge, such a structure has never been suggested before. The three outputs of the OWXC component are connected to three 1×3 optical switches to divert optical channels in three different directions. The outputs of each 1×3 switch are connected to the redirection couplers that allow the connection to be redirected in three outbound directions. At the center of the chip is a waveguide crossing mechanism that guides optical waves through the intersections. All major components of a photonic chip are sketched in Fig. 2. The whole structure is constructed on silicon-on-insulator (SOI) material using channel waveguides and patterned by electron beam lithography or deep ultraviolet (DUV) 68,69 . Access waveguides are silicon nanowires with the width w = 500 nm for supporting quasi-transverse electric (quasi-TE) single-mode transmission condition at 1550 nm 70 .
The schematic diagram of a proposed multi-degree optical switch based on the silicon photonic waveguide structure. This is a 2D grid of M×N photonic chips. Each photonic chip is a 3 ×3×3× 3 fourdirectional structure controlled by thermooptic phase shifters using titanium metallic thin films. www.nature.com/scientificreports/ The OWXC component. The operation of the 3 × 3 OWXC element is based on the multimode interference principle, which allows the self-imaging reproduction periodically 71 . The OWXC has two 3×3 MMI couplers placed at the input and the output, playing the roles of the optical channel divider and combiner. The multimode length of these MMI couplers is L 1 = 3L π /2 . The OWXC also has two 4 × 4 MMI couplers with the multimode length L 2 = 3L π /4 in the middle and six controllable phase shifters (marked by the red color in Fig. 2). The central output of the first 3 × 3 MMI coupler and the central input of the second 3 × × 3 MMI coupler is the twomode waveguide acting as a Mach-Zehnder Interferometer (MZI) passing through two 4 × 4 MMI couplers. For switching, controllable phase shifters (PSs) need to be shifted a phase difference of ±π/2 . The half beat length L π of the multimode waveguide is defined by 71 : where β 0 and β 1 are the propagation constants of the fundamental and the first order modes determined by the following relation 71 : where ν is the ν th guided mode order into the core waveguide, and n eff is the effective refractive index of the silicon core waveguide layer, obtained by solving the wave-propagation differential equation by using the numerical methods. 0 is the operation wavelength in vacuum, and W e is the effective width of the MMI coupler 71 : Figure 2. The architecture of our proposed four-degree 3 × 3 × 3 × 3 silicon photonic chip. www.nature.com/scientificreports/ where σ = 0 for TE polarization and σ = 1 for TM polarization. W MMI is the geometric width of the MMI coupler, and n c is the cladding refractive index ( SiO 2 material).
The 1 × 3 switches. The proposed multi-degree photonic switch contains three 1 × 3 switches. Each 1 × 3 switch consists of a 1 × 3 MMI coupler at the first section and a 3 × 3 MMI coupler at the second section. They have the same multimode region with the multimode length L 3 = L π . Two outermost access arms of two MMI couplers are linked via two controllable phase shifters. Depending on the choice of appropriate phase-difference combinations, either (2π/3,0), (0, 2 π/3), (− 2 π/3, − 2 π/3), the switch will select the output at the left, middle, or right sides 72 . A numerical simulation of the electric field envelope distribution and the transmission spectral wavelength response characteristics of a 1 × 3 MMI multimode coupler are shown in the subfigures of Fig. 2. The wavelength-dependent transmission spectral characteristic shows that the coupler acts as a perfect triplet divider with power at the three output ports (Out1, Out2, Out3) approximately 1/3 dividing ratio ( −4.77 dB) in a 20nm bandwidth in the 1550nm central wavelength region.
MMI couplers. Redirecting waveguides (denoted by the blocks of A1÷A5, B1÷B5, C1÷ C5 and D1÷ D5 in Fig. 2) are 3×3 MMI couplers with the multimode length L 4 = 3L π . These couplers operate on the general interference mechanism (GI-MMI) enabling the input optical field mirrored over the central line of the multimode region and reproducing the optical field at output from the input optical field. Such mechanism makes optical channels in the proposed multi-degree switch redirected flexibly, completely. Images of electric field distribution and wavelength-dependent transmission spectral characteristics are shown in the corresponding insets of Fig. 2. The coupler acts as a near-perfect waveguide crossover with a high transfer rate ( > 96% ) and a meagre interference ratio not exceeded -25 dB ensuring the switching feature to attain high optical performance in a wide 20nm bandwidth, as seen in Fig. 2.
Waveguide crossing structure. The central part of the structure is a waveguide crossing structure consisting of three perpendicular silicon nanowires crossing to each other. The operating principle is based on the multimode interference effect at the intersection point of the perpendicular waveguide. The single-mode waveguide crossings are indispensable building and connecting blocks for complex photonic circuits in the system-on-chip. By utilizing fully etched and shallowly etched waveguides and linear tapered waveguides in the MMI coupler region, the waveguide crossing attains ultra-low loss and imbalance 73,74 . The transfer characteristic of the waveguide crossing designed for the proposed multi-degree switch, illustrated in Fig. 2, shows that the transmission loss of this structure is low, only fluctuating from −0.1 to −0.2 dB in 20 nm wavelength bandwidth.
Control mechanism. Thermo-optic phase shifters controlled by the external voltage source play the critical role for realizing a wide range of integrated photonic applications such as neural networks 75 and reconfigurable photonic chip 76 due to ultrafast temporal response, high flexibility, high accuracy, compact size, and CMOS compatibility. For changing the phase angle in silicon photonic waveguides, the thermo-optic effect is applied to modulate the change of the silicon refractive index via the utilization of metallic heaters, such as heater based on the Ti metal thin film, causing the change of the silicon refractive index by the following relation 26,77 : where T is the change in temperature determined by T = T − T 0 ; T 0 = 300K is the room temperature; dn/dT = 1.84 × 10 −4 K −1 is the thermo-optic coefficient of silicon material; k = 2π/ is the wavenumber; n is the total index change of the silicon material; L P S is the length of microheater. A conventional configuration of a thermo-optic phase shifter has a metallic heater placing on top of the silicon waveguide to induce a phase shift, thanks to the combination of heating and the thermo-optic coefficient of silicon. Typically, metals are high-loss in the third telecom spectrum. Therefore, an upper cladding layer of silica is applied to optically isolate the heater and the silicon waveguide with reasonable spacing distance because a large gap may cause some drawbacks in the performance such as the power consumption and/or the switching speed. In contrast, a small gap may make the phase shifter to suffer a high absorption loss from the plasmonic effect 78 . Microheater-based phase shifters can attain a short switching time on a few µs and a relatively low electrical power consumption for several tens of mW enabling high benefits for optical performances. This explains why microheaters-based controllable phase shifters are preferable than other kinds such as carrier effect-based phase shifters. The thermo-optic phase shifter (TOPS) utilized in the proposed multi-degree switch composes of a metallic Ti-thin film with the thickness of δT i = 100nm , the Ti-heater width of W PS = 1µm , which is placed on the top of the silicon core layer an acceptable gap h Si0 2 within the range from 700nm to 1000nm. The active length of the TOPS is initially set L PS = 200µm to obtain the optimal value of the product of P π · τ during the operation process of the optical switch 26 . Figure 3a describes the structure in the three-dimensional space, and Fig. 3b describes the details in the side-view and the cross-section of the designed TOPS. Figure 3c,d respectively show the distributions of the index change ( n ) and the temperature rise ( T ) in the silicon core layer at the switching state when the electric power is applied to reach a required phase difference of π radian via the use of the Finite Element Method (FEM) simulation method. The required temperature increases about 68.76 K at the metallic heater for reaching the phase difference of π radian. Figure 3e presents the shifted phase angle ( �φ ) as www.nature.com/scientificreports/ a linear function of the electric power consumption ( P �φ ) under the influence of the TOPS which is simulated by using the FEM-based multi-physics tool. The needed powers to reach the required phase angles of π/2 ( 90 • ), 2π/3 ( 120 • ), − 2π/3 ( 240 • ) and − π/2 ( 270 • ) measured from simulation data are corresponding to 9.75 mW, 12.8 mW, 24.8 mW and 27.85 mW, respectively. Figure 3f,g correspondingly illustrate simulation results of the electric switching power consumption and the switching time as functions of the isolation gap between the silicon core layer and the metallic Ti-heater h SiO 2 . Here, the switching power consumption ( P �φ ) is a specific parameter representing the power efficiency for reaching a phase shift of �φ , which can be determined via the utilization of a modified two-dimensional treatment of the heat transformation model on the lateral spreading as follows 79 : where K SiO 2 = 1.4W/(m.K) is the thermal conductivity of SiO2, is the operation wavelength, and W PS is the width of the Ti-metal film on the lateral direction, ∂n ∂T �φ is the difference of ∂n on the difference of ∂T for reaching the required phase change of �φ . Whereas, the switching time characterized by the response time of the TOPS has a direct relation to the cut-off frequency by τ = 1 e.f cut−off , where e ≈ 2.718281828459 is the natural logarithm constant. In which, the cut-off frequency is directly related to the switching power as follows 80 : where ρ SiO 2 = 2.203 g/cm 3 is the density of silica, C SiO 2 = 0.703 J/gK is the specific heat capacity, and A denotes the effectively heated cross-section area relating to the geometry parameters of the TO phase shifter.
To supply the heat source for creating various temperature change levels in the switching operation, each individual microheater needs to be connected to a individual pulsed-voltage source. Pulsed-voltage sources have an ideal configuration of 5V peak to peak at a repetition rate of 12 kHz superimposed with a DC (direct current) biasing voltage across two contact points at the beginning and ending sides along the length direction L P S of the microheater with an excellent electronic conducting quality of the wire bonding pads by using the noble metals 81 .

Power transfer and validation experiments.
In general, a photonic device should be low transmission loss in a wide wavelength bandwidth. Especially, for the proposed photonic chip network designed to be a fullgrid and large-scale size chip network at the waveguide-capacity level, it should assure a high optical signal-tonoise ratio from any input ports to any output ports at a specific wavelength in the third telecom window (near 1550-nm region). This feature is demonstrated by using the transfer matrix relations. Optical power transfer functions are essential to verify first in the multi-degree M×N switch matrix for each switching state, i.e., sweep- www.nature.com/scientificreports/ ing all switch cells is straightforward, cross, left-hand turning, and right-hand turning states requiring a full optical power transfer characteristic map. Consider an input port i and an output port j at any directions (East, West, North South). Let ρ i is the injected power level at the input port i, and ρ ij the power level at the output. Let σ ijk be the leakage power to another port k, where k = j . We introduce most important parameters relating to optical performances of light paths in an optical switch to be the insertion loss χ ij and the crosstalk ratio K ijk , which are defined as follows 82 : For each switching state of a pair from input port i to output port j, the optical power transmission function is a bijection, meaning ε ij = ε ji . The aggregated crosstalk power at full switch load to an output port k is the total undesired powers leaked to k from all i to j transmission paths, which can be expressed as: The extinction ratio for output k can be written as: The insertion loss and crosstalk in dB can be expressed by:  Fig. 4 directly, one can see that, transmission characteristics curves of insertion loss at the wavelength smaller than 1550 nm are less variant. All characteristic curves gradually increase in 10-nm bandwidth from 1540 nm to 1550 nm and they are relatively close in the wavelength range from 1545 nm to 1550 nm. Besides, almost insertion losses attain the optimal values at the central wavelength of 1550 nm agreeing to the aimed targets because all of discrete elements are optimally designed on aspect of insertion loss at the central wavelength of 1550 nm. Furthermore, all characteristic curves of insertion loss decrease when the operation wavelength is larger than 1550 nm. However, characteristic curves are split into two major groups. The first group composes of characteristic curves of connection paths coming from waveguide channels in the adjacent input/output ports, for example, the connection paths from I i to K j or O j (i,j=1,2,3) channels. The second group includes characteristic curves of connection paths coming from waveguide channels in the vertical and horizontal directions after passing through the waveguide crossing region. Wavelength spectra responses of insertion loss transmissions in the first group gradually reduce like the wavelength spectra response of a silicon multimode waveguide resulting from the loss profile of silicon crystal and the unpreservable phase-matching condition. Among these connection paths, the transmission property of the cross channels is better than the transmission property of the central-straightforward channels, for instance, the insertion loss transmissions of I 3 -K 1 and I 3 -K 2 connections are better than the insertion loss of the I 2 -K 2 connection. This is because transmission of outer arms in the 3-dB MMI coupler followed by the general interference regime is better than the transmission of the central inner arm in the MMI coupler agreed by the symmetric interference regime, as seen in Fig. 2.
On the contrary, wavelength spectra responses of insertion loss transmissions in the second group are always smaller than that of the first group due to considerable insertion loss of the connection paths when surpassing the waveguide crossing sections. In addition, since the operation wavelength is larger than 1550 nm, transmission curves of connection paths in the second group dramatically fall down because, beside the unpreservable phasematching condition, the transmission characteristics of connection paths must suffer a remarkably accumulative loss from the waveguide crossing region, as seen in the subset figure exhibiting the transmission property of the waveguide-crossing element. Therefore, the 3-dB wavelength bandwidth responses of straightforward crossconnection paths are narrower than the 3-dB wavelength bandwidth responses of adjacent cross-connection paths in a multi-degree optical switch. Furthermore, one can see that insertion loss transmission characteristics attain correspondingly to the best values of about 0.25 dB and the worst values of about 1.1 dB at the central wavelength of 1550 nm, respectively. However, such low insertion loss and crosstalk levels still have a relatively wide bandwidth of 20-nm, demonstrating the excellent performance of the proposed multi-degree switch.

Routing for photonic chip network (PCN)
In the previous section, we have described the components of a photonic chip and demonstrated the possibility of transferring the power from one input port to an output port of the same chip with very low insertion loss and cross talk. In this section, we will describe our proposed routing policy for a full-blown optical switch, which is a Photonic Chip Network (PCN) of N×M photonic chips, where we need to route information from multiple inputs to multiple outputs simultaneously and dynamically. (7) (11) I.L = 10 log 10 (χ ij ), Cr.T = 10 log 10 (µ k ), The routing task on a PCN can be formulated as traffic routing problem, where the PCN is considered as a road network with four directions of size 3M×3N×3M×3N . Under this formulation, transmitting a signal from an input to an output port is equivalent to directing a traffic agent towards its designated destination, as illustrated in Fig. 5. At each time step, there might be multiple traffic agents in the road network, and our task is to specify the next way point for each agent to advance toward its destination. Ideally, we want to find the shortest path for each road agent, but this is a challenging task given the need to optimize for transmission criteria and avoid  www.nature.com/scientificreports/ collision. Furthermore, we need an efficient algorithm, especially when frequent recalculation is unavoidable due to the dynamics of the network. Network routing is a well-studied problem but many existing algorithms are unsuitable for a switch network. Traditional planning algorithms such as A* 83 and D* 84 are too slow for dynamic environments. Simpler heuristics such as Hill Climbing 85 are faster, but the provided routing path for each traffic agent can be far from optimal due to the greedy action that does not account for the long-term consequence. However, finding the optimal routing path is a sequential decision process, where the next way point of the agent will affect the state of the entire traffic network and the future course of actions. This problem is very much amenable to Reinforcement Learning (RL) 84,86 , and we propose to use RL to learn a routing policy.
In the remaining of this section, we will first describe the main components of our RL formulation, including the state representation, the reward function, and the action space. We will then describe a novel algorithm to learn the RL policy called Multi-Sample Discovery PPO (MSD). MSD is based on Proximal Policy Optimization 87 , but it is particularly designed for switch network environments.

Reinforcement learning formulation.
We now describe the main components of our reinforcement learning formulation, which are the state representation, the action space, and the reward function.
State representation. Given a PCN with M×N photonic chips, we will use a three-dimensional tensor of size 3M×3N×2 to represent the state of the PCN at each time step. This state representation encodes the current locations of all traffic agents in the network, the cumulative transmission loss of each agent, and also the destination of each agent. This state representation is obtained as follows. First, for each photonic chip c of the PCN structure, we will construct two 5×5 matrices L c and D c to represent the state of the chip. Without counting the corner entries, each of these two 5×5 matrices has exactly twelve entries along the outer edges of the matrix, and each entry corresponds to a specific port of the photonic chip. Let L c i and D c i denote the entries of L c and D c that correspond to the i th port of the chip ( 1 ≤ i ≤ 12) . If there is an agent k at port i, we will set L c i to k and D c i to the cumulative transmission loss of agent k. If the destination of an agent k ′ is at port i, we set D c i to k ′ . Thus, we use D c to encode both the transmission losses and the destinations of agents. This is possible because an agent should not be at the destination of another agent. Second, the matrices L c and D c can be stacked to create a 5×5×2 tensor to represent the state of the photonic chip c. Third, we spatially concatenate the state representations of all photonic chips together to create a 5M×5N×2 tensor, as illustrated by S T and S T+1 in Figure 6. Finally, we resize this tensor to 3M×3N×2 , and it is used as the state representation for the PCN.
Action space. A RL agent in a PCN has 12 possible actions at each time step: Left1, Left2, Left3, Right1, Right2, Right3, Up1, Up2, Up3, Down1, Down2, Down3, corresponding to four directions and three possible ports per direction. Depending on the action, the agent can remain on the same chip or move to an adjacent chip. For example, consider a specific agent at port O2 (Fig. 7). This agent will remain in the same chip if it takes any Up, Down, Left action. This agent will move to either port K1, K2 or K3 of the chip on the right side of the current chip, if the agent takes any of the actions: Right1, Right2, or Right3. If the agent is already at the right most edge of the PCN (i.e., no chip on the right side of the current chip), the agent will remain at the same location if it takes a Right action.
Reward function. The reward of an agent after each action is set to be the negative of the transmission loss. We first run a PCN simulator to compute the transmission loss corresponding to the photonic chip's input and output. When the agent takes action at time t, the agent receives the reward R t = −IL , where IL is the insertion loss www.nature.com/scientificreports/ defined in Eq. (11). By learning a policy to maximize the sum of rewards ℑ = T t=0 R t , we will obtain a routing policy that minimizes the total transmission loss.
Learning the reinforcement learning policy. To learn the optimal routing policy, we use Proximal Policy Optimization (PPO) 87 , a state-of-the-art reinforcement learning algorithm. The PPO is a type of policy gradient algorithm, which is an iterative optimization procedure where the parameters of the policy are updated based on the gradient of a loss function defined based on the agent's interaction with the environment and the rewards it receives. PPO is an on-policy algorithm, meaning that the agent uses its own policy to interact with the environment to generate interaction data sequences for optimizing the policy. Each interaction data sequence is called a learning episode, and in our case, it is a sequence of state-action-reward triplets as the agent is routed by the current policy from an input port to a desired output port. A learning episode can be a successful or unsuccessful routing, depending on whether the agent reaches the designated destination. The PPO is a robust and easy to use, but it is not data efficient because each learning episode is used only once for training the policy. Furthermore, the PPO might be trapped in a vicious cycle of bad policy and bad data, where the bad policy does not generate useful data to improve the policy. To address these problems, we develop here a novel algorithm called Multi-Sample Discovery PPO (MSD), which extends the PPO by maintaining a Sample Extraction Buffer (SEB) that stores learning episodes that correspond to successful routing. During training, MSD will first use its policy to generate a learning episode. If this learning episode is a successful routing, MSD will not be different from the PPO; it will use the learning episode to update the parameters of the actor and the critic functions, which are the main components of the PPO algorithm 87 . However, if the learning episode is a failed routing attempt, MSD will effectively find in SEB a successful route that shares a common node (in the traffic network) with the failed routing attempt. The part of the failed route after the common node is then replaced by that part of the successful one to create an updated learning episode that corresponds to a successful route, as illustrated in Fig. 8.
In MSD, we also introduce a novel component called the advisor function, which is maintained in addition to the actor function of the normal PPO algorithm. The advisor is essentially a special actor that is trained based on the samples provided by the SEB when bad samples are encountered. The role of the advisor is to provide suggestion to the multiple asynchronous actors that are deployed to explore the environment in parallel, as illustrated in Fig. 9.
Let π θ actor denote the policy function of the actors with θ actor being the vector of parameters of the policy function. At each training iteration, θ actor is updated to θ that maximizes the following objective 87 :  www.nature.com/scientificreports/ where the expectation Ê t indicates the empirical average over a finite batch of samples, Â t is an estimator of the advantage function at timestep t, r t (θ) is the probability ratio between the sought-after policy and the old policy, and clip(r t (θ), 1 − ǫ, 1 + ǫ) is the clipping function that clips r t (θ) between 1 − ǫ and 1 + ǫ.
In PPO 87 , the probability ratio is where θ old actor is the vector of policy parameters before the update, and a t is the action taken by the actor. In MSD, there is an advisor and the advisor suggests which action to perform for each actor. The usefulness of the advisor's suggestion is measured based on the ratio between the average total sum of rewards of the advisor and the actors (averaging over K learning episodes): If H p ≤ 1 , the advisor function is not better than the actor function, so the actors act based on their own policy. In other words, the action a t is sampled from the policy function of the actors, i.e., a t ∼ π θ old actor (a t |s t ) . The probability ratio r t (θ) is set based on Eq. (14).
If H p > 1 , the advisor function is better than the actor function, and the actors follow the actions suggested by the advisor. The action a t taken by an actor is sampled from the advisor function: a t ∼ π θ advisor (a t |s t ) . The probability ratio is: When the performance of the actor is worse than some expected value, the advisor will revert the actor into the balanced state. In the photonic chips network environment, the reward R is simulated from Photonic simulation, while the advisor, the actor, the critic are multi-layer perceptron networks with Exponential Linear Unit activation functions.

Experimental evaluation
Data and environment. We perform experiments on simulated data generated by the photonic component simulator tool with size of 16Mb in plain text. The data composes of transmissions loss from an input port (i) to an output port (j) on a photonic chip with various wavelengths. The example of data structure from an input port to an output port is presented in Table 1. The reward R ij for moving from the input port (i) to the output port (j) is based on the transmision loss and total cross-talk, as specified in Eqs. (11) and (12). From Table 1, the reward can be calculated as: In the experiment, the official size of PCN is investigated as 36×36×36×36. The number of actors available in the PCN is equivalent to the number of input-output pairs in PCN (36 actors). In order for the agent to adapt to the new route when the physical failure ports occur on the network, we generate erroneous nodes corresponding to these physical failure ports in a range from 16 to 32 by randomly choosing the available nodes in the PCN environment. In that case, if the agent enters an error node, we will add some penalty to the R ij by doubling the transmission loss from port i to port j.
Comparison algorithms and metrics. We compare the performance of MSD with several other stateof-the-art reinforcement learning algorithms: PPO 87 , A3C 88 , and HER 89,90 . We use several performance metrics, including: cumulative transmission loss, cumulative power consumption, and routing time. Loss(k). Table 1. Example of data generated by the photonic component simulator. is the wavelength of one specified optical signal. The Output column is output transmission, and Loss1, Loss2, and Loss3 represent the crosstalk of four directions of the agent going through. www.nature.com/scientificreports/ Parameter settings. The model parameters of MSD are fine-tuned though a pilot experiment using a subset of the dataset, which provides the optimal values as follows. Co-Efficient Entropy is set to 0.02, while Co-Efficient is 0.05. The Clipping value ǫ is set to 0.2 and is set to 0.97. For the network architecture, the number of dense layers is 3 with the number of units being 512. The convolutional layer is made to be false, and K is set to 50. The SEB memory is 100K samples, and the number of permutations on SEB is 50.

Results and analysis.
Transmission loss. Figure 10 shows the cumulative transmission losses of MSD and several other reinforcement learning algorithms on the variation of the number of training episodes. As can be seen, MSD converges faster than the other algorithms and attains a routing policy with the smallest level of the cumulative transmission loss. The cumulated average losses for randomly routed optical paths are altered from 4 dB to 15 dB when the MSD model is trained successfully even the M × N sizes are significantly large up to 36 × 36 for both environments with and without erroneous nodes. These span losses are within acceptable margins for the operating limit of an optical signal transmitting-receiving system, thus indicating an excellent routing quality of MSD when compared to other reinforcement learning algorithms, which have been investigated in simultaneous experiments. For example, when the erroneous nodes are random in the dynamic range from 16 to 32, A3C and PPO can spend average losses up to 90 dB and 100 dB on routing the optical paths, as seen in Fig. 10. Such attenuations are so dreadful that none of the photonic systems can operate in that condition. As a consequence, MSD-PPO can help the photonic network save amount of power margins to enlarge the network size as well as the propagation distance while ensuring the excellent transmission quality in terms of the bit error rate and optical signal to noise ratio in a defined bandwidth. This means MSD-PPO can attain the largest optical spectrum when compared to other reinforcement learning-based algorithms. Therefore, our PCN can support the routing operation for high-load traffic channels.
Power consumption. Figure 11 compares the power consumption levels of four reinforcement learning algorithms on chip networks with and without erroneous nodes. In both situations, the MSD leads to a final routing policy with the lowest power consumption that is not exceeded 4 W. After finishing a successful deep-learning process, MSD can provide a sufficient power consumption under 10 W even though the erroneous number can  Routing time. Figure 12 compares the routing time of MSD with PPO, A3C, and HER. As can be seen, MSD enables the routing time as approximately as HER. MSD outperforms both PPO and A3C with an awe-inspiring routing time of about 5 ms for both cases with erroneous and without erroneous nodes in a large scale of the experienced PCN size. This result demonstrates that MSD is capable of implementing real-time processing tasks. As must be recalled, MSD is an extension of PPO, and the performance advantage of MSD can be reasoned from the remarkable contribution of the Sample Extraction Buffer. With the excellent value of the routing time, PCN can reroute a optical channel instantly assuring the continuous information connection without disconnection.
Robustness to erroneous nodes. It is not unacceptable that MSD is more outperformed than that of reinforcement learning algorithms for optimization issues. To evaluate the effectiveness of different routing strategies and the proposed reinforcement learning algorithm, we need to consider the performance of the MSD deep learning model to events and responses from the physical environment of the on-chip integrated photonic network from the worst to the best cases after a successful training process. This issue is vital because a CMOS process for manufacturing a monolithic silicon-photonic chip network can be imperfect, or some fundamental chip units can be malfunctions over the timeline. The photonic chip network size is installed as M and N are all equal 36 exhibiting such commodious space that brute-force or heuristic algorithms become invalid or insufficient. Subfigures in Fig. 13 present the cumulative transmission loss T C (Fig. 13a), the electric power consumption P C (Fig. 13b), and the routing time τ R (Fig. 13c) in the routing progress from a random pair from some input to some output versus the physical connection error for the MSD reinforcement learning model in the best and worst cases, respectively. The number of errors investigated randomly varies from 1 to 40. For the best case, the routing processes are efficient and linear increase versus three investigated performance factors thanks to the effective operation of advisors and the excellent environment discovery ability of the Sample Extraction Buffer element. For the worst case, when the error number is smaller than 20, the model still performs the routine effectively. The situation becomes different when the error number is greater than 20 that there is a clear distinction when MSD has difficulty in routing optical paths because all performance parameters become more frequently fluctuated. For the cumulative transmission loss, one can see that the transmission loss may take 4 dB to 7 dB if fortunately  www.nature.com/scientificreports/ to meet the shortest path, for example, adjacent input-output pairs. However, even if it, unfortunately, meets the worst case, MSD still exhibits an ability to route effective optical paths thanks to the stable convergence and the practical feasibility of the off-policy. Because, as can be seen in Fig. 13, in the worst case with the longest path, the cumulative transmission loss is below 25 dB, and this value is within the allowable sensitive range for the current semiconductor photodetectors. Besides, the cumulative electric power consumption is about 3.6 W for the best case and not exceeded 15 W for worst-case. Furthermore, the routing time is below 8.4 ms in the worst-case. This is, therefore,appropriate for real-time routing in photonic connection networks. This effectiveness to erroneous numbers demonstrates that our chip network design has high stability and attains a large erroneous tolerance.

Scalability.
To understand the scalability of the proposed PCN and the routing algorithm, we increase the size of the PCN from 8×8 to 9×9 , 18×18 , 36×36 , and 42×42 , as illustrated in Fig. 14. When size the PCN is 9×9 , the resulting transmission loss, power consumption, and routing time are small, being 2.8 dB, 1512 mW, and 1.4 ms, respectively. With a considerably larger expansion up to 42×42 , an excellent result was recorded with an approximation of 8.8 dB of transmission loss, 8763 mW of power consumption, and 7.6 ms of routing time. This scalability demonstrates that our chip network design is highly modular, and the routing policy is highly scalable, which is suitable for creating sizeable inter-chip communication networks and photonic data centers in the future.

Broader impact
This article proposed a rectangular full-grid silicon photonic-on-chip network architecture enabling the fourdegree connection/switching ability through the presence of the novel waveguide cross-connect structure. The original idea of this structure can be analogously manipulated to construct higher-degree photonic switches and more complicated network topology architecture as well as scale up a vast of connection i/o ports. Besides, the networks can be self-controlling after being completely trained by our hybrid deep reinforcement learning models can make the proposed PCN more effective in the routing optimization and multiple controlling workers of the optical paths to achieve high performances against the dynamic changes of traffic, connection quantity, and erroneous nodes in the network. Thus, this paper is more beneficial to a wide variety of PCNs based applications in terms of transparency, adaptivity, responsibility, and optimality, including reconfigurable optical add drop multiplexers in optical transmission nodes, all photonic routers in Petascale data centers, fully connected layers in photonic convolutional neural networks. The proposed algorithm presenting in this investigation can be applied to control and optimize network resources and properties in many different topologies of distributed optical fiber communication networks, such as dynamic connection/node quantity, automatic traffic protection switching, optimal routing, ultrafast channel connection permutation restoration, sufficient energy consumption, and automatically updated configuration 91,92 . In such perspective scenarios, self-learning capability hidden in the control and administrative planes at distributed communication networks can execute and resolve better adaptive network configuration and resource management tasks via the deductive sampling capability associated with the Sample Extraction Buffer for the MSD-PPO algorithm of deep reinforcement learning model.

Conclusion
We have proposed a full-grid photonic switching network based on novel multi-degree silicon photonic switches enabling the routing strategy operation via artificial intelligence techniques. The network architecture provides flexible bandwidth configuration for high performance while being energy efficient. In addition to having low physical cost and high energy efficiency, optimizing the transmission loss and power consumption in a massive range stands out as a key challenge. The routing strategy, which can be seamlessly incorporated into the switch controller, potentially provides an additional advantage for the physical layer performance optimization at no extra cost. An enhanced technique of the PPO algorithm thanks to applying multi-sample discovery agents into PPO exhibits significant results for routing strategy. By defining the number of global input-output in the switch in topologies, we reveal their optimal paths based on the current state of the photonic-on-chip network. Our analysis shows that the optical routing effectiveness of transmission loss, power consumption, and routing www.nature.com/scientificreports/ time when applying MSD based reinforcement learning. This routing strategy also demonstrates an excellent efficiency for erroneous network nodes and fabrication tolerance error, thus increasing the photonic network's operating stability. Furthermore, our results show the scability of the network capacity demonstrated via both simulation and test platforms, even for moderate-scale silicon switches. Such marvelous properties make the proposed silicon photonic-on-chip networks being self-controlling in use and providing a potential for applications in decentralized petabyte data centers, photonic neural networks, big data processing, high-performance computing, and ultrafast optical intrachip communication (Suppl. Information).