Reinforcement learning based on reservoir computing
Figure 1 shows a schematic of reinforcement learning based on reservoir computing, incorporating the decision-making agent, which affects the future state of the environment by its actions, and its environment, which provides rewards to every action of the agent [4]. The objective of the agent is to maximize the total reward. However, the agent has no information regarding a good action policy. Here, we consider the action-value function \(Q\left({\mathbf{s}}_{n}, {a}_{n}\right)\) for state \({\mathbf{s}}_{n}\) and action \({a}_{n}\) at the \(n\)-th time step [4]. The agent selects an action with the highest \(Q\) value in each state, and the total reward can be increased if the agent initially knows the value of \(Q\). In various previous studies, the action-value function was replaced by deep neural networks, which were trained using some methods, including Q-learning [2, 3]. In this study, the action-value function is replaced by photonic delay-based reservoir computing to reduce the learning cost and realize fast processing.
Delay-based Reservoir Computing
The reservoir of delay-based reservoir computing systems consists of a nonlinear element and a feedback loop that realizes a network with many connected nodes [36]. In this scheme, the nodes in the network are virtually implemented by dividing the temporal output into short node intervals \(\theta\), resulting in an easier implementation because it does not require the preparation of a large number of spatial nodes to construct a network.
The \(n\)-th input data into the reservoir is the state vector given by the environment \({\mathbf{s}}_{n}^{T}=\left({s}_{1,n}, {s}_{2,n}, \cdots , {s}_{{N}_{s},n}\right)\), where \({N}_{s}\) is the number of state elements, and the superscript \(T\) represents the transpose operation. The state is initially preprocessed via the masking procedure, which is defined in Eq. (1), before injecting into the reservoir [36, 37], where the state is multiplied by a mask matrix \(\mathbf{M}\). The value of the mask is randomly obtained from a uniform distribution of \(\left[-1, 1\right]\).
\({\mathbf{u}}_{n}^{T}=\left(\mu {s}_{1,n}, \mu {s}_{2,n}, \dots ,\mu {s}_{{N}_{s},n}, b\right)\mathbf{M}=\left(\mu {\mathbf{s}}_{n}^{T},b\right)\mathbf{M},\)
|
(1)
|
where \(\mathbf{M}\) is the mask matrix, which is an \(N\times \left({N}_{s}+1\right)\) matrix, and \(\mu\) is the scaling factor for the input state \({\mathbf{s}}_{n}\), and \(N\) is the number of nodes in the reservoir. The elements of the preprocessed input vector \({\mathbf{u}}_{n}\) are given by \({u}_{i,n}\), which correspond to the input data into the \(i\)-th virtual node in the photonic reservoir computing. Moreover, a fixed bias \(b\) is added to Eq. (1), which prevents the signal \({\mathbf{u}}_{n}\) from being equal to zero when the elements of state \({\mathbf{s}}_{n}\) are close to zero and can lead to different nonlinearities for each virtual node.
We consider the input data \({u}_{i,n}\) for the \(i\)-th virtual node defined as \(\mu \left({m}_{1,i}{s}_{1,n}+{m}_{2,i}{s}_{2,n}+\cdots +{m}_{{N}_{s},i}{s}_{{N}_{s},n}\right)+b{m}_{{N}_{s+1},i}\), where \({m}_{p,q}\) is an element of the mask matrix \(\mathbf{M}\) in row \(p\) and column \(q\). The representation of \({u}_{i,n}\) indicates that the input data for the\(i\)-th node oscillates with the center on the bias \(b{m}_{{N}_{s+1},i}\). The center point of the oscillation in the input data is different for each node because the element \({m}_{{N}_{s+1},i}\) of the mask matrix is different for each node. A different part of the nonlinear function that represent the relationship of the input and output in the reservoir is used for each node because of the bias \(b{m}_{{N}_{s+1},i}\), leading to different nonlinearities for each node. Therefore, adding an input bias can enhance the approximation of the reservoir.
An input signal injected into the reservoir is generated by temporally stretching the elements of \({\mathbf{u}}_{n}\) to the node interval \(\theta\) as follows:
where \({T}_{m}\) is the signal period of each input data and is called the mask period, which corresponds to the product of the number of nodes \(N\) and the node interval \(\theta\) (\({T}_{m}=N\theta\)). The input signal \(u\left(t\right)\) is injected into the reservoir to generate a response signal, where the virtual nodes are extracted by dividing them by a small interval \(\theta\). The number of virtual nodes \(N\) corresponds to the number of elements in \({\mathbf{u}}_{n}\).
The output of reservoir computing is calculated from the weighted linear combination of virtual node states, which are extracted from the temporal output of the reservoir by dividing the output by the node interval \(\theta\), and is considered as \(Q\left({\mathbf{s}}_{n}, a\right)\) for reinforcement learning. The output related to action \(a\) is represented as \(Q\left({\mathbf{s}}_{n}, a\right)\), which is defined as:
\(Q\left({\mathbf{s}}_{n},a\right)={\sum }_{j=1}^{N}{w}_{a,j}{v}_{j,n}={\mathbf{w}}_{a}^{T}{\mathbf{v}}_{n},\)
|
(3)
|
where \({\mathbf{v}}_{n}\) is the vector of the node states for the \(n\)-th input and \({\mathbf{w}}_{a}\) is the weight vector for action\(a\). The number of reservoir outputs corresponds to the number of agents’ actions during reinforcement learning. In reinforcement learning, the action with the highest \(Q\) value is selected. In the following subsection, \({\mathbf{w}}_{a}\), which is trained based on the Q-learning in reinforcement learning, is explained and discussed.
Training For Reservoir Weights Using Reinforcement Learning
Q-learning is a well-known training algorithm for reinforcement learning [4]. In this study, we use this training algorithm to train the reservoir weights. Moreover, the update rule based on Q-learning is called off-policy learning [4], which shows that the action used for training differs from the agent's actual action. In the Q-learning method, the maximum Q value \({\text{max}}_{a}Q\left({\mathbf{s}}_{n+1}, a\right)\) for action \(a\) at the next state \({\mathbf{s}}_{n+1}\) is used, and the actual action is not always used for training. In our scheme, \(Q\left({\mathbf{s}}_{n}, a\right)\) is approximated using reservoir computing by considering a one-step temporal difference error \({\delta }_{n}={r}_{n+1}+\gamma {\text{max}}_{a}Q\left({\mathbf{s}}_{n+1},a\right)-Q\left({\mathbf{s}}_{n},{a}_{n}\right)\) and the square of the temporal difference error as the loss function. Then, the update rule for the reservoir weights is described as:
\({\mathbf{w}}_{{a}_{n}}\leftarrow {\mathbf{w}}_{{a}_{n}}+\alpha \left[{r}_{n+1}+\gamma \underset{a}{\text{max}}{\mathbf{w}}_{a}^{T}{\mathbf{v}}_{n}-{\mathbf{w}}_{{a}_{n}}^{T}{\mathbf{v}}_{n}\right]{\mathbf{v}}_{n}\),
|
(4)
|
where \(\alpha\) is the constant step-size parameter and \(\gamma\) is the discount rate for a future expected reward. These hyperparameters should be appropriately selected for a successful computation. We set \(\alpha\) as a small positive value and is related to the training speed. Moreover, \(\gamma\) is set to a positive value of less than 1.
Furthermore, we employ the experienced replay method to train the reservoir weights [38]. In this method, the observed data (state, action, and reward) are preserved in the memory, sampled randomly, and used for training. The randomly sampled data is referred to as the mini-batch. The size of the mini-batch and the number of preserved data are hyperparameters. Using the randomly sampled and preserved data for training may reduce the correlation between the data used for training and exhibits easier convergence of the Q-learning.
Moreover, we use the \(\epsilon\)-greedy method for the action selection in which the actions are randomly sampled using the probability of \(\epsilon\), which is initially set to 1 to indicate that the agent first takes a random action. Then, the value is reduced as the number of episodes for the reinforcement-learning task increased. The value of \(\epsilon\) is updated by \(\epsilon ={\epsilon }_{0}+\left(1-{\epsilon }_{0}\right)\text{exp}\left(-{k}_{\epsilon }{n}_{ep}\right)\), where \({n}_{ep}\) is the episode index of the reinforcement learning task and \({k}_{\epsilon }\) is the attenuation coefficient. The value of \(\epsilon\) converges to the value \({\epsilon }_{0}\), which is fixed at 0.01, as the number of episodes increases.
Scheme For Optoelectronic Reservoir Computing
In this study, we use an optoelectronic delay system shown in Fig. 2(a), which is commonly applied to explore complex phenomena such as dynamical bifurcation, chaos, and chimera states [39–41], as a reservoir. Moreover, the application of this system in physical reservoir computing has also been studied [42, 43]. The system is composed of a laser diode (LD), a Mach-Zehnder modulator (MZM), and an optical fiber for delayed feedback. In particular, the modulator provides a nonlinear transfer function \({\text{cos}}^{2}\left(\cdot \right)\) from the electrical inputs to the optical outputs. The optical signal is transmitted through the optical fiber with a delay time of \(\tau\) and is transformed into an electric signal using a photodetector (PD). The electric signal is fed back to the MZM after the signal passes through an electric amplifier (AMP). An input signal for reservoir computing is injected into the reservoir by coupling with the feedback signal. The temporal dynamics of the system are described using simple delay differential equations [41]. We use delay differential equations for the numerical verifications of the proposed scheme, which are described in the Methods section.
In our experiment, we employ a system similar to the scheme shown in Fig. 2(a), except for the absence of the delayed feedback, as shown in Fig. 2(b). Thus, the proposed system is considered as an extreme learning machine, which has been studied as a machine-learning scheme [44]. The digital oscilloscope (OSC) and arbitrary waveform generator (AWG) were controlled by a personal computer (PC). Initially, the state of the reinforcement learning task was calculated using the PC. Then, an input signal was generated from the state by applying a masking procedure for reservoir computing. The input signal was transferred to the AWG, which produced the signal, from the PC. The signal was amplified by AMP and injected into the MZM. The optical output of the MZM was modulated based on the injected signal. The optical signal was transformed into an electric signal at the PD. The electric signal was acquired by the OSC and was then transferred to the PC. The node states of the reservoir were extracted from the signal. The output of reservoir computing was calculated from the weighted sum of the node states, which corresponded to a Q value for each action in a reinforcement learning task. An action was selected based on the Q values, and the state of the reinforcement learning task was updated based on the selected action. In addition, the reservoir weights were updated based on Q-learning. The above procedure was repeated until the reinforcement learning task was terminated. This procedure for reinforcement learning was executed under the control of OSC, AWG, and PC in an on-line manner.
In our numerical simulation and experiment, the number of nodes \(N\) is 600, and the node interval \(\theta\) is 0.4 ns. Then, the mask interval \({T}_{m}\) is given as \({T}_{m}=N\theta =240\) ns. The feedback delay time is set similar as the mask interval in various studies on delay-based reservoir computing [36]. Moreover, it has been reported that the slight mismatch between the delay time and the mask interval enhances the performance of information processing [27, 45]. Therefore, we set the feedback delay time to \(\tau =236.6\) ns, which is related to the mask interval and the node interval by \({T}_{m}=\tau +\theta\).
Experimental And Numerical Results On Benchmark Tasks
We evaluate our reinforcement learning scheme based on delay-based reservoir computing using the reinforcement learning task, CartPole-v0 [46] in OpenAI Gym [47]. An un-actuated joint attaches a pole to a cart that moves along a frictionless track. The goal of the task is to keep the pole upright during an episode, which has a length of 200 time steps. A reward of \(+1\) is provided for every time steps in which the pole remains upright. The task is solved when the pole remained upright for 100 consecutive episodes. The details of the task are described in the Methods section. The hyperparameters for reinforcement learning are fixed at \(\alpha =0.000400\), \(\gamma =0.995\), and \({k}_{\epsilon }=0.04\), respectively. The number of memories and the size of the mini-batch for experience replay were 4000 and 64, respectively.
Figure 3(a) shows the numerical results of the total reward for each episode, indicating that if the total reward obtained is 200, the pole remains upright over an episode. Meanwhile, a total reward of less than 200 indicates that the pole falls in the middle of an episode. As shown in Fig. 3, the input bias is applied (\(b=0.80\)) as represented by the black curve. The pole cannot be kept upright when the episode index was small. On the other hand, the pole become upright over an episode as the number of episodes increases. The CartPole-v0 task is successfully solved since a total reward of 200 is obtained from each 100 consecutive episodes. Moreover, the total reward does not reach 200, and the pole cannot keep upright for all episodes when the input bias is not applied (\(b=0.0\), the red curve), indicating the necessity of introducing the input bias to solve the task. When the input bias is not introduced, we observe that a selected action over an episode is determined based on the initial state. The push to the right (left) is only selected over an episode if the pole angle is positive (negative), which results in the immediate fall of the pole. When the input bias is introduced, an action that prevents the pole from falling is selected after several episodes, enhancing the ability to identify an input state. The reservoir can be trained to prevent the pole from falling because it can identify the inputs of different states when the input bias is introduced.
Figure 3(b) shows the experimental results in which the total reward reaches 200 at the 110th episode and does not vary until the 300th episode, indicating that the task is successfully solved in the experiment. Moreover, we found that reaching the total reward to 200 is slower in the experimental result than that in the numerical result. The discrepancy between the numerical and experimental results is caused by the existence of the measurement noise in the experiment, which may perturb the estimated Q value of the reservoir. Moreover, an incorrect action may be selected in the experiment because of the influence of noise when the difference between the Q values of the two actions is too small. Therefore, it is necessary to learn the Q values until the difference in the Q values becomes sufficiently large to ensure the proper selection of the correct action in a noisy environment.
In addition, the system had no time-delayed feedback in the experiment. If the reservoir has time-delayed feedback, it would have a memory effect that would become advantageous since the reservoir with memory can learn the state-action value function including the past and current states. This is equivalent to expanding the dimension of the input state space, making it possible to approximate a more complex state-action value function and achieve a higher total reward by adding time-delayed feedback. Nevertheless, the time evolution of the state is uniquely determined only from the current state for the benchmark task used in this study, and this task can still be solved using a reservoir without time-delayed feedback in the experiment.
We emphasize that one action (episode) of reinforcement learning can be potentially determined by the processing rate of reservoir computing at a frequency of 4.2 MHz in our scheme, where one virtual network is constructed from a time series with \(N\theta =240\) ns (\(N\) is 600 and \(\theta\) is 0.4 ns). This is much faster than conventional reinforcement learning, and we can further increase the processing speed by decreasing the node interval \(\theta\) with a faster photonic dynamical system. In addition, the size of the trained parameters (600) is smaller than that for deep neural networks (e.g., 480 M parameters for ImageNet [8–10]). The hardware implementation of photonic reservoir computing is promising for realizing fast and efficient reinforcement learning.
We demonstrated another benchmark task called the MountainCar-v0 task, which is provided by OpenAI Gym [46]. This task aims to make a car reach the top of the mountain by accelerating the car to the right or left. A reward of \(-1\) is given for every step until an episode ends, consisting of 200 steps. An episode is terminated if the cart reaches the top of the mountain. Therefore, a higher value of the total reward is obtained if the cart reaches the top of the mountain faster. Solving this task is defined as obtaining an average reward of -110.0 for 100 consecutive trials [44]. The hyperparameters for reinforcement learning are fixed at \(\alpha =0.000010\), \(\gamma =0.995\), and \({k}_{\epsilon }=0.04\), respectively. The number of memories and the size of the mini-batch for experience replay are 4000 and 256, respectively.
Figure 4(a) shows the numerical results of the total rewards for each episode. The black curve represents the total reward for each episode while the red curve represents the moving average of the total reward calculated from the past 100 episodes. The total reward is -200 in the first several episodes, indicating that the car does not reach the top of the mountain at all. However, as the number of episodes increases, the car is able to reach the top of the mountain. The average reward increases as the number of episodes increases and exceeds -110 during the 267th episode, indicating that the task is solved using our scheme.
Moreover, the experimental results are shown in Fig. 4(b). The moving average (red curve) increases as the number of episodes increases. However, the moving average does not reach the blue dashed line (total reward of -110). In this task, a larger value of the total reward ranges from -120 to -80, which depends on the initial condition of the state in the task. In particular, the moving average becomes larger if the length of consecutive episodes where a large total reward is obtained is long. The length in the experimental result (the episode from 170th to 192nd ) is shorter than the numerical result (the episode from 197th to 249th ). Thus, the moving average in the experiment does not increase as much as that in the numerical simulation.
We conduct another experiment using the MoutainCar-v0 task. We use a reservoir weight, which is not updated to solve the task, trained in the experiment, as shown in Fig. 4(b). Figure 4(c) shows the total reward for each episode, where the weight trained until the 180th episode in Fig. 4(b) is used. The moving average (red curve) exceeds -110 at the 141st episode. Moreover, the average total reward of consecutive 100 episodes exceeds -110, and the task is solved. The results indicate that the trained weight properly works even if the conditions on the setup are slightly changed, such as the detected power at the PD. Therefore, the trained weight is robust against perturbations in the system parameters.
We numerically investigate the dependence of the performance on the input bias \(b\) in the MountainCar-v0 task. In Fig. 5(a), the red solid curve represents the maximum of the moving average of the total reward in 1000 episodes. The total reward is averaged for 10 trials, with each trial consisting of 1000 episodes. The total reward is nearly equal to zero for a small input bias value (\(b\le 0.5\)). A large value of the total reward is obtained for a large input bias value (\(b>0.5\)). This result indicates that an input bias is necessary for solving the task. An input bias with a value close to 1 is suitable for increasing the total reward, which can be related to the normalized half-wave voltage (\({V}_{\pi }\)) of the MZM, where the normalized voltage is equal to 1 in our numerical simulation. An input bias nearly equal to 1 can produce the nonlinearity of the MZM (\({\text{cos}}^{2}\left(\cdot \right)\)). It is considered that nonlinearity can assist in identifying different input states.
We also investigate the effect of the time-delayed feedback in the numerical simulation. In Fig. 5(a), the blue dashed curve represents the case in which the reservoir has no delayed feedback. At \(b=0.85\), a total reward of -130.29 is obtained. Thus, the reservoir can be successfully trained for the car to reach the top of the mountain, even if the reservoir has no delayed feedback. However, the performance is lower than in the case with delayed feedback (the red solid curve), which indicates that the delayed feedback can enhance the performance of the task.
For a more detailed investigation, we show the dependence of the total reward on the feedback strength, as shown in Fig. 5(b). For the three curves in Fig. 5(b), different input bias values are used. In the black (\(b=0.90\)) and red (\(b=0.70\)) curves, a large value of the total reward is obtained at a feedback strength of approximately 1. However, the total reward decreases as the feedback strength increases to more than 1. When the feedback strength becomes larger than 1, the temporal dynamics of the optoelectronic system change from a steady state to a periodic oscillation, even though the system has no input signal. If the reservoir dynamics is not in a steady state, the temporal responses of the reservoir can be different when the same signal is repeatedly injected, which is called the consistency property of dynamical systems [48]. If the reservoir has no consistency, the performance deteriorates because the same state cannot be identified. In a slightly small value of the input bias (blue curve, \(b=0.50\)), the total reward is almost equal to -200 for different feedback strengths. This result indicates that the adjustment of the feedback strength cannot enhance the performance if the input bias is too small.