A. Multi modulation system
We map the adaptive modulation scheme to a finite Markov decision process. Based on this discrete and finite-state theoretical framework, agents and the environment achieve their goals through interactive learning. In finite MDP, function \(p\) defines the dynamic characteristics of MDP and specifies \(a\) probability distribution for the selection of each state and action.
In order to improve the bandwidth efficiency of the system, an underwater acoustic communication system usually adopts a multi-band modulation scheme. A set of signal constellation points can represent the modulation level of each data symbol \({M}_{t}\). We select modulation scheme {BPSK, QPSK, 8-QAM, and 16-QAM.}, where \({M}_{1}=\left\{\text{2,4}\right\}\) in circular constellation Multiphase Shift Keying (MPSK) system, \({M}_{2}=\left\{\text{8,16}\right\}\) in the square constellation multilevel quadrature amplitude modulation (MQAM) system. It is assumed that the transmitter uses constant symbol period \({T}_{s}\) and ideal value is combined to obtain \(M = \left\{\text{2,4},\text{8,16}\right\}\), and the length of each time slot \({T}_{s}=\frac{1}{B}\), where \(B\) is the bandwidth of the received signal.
(1) State-space: since the receiver obtains the CSI information of the feedback link, including channel gain, multi-path, noise, and other information, we define the state of each time slot as \({S}_{t}=\left\{{s}_{1},{s}_{2},{s}_{3},\dots \right\}\)。
(2) Action space: in the OFDM underwater acoustic communication system model, since the transmitter automatically adjusts and selects the modulation scheme according to only the current feedback state in each time slot, the action of each time slot is defined as \({A}_{t}=\left\{{a}_{1},{a}_{2},{a}_{3},\dots \right\}\), where the modulation mode \({a}_{i}\) adopted for the \({i}^{th}\) time slot is selected from the given constellation M system.
(3) Immediate reward function: each time slot obtains the timely reward function \({R}_{t}=\left\{{r}_{1},{r}_{2},{r}_{3},\dots \right\}\) based upon the feedback, and it includes reward and punishment. The reward function is directly proportional to the number of data bits successfully transmitted, and related to the system throughput. The punishment function is only related to the BER.
A complete MDP consists of four tuples. Given the values of the initial states and actions, the probability of the occurrence of \(s’\in S\) and \(r\in R\) at time \(t\) can be obtained \(p\left({s}^{\text{'}},r|s,a\right)={Pr}\left\{{S}_{t}=s’,{R}_{t}=r|{S}_{t-1}=s,{A}_{t-1}=a\right\}\). However, in our adaptive modulation scheme in an underwater channel environment, the selected probability distribution of each \(s\) and \(a\) cannot be obtained, so it is defined as MDP triple \(<S,A,R>\).
B. Value calculation
We are considering designing an adaptive transmission system based on SNR \({\gamma }_{t}\). If the average data rate is maximized only under the fixed target BER, then \(k\left({\gamma }_{t}\right) \text{c}\text{a}\text{n} \text{b}\text{e} \text{s}\text{e}\text{t} \text{t}\text{o} \text{e}\text{q}\text{u}\text{a}\text{l}{\text{log}}_{2}{M}_{t}\), which can meet the general adaptive M-nary modulation. The accurate BER is obtained through the actual transmission of the system. We send specific data and feedback on the bit error in each time slot and compare the proportion of the number of bits incorrectly accepted by the receiver in the total transmission bits.
Considering various expenditure loads in the communication system, we calculate the total data rate so that the coding rate \(r\) is constant. Calculated according to the number of bits per second per Hertz, the spectral efficiency is:
\(\psi =\left(r\times \sigma \right)\times \frac{T}{{T}_{b1}}\times \frac{{K}_{d}}{K},\)
|
(13)
|
where \(\sigma\) is the bit rate, \(\sigma ={\text{log}}_{2}M\) and \(M\) is the modulation order. \({T}_{b1}\) is the OFDM symbol length and \({T}_{b1}=T+{T}_{cp}\), \({T}_{cp}\) is the length of the cyclic prefix, \(T\) is the basic OFDM symbol interval, \(K\) is the number of subcarriers, the size of the symbol after FFT transformation and \({K}_{d}\) is the number of data sub-carriers.
Different from Shannon capacity, in the multi-level modulation system, we use the number of bits sent per unit time as the system throughput:
\(T=\psi \times \text{max}\left[{P}_{cf}\times \left(r\times \sigma \right)\right],\)
|
(14)
|
where \({P}_{cf}\) is related to the system BER \(\rho\), and \({P}_{cf}=1-\rho\).
The penalty \(\varTheta\) sets the reward to zero. If the BER requirements are not met, all reward values are set to 0. The calculation of the value function is defined as:
\(\left\{\begin{array}{c}r\left(s,a\right)=-{c}_{1}\bullet \rho +{c}_{2}\bullet \psi +{c}_{3}\bullet T,\varTheta =0,\rho \le {P}_{b}\\ r\left(s,a\right)=0,\varTheta =1,\rho >{P}_{b} \end{array}\right.,\)
|
(15)
|
where \({c}_{1},{c}_{2},{c}_{3}\) are constants, representing the weight of each parameter in value calculation. No signal will be transmitted if the penalty bit\(\varTheta\) of each action is set to zero in the \({t}^{th}\) time slot. It happens when the current environment state not ideal. According to our automatic modulation strategy, even the modulation mode with the lowest order cannot meet the system requirements, so the optimal strategy is not to transmit.
C. Optimization problems
According to the feedback information of the feedback line, we set the optimization problem as:
\(\begin{array}{c}\underset{{M}_{t}}{{max}}{r}_{t}\left({s}_{t},{a}_{t}\right) \\ s.t.\left\{\begin{array}{c}{M}_{t}\in M=\left\{\text{1,2},\text{4,8},16\right\},\forall t=\text{1,2},\dots ,N\\ {\rho }_{t}\le {P}_{b,t},\forall t=\text{1,2},\dots ,N \end{array}\right.\end{array},\)
|
(16)
|
where, \({M}_{t}=1\) is the action of not transmitting, and the transmitter remains in a static waiting state.
D. The proposed adaptive modulation scheme based on DRL
In the traditional Q-learning algorithm, the Q value is stored in a table. The horizontal axis of the two-dimensional table is the state, the vertical axis is the action, and the median value is the Q value of the action corresponding to each state. For the low dimensional state space, the Q table can be stored in all states, and the optimal action can be selected by directly querying the Q table. However, there is a large-scale and continuously changing state space for the time-varying underwater
acoustic channel [23]. Deep Q-Network (DQN) is model-free, aiming to find the mapping relationship between the action-state and the Q value. The temporal-difference (TD) method combines the Monte Carlo sampling method and the bootstrapping of the dynamic programming method (using the value function of the subsequent state to estimate the current value function) so that it can be applied to the model-free algorithm and is updated in one step with faster speed. For the Q-learning method in which action is a discrete variable, \({Q}^{\text{*}}\left(s, a\right)\) is approximated by a deep neural network. We still consider transforming the continuously changing state into the discrete state of each time slot. Because the neural network can automatically extract complex features, we do not quantify the CSI but keep the feature vector as the input.
1) Model input
Usually, people take the combination of the channel frequency domain response estimated by the receiver and the SNR of the equalized subcarrier as the input vector of the network. However, due to the sparsity of the underwater acoustic channel, we consider transforming the channel frequency-domain response estimated by the receiver into the time-domain impulse response. Through the storage method of a sparse matrix, we can effectively reduce the amount of data and denote the network input signal as \(x\):
\(x=\left[\begin{array}{cc}{h}_{sparse}& SNR\end{array}\right],\)
|
(17)
|
In order to reduce the dimension of the input data, \(n\) peaks of the time-domain impulse response are extracted in advance to keep the data input size consistent.
\({h}_{sparse}=\left[\begin{array}{cc}\begin{array}{cc}\begin{array}{c}{A}_{1}\\ {\tau }_{1}\end{array}& \begin{array}{c}{A}_{2}\\ {\tau }_{2}\end{array}\end{array}& \begin{array}{cc}\begin{array}{c}\cdots \\ \dots \end{array}& \begin{array}{c}{A}_{n}\\ {\tau }_{n}\end{array}\end{array}\end{array}\right].\)
|
(18)
|
As in Eq. (1), \({A}_{i}\) and \({\tau }_{i}\) is the amplitude and relative delay corresponding to different paths.
2) Adaptive modulation algorithm based on deep reinforcement learning
The modification of Q-learning by DRL is mainly reflected in three aspects:
(1) DRL uses a depth neural network to approximate the value function;
(2) DRL uses experience replay to train the learning process of reinforcement learning;
(3) DRL independently sets up the target network to deal with the TD deviation in the time difference algorithm.
In DRL, the enhanced learning Q-learning algorithm and the SGD deep learning training are carried out synchronously. We use the powerful fitting ability of the neural network to approximate the action-value function in Q learning, so \(Q\left(s,a\right)\approx Q\left(s,a;\theta \right)\). The update method is calculated as follows:
\(Q\left(s,a\right)\leftarrow Q\left(s,a\right)+\alpha \left[r+\gamma {max}_{{a}^{{\prime }}}Q\left({s}^{{\prime }},{a}^{{\prime }}\right)-Q\left(s,a\right)\right],\)
|
(19)
|
where \(TargetQ=r+\gamma {max}_{{a}^{{\prime }}}Q\left({s}^{{\prime }},{a}^{{\prime }}\right),\) the loss function in the algorithm uses the mean square loss to update the parameters in the iteration: \(L\left(\theta \right)=E\left[{\left(TargetQ-Q\left(s,a\right)\right)}^{2}\right]\), with \(\theta\) parameters representing the network, \(\alpha\) as the learning rate, \({s}^{{\prime }}\)and \({a}^{{\prime }}\) as the state and action in the next iteration respectively, and \(\gamma\) being the discount factor in the TD method.
In the learning phase of each time slot, we consider that passing \(\epsilon -greedy\) strategy traverses all possible operations in each channel state. The greedy algorithm generates an optimal global solution through optimal local strategy, which is usually set \(\epsilon\) as a constant in a system. A number between 0 and 1 is randomly generated \(\xi\):
\(\left\{\begin{array}{c}{a}_{t}=\text{arg}{max}_{{a}_{t}}Q\left({{s}_{t},a}_{t}\right),\xi <\epsilon \\ {a}_{t}=rand\left({A}_{t}\right),\xi \ge \epsilon \end{array}\right.,\) | (20) |
The actions that maximize the Q value have a higher probability of occurrence, and the Q values corresponding to all possible actions can be learned. Therefore, we set the training state:
\(\epsilon ={max}\left(\text{0.01,0.2}-0.1\times \left(\frac{{N}_{episode}}{{N}_{0}}\right)\right),\)
|
(21)
|
where \({N}_{episode}\) is the number of episodes that the system continues to cycle and \({N}_{0}\) is a constant about episodes. \(\epsilon\) will gradually decrease with the continuous training of the current event. If the environment changes rapidly, the value of \(\epsilon\) needs to be increased to make the system more likely to train Q values corresponding to other actions in this state, which can maximize the reward of the selected action.
Each episode is carried out in each time slot for a period of time. In a certain period of time, the channel environment changes slowly. The agent judges whether the performance meets the requirements and calculates the reward value and punishment by observing the average maximum throughput and the average BER corresponding of each modulation order. The penalty value \(\varTheta\) represents the negative value of reward. In the same network, the weight after learning a task may completely change when learning a new task because the optimization target values are different in the time-varying underwater acoustic channel environment with significant differences. The objective function is the same, but the data sets are different. The old weights are easily damaged, so the batch random sampling method is adopted. We store 10000 groups of data. Each set of data contains the current environmental state information, modulation mode and value score obtained. The agent samples 100 groups of data from the experience replay, learns all samples of the whole batch, calculates the average gradient, and then updates them. During the update, only the Q value corresponding to the current modulation is updated, and others remain unchanged.
We consider the sample buffer with a sliding window mechanism. The sequence stored in the experience replay is \([{S}_{t}, {A}_{t}, {R}_{t}, {S}_{t+1}]\), the cache length is \(L=1000\), and the initialization is empty. The learned state transition sequence is added every time. When it is completed, we delete the old sample at the top layer and then store the new sample.
There are two networks with precisely the same structure but different parameters in DQN. As shown in Fig. 5, the prediction of Q estimation network mainNet is the virtual training network. Each step updates parameter θ according to the samples collected by mini-Bach, while the prediction of Q reality network targetNet parameters was used some time ago. The weight of mainNet is copied to targetNet every certain number of iterations C. \(Q\left(s,a;{\theta }_{i}\right)\) represents the output of the current network mainNet, which is used to evaluate the value function of the current state action pair; \(Q\left(s,a;{\theta }_{i}^{-}\right)\) indicates the output of targetNet. It mainly provides maxQ, which can solve the target. Therefore, when the agent acts on the environment, it can calculate Q according to the formula and update the parameters of mainNet according to the loss function. In order to prevent overestimation and make the Q value closer to its real value, our optimal action selection is based on the parameters of the Q network currently being updated. It completes an episode workout.
Specifically, the pseudo-code of using the DRL algorithm to find the best transmission strategy is shown in Algorithm 1.
Algorithm 1: DRL-Based AM Algorithm With TD Strategy
|
Input: CSI and SNR parameters
Output: optimal action of each time slot at\({{a}_{t}}^{\text{*}}\)
Initialize replay memory D to capacity N
Initialize state-value buffer B to capacity P
Initialize state-value Q-function with random weights θ
Initialize target state-value Q-function with weights θ− =θ
Initialize sequence S, A, and R
For episode = 1, \(M\) do
Initialize sequence \({s}_{1}=\left\{{x}_{1}\right\}\) and preprocessed sequence For step = 1, T do With probability, \(\epsilon\) select a random action\({a}_{t}\) Otherwise, choose optimal action at \({a}_{t} = {max}_{a}{Q}^{\text{*}}\left({s}_{t},a;\theta \right)\) each time slot.
Evaluate BER and \(\varTheta\) after passing through the system of each rate region
Evaluate spectral efficiency and maximum throughput
Obtain reward\({r}_{t}\)
Send the information to the receiver through the feedback link in every time slot
Store transition {\({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\)} in P
Sample random mini-batch of transition from P
Set\({y}_{i}=\left\{\begin{array}{c}{r}_{j},if episode terminate at step t+1 \\ {r}_{j}+d\widehat{Q}\left({s}^{\text{'}},\underset{a}{{argmax}}\widehat{Q}\left({s}^{\text{'}},a,\theta \right);{\theta }^{-}\right),otherwise\end{array}\right.\)
Perform a gradient descent step on\({\left({y}_{t}-Q\left({s}_{t},{a}_{t};\theta \right)\right)}^{2}\) concerning\(\theta\)
Every C steps reset\({\theta }^{-} = \theta\)
End for
End for
|