2.1 Data channel algorithm processing of interconnected communication technology

*2.1.1 Data processing flow*

The common physical layer data processing flow is the same as the data channel processing flow of U-NB-IoT. In order to check the correctness of the transmission at the receiving end, the cyclic redundancy check (CRC) code added first without processing bits is sent down by the data link layer [7]. It is added here that the CRC of U-NB-IoT is derived from the polynomial of CRC24. Equation (1) displays the expansion of CRC24 polynomial:

$$\begin{gathered} {g_{CRC24}}\left( X \right)={x^{24}}+{x^{23}}+{x^{18}}+{x^{17}}+{x^{14}}+{x^{11}}+{x^{10}} \hfill \\ {\text{ }}+{x^7}+{x^6}+{x^5}+{x^4}+{x^3}+x+1 \hfill \\ \end{gathered}$$

1

The CRC result is the remainder of the input bit polynomial divided by the CRC polynomial. Figure 1 displays the baseband processing flow.

The new transmission bit consists of CRC verification code and unprocessed bits. Then, this combination is subjected to channel coding processing [8]. Tail biting convolutional codes with a rate of 1 / 3 and a constraint length of 7 are adopted. Figure 2 is a detailed encoder.

U-NB-IoT uses tail biting convolutional codes to overcome the code loss of ordinary convolutional codes. Its advantage is that it can set the end of the register and the initial state to the same value [9]. As shown in Figure 2, the convolution polynomial is expressed as G, and it is exclusive OR processed with the original value of the bit to obtain the channel encoded data. Through the optimization process of rate matching, a data stream twice as many as the original bits is generated, but this is the result obtained after CRC and tail biting convolution process the original data. Rate matching occurs to adapt the bearing capacity of the convoluted data stream and the air interface [10]. The bitstream of scrambling airport bearing force is added after the rate matching. The U-NB-IoT system scrambles the code sequence and scrambles the codeword bit by bit through or program. Equation (2) is the initialization seed of the scrambling sequence:

$${C_{{\text{init}}}}=\left\{ {UFrame\left[ {1:0} \right],USubFrame\left[ {3:0} \right],Frame\left[ {3:0} \right],CellID\left[ {5:0} \right]} \right\}$$

2

In (2), the U frame is 2bit, the U subframe is 4bit, and the uplink and downlink frames are 4bit. These mean the frame number corresponding to scrambling. The unit index CellID6bit is the cell index (ID). Scrambling is adopted to improve the stability of data transmission. The function of this means is to whiten the transmitted signal, interfere at the transmitting end and remove the interference at the receiving end [11].

Local pilot generation and air interface mapping are components of downlink resource mapping of the U-NB-IoT system [12]. In the frequency domain resource pattern in the figure below, 12 subcarriers are corresponding in the frequency domain. One time slot corresponds to seven Orthogonal Frequency Division Multiplexing (OFDM) symbols. The green unmarked block is the data RE, and the white R0 represents the resource element (RE) occupied by the reference signal on the antenna port 0. Figure 3 and Figure 4 show the mapping of transceiver (1T1R) time-domain reference signals [13].

In the signal diagram, white R0 represents the RE occupied by the reference signal of the antenna port, blue represents the RE occupied by the reference signal of antenna port 1, and the remaining green is the data RE. In the same time slot, for antenna port 0, the RE position occupied by antenna port 1 can be set to 0; For antenna port 1, the RE position occupied by antenna port 0 is also set to 0. OFDM symbols are generated according to the mapped symbols processed by IFFT. The sampling rate of each channel in the U-NB-IoT system is 1.98MHz. In 10Ms baseband data, 128 points IFFT is performed for each symbol. Figure 5 and Figure 6 show time-domain reference signal diagrams of two transmitting antennas transmitting and receiving (2T1R).

*2.1.2 Data channel algorithm processing*

Transmission time interval (TTI) repetition and Multiple-Input Multiple-Output (MIMO) precoding are components of coverage enhancement processing. Physical Downlink Shared Channel (PDSCH) is a channel used by U-NB-IoT to connect service data [14].

In the U-NB-IoT system, 10Ms is not only an ordinary transmission time interval, but also the basic resource scheduling unit of the system [15]. According to the uniqueness of IoT equipment transmission, generally, the data of each TTI cannot touch the SNR threshold after decoding when it reaches the receiving end, so it is essential to improve the received signal-to-noise ratio [16]. The method of TTI repetition is to send the same data frame repeatedly and it can repeat multiple times, such as 1 time and 5 times, which is based on the integer multiple principles of TTI. Its disadvantage is to reduce the code rate and data transmission rate. Its advantage is to improve the demodulation performance of the receiver, improve the received signal-to-noise ratio and strengthen the coverage.

MIMO precoding is performed after data modulation. However, there is a problem, that is, OFDM technology is limited, such as the offset of carrier frequency and the offset of average power ratio [17]. Furthermore, OFDM has the advantage of reducing the complexity of receiver design. From the aspect of the signal-to-noise ratio of the receiver, flat fading cannot improve the signal-to-noise ratio of the receiver. Hence, the use of various diversity technologies can partially eliminate the disadvantages of fading channels. Figure 7 displays the composition of teaching resources in IoT structure [18].

A precoding technology in the MIMO system is called Space Frequency Block Code (SFBC). SFBC can increase the redundancy of the signal so that the signal can obtain diversity gain [19]. Equation (3) is the expression of SFBC.

$$\left( {\begin{array}{*{20}{c}} {{D^{\left( 0 \right)}}\left( {2i} \right)} \\ {{D^{\left( 1 \right)}}\left( {2i} \right)} \end{array}{\text{ }}\begin{array}{*{20}{c}} {{D^{\left( 0 \right)}}\left( {2i+1} \right)} \\ {{D^{\left( 1 \right)}}\left( {2i+1} \right)} \end{array}} \right)=\frac{1}{{\sqrt 2 }}\left( {\begin{array}{*{20}{c}} {d\left( {2i} \right)}&{d\left( {2i+1} \right)} \\ { - conj\left( {d\left( {2i+1} \right)} \right)}&{conj\left( {d\left( {2i} \right)} \right)} \end{array}} \right)$$

3

2.2 IoT resource allocation algorithm for deep reinforcement learning

*2.2.1 Resource allocation system model*

(a) QoS-oriented system model

The research object is power allocation and the selection of uplink spectrum resources. The research scenario is the category of massive IoT teacher-student access and various teacher-student services [20]. Figure 8 is the system model diagram.

The transmission power and time slot change of user equipment (UE) are set to be positively correlated. Then, the location and number of UE will change dynamically. Meanwhile, few frequency bands can be used in IoT equipment, so the number of teachers and students considered in the algorithm is greater than the number of resource blocks (RB) that can be used. Co frequency interference between IoT equipment is the main interference source of the system [21]. To sum up, equation (4) is the expression of Signal to Interference plus Noise Ratio (SINR) of teachers and students.

$$SIN{R_i}=\frac{{{G_{n,k}}{P_{n,k}}}}{{\sum\nolimits_{{i \in N/\left\{ n \right\}}} {{x_{{j_1}k}}{G_{j,k}}{P_{j,k}}+{\sigma ^2}} }}$$

4

*P* *i,k* and *P**j,k* are the transmission power of the multiplexing the RB *k* of IoT teacher-student n and teacher-student *j*. *G**i,k* and *G**j,k* are the channel gains of teacher-student *n* and teacher-student *j* to the base station and multiplexing the RB *k*. *X**i,k* means the pointer of teacher-student *n* on the RB *k*. Xi,k = 1 means that the *n*-th teacher and student use the *k*-th RB. In addition, Xi,k = 0. σ2 means the additive white Gaussian noise (AWGN) power of RB. Equation (5) displays the channel capacity of receiving S1NR of a single teacher and student:

$${C_n}=B{\log _2}(1+SIN{R_n})$$

5

(b) Utility function for QoE (Quality of experience)

From the beginning of wireless technology to today's 5G technology, from the beginning of connectionless to today's interconnection, the scale begins to increases. Teachers and students have also changed from simple service needs to various fast services. The types of service requirements for teachers and students are even more diverse today. From simple real-time message sending and receiving to intelligent control and complex personal LAN configuration, there are various restrictions on the communication system, which is based on each service request. 5G is the foundation of all mobile Internet, creating value for teachers and students through a sustainable service environment. Therefore, focusing on teachers and students is the most crucial service concept of 5G. Previous response of network performance is reflected by the quality of service (QoS). The indicator measurement of QoS is usually realized by hardware such as packet loss rate and spectral efficiency [22]. However, QoS is difficult to accurately measure teacher and student-centered communication networks because of the increase of service categories. QoE refers to the satisfaction of teachers and students when using the Internet to complete teaching tasks. This shows the feelings of teachers and students for service. QoE is adopted to measure the quality of service. Therefore, QoE indicators are adopted to optimize the management of wireless resources. Nevertheless, it is difficult to model and solve the optimization problem of QoE, which is due to the immaturity and imperfection of the modeling system. Hence, at present, it still needs to rely on the QoS parameters to evaluate the network performance and map them to the QoE function in order to achieve the effect of optimizing the resource allocation algorithm.

The measurement of teacher-student service experience adopts a utility function as a data processing tool. The mean opinion score (MOS) can show the distance between teachers' and students' expectations and the current network quality, and also energize teachers' and students' sense of the experience of the network. The transmission rate is very crucial for QoE, so this exploration focuses on the feedback of rate to the satisfaction of teachers and students, and uses the utility function to get the MOS evaluation.

There are three different types of services for teachers' and students' requests, including QoS constraint services, best effort (BE) services and services with special requirements. BE refers to services without QoS restrictions. Most teachers and students of such service requests do not require a delay. The second type of service request is QoS limited, and it has requirements for the amount of resources. The last one is the service request with the highest complexity and QoS requirements. Given the above situation, the sigmoid function is adopted to express MOS of the utility function, and the expression is as follows:

$${U_c}=\frac{1}{{A+D{e^{ - E\left( {C - {C_0}} \right)}}}}+F$$

6

A, D, E and F represent the parameters of the slope. E will affect the slope of the curve, and A, D and F will affect the mapping range from the utility function to MOS. C0 represents another form of resource needs of different groups of teachers and students.

The three different types of services above are naturally different based on different utility functions and different rate requirements. For BE service, there is a positive correlation between teacher-student experience and resource scheduling [23]. Therefore, this is a convex function with monotonically increasing characteristics. The stability condition of this function is that the rate reaches the threshold required by the transmission rate. The utility function is a monotonic increasing function and has the characteristics of QoS traffic. QoS requirements should be met, so the utility function grows rapidly. The high priority of teacher-student resource requests is based on the fact that the resources obtained by teachers and students are less than the QoS requirements. The low priority of teacher-student resource requests is because the resources obtained by teachers and students are greater than the critical value of QoS requirements. In this case, the growth of the utility function is very slow. The last case is teachers and students with special QoS requirements. Their satisfaction is difficult to reach the maximum unless the rate exceeds a certain value. Then, the opposite situation is that the satisfaction is 0. Figure 9 is the schematic diagram of the college physical education resource system for IoT.

(c) Optimization model

The construction of the QoE optimization model is based on the utility function. Constraints and optimization objectives constitute the optimization model. The minimum QoS guarantees the upper limit of optimization constraints. Equation (7) shows the resource allocation model of the n-th teacher and student:

$$\begin{gathered} {\text{ }}\mathop {\hbox{max} }\limits_{{p,x}} {\text{ }}{{\text{U}}_n}\left( C \right) \hfill \\ s.t.{\text{ C1 }}{{\text{U}}_n}\left( C \right){\text{>}}{U_{\hbox{min} }} \hfill \\ {\text{ C2 }}\sum\limits_{{k=1}}^{k} {{x_{n,k}}{p_{n,k}}{\text{<}}{P_{n,\hbox{max} }}} \hfill \\ {\text{ C3 }}\sum\limits_{{k=1}}^{k} {{x_{n,k}}{p_{n,k}}{\text{<}}{P_{n,\hbox{max} }}} \hfill \\ \end{gathered}$$

7

Each teacher and student is not lower than the utility function threshold, guaranteed by C1, and the maximum transmission power of each IoT teacher and student is limited by C2. The condition that each teacher and student can only select one RB is C3, but multiple teachers and students can select the same RB.

*2.2.2 Resource allocation algorithm for deep reinforcement learning*

Based on the research scenario, the discrete-time Markov decision process with continuous action space can be adopted to represent the hidden optimal stochastic control problems of power allocation and teacher-student scheduling. The teachers and students of the terminal cannot obtain accurate conversion information because of the complex transformation of the external environment. In addition, it is difficult to obtain the optimal solution with low complexity by the previous method in equation (7) above. Therefore, the resource allocation of massive IoT devices under the deep reinforcement learning algorithm is studied below.

(a) Reinforcement learning model

In the case of discrete-time, the optimization problem studied aims to form a conventional reinforcement learning problem with the interaction between environment E and intelligent modules. The real-time reward Rt is obtained based on the timely action taken by the intelligent module after receiving the observation results in each time slot t. State-space S is the product of the resource allocation scenario. It means the current environmental state of the intelligent module. According to the reinforcement learning model of massive teachers and students, RB represents the data transmission pointer and access pointer. St represents the state in each time period set, and its expression is as (8):

$${S_t}=\left[ {{e_1}\left( t \right),{e_2}\left( t \right), \ldots ,{e_K}\left( t \right),I\left( t \right)} \right]$$

8

In (8), ek(t) is the expression of channel access, which means the number of teachers and students occupying RB. I(t) is the data transfer pointer. If the value of the teacher-student utility function is less than the minimum threshold, the transmission will fail, and I(t)= 0. On the contrary, I(t)= 1. The intelligent module action of time slot t is defined as:

$${a_t}=\left[ {{c_t},{p_t}} \right]$$

9

at consists of RB index and transmission power of intelligent module equipment multiplexing current RB. Reward is a kind of evaluation, which specifically refers to an evaluation of the current state and behavior. The intelligent module can be adjusted according to the reward situation. The reward here can be defined as the utility function in equation (6). The reward Rt=r(st, at) is expressed as follows:

$${r_i}=\left( {{s_t},{a_t}} \right)=\left\{ {\begin{array}{*{20}{c}} {{u_i}\left( t \right),{\text{if }}{{\text{u}}_i}\left( r \right) \geqslant {{\text{u}}_i}{{\left( r \right)}_{\hbox{min} }}{\text{ and }}{{\text{p}}_{i,k}} \leqslant {{\text{p}}_{i,max}}} \\ {C,{\text{ otherwise}}} \end{array}} \right.$$

10

In (10), C is a constant negative reward value, which is adopted to punish the action of intelligent selection. For example, if the intelligent module has the limit of the maximum power in the system model, and the intelligent module selects the transmission power greater than the threshold, the intelligent module will be punished.

(b) Deep reinforcement learning algorithm

The purpose of introducing a nonlinear function approximator into the machine learning algorithm is to better deal with the problems of continuous action space and multi-dimensional state space. The frequency resources of the unauthorized frequency band cannot properly deal with these problems for ordinary linear function approximators. In the algorithm framework in Figure 10, the framework based on the integration of teachers and students is the main structure of the algorithm. Student network in physical education inputs the status into St and outputs action. The environment is triggered by the subsequent triggering action of the intelligent module under the action of the environment and feeds back the new state St+1. Finally, the loss function is calculated according to the data. Equation (11) displays the expression of the loss function.

$$L={r_i}+\gamma Q^{\prime}({s_{i+1}},\pi ^{\prime}({s_{i+1}}\left| {{\theta ^\pi }} \right.)\left| {{\theta ^{Q^{\prime}}}} \right.) - Q(s,a{\left| {{\theta ^Q})} \right.^2}$$

11

The way of updating parameters in the teacher network differs from that in the student network. Teachers update the parameters according to the gradient descent, while students update the parameters according to the gradient rise. The empirical replay method is adopted for data generalization, such as the experience pool of data storage in the figure. In order to achieve the purpose of algorithm convergence, the method of replaying experience in the experience pool is used to block the correlation among different data. Figure 10 shows the basic framework of the algorithm.

Figure 8 displays that Main Net and Target Net appear in both teacher network and student network. The same grid structure in Figure 10 is adopted to construct a target and evaluation network with the same structure but different parameters. The evaluation grid will allocate the target grid after a certain time. The algorithm flow chart in the figure below summarizes the procedures for each module to implement the in-depth teacher-student integration framework. The algorithm used updates the target network at each step. Figure 11 is a flow chart of a resource allocation algorithm based on the integration of teachers and students.