DQN-based mobile edge computing for smart Internet of vehicle

In this paper, we investigate a multiuser mobile edge computing (MEC)-aided smart Internet of vehicle (IoV) network, where one edge server can help accomplish the intensive calculating tasks from the vehicular users. For the MEC networks, most existing works mainly focus on minimizing the system latency to guarantee the user’s quality of service (QoS) through designing some offloading strategies, which, however, fail to consider the pricing from the server and hence fail to take into account the budget constraint from the users. To address this issue, we jointly incorporate the budget constraint into the system design of the MEC-based IoV networks and then propose a joint deep reinforcement learning (DRL) approach combined with the convex optimization algorithm. Specifically, a deep Q-network (DQN) is firstly used to make the offloading decision, and then, the Lagrange multiplier method is employed to allocate the calculating capability of the server to multiple users. Simulations are finally presented to demonstrate that the proposed schemes outperform the conventional ones. In particular, the proposed scheme can effectively reduce the system latency by up to 56% compared to the conventional schemes.

To address the above issues of cloud computing, a novel communication and computation paradigm named mobile edge computing (MEC) was proposed. By deploying calculating access points (CAPs) at the network edge, the calculating tasks can be unpacked to the neighboring CAPs through reasonable task partition and offloading in order to achieve a low latency and energy consumption [15,16]. To achieve the same goal, the authors in [17] studied a multiuser MEC network, where a deep Q-network (DQN)based offloading strategy was proposed for the task offloading. With the same reinforcement learning-based method, the authors in [18] focused on the study of channels to obtain better performances. Moreover, the system cost was studied in [19,20] in terms of a combination of energy consumption and latency, where a joint optimization method of offloading decision and resource allocation was proposed to enhance the network performance. In further, a deep deterministic policy gradient (DDPG) was proposed to resolve the offloading strategy design in the MEC system [21], where a long-term optimization was used. The authors in [22] considered parked vehicles as the computing service providers and proposed a dynamic pricing strategy in order to maximize the revenue of the computing service providers and meanwhile minimize the energy consumption of smart user equipment (UEs). It has been shown in [23][24][25] that the allocation of channel resources could be optimized to improve the performance of networks, and the authors in [26] enhanced the system performance through a dynamic game model.
The above literature review shows that most of the existing works attempted to optimize the system performance of MEC networks through offloading strategy design and resource allocation. To the best of our knowledge, seldom works have considered the pricing from the server and taken into account the budget constraint from the users. In practice, the pricing from the server may affect the system performance of the MEC networks, as this can affect the calculating capability allocated to the users by the CAPs. In addition, the budget of the users may also affect the network performance, as some users may not have enough budget to buy the computing resources at the CAPs, and the intensive calculating tasks have to be computed locally. Due to these reasons, we will jointly incorporate the budget constraint into the system design of the MEC-based IoV networks in this paper.
In this paper, we investigate a multiuser MEC-aided smart IoV network, where one edge server can help accomplish the intensive calculating tasks from the vehicular users. For the MEC networks, most existing works mainly focus on minimizing the system latency to guarantee the user's quality of service (QoS) through designing some offloading strategies, which, however, fail to consider the pricing from the server and hence fail to take into account the budget constraint from the users. To address this issue, we jointly incorporate the budget constraint into the system design of the MEC-based IoV networks and then propose a joint deep reinforcement learning (DRL) approach combined with the convex optimization algorithm. Specifically, a deep Q-network (DQN) is firstly used to make the offloading decision, and then, the Lagrange multiplier method is employed to allocate the calculating capability of the server to multiple users. Simulations are finally presented to demonstrate that the proposed schemes outperform the conventional ones. In particular, the proposed scheme can effectively reduce the system latency up by to 56% compared to the conventional schemes. The main contributions of this paper are as follows: • We study a MEC network for IoV, where we not only consider the resources allocation but also combine the charging rules with the users' budget constraints to optimize the performance of the MEC. • We propose a DQN and convex optimization algorithm, which a convex optimization method is integrated into the DQN framework. This algorithm not only has the advantages of reinforcement learning, but also uses the convex optimization method to reduce the complexity of algorithm and help convergence. • Simulations show that the proposed DQN and convex optimization algorithm can outperform conventional methods and can effectively reduce the system latency by up to 56%.
The rest of this paper is organized as follows. After Introduction, we discuss the system model of MEC-based IoV network and then present the optimization problem formulation in Sec. 2. After that, we give the DQN and convex optimization algorithm-based method to solve the optimization problem in Sec. 3. We further provide some simulations and discussions in Sec. 4 and, finally, make some conclusions in Sec. 5. help compute some parts of the tasks with its much more powerful capability, while the other can be computed locally. Moreover, when the tasks are offloaded to the CAP for computing, the CAP will charge users according to the amount of offloading tasks and the calculating capability. The following subsections will introduce the local computing model, the offloading model, and the purchase model, respectively. After that, we will give the system optimization problem.

Local computing model
As mentioned before, some parts of the tasks can be computed locally, and the local calculating latency is written as Fig. 1 System model of the considered vehicular MEC and the calculating fee is based on the calculating capability that the CAP allocates to the users. Hence, the payment of user u m for offloading is where η l is the price per bit of the task and η f is the price paid for the CAP capability. As the budget of each user is limited in practice, so we can get the budget constraint of user u m as where U max m is the maximum budget of user u m .

Problem formulation
In practice, the vehicular MEC network involves latency-critical tasks for the dynamic changes in the vehicles [30], and the system needs to process the latency-sensitive tasks from the users as quickly as possible due to the movement of the vehicular users. Therefore, the optimization problem of the network is to minimize the system latency, which can be formulated as where C 1 is the constraint of the offloading ratio, which indicates how many parts the user u m offloads to the CAP. Constraint C 2 presents that the calculating capability at the server distributed to user u m may not surpass the total calculating capability. Constraint C 3 denotes that users' payment of offloading should meet the budget constraint. From (10), the offloading ratio and calculating capability can be optimized to minimize the network latency while meeting the budget requirement. However, the optimization problem is complicated and hard to be solved by conventional convex optimization methods. Therefore, we turn to propose a DQN and convex optimization algorithm to resolve the problem. All notations used in this section are summarized in Table 1.

DQN and convex optimization algorithm
This section introduces a DQN and convex optimization algorithm for P1 in (10). The proposed algorithm overcomes the complicated action space caused by the full utilization of DQN which leads to an extremely high cost to perform exploration and then affects the final training result. In the following subsections, we first describe how to obtain the offloading decision through the DQN and then give the process of the resource allocation through the convex optimization method.

DQN-based offloading decision
To solve the problem P1 , we propose the DQN and convex optimization algorithm to obtain the offloading decision and calculating capability allocation. As shown in Fig. 2, the proposed algorithm is composed of a DQN-based method and a convex optimization method. Specifically, we first employ the DQN-based method to obtain the offloading decision. After the offloading decision is obtained, the convex optimization method is used to obtain the allocation decision of calculating capability. We can model the problem of offloading decisions as Markov decision process (MDP). In MDP, the agent firstly gets the state s t ∈ S from the environment at time slot t and then makes an action a t according to policy π . After that, the agent acts on a t in environment, causing the state of environment transits from s t to s t+1 , and the agent gets r t as a reward. Specifically, we define state space as S = {α} , where α = {α 1 (t), α 2 (t), α 3 (t), . . . , α M (t)} is the offloading ratio of users at the time slot t, and the action space is A = {ρ 1 , ρ 2 , ρ 3 , ...., ρ m , . . . , ρ M , ρ * 1 , ρ * 2 , ρ * 3 , ...., ρ * m , . . . , ρ * M } , where ρ m = −δ and ρ * m = +δ are actions to adjust the offloading ratio under the constraint C 1 . Moreover, the reward of the offloading decision problem is related to the system latency [31,32]   where τ 1 and τ 2 are two positive values with τ 1 > τ 2 . In further, we evaluate the policy π through Q function Q π (s, a) , which represents the accumulative rewards from an action a acting on state s. According to Q function, the best policy is and the agent gets the environment state s which helps in choosing an action through the best policy π * , Based on the above processing, we adopt deep Q-network (DQN), and it uses deep neural networks to approximate the optimal Q function. There are two neural networks in DQN, including the actor-network and the target-network. The role of actor-network is to predict the action a t ∈ A by inputting the state s t ∈ S . Generally, Q(s, a) is used to denote Q π * (s, a) . To avoid the offloading optimization problem falling into a local optimal value, we can obtain the action by the ǫ-greedy policy, where θ is the weights of the actor-network. In order to better approximate the Q function, we adopt the temporal difference (TD) approach in DQN, To obtain the TD target of DQN, we add a target-network as a copy of target-network, which is reset as the actor-network every T u time slots. And the loss function is [33][34][35] where θ is the weights of the target-network. And we use a back-propagation (BP) algorithm to update the actor-network every T l time slots. To break the relationship between data created at every time slot, we adopt an experience replay (ER) and a mini-batch sampling. A transition (s t , a t , r t , s t+1 ) is stored into ER at each time slot, and a mini-batch size of transitions are randomly sampled from ER to update the actor-network with BP algorithm every T l time slots.

Convex optimization-based resource allocation
After obtaining the offloading decision α m by the DQN-network, the problem P1 can be transformed into (11) Q(s t , a t ; θ) = r t + γ max a∈A (Q(s t+1 , a; θ )). From P2 , we can observe that the minimization problem is affected by the total calculating capability at the server and the budget of the user. The feasibility of the solution of the whole optimization P1 highly depends on the training and the design of DQN. As a part of the designed DQN, the solution of P2 has a limited impact on the training of the DQN. Moreover, a powerful DQN can still get a reliable and feasible solution to the whole problem even P2 cannot obtain the optimal solution. Therefore, it is worthwhile to find a solution with a lower complexity for P2 . Hence, we firstly limit the capability allocated to users in their budget and then based on the limit, we get the solution constraints by the total capability at the server shared by all users. To this end, we firstly relax the constraint C 2 and transform P2 into a convex problem, Then, we adopt the Lagrange multiplier method to optimize the problem P3 , and the Lagrange function can be written as where > 0 is a Lagrange multiplier. From (19), we set the first partial derivative of L (f m , ) with respect to f m and to zero, By combining and solving the above two equations, we can obtain the optimal solution of P3 as After obtaining the optimal solution of the relaxed problem P3 , we further consider the constraint C 2 and give a feasible solution for P2 . According to (8) and (9), we can obtain (17) P2 : min   By jointly considering (22) and (23), we can finally obtain a feasible solution of P2, From the above description, we can summarize the procedure of the proposed DQN and convex optimization algorithm in Algorithm 1.

Some discussions on the system design and optimization
Besides the above works, one should note that there maybe exist some malicious vehicles which may overheat the confidential message from the data offloading. In this case, some privacy protection methods such as the encryption [36] and physical-layer secure schemes [37] should be used to enhance the security of the considered IoV networks. Moreover, some novel wireless techniques should be incorporated into the considered system, such as advanced offloading strategies [38], relaying techniques [39][40][41], and UAV [42,43]. In further, some intelligent algorithms should be developed to allocate the system resources in a much more intelligent approach, such as deep learning [44][45][46], deep reinforcement learning [47] and federated learning [48].

Results and discussion
This section shows the performance of the proposed DQN and convex optimization algorithm in the vehicular MEC network from simulations. Channels used in our work obey Rayleigh flat fading, and the variance of AWGN is 0.01. The task sizes of users are l m = (100 + 5 × m) Mb, and the transmit power of each user is randomly set to either 2 W or 3 W. Moreover, the calculating capability of users is set to 2 × 10 8 cycle/sec, and the number of required cycles per bit of data for computing is set as C = 40 . All results given in this paper are the average of 5 experimental results. As to the network structure, we implement both the target-network and actor-network of DQN through two hidden layers with 256 and 64 nodes and employ the BP algorithm as the updater. The values of T l and T u in the DQN are set to 50 and 100, (23)  respectively. The size of experience replay is set to 20000, and the mini-batch size of sampling is set to 32. Figure 3 plots the latency of the devised scheme versus the number of episodes, where the number of users is set to 3, the budget of users is set to 210, the total calculating capability of CAP is set to 5 × 10 8 cycle/sec, and the bandwidth of a wireless link is 1 MHz. To compare with the proposed DQN and convex optimization scheme, we plot the performance of two other schemes. One is the All-local scheme where users compute their tasks locally, and another is the ALL-CAP scheme where users offload all tasks to the CAP and obtain the calculating capability from CAP with the maximum budget. From this figure, we can see that the system latency of the proposed scheme gradually decreases when the episode varies from 0 to 20, and it converges to about 10 after 20 episodes. In contrast, the system latency of All-local scheme and ALL-CAP scheme remains unchanged at the level of 23 and 17.5, respectively. The fast convergence of the proposed scheme indicates that it can obtain an effective offloading decision and the calculating capability allocation. Moreover, the proposed scheme has the best performance among the three plotted schemes. Specifically, the system latency of the proposed scheme is about 56% and 10% lower than that of All-local scheme and ALL-CAP scheme. Obviously, the proposed scheme can not only converge rapidly but also outperform the other two schemes. Figure 4 demonstrates the convergence of the proposed scheme with different numbers of users, where the total calculating capability at the server, the budget of users, and the bandwidth of a wireless link are set to 5 × 10 8 cycle/sec, 210, and 5 MHz, respectively. The figure shows that for the different numbers of users, the system latency of the devised scheme decreases in the first 30 episodes, and it converges to a low latency after 30 episodes. This result indicates that the proposed scheme can converge under various numbers of users. Moreover, the value of convergence increases with a larger M. This is because increasing the number of users causes more calculating tasks, which results in larger system latency. This further illustrates that the proposed scheme obtains a reasonable offloading decision and calculating capability allocation for different numbers of vehicles.  Figure 5 shows the effect of wireless bandwidth on the system latency, where the total calculating capability at the server is 5 × 10 8 cycle/sec, the budget of users is set to 130, and the bandwidth varies from 1 MHz to 5 MHz. This figure shows that the system latency of the proposed scheme and ALL-CAP scheme drops sharply when the bandwidth varies from 1 MHz to 3 MHz and becomes steady when W > 3 MHz, while the system latency of All-local scheme remains unchanged with various values of bandwidth. The reason for this trend is that a larger bandwidth can reduce the transmission latency of the proposed scheme and ALL-CAP scheme, while the tasks of All-local scheme are not transmitted to the CAP. Moreover, the proposed scheme can obtain a lower latency for various values of bandwidth compared with the other two schemes. Specifically, when the bandwidth is 5 MHz, the latency of the proposed scheme is about 70% and 40% lower than that of All-local and ALL-CAP scheme. In further, for the three schemes, the system latency with M = 3 is always lower than that with M = 7 . This is because a larger amount of tasks are produced due to the increasing number of users, which causes more communication and computation latency in the network. These results illustrate that the proposed scheme can outperform the other two schemes.  Figure 6 reveals the effect of the total calculating capability on the system latency of the proposed scheme with the variation of the total calculating capability of CAP under different bandwidths, where the number of users is set to 3, the budget of users is set to 210, and the total calculating capability at the server varies from 1 × 10 8 cycle/sec to 5 × 10 8 cycle/sec. This figure illustrates that the system latency of the devised scheme decreases when the total calculating capability at the CAP increases. This is because the CAP with a larger calculating capability can help users compute the tasks quickly, which leads to a reduction in the system latency. Moreover, with the variation of the total calculating capability from 1 × 10 8 cycle/sec to 5 × 10 8 cycle/sec, the performance improvement due to the enhanced calculating capability at the CAP becomes larger when the bandwidth increases. Specifically, the system latency drops by about 58% from F = 1 × 10 8 cycle/sec to F = 5 × 10 8 cycle/sec at W = 5 MHz, and it drops by 20% at W = 1 MHz. This is because the bandwidth affects the transmission latency and the total calculating capability affects the calculating latency at the CAP. These two types of latency both contribute to the system latency. Figure 7 demonstrates the effect of the user budget on the system latency, where the total calculating capability at the CAP is set to 5 × 10 8 cycle/sec, the bandwidth is set to 5 MHz, and the budget of users varies from 70 to 150. This figure expresses that the system latency of the proposed scheme firstly decreases with the budget from 70 to 110, and then, it becomes steady when U max m ≥ 110 , while that of All-local scheme is unchanged. This is because when the user budget is small, the calculating capability at the CAP allocated to the users is limited, which results in high calculating latency in the system. Moreover, the calculating capability allocated to each user decreases as the user number increases. This also verifies that the proposed scheme can make an effective offloading decision and calculating capability allocation compared with the All-local scheme. Figure 8 illustrates the effect of the total CAP calculating capability and the user budget on the system latency, where the number of users is set to 3, the bandwidth is 1 MHz, the total calculating capability of the server varies from 1 × 10 8 cycle/sec to 5 × 10 8 cycle/sec, and the budget of users varies from 30 to 190. Observing this figure, we can see that the system latency of the proposed scheme is marginally affected Total computational capabiliy at the CAP by the calculating capability at the CAP, when the user budget is small. The reason is that the budget limits the allocation of calculating capability to the vehicles, no matter how the total calculating capability changes. Similarly, when the total calculating capability at the server is small, the system latency is also marginally affected by the user budget, no matter how the budget changes. This is because the CAP does not have enough calculating capability to help the users compute the tasks. On the contrary, when the budget and total calculating capability at the CAP are both large, the system latency can be reduced to a small value. This is because when the CAP has enough calculating capability and users are rich enough, the allocation of calculating resources is no longer constrained, which makes a low system latency. All the above phenomena show that the system latency is limited by both the CAP calculating capability and the budget of users, and also indicate that the devised scheme in this paper has good performance in reducing system latency.

Conclusion
This article studied a vehicle MEC network, in which the CAP with limited calculating capability could receive part tasks from users to do the faster process, which can reduce system latency. We firstly formulated the optimization problem of latency by considering the limited calculating capability at the CAP and the budget of users. Then, we proposed a DQN and convex optimization algorithm to solve the problem. Simulations were finally conducted to show that the devised algorithm performs better than traditional methods, and it is robust to practical conditions of the vehicular MEC networks. As to future works, we consider that in the MEC networks, multiple CAPs can provide more options for users to do offloading, which can further reduce the calculating latency. Moreover, the computing of tasks and the transmission of tasks cause energy consumption, and the energy consumption is another important performance metric of the MEC networks in some scenarios. Therefore, we will consider multiple CAPs for the MEC networks and study the energy consumption in future works.