A reinforcement learning-based computing offloading and resource allocation scheme in F-RAN

This paper investigates a computing offloading policy and the allocation of computational resource for multiple user equipments (UEs) in device-to-device (D2D)-aided fog radio access networks (F-RANs). Concerning the dynamically changing wireless environment where the channel state information (CSI) is difficult to predict and know exactly, we formulate the problem of task offloading and resource optimization as a mixed-integer nonlinear programming problem to maximize the total utility of all UEs. Concerning the non-convex property of the formulated problem, we decouple the original problem into two phases to solve. Firstly, a centralized deep reinforcement learning (DRL) algorithm called dueling deep Q-network (DDQN) is utilized to obtain the most suitable offloading mode for each UE. Particularly, to reduce the complexity of the proposed offloading scheme-based DDQN algorithm, a pre-processing procedure is adopted. Then, a distributed deep Q-network (DQN) algorithm based on the training result of the DDQN algorithm is further proposed to allocate the appropriate computational resource for each UE. Combining these two phases, the optimal offloading policy and resource allocation for each UE are finally achieved. Simulation results demonstrate the performance gains of the proposed scheme compared with other existing baseline schemes.

state-action pair increases, the scale of the Q-table becomes too large to manage and enquire, which makes it complicated to obtain the optimal solutions efficiently and thus will directly affect the QoS of UEs. Therefore, a deep reinforcement learning (DRL) algorithm called deep Q-network (DQN) algorithm is proposed in Ref. [12] to obtain the optimal offloading policy and resource allocation, which utilizes the deep neural network to estimate the Q value which is more efficient to solve the problem concerning massive data. Researches such as [13][14][15] also show that the DQN algorithm can achieve a better performance in dealing with the problem of offloading decision and resource allocation.
However, a potential drawback with the DQN algorithm is that the value function in some states is independent of the selected action. To deal with this dilemma, an advanced DRL algorithm called dueling deep Q-network (DDQN) algorithm is suggested to overcome the mentioned defect [16]. The core idea of the DDQN algorithm lies in that the state-action Q value in the neural network is further divided into two parts, namely, the value function independent of action, and the action advantage function related to action. Moreover, based on the network structure of DDQN, the agent in RL will eventually learn more accurate value, which means the DDQN algorithm could get better performance than the DQN algorithm in solving the problem related to offloading policy and resource. Ref. [17] adopts a DDQN algorithm to predict the offloading behavior of UEs who have tasks with a semi-online distribution, while calculating and updating the total rewards after each offloading decisions until the total rewards achieve maximum. Similarly, Ref. [18] uses the DDQN algorithm to predict UEs' offloading modes meanwhile achieving a load balance of the MEC server with unknown environment information, which is interpreted as the unknown channel state information. By adopting the DDQN algorithm, Ref. [18] effectively improved the offloading efficiency and decreasing the resource costing. Therefore, in our previous work [19], we studied a computing offloading policy for multiple user equipments (UEs) in F-RANs by using the DQN algorithm to optimize the total utility of UEs. However, the limitation of the work is that computational resource for FAPs has not been optimized. So how to design an effective offloading policy as well as an efficient resource allocation scheme is essential to improve the total utility of UEs.
Illuminated by the contributions of the aforementioned researches, this paper serves to find an optimal offloading policy of UEs' tasks while optimizing the computational resource of FAPs in the considered F-RAN, assuming that each UE has a computationally intensive task to be processed. Especially, since the FAPs are resource-limited, some of the tasks can be processed in the FAP, and the others are forwarded to the cloud server by the fronthaul link. Moreover, some idle UEs who can provide additional computational resource by D2D communication around the requested UEs are taken into consideration to enhance the space of the offloading selection. Consequently, tasks of the requested UEs can be offloaded to FAP, nearby idle UE, cloud server, or process locally, respectively. This problem is formulated as a joint optimization to the offloading policy selection and the computational resource allocation of all UE's tasks with the objection of maximizing the total utility of all requested UEs. This problem has also been proved as a non-convex mixed-integer nonlinear programming (MINP) problem in Ref. [5]. To solve this challenging problem, we decompose it into two phases. At the first phase, a centralized DDQN algorithm is utilized to select the most appropriate offloading mode for each UE. Especially, we utilize a pre-processing procedure to reduce the complexity of the used DDQN algorithm. At the second phase, based on the training results of the DDQN algorithm, the tasks offloaded to the FAPs are classified according to their delay and energy requirements initially. Then, a distributed DQN algorithm is adopted to optimize the computational resource in each FAP to obtain the final offloading policy and resource allocation.
The contributions of this paper can be summarized as below: 1. Offloading policy selection: A centralized DDQN algorithm is utilized in the proposed scheme to select the most appropriate offloading mode for each UE, which consists of offloading to FAP, nearby idle UE, or processing by itself. Due to the complexity of the centralized algorithm in the Base Station (BS) will increase with the increasing number of UEs who have the offloading requirements, a pre-processing procedure is adopted to reduce the complexity of the DDQN algorithm by directly satisfying some of the UE's task requirements.

Optimize the computational resource in FAPs:
In the second step, there exists a circumstance that multiple UEs might be connected to the same FAP for task offloading. Therefore, to ensure the maximization of the total utility, some of the tasks in FAP should be sent to the cloud server to process. Aiming to jointly allocate the computational resource in FAPs while deciding the offloading decision for the UEs whether they be sent to the cloud or stay at FAP to process, we put forward a distributed DQN algorithm, and combining with the training results of the DDQN algorithm, the final optimal offloading policy and resource allocation are obtained. 3. The performance of the proposed DDQN and DQN algorithms: Simulation experiment compares the proposed offloading policy and resource allocation scheme with other existing baseline schemes. Meanwhile, the performance of the proposed DDQN algorithm and DQN algorithm also is given. This paper is organized as follows: Sect. 2 describes the system model, computation model and problem formulation. The proposed offloading policy based on the DDQN algorithm, and the computational resource allocation scheme based on the DQN algorithm, are illustrated in Sect. 3. Simulation results are demonstrated in Sect. 4. Finally, conclusions are drawn in Sect. 5.

Method
This section introduces the methods used in the work. we first build formulas of the latency, energy consumption in the considered task processing modes, then the formula described the total utility of the required UEs can be derived. Accordingly, a centralized DDQN algorithm is adopted to solve the problem of offloading mode selection, based on the training results of the DDQN algorithm, a distributed DQN algorithm is utilized to optimize the computational resource of each FAP. Both of the above deep reinforcement learning algorithm is implemented with Python 3.7.7, TensorFlow 2.0, and the Adam optimizer is used to carry out the gradient descent algorithm to minimize the loss function of the two neural network.

System model
The system structure is shown in Fig. 1, a F-RAN architecture is considered which includes a single cell scenario consisting of M FAPs, N UEs, and K distributed computing nodes (DCNs), where DCN typically acts as an idle UE with adequate computational resource distributed around UE, and can be connected with UEs by D2D communication. Assume each UE only has one computationally intensive task to be dealt with which is characterized as t n , where B n is the number of CPU cycles required for computing 1-bit data, and D n is the size of the task data. To improve the gains brought by offloading behavior, we suppose that each UE has four offloading options. Specifically, each UE can offload its complete task either to FAP, nearby DCN, cloud server, or process locally by itself. Accordingly, the vector of the offloading decision made for the UE n is defined as d n , ∂ m , β k , Local , Cloud = {0, 1}, ∀m ∈ M, ∀k ∈ K . Specifically, parameter "1" indicates the full task of UE will be offloaded, while parameter "0" indicates the full task of UE will not be offloaded. Besides, suppose that all the UEs, FAPs, and DCNs have already cached some task results in their own storage according to an optimal caching matrix proposed by our previous work [20]. Hence, UEs could first search the desired result of their requested task before carrying out the offloading behavior. If, fortunately, the result can be found within the caching storage, the requirement of the UEs can be directly satisfied, and there is no need to offload.
For the sake of characterizing the network topology in the considered F-RAN, we build a matrix P = [p i,j ] (M+K )×N where the columns indicate all the UEs who need to carry out offloading and the rows consist of the FAPs and DCNs which can be associated with. If the distance between UE j and FAP j (or DCN j) exceed the maximal distance of FAP or DCN represented as d FAP , d D2D , p i,j = 0 , otherwise p i,j = 1. Moreover, considering that in practical situation, there might have some DCNs not willing to provide their computational resource, we build a matrix Y = [y i,j ] K ×N to donate the willingness of DCNs to participate in the offloading process, where y i,j is expressed as Then, matrix Y is used to update matrix P, if y i,j = 1 , then p i,j = 1 , otherwise p i,j = 0 . Therefore, the matrix P can be updated as p i,j .
The important parameters in this paper are listed in Table 1.

Computation model
An orthogonal frequency-division multiple access (OFDMA) method is used to access the FAP through the cellular channel. When a UE has the requirement to (1) y i.j = 0, the DCN i has not willingness to UE j, i ∈ K , j ∈ N 1, the DCN i has willingness to UE j, i ∈ K , j ∈ N communicate with FAP or DCN, the BS will allocate one sub-channel (cellular link or D2D link) to the requested UE. Combined with the four possible options, the corresponding four task processing methods are introduced as follows. Local processing: If the task of UE n is processed locally, the local execution delay is calculated as where f l n denotes the computational capacity of UE n which means the CPU cycles per second when the tasking is being processed. Moreover, we adopt the same energy consumption model for local processing as in Ref. [21], which is expressed where z n = 10 −27 (f l n ) 2 represents the energy consumption in per CPU cycle of UE n.
Moreover, the utility function obtained by UE n with local processing mode is defined as the combination of the saved delay and the saved energy compared when the task of UE n is processed locally. Thereby, the utility obtained by UE n in local processing is expressed as where ρ t n and ρ e n indicate the weight factors of each task, respectively, and the value of these two parameters ranges from 0 to 1, which satisfies ρ t n + ρ e n = 1 [22]. Moreover, since different tasks have different requirements for delay and energy, if the task has a higher requirement for low latency, ρ t n will be bigger, otherwise, ρ e n will be bigger. FAP processing: If the task of UE n is carried out in the FAP processing mode, it takes three steps to complete the offloading process. Firstly, the task is initially uploaded to the associated FAP. Then, the task is performed by utilizing the computational resource provided from the FAP. Finally, the result of the task is returned to the requested UE n. Furthermore, since the transmission rate of the cellular link is relatively high while the task result is comparatively small, the transmission delay for task results is omitted in this paper [23]. Particularly, the uploading delay from UE n to the FAP m is calculated as where r u n,m denotes the uplink rate of UE n to the connected FAP m of the cellular link, which can be expressed as In expression (7), p u n,m denotes the transmit power of UE n, h u n,m represents the uploading channel gain, B indicates the bandwidth in each sub-channel, and N 0 stands for the noise power in each sub-channel. Thereby, the energy consumption of uploading task t n is presented as Moreover, the task execution delay to process the task t n with FAP m can be calculated by where f n,m denotes the allocated computational resource to UE n in FAP m. Additionally, the utility obtained by UEs is only related to the delay and the energy consumption spent on the UEs side. Thereby, for UE n who has offloaded its task to the FAP m, the utility can be represented as D2D processing: UE n can establish a D2D link to offload its task to a nearby DCN for processing. Let which stands for the achievable data rate for D2D link between user n and the associated DCN k, where h d2d n,h and p d2d n,h denote the channel power gain and transmit power between UE n and DCN k, respectively. Then, transmission delay and the energy consumption of UE n's with the D2D link can be presented as the execution delay of DCN k is calculated by where f k indicates the computational capacity of DCN k. Likewise, the utility obtained by UE n in DCN processing can be given as Cloud processing: Since the computational resource of FAPs is always not adequate for multiple UEs, so some tasks in FAPs should be offloaded to the cloud to ensure the total utility is not affected. With cloud processing mode, it takes the following steps for the offloading. Firstly, the task should be upload to the connected FAP, which be further sent to the cloud through the fronthaul link. Secondly, the task is executed by utilizing the computational resource provided from the cloud server. Finally, the computation result of the task is returned to the requested UE n. Specifically, we use T c to represent the round-trip transmission delay in the fronthaul link. Thereby, the execution delay in the cloud processing is expressed as where f Cloud n denotes the allocated computational resource to UE n at the cloud server. With reference to the FAP processing mode, the utility obtained by UE n in cloud processing is calculated as

Problem formulation
The objection of this paper is to maximize the total utility obtained by all UEs via finding an optimal offloading policy for each UE and optimizing the computational resource in each FAP. Combining with the system model and communication model, the optimization function can be formulated as C1 and C2 constrain that each UE can only select one processing mode. C3 and C4 ensure that the computational resource allocated to the offloaded tasks does not exceed the computational resource of FAP or cloud server presented as f FAP and f Cloud , respectively. C5 states that the number of UEs associated with each FAP should not exceed the maximum accessible number defined as C m , m ∈ M . Ref. [24] has proved that the problem formulated in (18) is a MINP problem, therefore, it is complicated to find the optimal solution by utilizing traditional optimization algorithms. Besides, the scale of the problem (18) also increases rapidly with the increasing number of UEs, which further increases the complexity of the solution. Consequently, the total utility obtained by UEs will also be directly affected. Based on the above challenges, in the following section, the problem formulated in (18) will be solved by jointly optimize the offloading policy and computational resource allocation. Specifically, we first divide the original problem into two sub-problems. A centralized DDQN algorithm running at BS is initially adopted to select the most appropriate offloading mode for each UE in the first phase. Especially, a pre-processing stage is introduced to decrease the complexity of the proposed DDQN algorithm. In the second phase, based on the training result of the DDQN algorithm, a distributed DQN algorithm is adopted to optimize the computational resource at each FAP. Combining these two phases, the final optimal offloading policy and resource allocation can be obtained.

The proposed computing offloading policy and resource allocation
This section mainly introduces the proposed DDQN algorithm-based computing offloading policy, pre-processing stage, and the DQN algorithm-based computational resource allocation scheme, respectively.

DDQN algorithm-based computing offloading
In the first phase, the utility obtained by each UE cannot be computed directly in the FAP processing mode because the computational resource in each FAP has not been allocated to UE before making the offloading decisions. Hence, we regard all the computational resource as a whole entirety in each FAP that can be allocated to the requested UE in the first phase, and after making the offloading decisions for each UE, we further optimize the computational resource in each FAP. Specifically, a DDQN algorithm is adopted to find the most appropriate offloading mode for each UE. After choosing the appropriate offloading mode, we will continue to decide which tasks should be sent to the cloud center while allocating optimal the computational resource to the FAPs.

Markov decision process
The DDQN algorithm is based on the DRL algorithm, which as a model-free approach can address complicated system settings by dynamically interacting with an unknown environment without any prior knowledge [25]. Meanwhile, DRL also can handle the potentially large state space problem [26]. In our considered F-RAN system, the problem of making offloading decisions for UEs is formulated as a finite Markov Decision Process (MDP) [27]. In our considered F-RAN system, assuming that the time period is divided into total T steps in each training epoch, and t = (1, 2, 3, .., T ) indicates each step, the parameter T denotes the number of UEs that need to offload tasks. Combining the considered F-RAN system and the DRL algorithm, the four essentials in RL presented as Agent, Action Space, Environment & State, and Immediate Reward, respectively, in each step t are defined as Agent: The agent is defined as a learner and a decision-maker in RL. Thereby, in our considered F-RAN system, BS is selected as the agent of the DDQN algorithm.
Environment & State: The environment in RL is defined as the set of all possible states, and the essence of RL is to perform actions to cause the state transfer [28]. Therefore, we set a matrix S as the state, which has the same shape as the matrix P , and the value of s ij in the matrix S should only be 0 or 1, s ij = 1 represents the agent selects FAP i (or DCN i) for UE j, otherwise s ij = 0 . At step t = 0 in each training epoch, we initialize the matrix S as a total zero matrix, then, the agent executes actions to interact with the environment to trigger the change of matrix S. Action Space: The BS make the offloading decision for each requested UE according to the network topology matrix P , and the optimization object is to find the optimal offloading mode for each UE. Thereby, in the proposed DDQN algorithm, we use a t ∈ A to denote the action in the step t, where A = {∂ 1 , ∂ 2 , ..., ∂ M , β 1 , β 2 , ..., β K }.
Immediate Reward: The settings of the reward function always need to be related to the objective function [29]. Accordingly, we set the immediate reward r t in each step t as two parts: If the constraints in Eq. (18) can be all satisfied, the agent will obtain a positive immediate reward r t represented as the utility obtained by the t-th UE. Otherwise, the reward obtained by the agent is zero. In addition, there exists another situation that the reward is set to be zero, that is ∃j ∈ N , M+K +1 i=0 p ij = 0 , which means the UE j cannot be connected with any FAP or DCN. Therefore, when the reward is zero, it means the UE should carry out local processing. We define the reward function at step t as At the end of each training epoch, the accumulated reward is represented as the total utility that the requested UEs.

The pre-processing stage
However, the proposed centralized DDQN algorithm in the BS always has a higher algorithm complexity, which is interpreted as the dimensions of the state space in the DDQN algorithm will increase dramatically as the number of the requested UEs increases, which increases the complexity of the DDQN algorithm while decreasing the efficiency of the network training. Thereby, a pre-processing phase is adopted to decrease the dimensions of the state space to improve the total utility obtained by all UEs. Specifically, assume that each UE, DCN, and FAP has cached some processing results of different tasks based on the optimal caching matrix C (M+K +N )×N come from our previous research [20]. Combined with our considered F-RAN system, we extend the dimension of the matrix P (M+K )×N to the same dimension as C and fill in "1" where they are extended. Then, dot multiplies the matrix C and obtains a matrix P ′ = P • C . In this way, each task has its own identity to be distinguished from others in P ′ . Accordingly, when a UE has a task to be processed, it will first check whether the task result has been cached on its local cache. If the result has not been found locally, the identification of the task will be transmitted to BS, and the BS will search the matrix P ′ then select the closest route to delivery to the requested UE. If the result can be directly obtained in the preprocessing stage, the maximum utility that UE n can obtain is expressed as However, if the result cannot be found in the pre-processing stage, the offloading procedure will be adopted. Since the BS server is equipped with a powerful computing server, the searching and delivery of the task result can be completed so fast that the delay to (19) (20) U n = ρ t n T l n + ρ e n E l n . transmit can be ignored. Therefore, UEs who no longer need to participate in the task offloading can find their task results during the pre-processing phase. In this way, the complexity of the DDQN algorithm can be decreased. The specific algorithm procedure in the pre-processing phase is shown in Algorithm 1.

DDQN algorithm
The DDQN algorithm-based offloading scheme is proposed to select the optimal offloading mode for UEs who need to be offloaded after the state space has been decreased. Specifically, the DDQN algorithm is a typical DRL algorithm that utilizes the deep neural network to approximate the state-action Q value with the aim of maximizing the expected accumulated discounted reward and get the optimal action [30]. The Q function is expressed as formula (20) where And γ is a discount factor between 0 and 1 that stands for the effect of the future timestamp rewards on current time-step rewards. The greater effect makes a bigger γ.
The model and architecture of the DDQN algorithm we designed is shown in Fig. 2, where we use each step t at a training epoch as an instance to introduce our network model. In each step t, the input of the DDQN network is the current state s t and the output is the Q value of each possible action at the state s t , which can be presented as Q(s t , a t ) . The agent selects an action according to the ε-Greedy policy then perform the action, which is interpreted as an action is randomly selected with the probability of ε and the action that has the maximum value of Q(s t , a t ) is selected with the probability of 1 − ε . The advantage of using this ε-Greedy policy is that it can make the agent explores the unknown action and state in each step so as to avoid the algorithm falling into a locally optimal solution. After selecting an action to execute, the state will transfer to the next state s t+1 . Meanwhile, the agent also gets an immediate reward represented asr t , and the network will carry on the training at the next step t + 1 until the end of the training epoch. During the training process, the object of the DDQN network training is to obtain a series of actions that can achieve the maximized accumulated discounted γ i R t+i |s t = s, a t = a (22) R t = r t + r t+1 + · · · + r T . reward. This can be interpreted as the BS aims at achieving the maximum total utility for all UEs in the considered F-RAN model. To achieve a better performance of the network training, the DDQN algorithm splits the output Q(s t , a t ) into two different parts, which is the State Value Function V (s t ) and Action Advantage Function A(s t , a t ) individually expressed as where ω and ϕ are the network parameters for V (s t ) and A(s t , a t ) , respectively. Specifically, V (s t ) stands for the excepted accumulated reward at the state s t , and A(s t , a t ) indicates the degree of superiority of action a t over the average level in state s t presented as formula (24) and (25).
According to Ref. [31], formula (23) can reformulated as Furthermore, according to the training procedure of DRL [32], we build the loss function of the DDQN algorithm as where r t + γ max aQ (s t+1 , a, ω − , ϕ − ) represents the target network and Q(s t , a t ; ω, ϕ) represents the predict network value. Actually, these two networks have the same structure but different parameters, where the parameters of the former are copied from the latter every I steps. During each training epoch of the DDQN network, the gradient descent a t ; ϕ)). algorithm is utilized to minimize the loss function to find the optimal parameters of the predict network, which is further used to evaluate the Q value of each chosen action [33]. In the DDQN algorithm, an experience pool is introduced to ensure the stability of the network training, where the specific approach is to put the latest interaction data (s t , a t , r t , s t+1 ) into an experience memory pool, when the training is start, a mini-batch (s ′ t , a ′ t+1 , r ′ t+1 , s ′ t+1 ) will be randomly sampled from the pool. As a result, the experience replay mechanism not only makes the agent learn from the previous experiences repeatedly but also removes the correlations between the observations. Thereby, the DDQN network training will become more stable and more efficient. The whole procedure of the above proposed DDQN algorithm is drawn in Fig. 3, and the proposed DDQN algorithm-based offloading scheme is presented in Algorithm 2.

DQN algorithm-based computation resource allocation scheme
Since multiple UEs connected to the same FAP will cause resource competition, some of the tasks in FAP should be relayed to the cloud server to ensure the maximization of total utility. Meanwhile, the computational resource in each FAP should be allocated to the UEs whose task has offloaded to the corresponded FAP. In this part, we first classify the tasks in each FAP into two different parts according to UE's different requirements in latency which is characterized with the delay revenue coefficient ρ t n . Specifically, tasks with higher delay requirement that are represented as ρ t n ≥ 0.5 are set to be remain at FAP to process. Otherwise, when ρ t n < 0.5 , the tasks will be sent to the cloud to process. Since the cloud center has abundant computational resources and owns powerful processing capability, while the computational resource of FAPs is limited. Thereby, we assume that the tasks sent to the cloud center can be processed in parallel [34]. Meanwhile, a distributed DQN algorithm is adopted to optimize the resource allocation in each FAP. DQN algorithm is also a typical model-free DQL [35], so the computational resource allocation problem can be formulated as MDP as well, the Agent, State, Action, and Reward are described as follows.
Agent: In the proposed distributed DQN algorithm, since the object is to optimize the computational resource in each FAP, we define the Agent as each FAP.
Environment & State: The state is defined as a combination of the available resources in each FAP and the obtained utility of UE in each FAP, which can be expressed as s = (F m , N m i=1 U i ) , where N m stands for the number of UEs who offload their tasks to FAP m.
Action Space: The action should contain all possible schemes of resource allocation to the UEs who remain at the FAP m. Besides, the DQN algorithm is mainly oriented to the problem with discrete actions. Thereby, the computational resource in each FAP should also be discrete, and the discrete computational resource blocks should be allocated to each UE. Supposed the computational resource in FAP m is divided equally into X parts. Therefore, the action is expressed as a t = (f 1 , f 2 , ..., f i , ..., f N m ), f i ∈ {1, 2, 3, ..., X} , where f i denotes the number of computational resource block which is allocated to the UE i. Immediate Reward: Since the agent act as each FAP in this distributed DQN-based resource allocation problem, so FAP m will immediately get a positive reward denoted as the utility of UEs in FAP m, which is expressed as N m i U i . In practice, if the variable range of reward value does not exceed a threshold quantity which is represented as a small value in ten consecutive time steps in the training epoch, we set this training epoch is terminated, and the network will be start at the next training epoch. As shown in Fig. 4, the input of the DQN is the state s t in each step t, then three fully connected layers are utilized to extract the features of the input data, finally, the output of the DQN is the resource allocation vector. When the DQN algorithm tends to converge, the agent can eventually learn the optimal resource allocation vector Similarly, the DQN algorithm uses the gradient descent algorithm to update the Q-network during each training epoch to minimize the loss function, which is formulated as where θ ′ represents the parameter of the target network which is copied from the predict network parameter θ every several steps. As with the DDQN algorithm, the DQN algorithm also adopts the experience replay mechanism to remove the correlation of the data to make the training of the network more stable. The proposed DQN algorithmbased computational resource allocation is illustrated in Algorithm 3.

Results and discussion
In this section, the parameters and results of the simulation experiment are presented to verify the performance of our proposed offloading and resource allocation scheme.
We consider a single cell with a radius of 400 m which distributed with 20-100 UEs and 2-10 FAPs. Also, some important parameters are listed in Table 2. Moreover, In the DDQN algorithm-based offloading scheme, we set the input layer has (M + K ) × N neurons, where we use three fully connected hidden layers to extract the feature of the input data, each of which comprises 256 neurons. Especially, to divide the State Value Function and the Action Advantage Function, we split the third hidden layer into two halves, which means that 128 neurons represent the State Value Function and another 128 neurons represent the Action Advantage Function. Besides, the discount factor γ in the DDQN and DQN algorithm is set as 0.99. Similarly, the DQN algorithm also adopts three full connected hidden layers and each layer comprises 256 neurons. The DDQN and DQN algorithms are implemented by TensorFlow 2.0 based on Python 3.7.7. Moreover, to train the DDQN and the DQN network, we use Adam optimizer and set the learning rate as 0.0002 to realize the gradient descent algorithm to minimize the loss function. For the action selection of each step, we set the parameter ε in the ε-Greedy policy as decaying from 0.08 to 0.01 through the network training process, which means that the predict Q-network tends to select the action with the maximal Q value to  further improve the learning efficiency. Furthermore, when dividing the computational resources of each FAP, in order to avoid missing the optimal solution, we set that each resource block is 0.1 Figures 5 and 6 display the learning curves representing the accumulated reward obtained by the agent in each training epoch of the DDQN algorithm and the DQN algorithm, respectively. The number of UEs is set as 20, and the average required number of CPU cycles is 760 cycles/bit. We can see that the accumulated reward is gradually increasing with a bit of fluctuation and eventually become stable and converges both in the DDQN algorithm and the DQN algorithm, which indicates that the two networks have been trained perfectly while the optimal offloading decision and resource allocation has finally reached.
Then, we compare the performance of the proposed DDQN algorithm-based offloading scheme with other methods, which are random task offloading scheme (RO), non-D2D offloading scheme (ND2D), non-pre-processing offloading scheme (NP), and random tasks selection scheme of FAP and the cloud (RS), respectively. Figure 7 shows the total utility of UEs versus the different numbers of UEs. Assume that the average required number of CPU cycles is fixed to 760 cycles/bit. It can be seen from Fig. 7 that the proposed DDQN algorithm-based offloading scheme can achieve the maximal total utility compared with other schemes. We give the following explanations. Firstly, for the RS scheme, since the tasks are randomly selected to be processed by the cloud, thereby, this scheme cannot guarantee the higher requirement of some UEs' tasks for latency. Hence, even the resources have been optimized in each FAP, the utility obtained by some UEs cannot be maximized. Besides, as for the non-pre-processing offloading scheme represented as NP, it can be observed that as the number of UEs increases, the interval between the NP scheme and the proposed DDQN scheme has gradually become lager. This phenomenon can be interpreted as that more UEs indicates more task requirements are produced. Thereby, the probability that the task results could be found within the matrix also increases, resulting in that more UEs can achieve the desired utility by directly obtain the task result which contributes to the gradually increased total utility. Moreover, for the random task offloading scheme represented as RO, it is clearly shown that the performance is significantly worse than the proposed scheme, the reason might be explained as because the BS selects the farther FAP or proximity devices for some UEs, which results in the increased transmission delay, then the obtained total utility of UEs cannot be satisfied. In some situations, the total utility might even be numerically negative, which is responsible for the poor performance of the total utility. Additionally, for the non-D2D offloading scheme represented as ND2D, the total utility is obviously lower than other schemes. The reason is that without the assistance of the nearby DCNs, tasks of UEs can only be offloaded to FAPs or cloud, and the computational resources of some nearby DCNs are not well utilized. Thereby, UEs cannot beneficial from the DCNs, which causes the lowest total utility compared with other schemes. Figure 8 illustrates the total utility obtained by UEs versus the different average required computational resource of tasks, where we fix the number of UEs to 100. It can be seen that as the average required computational resource of tasks increases, the total utility obtained by UEs also increases. This trend is explained as more required computational resource from UEs means larger execution delays with the local computing method. Accordingly, UEs can benefit by offloading their tasks to FAP, nearby DCNs, or cloud, respectively, through an optimal offloading scheme, which brings larger total utilities than execute the task locally. Figure 9 demonstrates the number of the beneficial UEs versus the different numbers of FAPs, where we fixed the total number of UE to 150, and the average required number of CPU cycles to 760 cycles/bit. It can be obviously observed that as the number of FAP increases, the total number of UE who can obtain the utility through offloading tasks to others (Total beneficial) also increases. The reason is given as follows. Since more FAPs will provide more alternative offloading modes to more UEs, and consequently UEs can enjoy the more abundant computational resource provided by the FAPs. Compared with DCN offloading mode (DCN offloading), we can observe that as the number of FAPs increase, the number of UEs offloaded to DCNs gradually decreases. The reason behind this phenomenon is twofold. Initially, the competition for offloading opportunities among UEs is fierce when the resource of FAPs is scarce. However, if UE cannot connect to any FAP, it tends to offload to the nearby DCN to increase the utility. Consequently, some tasks of UEs have to be offloaded to the nearby DCN. However, compared with offload the task to DCN, as the numbers of FAPs increase, the computational resource provided by FAPs gradually become abundant to satisfy UEs' demands for lower execution delay of the task. Since the utility obtained from the FAP is larger than that from DCN, more UEs will choose FAP offloading mode for higher utility. Therefore, as the number of FAP increases, more UE will increasingly prefer FAP offloading mode (FAP Fig. 8 The total utility of UEs over the average required computational resource of tasks offloading) instead of DCN offloading mode. On the other hand, for UEs who can obtain utility directly during the pre-processing phase (pre-obtained), the increased number of FAP means the larger probability that the task result has been cached, which can satisfy the requirements of more UEs. Therefore, the number of UEs who can directly obtaining the utility in the pre-processing stage grows with the increase in FAP number. Then, we compare the proposed algorithm with three other methods, namely, Full-FAP offloading where the tasks in FAP remain in FAP to process with non-resource optimization, and Full-cloud offloading which means that the tasks in FAP are all sent to the cloud to process. Figure 10 displays the total utility of UEs versus the different numbers of UEs, where we fixed the number of FAP to 10. It can be seen that as the increasing number of UEs, the total utility initially increases then gradually becomes stable. This is because as the number of UEs increases, more computational resource in FAP and cloud need to be allocated to more users, which will incur the longer execution delay of each UE's task. This is responsible for the slow growth of the total utility. Compared to the proposed resource optimization scheme to the non-resource optimization scheme, the DQN algorithm optimizes the allocation of the computational resource in each FAP, which improves the total utility. Besides, the performance of the Full-FAP offloading and the Full-cloud offloading is not as good as expected. This is due to the fact that if all UEs' tasks oriented to the FAP are executed at the FAP, the computational resource of each UE is insufficient, which will directly affect the execution delay. The lower utility of the Full-cloud offloading may be due to the long round-trip delay or the congestion in the fronthaul link, which increases the transmission delay of the task, and affects the total utility. Figure 11 presents the total utility of UEs versus the different numbers of FAP. It can be seen that the total utility increases with the increasing number of FAP. This is because more FAPs can provide more computational resource, so that more UE with good channel conditions can choose nearby FAP to offload their tasks, which reduces the transmission delay and execution delay of UEs' tasks, thus improves the total utility of UEs. Figure 12 illustrates the total utility of UEs versus the average computational resource required of tasks. The quantity of UE and FAP are set to 100 and 10, respectively. It can be observed that the total utility is gradually increasing. This is explained as the more average required computational resource of tasks responsible for a longer execution delay. Compared to processing locally, all offloading schemes such as offloading to FAP, DCN, or cloud will impact the utility of UEs because of the increased delay.

Conclusion
In this paper, we have studied an offloading selection and computational resource allocation scheme in F-RAN. Aiming at maximizing the total utility of all UEs who have the task to be processed, a DDQN algorithm-based offloading selection scheme is proposed to initially make the optimal offloading decision for each UE with an unpredictable CSI. Especially, the proposed DDQN algorithm is a centralized algorithm carried out at the BS, so we utilize a pre-processing phase to decrease the complexity of the DDQN algorithm before the network training. After getting the optimal action for each UE, we then utilize the distributed DQN algorithm to optimize the computational resource at each FAP. Simulation results demonstrated that the proposed offloading and resource optimization scheme can effectively increase the utility obtain by the required UEs while achieving a better performance compared with other schemes.