A Collaborative Cache Allocation Strategy for Performance and Link Cost in Mobile Edge Computing

— Mobile Edge Computing (MEC) is a new paradigm to address massive content access on mobile networks to support the fast-growing Internet of Things (IoT) services and applications. However, inappropriate cache placement and utilization and requests for cached data at different times are highly variable. Nevertheless, most current optimization approaches lack the adaptive ability to orchestrate dynamic caching environments, and although many studies use online deep learning approaches, many challenges remain. This paper synthesizes the value of cached content and the cost of transmission links to derive a comprehensive utility function that can meet both the performance and link cost requirements of edge computing caching systems by dynamically changing the weight values. To improve the timeliness of the caching policy and the efficiency of deep learning, we propose a collaborative two-stage deep reinforcement learning (CDRL) framework to design the caching mechanism; CDRL cleverly combines double Q-learning (DDQN) with deep Sarsa. Experimental results show that the proposed approach can effectively improve the performance in terms of cache hit rate, service latency, and link cost.

is a general problem of unreasonable resource allocation in edge servers [4].
In recent decades, scholars have conducted extensive research on collaborative storage. For cloud computing, collaborative storage is used to reduce data center energy consumption [5], increase user access speed [6], and reduce an overall load [7]. For content delivery networks (CDN) and information center networks (ICN), collaborative storage is used to reduce the amount of duplicate data and optimize energy efficiency [8][9][10][11]. Most of the existing edge collaborative cache research models the content placement problem as a linear programming problem centered on content popularity and has achieved many results; however, the cost of the data transmission link varies greatly, and the existing research is very different. Rarely consider the value of data, the link cost of data simultaneously, and edge cache dynamics and uncertainty. We propose an edge collaborative cache architecture for edge servers with limited storage space to coordinate data storage required in the edge computing environment. Through edge collaboration, content that is more likely to be accessed can be offloaded to edge services closer to the user. Therefore, even if the cache space is limited, each edge server has a high probability of meeting the user's data needs.
In addition, traditional rule-based and metaheuristic approaches have difficulty considering all environmental factors. To reduce the backhaul traffic load and transmission delay clouds in remote clouds, scholars have proposed caching mechanisms for Q-learning [12], deep learning predictive caching mechanisms [13], and collaborative edge caching frameworks based on deep reinforcement learning [14]. However, most learning methods still suffer from unstable returns or slow convergence of the training process.
In general, a new unified model of variable cache utility and an improved deep learning mechanism is needed at this stage to avoid various bottlenecks in deep learning on caching. To solve this problem, we propose a cache optimization strategy based on dynamic cache utility, CDRL to solve these caches and link cost issues in edge computing. The cache utility considers both the cache value and link cost of the data and uses a novel algorithm to dispatch the data on the edge nodes. Our contribution is as follows: 1) Analyze the optimization mechanism of the edge cache, and define the content placement problem of an edge node. The goal of the problem is to make the edge cache meet the actual performance/link cost requirements in a specific environment, improve the overall user QoS, reduce the overall link cost, and maximize cache utility.
2) The design proposes a strategy based on dynamic cache utility and CDRL to solve the cache placement of edge nodes. Due to the CDRL algorithm can effectively reduce overestimation compared to other deep learning algorithms, better cache results can be achieved in a delay-sensitive edge computing environment.
3) The strategy's performance based on dynamic cache utility and CDRL is evaluated through simulation experiments. The simulation results show that, compared with traditional cache replacement algorithms such as LRU, LFU, GREEDY, and DQN, this optimization method can effectively increase the cache hit rate and reduce the cache cost and latency.
The rest of the paper is organized as follows: Section 2 summarizes the current work progress, and Section 3 describes the problem model and CDRL strategy. In section 4, a cache placement algorithm for edge computing is introduced. Section 5 proves that the cache placement problem is an NPC problem. Section 6 analyzes the experimental results and gives conclusions. Finally, we summarized the article and looked forward to it.

II. RELATED WORK
This section presents a brief overview of the content caching selection mechanism defined in different works. Existing studies can be divided into the following two categories. The first category is to optimize traditional methods or convex optimization in cloud caching. For instance, Li [6] et al. analyzed the problems of delay, bandwidth, data rate, and fault tolerance for data transmission across different cloud data centers and aimed at the single node cache of varying cloud data centers. The collaborative caching system proposed a Bayesian network load balancing cache placement algorithm based on historical predictions to optimize the file access speed of the cloud data center. Jameela [7] et al. introduced the concept of distributing cloud caching. It uses small caches available on servers or network fog nodes to reduce the load on the data center. The two-way parallel transmission reduces the download time of newly released software, games, and movies from the release point to the client. Yi Peng [8] et al. comprehensively considered the energy consumption optimization and performance improvement of the content delivery network (CDN) and proposed an implicit cooperative caching mechanism for energy consumption optimization.
Due to the explosive growth of multimedia service demand in mobile networks, the overloading of transmission links and the severe shortage of buffer space has become urgent problems. Researchers have conducted in-depth studies on cache optimization schemes to inform the action center networks (ICNs) and content center networks (CCNs). For example, Zhijiang [9] et al. focused on optimizing cache content placement in an ICN environment. A cache offloading strategy with edge priority feedback was proposed. Based on the data rate requirements of backhaul links, Li et al. conducted a systematic study of caching techniques in cellular networks, including macro-cellular networks, heterogeneous networks, device-to-device networks, cloud radio access networks, and fog radio access. Access to the network; three aspects introduce caching algorithms: content placement, content delivery, joint placement, and interaction. Ge [11] et al. developed an optimization model for a hierarchical collaborative caching system for mobile content distribution networks (MCDN) to minimize the overall cost of user access to resources and proposed a utility-based heuristic hierarchical coordinated caching strategy based on this model. However, the caching research content only for edge computing is less but in line with the research direction of ICNs and CDNs. For example, YAYUAN [15] et al. proposed a generic framework for multimedia caching for edge IoT-assisted devices (EACM). This framework facilitates intelligent caching of edge IoT devices. It leverages fine-grained adaptive leasing and tuning of edge IoT devices to adapt to user demand's temporal and spatial dynamics. Avrachenkov [16] et al. developed a low-complexity distributed asynchronous content placement algorithm to solve the cached content placement problem. Ferragut et al. investigated a time-based (TTL) caching policy. Ioannidis [18] et al. proposed a distributed adaptive content placement algorithm that performs stochastic gradient ascent on concave slack of the expected cache gain and constructs probabilistic content at the expected optimal value. Wang [19] et al. proposed a distributed framework to integrate data forwarding and dynamic cache layout. Yan [20] et al. proposed Caching rules in Buckets (CAB) to alleviate the dependency problem by dividing the domain space into regions (buckets) and caching the regulations associated with the requested areas. Chen [21] et al. integrated SDN, caching, and computing to propose a new integrated framework SD-NCC focusing on energy consumption.
Although many collaborative caching mechanisms exist, few optimization approaches consider cache value and link cost. The key to studying "cache utility optimization mechanisms in edge environments" is to develop a collaborative model of edge caching and use it as a basis to research utility optimization methods for high-density content during cache placement. In this paper, we start from the model of content and the cost of links. The corresponding utility optimization scheme is proposed for the optimal cache placement strategy in the edge environment.

III. PROBLEM DEFINITION AND SYSTEM MODEL
This section introduces the topology of co-operative edge caching-supported IoT systems and discusses the delay model. Next, the cache replacement model is presented. Finally, we formulate the optimization problem of edge caching for IoT systems. Some key parameters are listed in Table 1.

A. Problem definition
In the environment of mobile edge computing, the placement and optimization of cached content have always been an important research topic. Most of the existing cache optimization strategies are based on the popularity of cached content or the storage capacity of edge servers to optimize the cache placement. However, there are few studies on the optimization of cache strategies aiming at the overall dynamic utility of the cache system. This paper draws an active edge cache placement optimization strategy based on "dynamic cache utility" according to the cache value and link cost of the content requested by the user. Meanwhile, the cache utility value is dynamically adjusted to achieve dynamic optimization of the cache utility. For the optimal allocation problem of content placement, we define the situation as follows: Base Station (BS): Each base station deploys an edge server for computing and caching. Thus, each base station can cache various contents to satisfy the various content service requirements of UEs. Each BS has its cache capacity TOTn and bandwidth bandn. The base station is bound to cache the most valuable content when the capacity is limited and replace the cached content through optimization methods when the capacity is insufficient.
Edge network: The base stations serve a large number of UEs via wireless cellular links and the base stations are connected via wired optical cables. V denotes BSs set, and Vn ∈V represents an BSn , each Vn communicates with the cloud via the core network.
Content: The size of the content is various. We assume it is an integer, and the unit is mega (MB) for simplicity. Users (UEs) can fetch the requested content from the BS or download the content directly from the cloud to the BSs.
Cache strategy: Given BS set V and content set C, the content placement strategy is a many-to-many function; that is, content elements in content set C need to be placed in parts of server set V. For a given edge network, there can be many content placement strategies. The BS needs to obtain the content and can get the data from the storage of the nearby BS or the cloud.

B. Framework design of CDRL
In Figure 2, we demonstrate the framework of CDRL and its interplay with a typical edge system. The goal of CDRL is to provide a performance and cost-aware controlling framework, which can dynamically reconcile both cache action and utility configuration in an edge system, aiming to achieve a better trade-off between performance and link cost. So, CDRL is designed as a two-layer controlling framework that relies on a reinforcement learning mechanism to make an optimal decision on resource management at runtime. As shown in Figure 2, CDRL consists of four components: A two-stage learning model, cache controller, cache action, and utility configurator.
Cache controller: This is the critical component of the edge cache system, which is designed to collect information about various resources in the edge cache system, and control the action of the caching system. It transmits cache/time information to the learning model, and it also directly influences the cache system based on cache action and utility configuration.
Two-stage learning model: This component is divided into two reinforcement learning models: a traditional cache learning model, which receives the cache information sent by the cache controller and obtains a dynamic cache placement strategy through training. Another reinforcement learning model receives the time information of the cache controller and returns the result of the utility parameter to update the utility according to the time information.
Utility Configurator: This component can adjust the calculation method of the cache utility according to the learning results and return the development of the utility calculation to the cache controller.
Cache action: The cache action is divided into three parts, caching to the cloud, caching to the local BS, and caching to other BSs. The cache action is obtained according to the learning results, and the cache action matrix is sent to the cache controller.

C. Cache utility model
When a user requests specific content for the BS Vn, its possible sources are its cache, its BS cache, other BS cache, and cloud storage. Assuming that the user's request interest has been given, the resource search and request sending and receiving have been carried out appropriately. User interests are generally obtained in the existing system by analyzing historical request content or online real-time methods. Resource search and transmission usually adopt buffer exchange protocol and GPRS tunnel technology.
Edge servers are usually deployed in commercial or industrial sites; we considered mainly an edge computing scenario in a city's commercial area containing a set of V BSs, represented by V1, V2, ……, and VN. These servers are scattered in a city business district, and each BS Vn has a cache utility Wn(t). At a specific time, a total of M contents need to be cached, C1, C2, ……, CM, which are distributed within the range of N BSs, and there are U users in the field.
One of the main ideas of this paper is to dynamically balance the cache value with the link cost, hence a comprehensive utility function is proposed. The content popularity, cache capacity, and cache replacement cost in the cloud-edge cache are combined as the cache value and subtracted the link cost. This combined utility function can be customized according to the requirements. The goal is to maximize the overall cache utility, to meet the general cache requirements of edge computing and maximum cache efficiency. Therefore, the maximum cache utility at t can be expressed by the equation (1): When a BS is requested to cache content, set the cache utility of the server as Wn(t), the cache value as VL i n(t), and the link cost as TM i n(t). The cache utility model of a single-edge server is obtained as follows: Where x i n=1 represents the content Ci needs to be cached, x i n=0 otherwise. VL i n(t) and TM i n(t) represent the caching value and the caching link cost of content i on the server Vn, respectively. According to formula (2a-2b), the high caching utility of the server Vn indicates that the server has a higher caching value and a lower link cost when caching content.
The cache value and link cost of the content Ci are expressed by formula (3): and RPn(t) represent the popularity of content Ci for server Vn, the available cache size, and the cache replacement cost at t of the server Vn, respectively. ( ) denotes link cost of server Vn cache content Ci, which will be defined in detail in equation (8).  represents the weight of the link cost (we use the equation (14) from the experiment to express), which will be performed in the experiment analysis shows that the higher the value, the lower the overall link cost, but the overall hit rate will decrease, and vice versa. The weight  has a potential advantage in practical applications, which can meet the performance requirements of changing the entire cache system when it is changed. For example, in environments that require a large amount of content cache, such as Olympic venues, shopping malls on holidays, etc., the value  can be reduced to increase the overall system's performance; when the cache demand is small, the value  can be minimized by increasing the weight. The general link consumption achieves the goal of overall energy saving. The cache value and link cost will be described in detail below.

D. Cache value
Cache value in most literature is only related to the popularity of the content, but for BSs, limited capacity of BSs and frequent cache replacements will change the cache value; therefore, this article defines cache value as follows: The popularity of the content is randomly generated, which is generated by Zipf distribution： The Zipf distribution is more in line with the distribution of users' actual resource requests in the cache system [25], and it is widely used in the simulation of cache algorithms. According to various literature experiences [27], in the above formula, i n r (t) is the resource popularity ranking of node n when content i when time is t, and the parameters q=0, a=0.8. In the experiment, the parameter values of the Zipf distribution are discussed, random(t) represents a random function from 0 to 1, and the parameters of the Zipf distribution can be represented by equation (6): . . ( ) (0,1) s t random t  (6b) AVAn(t) represents the available cache space of server n, and a high value of RPn(t) reflects the high cost of replacing the content cached on the server Vn, and a low value indicates that it is more worthwhile to cache content on the server. Suppose the currently requested content exists in the server. No cache replacement cost is generated, but if the requested content is not in the server, the cache replacement cost needs to be defined. The cache replacement cost is determined as follows: datai represents the size of the content, and bandn means the network transmission bandwidth of the server Vn. If the size of content Ci is smaller than the available cache, no replacement cost, otherwise, the replacement cost is expressed as . The function of the cache replacement cost formula represents the cost of exchanging data between the content and the server when it is to be replaced. For any content to be replaced with any server, the price paid will be directly related to the content size, server capacity, and transmission bandwidth.

E. Link cost
Since the policy of minimum link cost needs to be considered as much as possible in the content acquisition source decision problem, the prioritization of content acquisition sources needs to be determined in the caching decision problem. According to the model assumptions, the link cost is ordered from smallest to largest, and the source priority is related to edge servers > other edge servers > cloud servers. When calculating the link cost of specific resource acquisition, if a resource placement policy is given, the resource acquisition caching decision can be obtained based on this priority, and the corresponding link cost can be obtained.
As shown in Figure 3, in the edge collaborative caching system, when a user requests a particular content, its possible sources are the own BS cache, other BSs cache, cloud cache, and from different places, the cost of the link to get the content must be additional. The link cost used by the model represents the network link cost of the end-to-end connection in the network. The link cost is not an actual physical quantity. It integrates delay and bandwidth and reflects the network cost of end-to-end data transmission. Link quality can be estimated by measuring the delay and bandwidth. In the model, we assume that the link quality knows. The greater the link cost, the higher the link delay, the lower the bandwidth, and the poorer stability. Assuming that the request rate of user u for content is du i , du i represents the probability of user u requesting a certain content Ci within a period, and it reflects the user's interest in a particular resource.

TM t S t y data d Tr t Bc Tr t Be Tr
Trc n (t) represents the link cost from the cloud to the BS Vn, and Tre n (t) represents the link cost from other BSs to Vn. Represents the link cost of the content from the BS Vn where the user u is to the user u. When Bci=1, the content Ci needs to be transmitted from the cloud to the BS, otherwise. When Bei is 1, it means that the content Ci needs to be transferred from other BSs to the BS where it is located; otherwise. i u y =0 means that Fig. 3. Edge system topology supporting cache the content exists in the user u, that is, there is no link consumption at this time, and i u y =1 means that there is no content Ci in the user u, and it needs to be obtained from the cache of the BS. In our hypothesis, when the user requests the edge computing network, the default value is i u y =1.

F. Collaborative two-stage learning strategy
We combine the Deep Sarsa and Q-Networks algorithm to solve the problem that content can be cached suitability in the edge node. Introducing Sarsa learning to Deep Q-Networks is to remove overestimations. The frame of the twostage learning is shown in figure 4. Collaborative two-stage deep reinforcement learning is separated into pre-learning Double Deep Q-Network (DDQN) in the cloud and dynamic learning (Sarsa) on the edge. The pre-learning gets a state matrix, action matrix, and reward matrix by contributing historical data from the utility model. Pre-learning will generate a strategy for the edge node. Dynamic learning will act at+1 before getting state si+1 and reward ri+1, and the dynamic cache strategy changes the realtime edge cache state. Experience replay should store all samples and remove the correlations through random sampling.
In the whole system training process, first, randomly select the data set in the cloud edge environment, obtain the utility model, and then train the Q-network to obtain the initial placement strategy by minimizing the loss function of each iteration period. After the initial placement strategy is received, DDQN and Sarsa algorithm should be applied to the dynamic caching process to achieve the overall optimal dynamic caching decision. We use deep learning in the two-stage learning model to obtain the cache placement strategy and better general utility.
The dynamic caching strategy is obtained from the dynamic training of the Sarsa algorithm. The difference from the DDQN algorithm is that before the Q-function is updated, the agent has performed the ai+1 action based on the state si+1 and obtained the reward ri+1, Its way to find future rewards is the actual reward from the real action. The Sarsa algorithm will continuously estimate the behavior strategy and change the greed of the behavior strategy for the Q-function. Through the experience replay sequence, the behavior can be smoothly operated in the previous state, avoiding learning shock or divergence. Each step of the overall learning replicates the weight of the online network, which helps to make the neural network more stable than the standard online Q-learning. The overall reward value will change when the server performs a caching operation.
We model the content cache placement process in BSs as MDP [25]. The status of caches, cache action, utility configuration, and feedback rewards are as follows.
Status of caches: The cache status of the content is represented by sn i , and sn i =1 means that BSn caches content Ci, otherwise. This paper represents s as an aggregate of states, s is defined as = { 1 , 2 , . . . , , . . . , 1 , 2 Utility configuration: The utility configuration is configuring the size of . The size of  can be changed to the interval of (0, 1) through learning to dynamically adjust the design of the utility value in the natural environment of the dynamic time change.  (sn i ) represents the value of in status of sn i . Feedback reward: When the local BS takes action a, the system will receive feedback rewards. The feedback reward is determined by the overall cache value and link cost. To obtain the maximum system reward and ensure the maximum utility in the dynamic process, we use the difference in utility as the reward of the system: The formula (10) is divided into two procedures; the first is to determine the value of  according to the current time state f(t), f(t) represents the time-based flow estimation function (which can be designed according to the realistic scenarios). The second is to determine the caching strategy according to the current cache state. The value is defined first in each decision, and then the caching strategy is determined. ΔW represents the difference in utility before and after content caching.
In this paper, in the DDQN under the duel reward framework, the agents of BS share the weights of the low-level network and have different high-level networks for the rewards. Finally, the reward normalized is R(s, A). DDQN implements pre-training and obtains the initial caching strategy; the action of the learning model is defined as and DDQN should iteratively update the state to get the initial optimal-caching strategy. The cloud server first transmits the historical data of the contribution to the pre-training module through the utility model. Then the pre-training module gets the cache state matrix, the system behavior matrix, and the system reward matrix. Finally, the cloud server obtains an initial cache placement strategy according to the overall system reward and outputs the cache strategy.
denote caching strategy which is represented by a combination of actions, and 1 = ∑ ( ) ∈ , ∈ denotes preprocess strategy. Based on (10) and (11), the objective is to maximize the long-term reward based on an arbitrary initial caching state s1 : Moreover, a single-agent infinite-horizon MDP with a discounted utility (12) can be generally utilized to approximate the expected infinite-horizon unaccounted value, mainly when γ ∈ [0,1), The operation performed by the BS is based on the current caching state s, and the basic model of its cumulative reward value (Q-function) is as follows: Where αi ∈ [0,1) is the learning rate, represents the cumulative reward value generated after performing the strategy ϕ(s) in the state s to reach the state s' and then selecting the best strategy ϕ'. Where γ is the discount factor, it can exponentially decrease the value of a future reward. After the server performs the operation ϕ(si), the current state si changes to si+1, and the instant reward R(si, ϕ (si)) is obtained. This formula represents the overall reward value after operation ϕ(s) minus the overall reward value of the previous time.
We use DDQN to deploy the pretraining process in the local BS. we can obtain the approximate optimal Q-value according to the following equation: Where − denotes target network weights, Q denotes Q-value of target networks. Three neural network models need to be placed for each BS, i.e., the main network of Rf and RW, and a target network. Traditional DDQN stores two value functions, and each value function is updated with a value from the other value function for the next state. Different from the traditional DDQN, this paper combines the two rewards to obtain a new original Q function, and updates this original Q function in cooperation with the target Q function. Moreover, we can obtain the gradient updates of ωi by ( ) as follows: To minimize the loss function L(ωi) and overestimation at each episode, we propose the "Two-stage deep reinforcement learning algorithm for caching" algorithm, as shown in Algorithm 1. This algorithm does not simply use that DDQN for pre-training and sarsa for dynamic training. However, after pre-training with DDQN, the action is selected with probability ɛ in the dynamic training and ɛ = 0.1. In other words, there is a probability of performing a random act, and there is a probability act to maximize the Q-value. foreach i∈C 9.
if BS Vn receive a request content Ci.
sample The two-stage deep reinforcement learning algorithm is shown in Algorithm 1 and has two main procedures. 1) Procedure 1 (Pre-training process): First, the relevant parameters of DDQN are initialized in the experience replay to get the low-level weights of the main network and the weights of the target network (line 1-3). Then the main and target networks are trained with global history information to obtain the initial policy. The obtained states, strategies, rewards, and value functions are saved to the experience replay (line 4-6).
2) Procedure 2 (Online training process): In the case that the BS receives a content request, the caching action is first executed with ɛ-greedy, and then the cache state and cache policy are reset and saved to the experience replay (line 9-13). The greedy strategy is evaluated against the online network but uses the target network to estimate its value. Then, execute Sarsa to train the caching process and update all parameters (14)(15)(16)(17).
In Algorithm 1, the algorithm's complexity mainly considers the number of conversions and the number of backpropagations in dual-depth reinforcement learning [26]. The difference between [26] is that we have used the deep reinforcement learning network twice to optimize learning for different goals. Pre-training can be performed before the strategy is deployed, as it only requires historical information and its training process does not interfere with the actual cache execution. So the execution efficiency of the two-stage strategy after deployment is not substantially different from that of other deep reinforcement learning.

IV. CACHE PLACEMENT ALGORITHMS
The problem of edge cache placement is an NPC problem. Considering the value of cached content and the link cost, we start from user requests and design a heuristic algorithm based on the greedy idea. The core of the algorithm lies in the decision-making process of caching. At each step, the target value of the global cache utility is guaranteed to increase. Algorithm 1 represents the pseudo-code of the cache placement algorithm based on dynamic cache utility. First, construct reward functions Rf and RW, initialize an empty hash map to store cache decisions, and initialize the Q network. Next, take part of the data from the experience replay sequence to pretraining the main network and the target network (line 1-4).
If the capacity of the BS Vn that needs to cache the content is not full, the content Ci is directly cached, and the status si is updated synchronously. If no requested content is cached in the BS and the capacity is full, a deep Q network is executed to train the cache process and update all parameters (5-10). Finally, put the caching strategy into the hash map (11). BS Vn needs to cache Ci 5.
if Vn is not full then 6.
Cache the content Ci by the two-stage deep reinforcement learning algorithm 7.
Update the state si 8. endif 9. endforeach 10. f (HashMap<C, Vn >)←Overall cache decision 11. return HashMap<C, V > The optimal placement strategy will be used as a guiding strategy for each server's cache. However, it is difficult to find an optimal strategy because the decision of the problem is NPcomplete. The following section proves the NP-completeness of the problem.

V. NP-COMPLETE PROOF
If we can reduce the NP-complete set cover (SC) problem to the cache placement set problem, we can prove that the decision-making problem of the cache placement set is NPcomplete. Firstly, define the decision-making problem of the cache placement set: Given an edge network G(V, E), where V1, V2,…, VN∈V are BS nodes, and the cache utility is W1， W2，…, WN. The content set is C1, C2,…, CM, x i n indicates whether the content Ci is on the BS Vn.
The SC problem is defined as: given a set of m elements {F1, F2, …, Fm} and a set of subsets of it H and integer q, the number of subsets is k. Discuss whether there is a subset whose size does not exceed K and can cover all elements of the entire set F.
Let <m, k, q, H1, H2, …, Hk> represent an instance of the SC problem, and let <M, N, W1, W2, …, WN, x i n, R> represent an instance of the cache placement set problem. Below we prove that the SC instance can be polynomially reduced to an instance of the cache placement problem.
m → M and k → N respectively represent the number of content and the number of BSs, q→R means the number of cached contents should not exceed q, in the 1→x i n rule, the value of x i n is 0 or 1, which depends on whether the set Hi Owns the element Fj. It can be seen from this rule that the SC to cache placement protocol can be completed in polynomial time.
Firstly, assume that the SC instance has a feasible solution; there is a q subset in the set H that can cover all m elements. According to the rules of the statute, in the corresponding cache placement instance, we can cache the content in the new server, making the overall utility more excellent. If the SC instance does not have a feasible solution, it needs at least a q+1 subset to cover all m elements. Correspondingly, the BS does not need to cache the content currently. So, the cache placement set problem is an NP-complete problem.

VI. EXPERIMENT ANALYSIS
In this section, we will evaluate the proposed algorithm from the perspective of the comprehensive utilization of BSs. We considered two to four base stations for the simulation, each with a maximum coverage radius of 300 m. In addition, each base station has 20 channels, the channel bandwidth is 10Mhz, and the transmission power is 40W. We obtain global and local content popularity from real data sets to realistically implement the experiment. We use a large-scale offline MSN actual data set derived from the application Xender. Xender is widely used to share content in India. We used the experimental data dated August 1, 2016, to August 31, 2016, including 450 and 786 user tracks. More than 153,482 files were shared, and the number of requests was 271,785 and 953 [27]. To accommodate the degree of matching for different scenarios, we designed different user request probabilities based on different f-functions.
The simulation was performed on a Dell server as cloud, which contains a 20-core CPU, 128G of RAM, and 2T of storage hard disk, and four Dell servers as edge, which of them includes a 4-core CPU, 8G of RAM and 1T of storage hard disk. For each request, we record execution information, including request type, end-to-end completion time, link cost, and whether it is a cache hit. This information can be used to calculate the hit rate, average delay time, number of backhauls, and link cost for each experiment. The double DQN consists of a fully connected feedforward neural network with one midlayer composed of 200 neurons, used to construct the origin Qnetwork and target Q-network.

A. Experimental value of weight parameter
In the experiment, we changed the episode and the content's number to show how these parameters affect the overall cache performance of each algorithm. According to the article [23], the delay between the BS and the user is proportional to the Euclidean distance. In our simulation, it is assumed that every 1km will cause a 1ms delay. Since the location of the user and the BS are known by the simulation, the delay can be directly calculated. The details of the test parameters can be seen in Table 2.
We first analyze the weight of link cost. The number of contents is 1000 and 10000, the number of BSs is 4, and the value of  is from 0.1 to 0.9. The experimental evaluation performed the average cache hit rate and overall link cost in 10 time periods. Figures 6 (a) and (b) show that the hit rate and link cost decrease with the increase. The experiment on  is carried out without introducing deep reinforcement learning for the value of  . However, it can be seen from the investigation that the change of the  value will affect the cached result. Therefore, the proposed dual learning framework determines the final experimental results.

B. Comparison and analysis of experimental results
For evaluating the proposed cache placement strategy, we use some advanced caching schemes for comparison, as shown below: 1) Least Recently Used (LRU): The content of the LRU will be replaced first.
2) Least Frequently Used (LFU): LFU content will be replaced first.
3) Local greed for content popularity (GREEDY): The least popular content will be replaced first.  Figures 7 and 8 show the performance demonstrations of average content access latency, hit rate, backhaul traffic, and overall link cost. Set the parameters of the number of user content and server cache size to M=10000 and TOT=150Mb. It can be seen from the figure that the performance of GREEDY is better than that of traditional LRU and LFU. Compared with it, the advantage of collaborative deep reinforcement learning (CDRL) is more prominent.
The article defines the meaning of f(t), and f(t) can be determined according to actual conditions. We define f(t) as a Poisson-like distribution:   In the formula (14), the value of t is [1,10], which represents the reward when the deep reinforcement learning system values  at time t. More f-functions are designed for different scenarios in the later section. Figure 7 shows the variation curve of cache strategy performance according to time. The longer the time, the overall system tends to be stable. The CDRL strategy is compared with classic caching methods such as LFU, LRU, and GREEDY. It can be seen from the figure that the CDRL strategy has significant advantages over other approaches in terms of cache hit rate, delay, backhaul traffic, and overall link cost; the GREEDY strategy is a greedy local algorithm for content popularity. Although it has more advantages than traditional LFU and LRU, it has a significant gap compared with the CDRL strategy. In the CDRL strategy, more requests hit the cache node, reducing the average link cost of resource access and improving user experience. From Figure 7(a) and Figure  7(b), it can be concluded that compared with the other three strategies (LRU, LFU, GREEDY), the improvement in the hit rate of the CDRL strategy compared with the different three strategies is: 22.1%, 28.8%, and 19.6%. The delay reduction was 5.1%, 11.3%, and 3%, respectively. It can be observed from Figure 7(c) that the proposed algorithm can offload more backhaul services and is 14.2%, 21.1%, and 10.1% higher than other algorithms. Benefiting from the calculation of the overall link cost for each cache, it is observed from the experiment in Figure 7(d) that the average reduction of the overall link cost is 9.4%, 9.3%, and 3.3%, respectively. Figure 8 shows the curve of the overall performance of the BS according to the number of contents and compares it with policies such as LFU, LRU, and GREEDY, with the number of contents varying from 10,000 to 100,000. As shown in Figure  8(a)-(d), the CDRL strategy has a significant advantage in terms of latency compared to other strategies. For example, compared with the other three strategies (LRU, LFU, and GREEDY), it can be seen from Figure 8(a) that the CDRL algorithm can show good performance when the number of contents is small, but as the number of contents increases, the overall hit rate decreases without any change in the number of servers, and the hit rate of the CDRL algorithm increases by 32. 9%, 53.9%, and 5.1%, respectively. In Figure 8(b), the latency of all algorithms is low due to the small amount of content. Still, as the number of content increases, most servers need to change content more frequently under various policies, which will lead to an increase in latency. The average latency of the CDRL algorithm is compared to other algorithms. 6.4%, 9.2%, and 1% reduction, respectively; in Figure 8(c), the backhaul traffic will inevitably increase as the number of content increases. Compared to the other algorithms, the CDRL algorithm reduces the backhaul traffic by 7.9%, 14.3%, and 0.6%, respectively; in Figure 8(d), when there is less content, there is little difference in the link cost among the algorithms. Because the link load is smaller at this time, as the content increases, the link cost of all algorithms increases accordingly, and the link cost of the CDRL algorithm decreases by 10.1%, 9.1%, and 7.7%, respectively, compared to the other algorithms.
It can be seen from the above experimental results that the performance requirements and link cost requirements of the overall system can be dynamically adjusted according to actual conditions. Reduce the parameter  value when high performance is required, and increase the parameter  value when a lower link cost is needed. Due to the comprehensive consideration of the cache cost in performance, better performance indicators can be obtained than other excellent cache replacement algorithms.
Finally, the values of exploration probability ɛ and scene function f (t) are discussed. Figure 9 shows the performance comparisons for CDRL in terms of the hit rate with different exploration probabilities ɛ = 0.1, ɛ = 0.6, and ɛ = 0.9. Exploration probabilities can greatly affect caching efficiency, so a wide range of experiments are needed to determine the best exploration probabilities before deploying a caching policy. Moreover, we consider three different scenarios, which are shopping malls, factories, and schools. And accordingly, we propose three different scenario functions: 1 ( ) = denotes cache request rate model for malls scenario. The factory is distinctly different from the commute, so the cache request rate model for the factory scenario is defined as Assuming a commute time of 3-10, the cache has a high request rate during work hours and a low request rate at other times. This paper set the cache request rate for the school scenario to fluctuate randomly in an interval of time as, 3 1 ( ) (sin( ) 0.5) 32 f t t  =+ (20) We experimented with the cache hit rate and link cost for three f(t) in these three scenarios as shown in figure 10.  As shown in figure 10, the cache hit rate and link consumption are within reasonable ranges and tend to be optimal over time in three scenarios with more volatile requests.

VII. CONCLUSION AND OUTLOOK
In this article, we propose a method for cache placement in edge computing. First, a system framework based on overall variable utility and two-stage deep reinforcement learning is presented. The details of general utility and two-stage deep reinforcement learning are introduced, respectively. The cache value determines the total utility value and link cost, and the system cache is dynamically optimized through two-stage deep reinforcement learning. Experimental results show that the proposed strategy has been applied in edge computing systems and has obvious advantages compared with other widely used caching strategies.
There is still much work to improve: 1) optimizing the utility model. The content popularity is based on Zipf distribution, but there may be various changes in content popularity at different times and under other events. If the content popularity can be predicted more accurately through machine learning and deep learning methods, the overall cache utility can be significantly improved. 2) Model optimization of link cost. The link cost in this article is a synthesis of the dimensions of experiment and bandwidth and simplified to known, but in the actual network, how to calculate the link cost still needs to be solved better.