Mode Selection of Content Delivery at Edge Nodes Based on Learning

Pre-caching at edge nodes may improve resource efficiency for future content-centric cellular networks by bringing contents closer to users. However, which mode (muticast or unicast) should be selected for the efficient delivery of those cached contents is still not well addressed, especially at the situation that some key parameters such as content popularity and transmission environments are unknown. To solve this problem, the criterion of delivery mode selection is studied based on the learning policy of multi-armed bandit. According to the criterion, a mode selection algorithm is proposed. In this algorithm, edge nodes can choose the better delivery modes in the current slot only dependent on existing observations in the previous slot. Performance evaluation results validate our analyses and proposals on the mode selection.


Introduction
Recently, a dramatic increase of traffic load over cellular networks stimulates the investigation on offloading techniques. One of valid approach is content caching at network edge nodes, such as base stations (BSs) of macro and small cells (i.e., micro-cell, pico-cell, and femtocells) and user equipments (UEs). Different from traditional cloud-based content delivery networks (CDN) [1], where the CDN components are located in the cloud and it may degrade the quality-of-experience (QoE) through initial delay and stall, edge (fog) stratum between the cloud and the end-user provides potential solution to efficient content retrieval by caching content near to users.
After caching, the most important step is content delivery which usually has two transmission modes, unicast and multicast. For unicast mode, content delivery follows a request-to-response mode where an edge node responses one user at a time. For example, in [2], contents are pre-cached at macro BSs (MBSs). The probability that video requests 1 3 experience low initial delays is improved by video-aware backhaul and wireless channel scheduling. In [3], through content delivery by small cell BSs (SBSs), the authors design a content delivery strategy to find the best SBS to serve the requested UE. In [4], users requesting contents are served by connecting to local wireless access points (APs) with cached contents. Caching-and-delivery scheme is designed to get approximate optimality of tradeoff between transmission rate, storage, and access cost. Except content delivery by base stations, some studies investigate content delivery by users [5]. For example, in [6], with contents pre-cached at UEs, the content delivery problem is formulated based on a flow model and delivery mechanisms are studied to provide quality-of-service (QoS) guaranteed service. In [7][8][9], internet of things (IoT) devices such as vehicle, unmanned aerial vehicle (UAVs) are also considered as content caching and delivery nodes to reduce latency and improve user experience. In addition, there are some studies investigate hierarchical model where content caching and delivery by heterogeneous devices such as [10][11][12]. For multicast mode, content delivery achieves high spectral efficiency by serving multiple users simultaneously when users request the same content in the serving scope, e.g., as analyzed by [13]. In [14], content delivery is realized with selected peers receiving multicast content from some selected agents. The selection problem is formulated as an integer linear programming problem to be optimized. In [15], an optimal dynamic multicast method is proposed to balance the delay and spectral efficiency with the consideration of delivery cost such as the average network delay, backhaul cost, and transmission power. In [16], the authors propose a multicast delivery strategy to maximize the successful transmission probability. Content sharing schemes via device-to-device (D2D) multicast are studied to provide contents for users with worse cellular links to reduce energy consumption in [17,18]. MBSs-based multicast delivery is studied in [19], where a BS pre-caches all contents that might be requested by the UEs and each UE caches a part of each file. Then, different requests can be satisfied by decoding the content from multicast streams.
Both of delivery modes have their advantages and disadvantages. Hence, a better understand of delivery mode selection is a preliminary step for efficient content delivery, which has few solutions. In our previous work [20], we compare two modes and indicate which one is better given concrete parameters such as network parameters, physical parameters and requested parameters. In practical, each edge node cannot or is reluctant to collect so many parameters before making a decision. In addition, closed-form expressions of performance metrics cannot be given when the distributions of some parameters are unknown. Therefore, we design online learning strategies for the mode selection of content delivery in this paper, which can select a set of delivery modes of requests without knowing content popularity and network parameters.

System Model
We consider a cellular network where edge nodes provide high data rate service to its users with cached files. When a user requests files not cached at its served edge node, the edge node will retrieve this file from data center and send it to its user. Since this retrieval will not affect the comparison between the delivery modes discussed here, we assume requested files are cached at edge nodes.
Denote the set of requested files to be F . The size of file f ∈ F is S f . Time is slotted and each time slot spans t. The instantaneous demand of f is d f , which is an independent identically distributed (iid) random variable with mean f . For simplicity, we assume that users have identical file popularity distribution. Considering that the file popularity distribution may change with spatiotemporal variation and some widely known models such as zipf are only suitable for describing the statical characteristic of mass of nodes, we assume that the distribution of popularity is unknown at edge nodes. For each request, the served node responses it via two alternative delivery modes, i.e., multicast delivery and unicast delivery. There is a controller in each node, which decides the delivery mode for a file request. The objective is to maximize the expected total scheduling reward during an observed time period T by assigning the appropriate delivery mode to each request. The selection criterion can be based on closed-form calculations, random decision, greedy or exhaustive search in term of resource efficiency, QoS requirement, loading fair and etc.. We will discuss our selection criterion in detail in the Sect. 3 and show its advantage by comparing it with other criterion in Sect. 4.
At per slot t, a controller selects an action x t (a set composed by the delivery modes of all requests) from a convex set X ∈ R d based on a given mode selection method (to be discussed in Sect. 3). At each time slot, the instantaneous reward is expressed by h F x t . Denoting the instantaneous reward associated with file f as The available delivery resource such as power and bandwidth at per slot are bounded, which is denoted by a constant R 0 here. Then the optimal problem of maximal expected total reward can be modeled by where g F x t is the cost function which describes the total cost of delivering files for all requests with action x t .

Learning on Content Delivery Mode
In this section, an online learning selection strategy is designed to obtain the maximal reward. Since the distribution of expected reward (shown in Eq. (1)) is unknown, a popular learning method, multi-armed bandit (MAB) [21] is used here to help make transmission decisions. MAB assumes that a machine has some arms. At each slot, the machine selects to pull one or several arms to maximize the expected accumulated rewards overtime based on its current knowledge, while simultaneously acquires new knowledge. In our problem, an edge node can be seen as a machine and a delivery mode of content request can be seen as an arm. At each time slot, the machine chooses some arms to play, i.e., chooses the delivery mode of each request file, and acquires random rewards of the played arms. Which arms should be played in the given time slot are based on the outcomes of the arms played in the previous time slots, so that the expected reward during a time period is as close to that of the optimal arms as possible. Since the reward model is unknown, the essence of the MAB learning problem lies in the well-known tradeoff between exploitation (aiming at gaining the immediate reward by choosing arms that are suggested to be the best by past observations) and exploration (aiming at learning the unknown reward distribution by choosing new arms). Here the tradeoff policy uses combinatorial upper confidence bounds (CUCB), which chooses a set of arms to minimize the difference between its expected accumulated reward and that of the policy always pulling the best arms) [22]. Based on CUCB, an index is given to select a set of arms from all, which will be discussed latter in detail.
For each request, multicast mode m and unicast mode u have different cost (e.g., energy consumption and bandwidth occupation) and different rewards (e.g., the number of simultaneously satisfied requests). Under the delivery mode = {m, u} , we denote the reciprocal of cost and the reward of content delivery for f via action x t as C f x t , and r f x t , respectively. Then, in time slot t, the instantaneous reward can be written as Then the objective in Eq. (1) can be written as follows.
At each slot, C f x t is determined by the selected transmission mode of f, which can be Based on Eqs. (2) and (5), the optimization problem is re-written as follows.
From the perspective of MAB, a set of arms x t are chosen, and their instantaneous rewards are observed at each slot. To maximize obtained rewards during T, we can obtain an index set based on CUCB as R(X) = r f x t ∶ ∀f ∈ F, x t ∈ X , where r f x t is the index of f and can be expressed by where r f x t is the observed rewards of f delivered via mode . N f is the times of using for f.
Next, based on the estimated r f x t , the set of arms which will maximize the accumulated reward during T is selected at each slot. According to the CUCB policy, the set is composed by arms which maximize the estimated reward f ,t r f x t at each slot. Thereby, the original optimization problem during T can be transformed into the optimization problem at slot t as follows.
It is a knapsack problem and NP-hard, which can be solved optimally by using branching algorithms, such as branch and bound (B &B). However, the complexity of B &B in the worst case is exponential, same as exhaustive search. Thus, we use relaxation here for this integer linear programming (ILP) problem by relaxing the binary constraints on f ,t to 0 ≤ f ,t ≤ 1 . Then Eq. (8) becomes a linear program (LP), and can be solved in polynomial time. We can sort R(X) such that r i > r j , if i < j , ∀i, j ∈ F and the corresponding coeffi- Here is the single non-integer element in B (X) , it will be approximated integer to approximate the solution to Eq. (8). Let B � (X) approximate to the optimal solution, which greedily delivery files sequentially, starting from the file with the higher reward, where r f x t = max r u f x t , r m f x t , until the available transmission resources at the current slot has been assigned completely. As defined in [23], the ratio between the optimal performance and greedy performance is , which satisfies We can see that → 1 for the larger number of files, i.e., the performance of the given approximate scheme will be very similar to that of optimal scheme. With the calculated B � (X) , the selected arms, i.e., delivery modes of request files at a slot can be determined by edge nodes. The selection algorithm is given in Algorithm 1.

Performance Evaluation
In this section, we present simulation results to evaluate our proposed theory and algorithm. We assume that the size of each content S f is randomly chosen from {0.1, 0.2, 0.3, 0.4, 0.5} units. In each time slot, requests are produced according to a Zipf distribution. Generally, multicast may use more resource (e.g., resource blocks (RBs)) than unicast to cover the request with the worst communication channel. Here the cost of unicast and muticast are randomly selected from sets {0.1, 0.2, 0.3, 0.4, 0.5} and {0.3, 0.4, 0.5, 0.6} , respectively. To evaluate the proposed algorithm, two upper bound algorithms are compared. In the first upper bound algorithm (UB1-exhaustive), we assume that the popularity profile is known and exhaustively search is used for optimization in each slot. In the second upper bound algorithm (UB2-greedy), we assume that the popularity profile is known and greedily search is used for optimization. For comparison, random algorithm is also considered, which randomly deliveries requested files until the delivery resource is full. Figure 1 compares instantaneous rewards among different schemes. It is shown that the proposed algorithm achieves 90% performance of exhaustive algorithm and 95% performance of greedy algorithm. The excellent performance of proposed algorithm do not need Compared to the random scheme which also does not need pre-knowledge of popularity and complicated computation, the performance of proposed algorithm is about 30 times better than that of random scheme. Figures 2 and 3 compare rewards based on unicast, multicast and the proposed scheme. Compared Figs. 2 and 3, it is shown that both unicast and multicast have their appropriate applicable situations, e.g., multicast is more efficient when there are some most popular files. However, we can see that the proposed scheme performs much better than only using unicast or multicast for delivery (e.g., the performance of the proposed scheme can be more than 10 times better than multicast and unicast). It is because that the proposed scheme can decide a better delivery mode for each requested content by learning without knowing concrete channel quality and network deployment.

Conclusion
In this paper, we have studied learning-based mode selection to make the content delivery in cellular networks more resource efficient. The solution to selecting a set of delivery modes for requests during a time period is presented, which only depends on the observed instantaneous demands and rewards of content requests without knowing many affecting parameters such as network parameters and file popularity distribution. In the solution, a selection criterion based on multi-armed bandit (MAB) is given. With this criterion, the content delivery of proposed scheme can achieve a sub-optimal performance without exhaustive or greedy search. The performance evaluation shows that the performance of proposed scheme can achieve 90% and 95% performance of optimal solution and greedy-search solution. In addition, it performs better than multicast or unicast content delivery, e.g., the performance of the proposed scheme can be more than 10 times better than multicast and unicast.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.