Deep Reinforcement Learning with Importance Sampling for QoE Enhancement in Edge-Driven Video Delivery Services


 Adaptive bitrate (ABR) algorithms are used to adapt the video bitrate based on the network conditions to improve the overall video quality of experience (QoE). Further, with the rise of multi-access edge computing (MEC), a higher QoE can be guaranteed for video services by performing computations over the edge servers rather than the cloud servers. Recently, reinforcement learning (RL) and asynchronous advantage actor-critic (A3C) methods have been used to improve adaptive bit rate algorithms and they have been shown to enhance the overall QoE as compared to fixed-rule ABR algorithms. However, a common issue in the A3C methods is the lag between behavior policy and target policy. As a result, the behavior and the target policies are no longer synchronized with one another which results in suboptimal updates. In this work, we present the deep reinforcement learning with an importance sampling based approach focused on edge-driven video delivery services to achieve an overall better user experience. We refer to our proposed approach as ALISA: Actor-Learner Architecture with Importance Sampling for efficient learning in ABR algorithms. ALISA incorporates importance sampling weights to give higher weightage to relevant experience to address the lag issues incurred in the existing A3C methods. We present the design and implementation of ALISA, and compare its performance to state-of-the-art video rate adaptation algorithms including vanilla A3C and other fixed-rule schedulers. Our results show that ALISA provides up to 25%-48% higher average QoE than vanilla A3C, whereas the gains are even higher when compared to fixed-rule schedulers.


Introduction
The global edge computing market is projected to grow to a multi-billion-dollar scale at a CAGR of 38.4%, reaching $61.14 billion by 2028 [25]. This growth is largely expected to be driven by the deep integration of artificial intelligence (AI) systems on edge devices, helping them to make important decisions in a split second. AI has the potential to improve a multitude of mobile and edge services like video streaming, online gaming, voice over IP, smart home applications, remote health monitoring, etc. In particular, video streaming is projected to be a major contributor to global Internet traffic, with each user streaming 35 GB of video streaming data per month on an average, contributing to about 77% of entire mobile data traffic [8]. Hence, it is of prime importance to improve the quality of experience (QoE) of video streaming for users.
Dynamic Adaptive Streaming over HTTP (DASH) [28] has become a prominent standard of streaming video content over the best effort Internet. In general, ABR algorithms have been widely explored for improving QoE in DASH-based video streaming [4]. ABR algorithms dynamically change the video bitrate based on the underlying network conditions such as buffer occupancy and observed throughput to provide a higher QoE for the users. These algorithms, however, follow a fixed set of rules to make decisions and are often optimized for specific scenarios. This makes such methods difficult to generalize to a wide variety of network conditions prevalent in today's ever evolving networks.
With the rise of multi-access edge computing (MEC) that offers low latency, high bandwidth and context awareness by performing the computations on network edge devices rather than a centralized cloud [10], there have been few recent works in the direction of DASH-based edge-driven video services [3,11,16,34]. In [11], a real-time method for QoE estimation is proposed for edgedriven mobile video applications. Such estimation provides the service provider to optimize the video quality and bit-rate for assuring better QoE for users in dynamic environments. In [16], authors proposed a novel caching method to store the highest video bit rate video at the edge server instead of storing video with different bit-rates. The encoding offloading is transferred to edgeserver from the cloud server and the proposed methods reduces the video load time and increases the cache hit-rate as compared to traditional store and forward caching mechanism. Additionally, the ML-driven content fetching for DASH is also proposed in [3] to improve the cache-hit ratio and to reduce the backhaul link utilization. Recently, an edge-based adaptive bit rate (ABR) algorithm is presented in [34] where edge server makes decision on video chunk quality and adapt rate based on the client's network conditions. Reinforcement learning (RL) [31] is an area of machine learning concerned with how agents ought to take actions in an environment to maximize some notion of cumulative reward. Several recent works have explored the integration of such RL methods into video streaming and obtained encouraging results [19,27] wherein, the goal is to achieve a high QoE. RL techniques with asynchronous advantage actor-critic (A3C) methods [21] have shown several advantages over fixedrule based ABR algorithms. Several researchers [19,27] have used a vanilla A3C method to generate adaptive bit rates for improving overall QoE. The A3C [22] agent consists of multiple actors and a central learner with a critic. Each actor generates experience based on its own behavior policy independently and in parallel with other actors. The individual experiences are then sent to the central learner which updates the target policy (policy that the A3C agent aims to learn) according to the generated experience. However, in many cases, A3C agents require a large amount of data to learn a suitable policy. Increasing the number of actors is a common solution to process the large amount of data while maintaining a low compute time. However, in such cases, each actor's behavior policy starts lagging behind the central learner's target policy [9]. As a result, the behavior and the target policies are no longer synchronized which results in suboptimal updates. To alleviate the problem, the use of importance sampling weights is suggested in [9] where weights are assigned to give more importance to relevant experience, and less importance to experience that is less probable [9]. In this paper, we propose deep reinforcement learning with important sampling for QoE enhancement in edge-driven video delivery services. Our solution is refereed to as ALISA: Actor-Learner architecture with Importance Sampling for enhancing QoE in ABR algorithms. The proposed method is capable of generating adaptive bit rates by using RL based actor-learner architecture without considering any pre-programmed model and assumptions of the underlying systems. The novel contribution of the current work is to integrate deep reinforcement learning with importance sampling to train, learn and generate adaptive bit rates at the edge server. The research contributions of this work are as follows: • First, we propose the integration of actorcritic methods with importance sampling weights [9] to generate ABR for edge-driven video delivery services. By assigning importance sampling weights and consequently, assigning more importance to relevant experience, our solution learns faster and provides a better overall QoE than state-of-the-art ABR algorithms.
• Second, we present the performance of our proposed approach using different data sets including traces from FCC [1], Norway [26], OBOE [2] and live video streaming [32]. We present a comprehensive study using three different variants of QoE metrics formulated as rewards for utilizing deep reinforcement learning. Finally, we also present the comparison over different network characteristics considering both lossless and lossy scenarios.
• Third, we present the comparison of our proposed approach with several state-of-theart ABR algorithms. This includes a comparison with basic implementation of A3C, vanilla A3C [19] and the comparison with other non-RL ABR algorithms including RB [30], BOLA [29], RobustMPC [33], etc. Our results show that ALISA provides up to 25%-48% higher average QoE than vanilla A3C, whereas the gains are even higher when compared to the fixed-rule schedulers.
The remainder of the paper is organized as follows. Section 2 presents the related work HTTP-based video services and ABR algorithms. Section 3 presents the relevant background on reinforcement learning and actor-critic methods. Further, Section 4 presents the problem statement, the integration of importance sampling weights, proposed algorithm and system design. We present the experimental setup and results in Section 5 and Section 6, respectively. Finally, we conclude our work in Section 7.

Related Work
In this paper, we focus on the HTTP-based video delivery services with ABR generation at the edge server as shown in Figure 1. Such systems are primarily implemented using DASH [28]. We focus on edge-driven ABR approach, for example in [34], where ABR decisions are performed by edge server instead of clients hence the edge computing power and storage capacity can be efficiently utilized. In such systems, the videos are stored in discrete chunks on the server. After connecting to the server, the edge server requests each chunk with a specific video bit rate from the server based on the client's requests. The edge-server uses an ABR algorithm to select the video bit rate based on the network conditions and sends it to the server. Several ABR algorithms have been proposed [13,24,29,33] to generate adaptive bit rates for video distribution over wireless networks. The algorithms can be primarily classified into being rate-based or buffer-based.
Rate-based algorithms [30] predict the bitrate of the future chunk as the maximum supported bitrate based on available network bandwidth and based on past chunk history. Many of the ratebased algorithms, however, suffer from bias in the system [18], leading to overestimation of the available bitrate. In contrast, buffer-based algorithms [13,29] make predictions based on the client's buffer occupancy.
However, since most of the proposed strategies operate based on pre-defined rules, they suffer from several drawbacks. First, these algorithms are susceptible to sudden changes in network conditions which can lead to inaccurate predictions. Second, several methods can be used to achieve a higher QoE, however, there is a tradeoff between such methods. For example, selecting the highest supported bitrate for every chunk can lead to loss of smoothness due to fluctuations in video resolution. Finally, the bitrate selection for a present chunk can often affect the bitrate selection for the future chunks. For example, downloading chunks using the highest possible bitrate can results in a lower bit rate and quality for the future chunks to prevent rebuffering.
Recently, there has been focus on the development of a class of ABR algorithms that rely on application of reinforcement learning. There have been several attempts to use Q-Learning for this task [6,7]. In these works, tabular Q-Learning is applied, which makes it infeasible to expand it to larger state spaces. Some works [6] also assume the Markovian property to hold, however, this is not always a good assumption to make because the predicted bitrate need not only depend on the last seen chunk, instead, it can be affected by multiple historical chunks. To solve the issue of the Q-Learning task becoming intractable with increasing feature space, actor-critic methods for ABR generation have been explored in [14,19,27]. In these papers, the A3C agent is used to generate ABRs, and achieve better QoE than most of the other fixed-rule based ABR algorithms. However, there are some issues with A3C [5,9,12,17,20] where one of them is the lagging behind of an actor's behavior policy as compared to the central learner's target policy. This impacts the performance of A3C agent in the existing RL-based video delivery systems, resulting in lower sampleefficiency and learning of a suboptimal policy. In this paper, we propose and investigate the use of assigning importance sampling weights [9] to the experiences based on their relevance and overcome a significant drawback with existing implementations of A3C agents with HTTP based video delivery systems.

Background
In this section, we present a brief overview of reinforcement learning and actor-critic methods.

Reinforcement Learning
A reinforcement learning solution aims to learn a mapping from the state space to the action space by repeated interaction between the RL agent and the environment. The RL problem is modeled as a Markov decision process with agents states and actions. Let us consider a discrete system where at each time step t ∈ {0, 1, 2, ...}, the RL agent observes its state s t , takes an action a t , moves to state s t+1 and receives a reward r t+1 . Further, for a sequence of states and actions, the discounted cumulative reward is defined as where τ is the sequence of states and actions, i.e. {(s t , a t ), (s t+1 , a t+1 ), ...} and γ ≤ 1 is a discount factor. The agent selects action based on a policy, where π θ (s t , a t ) is the probability that action a t is taken in state s t and θ are the policy parameters upon which the actions are based. Following the policy π, the value function V (s) for a state s is defined as The goal of an RL agent is to find the optimal policy π * that maximizes the overall discounted reward. The optimal policy is given by,

Actor-Critic Methods
As an improvement to the value-based methods, actor-critic methods have been proposed [22], which makes an update using gradient ascent at every step without having to wait till the end of an episode. The policy update with respect to its parameters θ is defined in terms of the gradient operator ∇ as follows, where α is the actor learning rate,q w (s t , a t ) is the critic function that indicates how good an action a t is in state s t . The parameters w of the critic function are updated as follows, where ξ is the critic learning rate.
However, a policy network trained in this manner may have high variance, which can cause instability during training. To mitigate this issue, the advantage actor-critic (A2C) [22] framework introduces the advantage function to determine the advantage of the action taken in state s t as compared to the average value of actions in s t . The advantage function is defined as the temporal difference (TD) error: The final gradient-based update for the actor is as follows, where H(π θ (.|s t )) is the entropy factor which promotes random actions and β is the regularization term. The entropy term is defined as The value of β is initially set to a high value to promote exploration early on, and it is reduced as training progresses. To enhance training speed, Asynchronous Advantage Actor-Critic (A3C) framework [22] is proposed to simulate multiple actors in parallel and asynchronously. These actors synchronize their parameters with the central learner at regular intervals.

Problem Statement and Proposed Solution
In this section, we present the problem formulation and the proposed solution including the system design details for deep RL based edge-driven video delivery system.

The Issue
Reinforcement learning agents often require a large amount of experience to model the environment effectively and accurately. Increasing the number of actors during training is a common technique to achieve this goal. However, there is an inherent issue with this method. As shown in Figure 2, the central learner first synchronizes its weights with all the actors (Step 1) and then actors provide their experience to the central learner (Step 2). However, there are situations when updates may lag from some of the actors. For example, as shown in Figure 2, the central Step (1): Each actor synchronizes its weights with the central learner; Step (2): One of the actors provides experience to the central learner, which updates the weights of the target policy; Step (3): The other actor provides experience to the central learner. The behavior policy for the experience is not synchronized with the latest version of the target policy (updated in step 2), hence the experience is based on an older policy.
learner updates its target policy even before its receives the experience from the actor on the right side (Step 3). Therefore, the behavior policy corresponding to this experience lags behind the target policy, and the experience may not be as relevant to update the current target policy. This issue is only aggravated by the presence of more actors, and learning shifts off-policy as actors generate experience with an older version of the behavior policy. The lack of synchronization between the behavior and target policies results in suboptimal updates. Ultimately, this leads to learning an overall suboptimal policy. We aim to develop methods that counteract the lagging of the behaviour policy behind the target policy during training, which in turn will help us achieve better performance on unseen test data.

Integration of Importance Sampling and ALISA Algorithm for Policy Update
Importance sampling is a commonly used method to address the issue of a data distribution mismatch. It enables us to estimate the expected value of a function f (x), where x follows a probability density function q on the domain D, using values sampled from a different distribution r on the same domain D as, Importance sampling transforms the data drawn from the distribution r such that it appears to be sampled from the distribution q. This effectively addresses the distribution mismatch issue, which in our case occurs due to the lagging of the behavior policy behind the target policy.
To overcome the distribution mismatch between the target policy π and the behavior policy µ, we employ importance sampling. Correlating to the notation in Equation (9), we have q ≡ π and r ≡ µ. Similar to the authors of [9], we use the n-step V -trace target to correct for the off-policy shift. The n-step V -trace target now serves as an estimate of the value function V for the target policy π using an older version of the behavior policy µ. The n-step V -trace target is defined as, where, is the temporal difference. are the importance sampling weights. The importance sampling weights ρ t and c i are used to give importance to experience which is more relevant to the target policy than the behavior policy. Here, π denotes the target policy and µ denotes the behavior policy.
• ρ and c are lower threshold values for their corresponding importance sampling weights, which we set to 1 throughout our work.
• ρ t denotes how much more probable the action a t taken in state x t is according to the target policy compared to the behavior policy.
• t−1 i=j c i denotes how much more probable the predicted path from state s j to s t−1 is according to the target policy compared to the behavior policy.
Subsequently, the V -trace targets are used in place of V for gradient computation. The n-step V -trace target can also be defined recursively as (11) which we use during implementation throughout our work. As a result of these calculations, actions which are more likely to be taken according to the target policy contribute more to the V-trace target. Hence the importance sampling weights help the reinforcement learning model to focus on the experience which is more relevant and leads to better parameter updates and assign less importance to suboptimal experience. Using the above definition of the V-trace target, Algorithm 1 outlines ALISA's policy update algorithm with the importance sampling weights where n equals to the length of the episode. The V -trace target calculation with ALISA takes as input the information related to an episode consisting of the sequence of states (s), the sequence of action probabilities according to the behavior policy (a b ), the sequence of rewards (r) along with the metaparameters ρ and c and the actor and critic models. As a consequence of importance sampling, actions which are more likely to be taken by the current target policy contribute more to the gradients compared to actions which are likely to be taken by earlier lagging versions of the target policy, but not the current one. This algorithm guides the RL model to focus on experience which matters more, and assign less importance to other less relevant experience.

ALISA: System Design
ALISA uses the familiar DASH framework and deep reinforcement learning algorithms like A3C as key components of its design and improves on them to achieve better QoE for video streaming. Figure 3 shows the main components of our system. The user streams a video on their devices which connects to the edge server. The edge server contains several components. The main component is the ABR controller, which handles the process of choosing actions. It observes several state parameters such as the bandwidth, bitrate selection history and the buffer occupancy and decides the action to take i.e. the bitrate selection  We now describe the training process used for the ABR controller. As shown in Figure 4 the training setup consists of several actors being coordinated by a single central learner. The actor contains a behavior policy as its parameters, while the central learner maintains the target policy and the critic parameters. All three of the behaviour policy, target policy and the critic function are modelled as a neural network. The training process can be considered as the repetition of the following steps until convergence: 1. First, an actor simulates an episode and generates a batch of experience consisting of the states, the corresponding actions taken by the actor and the rewards received as a result.
2. The experience is then passed back to the central learner.
Algorithm 1 Calculation of V -trace targets using ALISA for an episode Input: s: Sequence of states a b : Sequence of action probabilities according to behavior policy r: Sequence of rewards ρ: Lower threshold for ρ c: Lower threshold for c actor: target policy from central learner critic: critic model Result: v: V -trace targets V ← critic values for s p b ← behavior policy probabilities for optimal action a t ← target policy probabilities for s p t ← target policy probabilities corresponding to optimal actions of behavior policy, computed using a t and p b 5. The critic gradients are computed using the observed states and their corresponding rewards, while the target policy gradients are computed using the observed states, the corresponding actions, the obtained rewards and the V −trace targets. The target policy and the critic network are now updated using back propagation.
6. Finally, the central learner shares the latest version of the target policy with each actor, which set their behaviour policy to the newest target policy to generate the next batch of experience.
ALISA effectively decouples the processes of acting and learning, while also correcting for the off-policy shift that can occur as a result. This has significant implications for learning ABR algorithms. Since the large amount of video is streamed for the users across the world and with the upcoming advancements in edge computing, the ALISA's architecture allows to use the massive amount of data to continuously fine-tune the ABR algorithm and adapt to the ever changing network conditions, all without putting the privacy of the users at risk. While the edge devices continuously make decisions to select the bitrate, the decision choices can be passed on to the central learner on a remote cloud server, where learning can take place in a federated [15] manner. The latest policy can be synchronized between the edge devices and remote server at regular intervals. In this setting, the policy of the central learner may lag behind the policy using which the edge devices select bitrates. This distribution mismatch is effectively handled by importance sampling integrated into ALISA.

Experimental Details and Training Methodology
In this section, we describe the experimental setup together with the performance metrics used to evaluate ALISA's performance.

Experimental Setup
We use the Python based framework proposed in [19], to generate and test our ABR algorithms. The client requests chunks of data from the edge server and provides it with parameters pertaining to the observed network conditions like bandwidth, buffer occupancy and bitrate history. We have integrated importance sampling as described in Section 4 and assigned weights to the target and the behavior policies. To emulate network conditions for effectively testing our trained RL model, the MahiMahi [23] framework has been used, which is a record-and-replay HTTP framework for this task.
We use four datasets as part of our training set: the broadband dataset provided by the FCC [1], the mobile dataset collected in Norway [26], the OBOE traces [2] and the live video streaming dataset [32], which have been pre-processed according to the MahiMahi format.
Subsequently, we compare ALISA with the following state-of-the-art ABR algorithms: • Vanilla A3C [19]: uses vanilla A3C without any additional techniques to train the agent for delivering adaptive bit rates.
• Rate-Based (RB) [30]: RB predicts the maximum supported bitrate based on the harmonic mean of past observed throughput.
• BOLA [29]: Bitrate selection is done exclusively based on buffer occupancy, using Lyapunov optimization.
• RobustMPC [33]: MPC uses buffer occupancy observations and throuhgput predictions similar to RB. Additionally, RobustMPC accounts for error between predicted and observed throuhgputs by normalizing throughput estimates by the maximum error seen in the past 5 chunks.
To quantify the performance of the ABR algorithms, we use the formulation of QoE: 12) where b i and q(b i ) represent the bit-rate and quality, respectively, for chunk i. A higher bit rate means a higher quality and a higher QoE. However, there are also penalties due to rebuffering time T i (represented by second term) and fluctuations in video quality (represented by the final term) that hinders the overall smoothness. In this paper, we evaluate the performance of the proposed approaches with three QoE variants [19] that depend on the above general QoE metric: • QoE lin : q(b n ) = b n where value of rebuffer penalty is µ = 4.3, • QoE log : q(b n ) = log(b/b min ) that considers the fact that the marginal improvement in quality decreases at higher bitrates with µ = 2.66 and • QoE HD : assigns a higher value to higher quality bitrates and lower values to lower quality bitrates with µ = 8.

Dataset details
We use the following data sets for training, validation and testing. Our selection of data sets for training, validation, and testing is in line with the previous experimental setups in [19], [2], [32]. For training, we have used three different data sets. The first data set consists of 127 traces, out of which 59 belongs to the FCC [1] dataset while the remaining 68 belongs to the Norway HSDPA [26] dataset. The second data set consists of 428 OBOE traces [2] and the third data set consists of 100 live video streaming traces [32]. To demonstrate the benefits of ALISA over different trained models, we have generated three different trained model corresponding to three difference data sets described above. For all the three trained model, we have used the same validation data set, i.e., 142 Norway traces. Finally, for all the three trained models, after validation, the testing is performed using 205 traces from the FCC dataset and 250 traces from the Norway HSDPA dataset.

Training Methodology
We train the three models for each configuration, one each for QoE lin , QoE log and QoE HD as reward metrics. For each model, we use a consistent set of hyperparameters throughout. The discount factor γ is set to 0.99. The learning rates are set to 0.0001 and 0.001 for the actor and critic respectively. Additionally, we also set both importance sampling thresholdsρ andc to 1. We train multiple models for different configurations of entropy weights. First, we train several models with a constant entropy weight for 100,000 epochs. Next, we use a decaying entropy weight where the entropy is gradually decreased over 100,000 epochs.

Testing Methodology
We select the model with the highest validation QoE for testing. We perform testing under both lossless and lossy conditions simulated using the MahiMahi [23] framework. We perform tests under packet loss percentages of 0%, 0.1%, 0.5%, 1% and 2%, where random packets are dropped from the video stream. We evaluate all models on the three different QoE metrics discussed in Section 5.1.

Results
In this section, we present the results and comparison of ALISA with other state-of-the-art ABR algorithms.

Convergence Speed
ALISA takes advantage of the importance sampling strategy during training. As a result, it is often able to achieve a higher QoE compared to Vanilla A3C in a shorter time. Figure 5 presents the plots of the epochs elapsed versus the maximum QoE achieved till then. These plots are generated during the training using the first data set, i.e., 127 traces from FCC and Norway data sets. Our results show that by the time ALISA obtains a fairly high QoE of over 40, Vanilla A3C is only able to obtain a highest QoE of approximately 35. This demonstrates ALISA's advantage in learning and adapting to newer conditions faster, resulting in shorter training times. Similar results are observed during the training using OBOE and live video streaming traces.
6.2 Comparison with state-of-the-art ABR algorithms

Results with training and validation data sets
We perform a comprehensive set of training on the three data sets and report our results on QoE lin , QoE log and QoE HD metrics. Table 1  We note that the models do not converge well on training with a very low or very high entropy. We found a constant entropy of 0.1 to work well for QoE lin and QoE log metrics, while 0.75 worked well for QoE HD . We have also trained multiple times using a decaying entropy regularization. We start from a high value, and gradually decrease our entropy weight every 20,000 epochs, gradually going down to 0.1. We find that decaying entropy regularization is more effective in almost all the cases as seen from Table 1 since after a few epochs, a high exploration is not required to achieve an optimal policy.

Results with Test Data Sets
We also compare ALISA to several other stateof-the-art ABR algorithms such as RB, BB, BOLA and RobustMPC described in the previous section. We have also compared ALISA with Vanilla A3C, an RL based basic A3C approach that does not utilize the importance sampling weights. higher QoE than BB, 30% higher QoE than BOLA, 25% higher QoE than RobustMPC and 20% higher QoE compared to Vanilla A3C when tested under lossless conditions. This performance translates to lossy conditions as well. We note that ALISA is able to obtain upto 25%, 28%, 48% and 48% higher QoE compared to vanilla A3C under losses of 0.1%, 0.5%, 1% and 2% respectively. We summarize the remained of our testing QoE metrics for a random packet loss percentage of 0.1%, 0.5%, 1% and 2% in Table 3, Table 4, Table 5 and Table 6, respectively. These results indicate that ALISA achieves a significantly better performance than many other fixed-rule based ABR algorithms and also vanilla A3C. Further, we also visualize the different components of the QoE metric from equation (12) to understand how ALISA performs better than other ABR algorithms. Figure 7 shows how ALISA can consistently achieve higher bitrates than other methods. This increases the first component of QoE. From Figure 8, we note that ALISA maintains a lower buffer size than vanilla A3C, leading to an decrease in the second component in equation (12) hence ALISA provides a lower rebuffer penalty. Overall, this leads to a higher quality of experience for ALISA over other ABR algorithms.

Conclusion
We show how importance sampling and a structured entropy selection significantly improves the performance of vanilla A3C methods on the task of generating ABR algorithms for edge-driven video delivery services. By employing these methods as part of our proposed system, ALISA, we are able to consistently achieve an improvement in QoE of 25-48% and higher in certain cases. We also test our methods on a wider variety of conditions in terms of packet losses and find similar improvements. Finally, we also visualize and compare the bitrate selection and buffer size of ALISA with other ABR algorithms and find that ALISA performs better on both aspects, leading to an improved quality of experience. The future work includes the investigation of advanced hybrid cloud-edge architectures for the deployment of ALISA. Further, we also aim to investigate ALISA in a federated setup to utilize distributed training across multiple decentralized edge devices.   Ethical approval : This article does not contain any studies with human participants or animals performed by any of the authors.

Acknowledgments
Informed consent For this type of study formal consent is not required. Authorship contributions: MN and VD contributed equally to this manuscript. MG and PS supervised the process. The manuscript was first written by MN and VD and then edited by MG and PS.