A virtual machine migration policy for multi-tier application in cloud computing based on game theory

Cloud computing technology provides shared computing which can be accessed over the Internet. When cloud data centers are ﬂooded by end-users, how to eﬃciently manage virtual machines to balance both economical cost and ensure QoS becomes a mandatory work to service providers. Virtual machine migration feature brings a plenty of beneﬁts to stakeholders such as cost, energy, performance, stability, availability. However, stakeholder’s objectives are usually conﬂicted with each other. Also, the optimal resource allocation problem in cloud infrastructure is usually NP-Hard or NP-Complete class. In this paper, the virtual migration problem is formulated by applying game theory to ensure both load balance and resource utilization. The virtual machine migration algorithm, named V2PQL, is proposed based on Markov Decision Process and Q-learning algorithm. The results of the simulation demonstrate the eﬃciency of our proposal which are divided into training phase and extraction phase. The proposed V2PQL policy has been benchmarked to the Round-Robin policy in order to highlight their strength and feasibility in policy extraction phase.


Introduction
Cloud computing technology has made computing resources more and more powerful, abundant and * Correspondence: khietbt@tdmu.edu.vn 1 Training and Science Technology Department, Posts and Telecommunications Institute of Technology, 11 Nguyen Dinh Chieu, Ho Chi Minh, Vietnam Full list of author information is available at the end of the article cheaper based on the rapid development of processing power and storage, where multiple users can access to computing resources over the Internet in ondemand fashion [1]. Cloud computing resources can be adjusted based on user-on-demand mechanism to ensure quality of service (QoS) as well as profits. By using virtual technology, the physical machines (PMs) are accessed by customers in a multi-tenant manner. Virtualization technology allows to create multiple virtual machines (VMs) on a physical server (PM), and each VM is also allocated hardware resources like real machines with RAM, CPU, network card, hard drive, operating system, and other own applications. Virtualization resources are flexibly organized for the benefit of applications and software. To take advantage of distributed computing, cloud-based deployment applications are often developed based on service-oriented architecture which is deployed based on a cluster of unique services and communicated with each other by a flexible mechanism [2]. For example, multi-tier applications based on web services such as three-tier web applications include web server tier, application server tier, and database server tier; NoSQL applications are deployed based on Cassandra, Hbase, Infinispan and HDFS technologies; Unlike monolithic applications that incorporate tightly integrated modules, applications based on service architecture are well suited for cloud infrastructures. Therefore, resource mismanagement can lead a lot of problems to customers, their end-users and service level agreements (SLA) violations, energy wastage, increased costs, revenue loss, and so on. VM migration is one of major advantages of virtualization to manage cloud computing resources. It allows migrating VMs to from one location to another which makes VMs free from the underlying hardware. A VM can be migrated from one PM to another PM while the VM is still running during migration [3]. It is a major advantage of Cloud computing which increases shared-resource utilization. Cloud system can archive load balance by migrating VMs from over-loaded PM to light-loaded ones. VMs running on light-load PM can be consolidated in another PM to minimize power consumption by reducing the amount of running PMs. Proactive fault tolerance model can be implemented by migrating VMs to another PMs to avoid expected faults before their occurrence in PMs [4]. Besides, the performance of cloud-based application can be increased by migrating some VMs from their limited resource PMs to rich resource ones. However, VM migration aims at different purposes of stakeholder's objectives including service providers, customers, end-users. The optimal resource utilization is essential in the efficient use of resources in large-scale of a cloud environment, the optimization problem of this type is usually of the NP-Hard or NP-Complete class [5]. The type of VM migration algorithms can be derived from distributed and parallel computing such as scheduling work for multiple processors, bin packing, graph partitioning algorithms. Many resource coordination algorithms have been developed, but none is suitable for all applications [6][7] [8]. To solve these problems, usually, the exhaustive algorithms, deterministic algorithms or meta-heuristic algorithms [9][10] [11] are applied by specific characteristics. In experiments, the deterministic algorithms are better than the exhaustive algorithms. However, the deterministic algorithms are inefficient in in large-scale environment [12]. Meanwhile, cloud services need to response customer as soon as possible to ensure QoS. In addition, not only cloud system but also cloud-hosted application become more complex in run-time because of elasticity characteristic and resource sharing paradigm. Therefore, it is rarely feasible to have the detailed prior knowledge on cloud system, cloud-hosted application and their interactive dynamics for managing resources effectively. Besides, the majority of physical resources in the cloud computing environment are not homogeneous as well as customers' resource demand. Heterogeneous resources can cause fragmentation of resources, resulting in a waste of resources. This issue requires new methods to coordinate resources in stochastic, complex, and heterogeneous systems with limited prior knowledge. These methods should be able to automatically produce effective resource management policies in run-time.
This paper focuses on VM migration solutions which aim the purposes of stakeholders including cloud service providers and customers. From perspective of cloud service providers, resource management helps maximize system utilization and reach high profit. To customers, resource coordination helps to ensure service level agreement(SLA). However, the maximum exploitation of resources will lead to the performance and QoS provided to customers will be difficult to satisfy. In the meantime, customers want to minimize using costs thereby leading to minimizing service time by requiring more resources. From there, it can be seen that the target relationship of cloud service providers and the customers may conflict with each other. Motivated by stakeholder goals, the VM migration problem considered in our work is based on maximizing the resource utilization while ensuring customers SLAs by balancing the load among PMs to avoid concurrency and congestion. The problem is modeled as noncooperate game based on game theory. To deal with VM migration in run-time, a new approach of continuous learning in interaction which refers to as Reinforcement Learning (RL) should be applied to the dynamic cloud environment. With no prior knowledge about characteristics of cloud system, the cloud controller agent takes migration actions and learns on-thefly about their efficiency through the observed feedback from the cloud infrastructure.
In view of this challenge, the VM migration algorithm is proposed by applying reinforcement learning method which ensures to balance the goals of stakeholders including service providers and customers in this paper. The VM migration problem for multitier applications is formalized in non-cooperative game theory for ensuring the goal of stakeholders. Based on Markov Decision Process (MDP) [13][14] [15], the V2PQL algorithm is developed to trade-off the load balance and resource utilization in cloud infrastructure. The optimal policy of VM migration is searched in the training phase. The agents perform actions impacting the environment in order to maximize the total reward as a result of actions. At discrete moment of time, the agents observe the state of system and choose an action from the set of action impacting the environment. After the training phase, the optimal policy with Q-Value is used to migrate VMs to other PMs. Our main contributions of the study are as following.
• The VM migration for multi-tier application is modeled by using non-cooperative game theory to describe the conflict among cloud service provider and customers. In this game, the PMs are considered players of the game which take into the selffish feature in the case of scarce resources [16] [17]. Each player tries to maximize their own utility by changing their strategies which trade-offs the load balance and resource utilization. • The VM migration algorithm, named V2PQL, is proposed by applying Q-Learning algorithm to solve VM migration game. Without any prior knowledge, the V2PQL algorithm tries to find an optimal VM migration policy based on interacting the agents and the environment. The optimal policy is described as Q- Table which includes states, actions, and q-values. The Q-Table is updated overtime by reinforcement learning mechanism.
• The heterogeneous data center which deploys one hundred multi-tier application is simulated. The optimal VM migration policy is investigated by the V2PQL algorithm in the training phase. And then,in VM migration game, the utility of V2PQL policy is benchmarked with Round-Robin policy in the extraction phase. The outline of the paper is as follows. The related work is discussed in section 2. Section 3 presents the The VM migration game approach . The VM migration algorithm based on Q-Learning is described in section 4. Section 5 presents the evaluation of the proposed method and a discussion on the results. Finally, Section 6 concludes the work.

Related work
According to stakeholder requirements, the decision making process of selecting available resource's PM to deploy VM have different objectives [18] [19].
PMs running with under-loaded state leads to waste of energy while overloaded state results shorten lifespan of PMs, and then reduces QoS. By migrating VMs, the loads can be balanced between among PMs in data-center to ensure QoS [20][21] [22]. Massimo Ficco et al. [23] proposed a meta-heuristic approach for the allocation of cloud resources based on the model of biological-inspired coral ecology optimization. Based on the game theory, the optimize resource allocation strategies are searched to ensure the objectives of the service providers as well as the requirements of customers. The evolutionary algorithm is proposed based on observing the structure of coral reefs and spawning corals. It also exploits the dynamism of competition among users and service providers to satisfy the benefits of stakeholders. Experiments show that the combined method based on biological emotions and game theory not only achieves a satisfactory solution of adaptability and elasticity but can also lead to significant performance improvement. In [6], Bai et al. proposed a method to evaluate the performance of applications on cloud computing. By the analysis of the QoS metrics including average response time, average waiting time, the flow density (usage) of each PM is evaluated in a heterogeneous data center. A complex queue model of serial and parallel queuing systems is modeled to evaluate the performance of heterogeneous data centers.
Considering environmental and economic aspects, the energy-aware in cloud computing becomes a hot topic. In PM consolidation, VMs are migrated for using as fewer PMs as possible. The power cost of VM migration models are considered through the metric networks between the source PM and the destination PM. [24] considers load balancing of PMs which consists of a set of VMs described as a multi-dimensional vector. VMs are assigned to the smallest amount of PM within the power limit to achieve optimum load. VM allocation is modeled based on non-cooperative games. Moreover, the distribution of resource utilization is resolved with machine learning algorithm to achieve efficiency [25] [26][27] [28]. Dhanoa et al. [29] analyzed energy consumption during VM migration on VM size and network bandwidth. In [30], Rybina et al. introduced prediction of energy cost due to migration based on resource utilization of PM. By reaching more accurate prediction model, the better migration decisions are taken in data center power management. Therefore, VM migration with less time will help in minimizing power cost during the migration process. Algorithms to adjudicate which VMs to be migrated from each PM should be considered the phase of prediction of VM resource demand based on machine learning [31][32] [33][34] to support the decision making.
The resource management in cloud environment is considered as a automatic control problem using the the reinforcement learning approach. Reinforcement learning is one of the machine learning methods in which the agents take actions impacting the environment to minimize the total amount of penalties from result of each action [35]. In [36], the authors proposed a unified reinforcement learning approach to the VM and application configuration processes. VM resource needs the changing workload which is adapted to provide service quality assurance. However, the proposed approach does not account the need of VM migrations. Farahnakian et al. [37] proposed a dynamic consolidation method based on reinforcement learning to minimize the number of active hosts according to the current resource requirements. The host power mode can be determined by an agent. A decision about host power mode from collected data is taken by the agent learns and is improved itself as the workload changes dynamically. However, the proposed algorithm focuses on only CPU performance and does not discuss other host resources. In [38], the authors analyze the possibility of application to cloud data center resource management based on reinforcement learning method. The proposed method deals with the power consumption and the number of SLA violations by using Q-Learning approach.

Virtual machine migration game approach
In this section, the VM migration problem is modeled by non-cooperative game with PMs as players. Each player tries to maximize the utility which trades off the load balance and resource utilization. The VM migration game approach is proofed that exist the Nash equilibrium. Supposing the cloud infrastructure has a large scaled heterogeneous physical machine and provides the computational resource as on demand model. There are a lot of physical machines deploying VMs based on virtualization technology. Cloud providers offer a group of VM types to remove the complexities of selection for customers, and each type is specifically determined as the number of CPU, the memory size, the storage size. Depending on the needs of the users, the resource allocation decision of providers have to adjust dynamically. The multi-tier applications are hosted in VMs cluster.

VM migration modeling
As shown in Fig. 1, it is possible to model the VM migration problem on the cloud in the form of a directed acyclic graph (DAG) [39][40] G(V, E) where V is the set of vertices representing the work, E is a set of edges that show the dependency relationship between vertices. VM migration process is trigged when the deteriorating PM is detected. At the moment, cloud infrastructure has n VMs which need to be migrated to m safe optimal PMs. Definition 1 (migration decision). A possible resource allocation for migrating n VMs to m PMs can be described as a binary matrix X(n × m): where x nm = {0, 1}, the migration VM n to PM m is described x nm = 1, otherwise x nm = 0.
Definition 2 (allocation decision). Based on definition 1, a possible for k kind of resource allocation in the PM i th can be described as allocation matrix M (i) (n × k): where v (i) nk ∈ Z + shows the amount of resource type k of i th PM provided to VM n.
, . . . , M (m) } defines a possible resource allocation strategy for all PMs. An optimal VM migration problem can be described as following the trade-off between load balance and resource utilization based on non-cooperate game theory.

Load balance
Let σ (i) denotes the resource usage of the i th PM which is measured as following: where λ j coefficient shows the influence o resource type j with j denotes the usage of resource type j in i th PM, and c (i) j is the capacity of resource type j in i th PM. The load balance of the system is calculated by the following formula: whereσ is the average value of the performance of PMs in the cloud computing infrastructure.

Resource utilization
For service providers, in order to achieve high profits, the resource of PMs need to avoid wastage resources of PMs. The concept of skewness in [41] is applied to quantify the unevenness of the utilization of different resources on PM p which is calculated as following: whereū (i) is the average utilization of all resource for i th PM.

VM migration game approach
In this section, a game theory approach to VM migration is presented, aiming at keeping a load balance as well as avoiding wastage resources of PMs. Game theory is a mathematical study of strategy in which the interactions among all game players are ensured their best outcomes [16]. The VM migration problem is modeled on non-cooperative game in which the safe PMs are as players.
Definition 3 (VM migration game model). A VM migration game is described as a three-tuple vector G = (P, (M (i) ) i∈P , (f (i) ) i∈P ). In this study, one global objective of this migration game is to balance th load and each individual player tries to minimize their resource wastage. To exploit load balance and also get the maximization of resource utilization, the utility function of i th player is designed as following: The game's utility function has an important influence on a player's strategic decision and the game's outcome. Each player tries to maximize their own utility by adjusting their strategies, which is described as following: where the constraint (8) ensures the provided resources of i th PM not to exceed its capacity, and the constraint (9) ensures a VM migrated to one and only one PM. The Nash equilibrium of the game is a state in which no player can increase its utility by changing its strategy while the other players have fixed their strategies. In other words, the Nash equilibrium is considered as a set of strategies where players have no motivation to change their actions. For any player i, every element β (i) ∈ M (i) is the strategy for player i, β (−i) = [β (j) ] j∈P,j =i describes the strategies of all player except i, and β = (β (i) , β (−i) ) is referred to as a strategy profile.
Definition 4 (Nash equilibrium). A profile β * is a Nash equilibrium of G if and only if every player's strategy is a best response to the other players' strategies, that is, where β (−i) is the strategies of all player except i, br (i) is the best response of player i, and br (i By defining the set value function br : β → β by br(β (i) ) = × i∈P br(β (−i) ), the Eq. (10) can be rewritten in vector form as β * ∈ br(β * ). The existence of β * for which β * ∈ br(β * ) is proved by using Fixed point theorems.
Lemma 1 (Kakutani's fixed point theorem) Let β be a compact convex subset of R n and br : β → β be a setvalue function such that for all β ∈ β the set br(β) is nonempty and convex, and graph of br is closed. Then there exists β * ∈ β such that β * ∈ br(β * ).

Theorem 1
The VM migration game G always has at least one strategy Nash equilibrium.
Proof Using Lemma 1 ∀i ∈ P, we have i) β is compact, convex, and non-empty. We have β (i) = {v j } is closed, bounded, thus compact convex. Their product set β is also compact. ii) br(β) is non-empty. By definition br (i) (β (−i) ) = argmax where β (i) is non-empty and compact, and ) is a continuous function in β, and by Weirstrsaa's theorem br(β) is non-empty. iii) br(β) is a convex-valued correspondence.

VM migration algorithm based on Q-Learning
In this section, the MDP model includes the state and action spaces, transition probabilities, and reward structure which are completely specified. However, transition probabilities are often unknown for real-work setting. Also, the state and action spaces are often too large for algorithms to handle [42]. To solve this problem, the V2PQL algorithm is applied to find a Nash equilibrium of migration strategy by influencing the observed system states and rewards. Furthermore, the algorithm dose not require the prior knowledge of model parameters.

MDP framework for VM migration
A discrete-time MDP model is applied to build the optimal VM migration algorithm. Considering the process of migrating a VM to a safe PM is a stochastic process with assuming VM migration request arrivals independently. At each small separate time step, there is either one arrives exactly or no VM migration request. Also, these migration events occur with some given probability. Furthermore, the probability of VM migration requests a given type following with a predefined distribution. Given a sufficiently small a discretetime step, a good approximation to the Poisson process is provided by this simple stochastic process.
To narrow down the system state space, fuzzy logic method are applied to the value of load lalancing in Eq.(4), utilization resource in Eq. (6). The Fig. 2 (a) depicts the membership functions of load balance level including three states, i.e., Good, N ormal, and Bad, which is calculated as following: Figure 2 The membership function. The membership function charts of load balance, and resource utilization. Fig 2(b) shows the membership functions of resource utilization level including three states, i.e., Low, M edium, and High, which is calculated as following: . The transition probability matrix P (s ′ |s, a) can be analytically derived for a stochastic model.
Definition 6 (Reward structure). The optimization problem (7)-(9) described the benefit of the current VM migration showing the system state snapshot. The reward R(s, a) of VM migration MDP can be defined as using the object function (7): The optimal MDP policy is a mapping from a MDP states S to a set of actions A based on maximizing the average reward of discounted cumulative reward over time. The reward function can serve as a basic element to change the policy. By using the modified reward function architecture, algorithms like Value Iteration (VI) or Policy Iteration (PI) calculate optimal policies. For example, in the Value Iteration algorithm, set V (s 0 ) as the initialization value and update V (s) iteratively until V (s t ) ≈ V (s t+1 ) according to the equation Bellman: where α < 1 represents the discounted value and n iterations. The optimal policy is then seen as argmax a V * (s) where V * (s) are convergent values from equation (18).
Definition 7 (Policy). Policy Π(s, a) is the probability of selecting action a from state s, calculated using the following formula: where E Π . is the expected function of policy Π, R t = r t+1 + γr t+2 + ... = ∞ k=0 γ k r t+k+1 , γ is a coefficient that denotes the importance of future reward value.

VM migration algorithm
To find the VM migration strategies, a model-free version of the learning agent is proposed by applying Q-Learning algorithm [35]. The VM migration decision can be generated close to optimal by interacting with the environment without any prior knowledge. The Qlearning model is presented by a set S of environment states that learning agent can meet perceptual learning, a set A of actions that agent can execute on cloud resources, a reward given to the agent, and the environment state can be changed by the action. The agent's cumulative reward is maximized by interacting its action responding to its observations. The optimal policy can be found according to interactively updating the Q function until convergence. At each step, an action's system is chosen based on the system state S, which is denoted Q(s, a). As shown in Fig. 3, Figure 3 Steps to migrate VMs. the process of finding the VM migration strategies is modeled as traveling the graph of robot. At the first time, the start state of robot corresponds to without a VM migrated a PM. After performing the action that goes to PM2, the robot state changes state 1 which corresponds to migrating VM1 to PM2. Each step of the robot, he will select a PM i for hosting VM j. In the model of stochastic state, the probability transition matrix is described by P (s ′ |s, a). The final state of robot when all VMs are migrated PMs. At each step, the robot will select an action that has a good reward in the past denoted Q(s, a). Before the next interaction of management process, the Q function is defined two-dimensional table of Q(s, a) as follows: where Q(s t , a t ) is an expected long-term reward for executing the current action a t in the current state s t which denotes the t th estimate of Q * , η ∈ [0, 1] is the learning rate that indicates how fast the data of new states will be taken into account in the next steps, the robot does not learn to improve future actions when η = 0, if η = 1 then the data on results of the latest management is used by the robot; γ ∈ [0, 1] is discount factor which determines the importance of future rewards, if γ = 1 then the robot takes into account a long-term maximize reward, in case γ = 0, the robot aspires only the latest reward; P(s t+1 , a) is randomly obtained according to the probabilities defined by P and η is a step-size sequence; max a Q (P(s t+1 , a)) is an estimation of the optimal Q-value in the future; the immediate reward R t+1 = R(s t , a t ) is observed at every time step given to the robot by environment and can be obtained through a real-world setting or a simulation engine, not requiring the knowledge of either P or R. After a sufficiently large number of time steps, an approximate optimization policy, i.e., the mapping from a given state s to the action a * , is taken from the Q table as follows: The objective of the learning agent is to find the best mapping policy S → A that maximize expected longterm reward for executing actions. The learning agent can choose control action as following strategies: (i) the choice of random action can occur at the beginning of the management process; (ii) the choice of action defined by the policy Π. The VM migration algorithm is presented as follows: Algorithm 1 V2PQL -VMs migrate to PMs Q-Learning Algorithm Input: ǫ, η, γ Output: Q * 1: Initialize Q value 2: Q[i, j] = 0, 1 < i < S, 1 < j < A The estimates Q converge with probability 1 (w.p.1) to Q * as long as t η t = ∞, t η t 2 < ∞. Watkins first proposed Q-learning algorithm [35] which is later established convergence w.p.1 by Watkins and Dayan [43]. The V2PQL algorithm starts with controlling the VM migration without prior knowledge. The migration policy can be determined by choosing actions that correspond to the highest Q-value after enough explorations.

Evaluation
In this section, the efficiency and effectiveness of proposed VM migration approach are demonstrated through the large scale infrastructure cloud computing simulation. The evaluation of VM migration is done through a prototype implementation of V2PQL algorithm running on cloud infrastructure which has hundreds of the needed VM's multi-tier application migration following CloudSim. It is divided into training phase and extraction phase. Initially, the optimal  policies are explored by V2PQL algorithm in training phase. And then, these polices are continuously applied to the real time VM migration process by the line 3 to line 8 of V2PQL algorithm in extraction phase. During execution, the VM migration polices which show the strength of V2PQL reinforcement learning algorithm are continuously updated. In training phase, the cumulative reward and the temporary evolution of Q-value which show the efficiency of exploration/exploitation strategies are studied by changing the ǫ parameter of V2PQL algorithm. In extraction phase, the utility of players, load balancing, resource utilization, and running time which show the efficiency of V2PQL algorithm are benchmarked with Round-Robin algorithm.

Environment setup
The simulations to evaluate the performance of VM migration were done on the computer (8GB RAM, Core i5, 256GB SSD). To reduce the complexity of simulations, three kinds of resource are considered in our simulations, i.e., CPU, RAM, Storage of PM, and VM configuration. The heterogeneous datacenter which deploys multi-tier applications is simulated by using the parameters of Table 1. Each multi-tier application is deployed in a cluster VM which the configuration of VM is randomly chosen by Table 2. The VM migration process is trigged when the deteriorating PMs are detected. We set up a data center including 450 PMs, 200 multi-tier applications in which 119 PMs are detected faults and 1543 VMs need to be migrated to safe PMs.

Training phase
To evaluate the efficiency of V2PQL algorithm, the different investigating learning strategies are considered by a group of simulations. The VM migration policies are depended on the V2PQL parameters including ǫ exploration/exploitation (cf. step 2 in V2PQL algorithm), η learning rate, and γ discount factor of rewards (cf. step 4 in V2PQL algorithm). The exploration/exploitation strategies are invested by changing ǫ ∈ [0.1, 0.9] while the learning rate is set to a constant value η = 0, 1 and the discount factor of reward is set to γ = 0, 8 like [44]. The efficiency of V2PQL algorithm are evaluated in terms of reward, temporal evolution of Q-value. The cumulative reward over time by following the actions is generated by a policy, starting from an initial state which the robot does not choose a PM for any VM. An episodic task is referred to a complete sequence of interaction, from start to finish. The robot reaches a terminal state when the list of needed VM migration is processed. V2PQL can exploit such knowledge by initializing the Q value (cf. step 1 in V2PQL algorithm) with more meaningful data instead of initializing them with zero as well as can be quicker learning convergence. In episodic = 1000, the average rewards as function of the ǫ = 0.1, 0.3, 0.7, 0.9 are described in 4. At ǫ = 0.1 show that the robot seldom focuses on improving future actions, otherwise, ǫ = 0.1 show that the robot focuses on improve future actions. The discounted cumulative reward depicts in 5. The temporary evolution of Q-value refers to each state-action pairs in the learning strategy. The change of Q-values occurs when the system is in state s t and takes specific action a i . For instance, in Fig. 6, q (23,24) shows that the change in the q-value occurs when the system state s(t) is 23 and specific action a i takes 24. The almost q-value is convergence in episodic = 1000.

Extraction phase
After training phase, the optimal VM migration policies are found out through Q table. The V2PQL policies are benchmarked with Round-Robin policy. The utility of players following Eq.(7) are shown in Fig. 7. The utility of Round-Robin algorithm is distributed to more players than V2PQL algorithm. As shown in Fig  9, the resource utilization of V2PQL algorithm is better than Round-Robin algorithm. However, as shown in Fig. 8, the load balance of Round-Robin algorithm is better V2PQL algorithm. As shown in Fig 10, the running time of V2PQL algorithm with ǫ = 0.3 is better than Round-Robin algorithm in whole VMs while the running time of V2PQL algorithm with ǫ = 0.7, 0.9 is better than Round-Robin algorithm from 500th to 1543th VM. As the result, the V2PQL migration policies have a promising running time.

Conclusions
In this paper, the V2PQl algorithm is proposed to solve the VM migration game approach based on MDP. De-  pending on the characteristics of each algorithm, the use of strategic construction to migrate VMs for games is also different. The action exploration strategies have been studied by changing ǫ. Therefore, prior knowledge dose not need for VM migration problem if training phase of V2PQL enough. The effectiveness of the algorithm is evaluated by comparing it with the Round-Robin algorithm. In extraction phase, the optimal VM migration policy of V2PQL algorithm is simply applied by choosing the maximum q-value at specify system state. In the future, many other RL algorithms will be developed to compare the evaluation with the proposed algorithm. "Evaluation" and helped Thanh Khiet with fruitful discussions during the evaluation phase. The author(s) read and approved the final manuscript.