In the GSQL-w algorithm, the w-parameter [3] is for convergence of errors, defining a few options to consider, depending on the possible choice of its value. In a system that needs to realize the optimization of resources, the selection of w determines two cases.
- 0 < w 6 1 (Under-relaxation)
- 1 < w 6 w (Over-relaxation)
w being the convergence factor of choice that belongs to probability functions. The mean square errors were observed to be less thus proving the value of w relaxes the convergence of the agent to take appropriate action. GSQL2 is theoretically evaluated already with the w parameter to have an extended scope with heuristics. Our motivation is hence to perform the experiments by extending the GSQL-w for varying w, parallel also for several space and action values. This is our novel attempt at heuristics on w, the convergence parameter that was a limitation in [3]. The steps involved in our new framework are presented in the subsection.
Steps in GSQL-wh
The heuristics-based rule-of-thumb strategies shorten decision-making time. As it was observed for w=1 the update rules derived from the Q-learning were similar. Gradient descent is for minimizing the mean squared error between the target and current estimates. By modifying target values, the algorithms learn faster, ie., collect rewards quickly. This kind of performance is expected in an application where automatic resource allocations happen. The effective initial phase of learning is crucial in the overall system performance. The modification in this algorithm hence is to first choose this convergence parameter with exiting probability of action state pairs. In the next section, the authors discuss one such application that delves into early convergence ie., the automatic allocation of resources to Cloud applications.
Auto-scaling of Cloud Applications
Systems hosted on the cloud demands are based on the amount of real-time workload they must operate upon. Auto-scaling means automatic allocation and de-allocation to such applications as and when they need resources. The resource requirements of an application hosted are agreed upon with the cloud service provider. This definition of SLAs is based on cost and performance parameters like throughput and response time. The authors defined this kind of problem of the cloud systems to be modeled with the Markov Decision Process as explained here under.
1) State(S): A state at time t is (W,v). W is the workload of an application and v is the number of units of resource currently that is allocated.
[→] New state (W0,v0) means on how W0 as the workload characteristics drives a previous value ie., v0 = v + a
2) Action (A): A decision at time t, is an action to allocate or de-allocate a resource for the next time period t+1
3) Reward (R): A cost function like the one below defines the reward. S ×A×S → R.
Reward increased basically when the cost of the system reduces, hence it is a negative of the cost function. The authors define here how the cost per unit of resource is arrived at based on the response time of the application per (1). There exists an SLA as a penalty until which performance is pre-defined. This Service Level Agreement (SLA) is documented between the client and a cloud service provider.
C((W,v),a) := c × v + SLApenalty(r) (1)
where c is the cost per unit of resource, r is the response time of the application and SLApenalty is redefined performance. Deep reinforcement learning became a solution to such problems as the authors have usually large actions and spaces. [2]. SOR-DQN proposals attempted and helped reduce the training time through environments’ differing statuses. GSQL-w proposed improvements for further error reduction by arriving at the empirical operators equal to the generalized bellman operator. The authors combined the idea of training time reduction by applying the various values with heuristics on the parameter w itself suggested by GSQL. The authors used the finite-time performance of the algorithm as defined in an existing PAC (Probably Approximately Correct) framework [5]. Algorithm convergence was assured for the over-relaxation as defined in (2).
1 < w 6 w0(Over − relaxation) (2)
where w tends to change over iterations and depends on the underlying MDP, the defined discount factors, etc., Further it is shown that for values of w greater than 1, GSQL-w is superior to Speedy Q-learning. Thus, the authors enhanced the GSQL to a level thereby applying heuristics for the 5 values of w. Our numerical experiments on the value of w evaluate the theory of successive relaxation on the bellman operator, which in turn propagates information throughout the space. MDPs the authors analyzed are of discrete state space sizes of 3, 5, and 10. The authors define further the details of this new framework as follows. The modified architecture from [7] forms an auto-scaling system based on DRL with a simulation introduced for the new GSQL-wh based on [11]. Fig. 1 illustrates the main building blocks of the GSQL-wh framework. While the application runs on a cloud platform, the modified GSQL-w guides resource provisioning through error convergence. More precisely, it follows the number of resources allocated to reach a certain goal, i.e.., to monitor, analyze, plan, and execute (MAPE) control loop [6], where different characteristics of the application (e.g. workload and response time) are continuously monitored, the satisfaction of system goals are checked, and accordingly, the resource allocation is adapted in case of deviations causing any lapse in efficiency. The goals (i.e., SLA, cost, response time) are reflected in the reward function that is set automatically from the environment parameters. An auto-scaling agent reads in the states at time t for a workload of the application and several units of resources currently allocated. The output from that agent is the action, which is considered as cost and performance functions rewarded to the cloud platform, that constitutes also as the main component. The monitoring function collects metrics and feeds the agent. An actuator issues adaptation commands from the heuristic controller (GSQL-wh) at each control time stamp/ interval. Overall, it is the application under test that is what makes use of the allocated resources to it. Illustrated in Fig. 1 adapted from [11] and [7].
2.3. Experiments
The setup with Python evaluates the proposed algorithm GSQL-wh. Auto-scaling cloud resources are based on the data generated by passing random state, action pairs of simulated workloads. The no of virtual machines that are allocated for applications will hence be a calculation that is based on requests for every second as represented in a time series data set. After the initial setup, and after the estimation error (Bellman equations’) calculations the authors arrived at a few example relaxation techniques based on pure w value [5]. The case the authors observed based on over relaxation on where the value of a parameter convergence happened with w could normally fit as explained in GSQL-w. With the randomizing parameter varied for each run, our algorithm converges the errors early in the training runs in the over-relaxation. This is demonstrated in the Fig 2., by plotting the mean squares arrived for 5 values of w. The average error of different RL algorithms at the convergence parameter w=1.3 for the states at 10 and actions at 3.
Illustrated in Fig. 2(a) shows the average error for the different algorithms plotted against the iteration number. The average error is defined as the difference between the optimal value function and its estimate based on the current Q function. This is returned by the algorithm as a modified Q value, which is averaged across the MDPs over 4 runs for different space action pairs viz., S in 3, 5, & 10. The actions were the matrix of values computed for every state. More the states, there are more probabilities, and hence w relaxation parameter widens when the authors read in more workloads in this setup. It is seen that the average errors of GSQL-wh decrease with the number of iterations at a faster rate as compared to GSQL-w. This empirically shows that our framework works well for several different MDPs and their superiority over others. The authors also demonstrate the clear advantage of GSQL-wh over GSQL2-w when the relaxation parameter w is large. For this, the authors have generated a few random MDPs with a discount factor of 0.9 and with the probability defined as (3):
P(i|i,a) = 0.9 ∀(i,a) such that w∗ = 1.3 (3)
The result is illustrated in Fig. 2(b). The authors studied the rate of convergence of the GQL-wh experimentally. The results are that the heuristics values of w determine the coverages as linear or non-linear. Further to validate the relaxation parameter and to observe the training time performance, the authors collected the results from the simulated workload. The linearity of the data set determined the training time reduction refer to Table 1.
Table 1. The performance of algorithms
Algorithm
|
Error (convergence)
|
Training time
|
SQL
|
3.20
|
0.50
|
GSQL
|
3.00
|
0.25
|
GSQL-w
|
2.80
|
0.05
|
GSQL-wh
|
2.50
|
0.05
|