A fine tune hyper parameter Gradient Boosting model for CPU utilization prediction in cloud

doi:10.21203/rs.3.rs-3419624/v1

Download PDF

Research Article

A fine tune hyper parameter Gradient Boosting model for CPU utilization prediction in cloud

https://doi.org/10.21203/rs.3.rs-3419624/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

CPU utilization prediction is key factor for efficient resource management and capacity planning in cloud computing environments. By accurately predicting utilization patterns, resource managers can dynamically distribute workloads to ensure optimal utilization of resources. The load can be equally distributed among virtual machines, leading to a reduction in VM migration and overhead time. This optimization significantly improves the overall performance of the cloud. This proactive approach enables efficient resource usage, minimizing the risk of bottlenecks and maximizing overall system performance. In this paper a Gradient Boosting model with hyper parameter tuning based upon grid search (GBHT) is proposed to enhance CPU utilization prediction. Multiple weak learners are combined in the proposed model to produce a powerful prediction model. and hyperparameters tuning is used to enhance its performance as well as predictive accuracy. Different machine learning and deep learning models are examined side by side. The results clearly demonstrate that the proposed GBHT model significantly contribute superior performance then the traditional machine learning models (SVM, KNN, Random Forest, Gradient Boost), deep learning models (LSTM, RNN, CNN), time series model (Facebook Prophet) and as well as the hybrid models, combining LSTM with Gradient Boost and Gradient Boost with SVM. The proposed model demonstrates superior performance compared to the other models, achieving the lowest MAPE of 0.01% and high accuracy with an R2 score of 1.00.

Artificial Intelligence and Machine Learning

Gradient Boost

SVM

KNN

Random forest

LSTM

RNN

CNN

Facebook prophet model

Cloud computing has emerged as a prominent paradigm for distributed computing, delivering on-demand processing, storage, and bandwidth capabilities. Virtualization techniques enable the provisioning of physical resources, such as CPU capacity, memory, and I/O, to support cloud-based applications. CPU resource prediction plays a crucial role in effective resource management in cloud data centres, allowing Infrastructure as a Service (IaaS) providers to engage in efficient capacity planning [1–2].

However, existing CPU resource prediction algorithms primarily focus on point value prediction, which suits short-term CPU load forecasting in centralized computer systems but falls short in accurately anticipating CPU utilization variations across large-scale distributed environments [3–6]. To meet this challenge, special prediction models designed for future CPU usage in cloud computing environments are needed. The model aims to provide accurate predictions by taking into account the complexity of CPU allocation. Using technologies such as predictive models, machine learning, and predictive modeling, it can capture complex CPU usage patterns in cloud deployments and enable IaaS providers to optimize CPU resources and reduce under or overuse and availability, resulting in better performance. cost effectiveness and customer satisfaction [7–11]. The unpredictable nature and unpredictable behaviour of cloud resources, place significant challenges to traditional forecasting methods like ARIMA, highlighting the need for artificial intelligence and technology learning algorithms. Gradient boosting is a powerful machine learning technique that shows promise in capturing complex and unrelated patterns by combining multiple weak patterns, creating pattern strength that leads to accurate prediction. This enables resource managers to make proactive decisions in CPU workload balancing, optimizing CPU resource usage and ensuring efficient system operation [12].

Additionally, accurate CPU utilization prediction using gradient boosting allows for effective CPU resource allocation based on anticipated demands, preventing underutilization or overutilization. This proactive approach enhances CPU resource allocation efficiency and cost optimization. In this context, the proposed work objective to investigate the effectiveness of the Gradient Boosting model with hyperparameter tuning for CPU utilization prediction. The performance of proposed model is compared against other popular models like SVM, KNN, Random Forest, LSTM, RNN, CNN, Facebook Prophet model to provide insights into the most accurate and reliable predictions for CPU utilization [13–16]. Hyperparameter tuning allows us to optimize the performance of the Gradient Boosting model, fine-tuning its parameters to improve its accuracy. The proposed model will contribute to improved system performance, CPU resource allocation, and overall operational efficiency in cloud computing environments.

1.1 Innovation and major contribution

The innovation and major contribution of the proposed model lie in its combination of gradient boosting with hyperparameter tuning for CPU utilization prediction. By leveraging the power of gradient boosting, which is an ensemble learning method, and optimizing the hyperparameters through grid search, the model achieves enhanced performance in accurately forecasting CPU utilization. Additionally, it improves the model's performance and leading to superior CPU utilization prediction performance and offering valuable insights for optimizing resource management and decision-making in computing environments. The successful prediction of CPU utilization enables the proper distribution of workload among virtual machines (VMs) in a cloud environment, by accurately forecasting CPU utilization, cloud administrators can make informed decisions regarding VM placement and resource allocation. They can distribute the workload in a way that ensures each VM operates within its optimal capacity, avoiding overloading or underutilization. This balanced distribution of workload reduces the need for frequent VM migrations and minimizes the associated overhead time.

The remaining sections of the paper are organized as follows: Section 2 outlines the related work and existing models that are used for comparison are mentioned in the section 3. In Section 4, the problem statement is discussed. Section 5 introduces the proposed modification to the Gradient Boost algorithm, explaining the key enhancements and modifications made to improve its performance in CPU utilization prediction. The performance evaluation of the proposed GBHT algorithm is presented in Section 6, where the experimental setup, analysed the results. Finally, the section 7 described the concluding observations of the proposed model, and highlight potential avenues for future research and development.

As basic models for workload prediction, Singh et al. [17] offered an ensemble model made up of Neural Network, K Nearest Neighbour, Support Vector Machine, Naive Bayes, and Decision Tree. They introduced a "WMC" algorithm to solve the multitasking prediction challenge while being efficient at processing large amounts of online data. The algorithm can incrementally reproduce the ensemble model parameters, making it well-suited for processing large-scale datasets. Feng et al. [18] presented a new design model FAST for performance prediction. The model provides several techniques to increase the accuracy of predictions while reducing the computational load. An important part of the model is the development of a dynamic window algorithm that takes into account correlation and performance variation over time. The model dynamically adjusts the size of the sliding window, maximizing predictive accuracy while minimizing computational overhead. The integration of the local environment further improves the prediction accuracy considering the physical characteristics. Evaluation using the real-world Google Cluster monitoring dataset demonstrates the effectiveness of the FAST model in improving real-time forecasting performance. Yang et al. [19] proposed the PSR and EA-GMDH prediction method which allowed for the reconstruction of time series by identifying appropriate variables, capturing important patterns and dependencies in the data. By integrating the self-organizing capabilities of GMDH with the evolutionary algorithm, the PSR and EA-GMDH method enhanced prediction accuracy and the ability to handle complex data patterns.

Xu Dayu et al. [20] addressed the challenge of predicting resource demands in cloud environments and proposed short-term load forecasting method employed feature extraction to capture relevant information from dynamic cloud resources, while the long-term load forecasting method integrated multiple forecasting algorithms and leveraged generalized fuzzy set theory to aggregate their predictions effectively. The hybrid technique for forecasting host load in cloud computing was introduced by Chen et al. [21]. The strategy successfully used Ensemble Empirical Mode Decomposition to break down the nonstationary host utilization sequence into intrinsic mode function components. Each component was then individually analyzed and predicted using the ARIMA model, and the predictions were combined to provide a more accurate overall prediction of the host load. Mahmud et al. [22] proposed a model to predict performance using a general method. The integrated model combines several base models to improve the accuracy of estimating CPU and memory usage in the process, and shows a decrease in RMSE, actually an increase compared to using the same base model alone. Amiri et al. [23] introduced an additional learning model to predict future demand requirements. The model uses fuzzy logic techniques to increase execution speed and efficiency, eliminating the need for special knowledge. The model adjusts its predictions according to the feedback it receives from the environment, correcting and improving its performance over time. The model combines further learning with fuzzy logic, providing a new way to identify future application needs. Qin Wencong et al. [24] proposed a SVM-based model, analyzed the impact of SVM factors on classification performance, and highlighted the benefits of using polynomial kernel functions. However, the limitations of this study, such as the use of inaccurate data and the lack of statistical significance or transmission accuracy, should be considered when evaluating the effectiveness and reliability of their solutions in real communication. [25] proposed a method for virtual machine (VM) scheduling model based on previous usage knowledge of virtual machines. In the VM planning process, the filtering step includes checking if the PM has enough resources to host a particular VM. Preventive Maintenance that meets these requirements with a weighting system designed to favour Preventive Maintenance, which has been shown to be more resource efficient in the past, is considered suitable for hosting VMs.By placing greater weight on these PMs, the authors hope to improve overall PM selection for hosting VMs. Experimental results show that this solution improves PM selection by combining information from historical behaviour. The authors can improve the efficiency and effectiveness of VM programming by considering the use of historical data and using the SVM classification. Duggan et al. [26] proposed a method to predict future CPU utilization values using a recurrent neural network (RNN). RNN is a neural network that can capture and store temporal data, making it suitable for time series data analysis. The neural network architecture was adopted by Duggan et al. Build effective capture models and dependencies contained in real-time data. The model can obtain a more accurate result by collecting information from the previous steps using the connections formed in the network. Experimental results show the effectiveness of the RNN-based method in CPU usage estimation. The model achieves high accuracy in predicting changes in CPU usage values. This indicates that the RNN model intelligently captures the underlying patterns and performance of CPU time series data. Similarly, Zia Ullah et al. [27] proposed an AR-NN model for real-time estimation of resource usage. The AR-NN model is a variation of neural networks that combines autoregressive techniques with neural network architectures. It is suitable for analyzing data that does not fit the Gaussian distribution. Ziya Ullah et al. Testing was done using a database of CPU usage signals from a 120-server data center. The results illustrate that the AR-NN model outperforms traditional approaches e.g. ARIMA (Autoregressive Integrated Moving Average) for the given data. This indicates that the AR-NN model can capture complex and powerful patterns in CPU usage data, resulting in better prediction accuracy as compare to the traditional models.Rao et al. [28] described a VCONF approach to cope the configuration process by following each VM's protocol. The signal is provided by the feedback function and the goal is to maximize the reward. Outcome reveal the efficiency of VCONF. In addition, the system shows good adaptability and scalability when applied to large systems. These findings demonstrate the effectiveness of VCONF in automating virtual machine configuration and enabling resource allocation in cloud environments. Xiao et al. [29] recommended higher education support for delegation to optimize the deployment of virtual machines in data centers. Using reinforcement learning techniques, ARLCA continues to learn and adjust its decision-making process based on performance and feedback from the data center. The goal is to find the optimal virtual machine deployment that minimizes energy consumption and minimizes SLA violations.

Experimental results presented by the authors demonstrate the effectiveness of ARLCA in providing significant improvements in energy savings and reducing SLA violations compared to traditional distributed virtual machine methods. ARLCA provides resource utilization and operational efficiency by providing an intelligent and robust approach to virtual machine deployment in data centers.

In this section, three different models have been described that are used for CPU utilization prediction: SVM, KNN, Random Forest, LSTM, RNN, CNN and Facebook Prophet model. These models work as base for evaluation to assess the enactment of the proposed fine hyperparameter tuning Gradient Boosting model.

3.1 Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) emerge as formidable contenders for comparison. These specialized neural networks possess a unique ability to process sequential data by preserving a hidden state that acts as a reservoir of past inputs. Operating in a sequential manner, RNNs dynamically update their hidden state at each step, taking into account both the current input and the previous hidden state. This remarkable characteristic allows RNNs to capture dependencies and discern temporal patterns embedded within the data. At the core of an RNN lies the fundamental concept of enabling information to traverse not only from input to output but also persist and influence future computations. With each time step, an RNN receives an input (${X}_{t}$) and a hidden state (${h}_{t}$-1) as its nourishment. The hidden state serves as the neural network's memory, diligently retaining valuable information from preceding time steps. Through an amalgamation of the current input and the previous hidden state, the hidden state undergoes transformation, guided by a set of learnable parameters. The parameters, is represented in Eq. 1 for an RNN as below:

$${h}_{t}=f({W}_{h}.{h}_{t}+{W}_{x}.{X}_{t}+b){h}_{t}$$

Where:

${h}_{t}$ is hidden state at t time step.
f() is an activation function which introduces non-linearity to the network.
${W}_{h}$ and ${W}_{x}$are weight matrices that capture the relationships between hidden and previous hidden state and input, respectively.
b is a bias term.

As the gradients propagate backward through time, they can diminish exponentially, making it challenging for the network to learn long-term dependencies [30]

3.2 Long Short-Term Memory (LSTM)

LSTM is an improved variant of RNN that addresses the gradient problem. LSTM incorporates a memory cell, which is capable of retaining information over long periods of time, and three gating mechanisms which enables LSTM to capture and propagate long-term dependencies effectively, making it well-suited for tasks involving long sequences of data. The computations in an LSTM cell can be described using the following equations:

Input gate computation:$i\left({g}_{t}\right)=\sigma ({V}_{i}*\left[h\left({g}_{t}-1\right),x\left({g}_{t}\right)\right]+{b}_{i})$

Forget gate computation:$f\left({g}_{t}\right)=\sigma ({V}_{f}*\left[h\left({g}_{t}-1\right),x\left({g}_{t}\right)\right]+bf)$

Output gate computation:$O\left({g}_{t}\right)=\sigma ({V}_{o}*\left[h\left({g}_{t}-1\right),x\left({g}_{t}\right)\right]+{b}_{o})$

Cell state update:$C\left({g}_{t}\right)=f\left({g}_{t}\right)*c\left({g}_{t}-1\right)+i\left({g}_{t}\right)*\text{t}\text{a}\text{n}\text{h}({V}_{c}*\left[h\left({g}_{t}-1\right),x\left({g}_{t}\right)\right]+{b}_{c)}$

Output computation:$h\left({g}_{t}\right)=(o\left({g}_{t}\right)*\text{t}\text{a}\text{n}\text{h}(c\left({g}_{t}\right))$

In these equations, current input $x\left({g}_{t}\right)$ and the previous output ($h\left({g}_{t}-1\right)$), ${V}_{i}$,${V}_{f}$, ${V}_{o}$, and ${V}_{c}$are weight matrices, ${b}_{i}$, ${b}_{f}$, ${b}_{o}$, and ${b}_{c}$ are bias vectors, σ represents the sigmoid activation function, and tanh represents the hyperbolic tangent activation function. The memory cell and gating mechanisms differentiate LSTM from traditional RNNs, making it a powerful tool for various applications [30].

3.3 Facebook Prophet model

The time series forecasting model known as Prophet was developed by the Facebook data science team. For time series data with numerous seasonalities, trends, and outliers, it is intended to offer precise and understandable forecasts. The holiday effects component allows the model to incorporate the impact of holidays or special events that may affect the time series. Prophet utilizes a piecewise linear or logistic growth trend model, Fourier series to model seasonality, and customizable holiday effects. It employs Bayesian inference to estimate the model parameters and provides uncertainty intervals for the forecasts, allowing users to assess the reliability of the predictions. The mathematical Eq. 2 for the Prophet model can be represented as follows:

$$z\left(t\right)=a\left(t\right)+b\left(t\right)+c\left(t\right)+e\left(t\right)$$

where:

z(t) denotes the predicted value or objective variable at time t.
a(t) denotes the trend value, which models the long-term growth or decline in the data.
b(t) denotes the seasonality value, which captures the periodic patterns or fluctuations in the data.
c(t) represents the holiday effects, which account for the impact of specific events or holidays on the data.
e(t) represents the error term, which accounts for any random or unpredictable variations in the data.

Prophet's elegant handling of missing data and outliers is one of its strengths., making it robust and applicable to a wide range of domains. It is particularly popular in industries such as retail, finance, and demand forecasting. Prophet is known for its simplicity, interpretability, and overall good performance, making it a valuable tool for time series analysis and forecasting tasks [31].

3.4 Random Forest

This algorithm operates through four major phases, each contributing to its overall effectiveness:

Random Sampling

During the random sampling phase, Random Forest selects a random subset of the training data using a technique called bootstrapping. This involves creating multiple bootstrap samples, where each sample is generated by randomly selecting data points.

Feature Subset Selection

Random Forest further introduces randomness by selecting a subset of features for each individual tree. This helps to decorrelate the trees and ensures that each tree focuses on different aspects of the data. By randomly selecting feature subsets, Random Forest promotes diversity among the trees, making the ensemble more robust and capable of capturing different patterns and relationships within the data.

Tree Growth

Each tree in the Random Forest is grown using the selected data and features. The trees are built using recursive binary splitting, where the data is split based on certain criteria.

Aggregation: Once all the trees are grown, Random Forest combines their predictions to make the final prediction.

Random Forest reduces overfitting by combining predictions from multiple trees. Each tree focuses on different aspects of the data, and their collective decisions provide a more accurate and robust prediction [32].

3.5 K-Nearest Neighbours

K-Nearest Neighbours (KNN) is an algorithm that is both simple and effective for classification and regression tasks. It operates through the following steps:

Distance Calculation

KNN computes the distance between the target instance and each training instance. Various distance metrics, such as Euclidean or Manhattan distance, can be used.

Nearest Neighbor Selection: KNN identifies the K instances with the smallest distances to the target instance. These K instances are the nearest neighbours of the target instance in the feature space.

Majority Voting

For classification problems, KNN determines the class label of the target instance by majority voting among its K nearest neighbours. For problems related to regression, the expected value is the middling of the values of the K nearest neighbours.

KNN makes predictions by considering the values of the K adjacent neighbors. The proximity of instances in the feature space influences the prediction, with majority voting for classification and averaging for regression [33].

3.6 Convolutional Neural Networks (CNNs)

It is most popular deep learning models in the present time for image processing. It comprises three main layers as shown below:

Convolutional Layer

CNNs apply learnable filters to extract features from the input image. Each filter performs a convolution operation, capturing local patterns and detecting specific features.

Pooling Layer: CNNs use pooling layers to reduce spatial dimensions while preserving important information. Pooling operations, such as max pooling or average pooling, down sample the feature maps obtained from the convolutional layers.

Fully Connected Layer

These layers are responsible for conducting high-level reasoning and decision-making processes using the extracted features.

CNNs are trained and learn hierarchical representations of the input images, starting from low-level features and progressing to high-level features [34].

3.7 Support Vector Machines (SVM)

SVM are versatile algorithms for classification and regression tasks. They operate in the following steps:

Optimization Problem

SVM formulates an optimization problem to find the hyperplane that best separates data points of different classes.

Regularization Parameter

This parameter balances the importance of correctly classifying training instances and achieving a wider margin.

Non-Linear Tasks

SVM can handle both linear and non-linear classification and regression tasks using kernel functions to plot the statistics into higher-dimensional feature spaces [35].

This algorithm suitable for various classification and regression problems.

The problem at hand is to develop an accurate and reliable predictive model for CPU utilization. Given a time series dataset of CPU utilization measurements, denoted as $X=\{{x}_{1},{x}_{2},\dots ,{x}_{n}\}$, where ${x}_{i}$ represents the CPU utilization at time i, the objective is to design a model that can accurately forecast future CPU utilization values. Let $Y=\{{y}_{1},{y}_{2},\dots ,{y}_{n}\}$ represent the corresponding set of actual CPU utilization values at time i + 1, i.e., ${y}_{i}$ is the actual CPU utilization at time i + 1. The goal is to train a model that can predict ${Y}_{hat}=\{{y}_{hat1},{y}_{hat2},..,{y}_{hatn}\}$ such that ${y}_{hati}$ is the predicted CPU utilization at time i + 1.To achieve this, the proposed approach utilizing $GBHT$, denoted as $GBHT\left({h}_{t}\right)$, where ht represents the hyperparameters of the GBHT model. The model will be trained on the historical CPU utilization data X to learn the underlying patterns and relationships. The primary challenges in this problem can be summarized as follows:

Modeling the complex and dynamic nature of CPU utilization patterns in a time series context.

Optimizing the hyperparameters ${h}_{t}$ of the GBHT model to achieve the best performance in CPU utilization prediction.

Comparing the performance of the GBHT model with other traditional models such as LSTM, RNN, SVM denoted as M, to determine the most effective approach for CPU utilization forecasting.

To address these challenges, an iterative process will be employed, involving the training and validation of the GBHT model with hyperparameter settings. By developing an accurate and reliable predictive model for CPU utilization and comparing it with other existing models, this research aims to provide insights into the effectiveness of Gradient Boosting with hyperparameter tuning in CPU utilization prediction. The findings of this paper will contribute to better resource allocation, system performance optimization, and capacity planning in computing environments.

Proposed GBHT model

The proposed GBHT model is a machine learning approach that strengthens the ensemble learning technique of gradient boosting with the process of hyperparameter optimization to produce a powerful and precise predictive model. This approach is particularly useful for tasks such as CPU utilization prediction, where accurate forecasting is essential for resource allocation and system performance optimization. In Gradient Boosting, weak prediction models, often decision trees, are iteratively added to form an ensemble model. Each weak model is trained to correct the errors made by the previous models. The models are combined by assigning weights to their predictions based on their individual performance. This iterative process enables the ensemble model to learn complex relationships and capture nonlinear patterns in the data, leading to improved predictive accuracy. Hyperparameter tuning is an essential step in building a successful Gradient Boosting model [36]. Hyperparameters are settings that govern the behavior of the model but are not learned from the data. The model with different hyperparameter settings are evaluated using cross-validation. The hyperparameter combination that yields the best performance is selected as the final model. By combining Gradient Boosting with hyperparameter tuning, a powerful predictive model has been created that is tailored to the specific task of CPU utilization prediction. This approach allows us to leverage the strengths of Gradient Boosting, such as its ability to handle complex relationships and capture nonlinear patterns, while optimizing the hyperparameters to achieve the best possible prediction accuracy.

By fine-tuning the model, we can enhance its performance and ensure its suitability for real-world deployment. The grid search (GS) method is used in the proposed model as it involves exhaustively testing all combinations of hyperparameters within a predefined grid configuration. By assessing the Cartesian product of user-defined values, GS systematically evaluates each combination.

Table 1

Notations for Proposed GBHT Algorithm
Notation	Description
D	Historical CPU utilization data
$Dtrain$	Training dataset
${D}_{test}$	Testing dataset
${E}_{0}$	Initial ensemble estimator
F	Feature set
${F}_{norm}$	Normalized feature set
H	Hyperparameters for Gradient Boosting
${H}_{opt}$	Optimal hyperparameter combination
${E}_{i}$	Estimator at iteration i
${R}_{i}$	Residuals at iteration i
${y}_{train}$	Ground truth CPU utilization values for training set
${X}_{train}$	Input features for training set
$ytrai{n}_{pred}$	Predicted CPU utilization values for training set
$ytes{t}_{pred}$	Predicted CPU utilization values for testing
${N}_{test}$	Samples in the testing set, number
$ytes{t}_{i}$	Value for the ith sample in the testing set's actual CPU usage
${y}_{test}$	Mean of the testing set's actual CPU utilization values
N	Iterations count

Algorithm:

Gradient Boosting with Hyperparameter Tuning for CPU Utilization Prediction

Input:

• D: Historical CPU utilization data

• D_train: Training dataset

• D_test: Testing dataset

• F: Feature set

• F_norm: Normalized feature set

• H: Hyperparameters for Gradient Boosting

• H_opt: Optimal hyperparameter combination

• E_i: Estimator at iteration i

• R_i: Residuals at iteration i

• y_train: Ground truth CPU utilization values for training set

• X_train: Input features for training set

• y_{train_pred}: Predicted CPU utilization values for training set

• y_{test_pred}: Predicted CPU utilization values for testing set

Procedure:

1. Split the historical CPU utilization data into training and testing sets:

${D}_{train}=\left\{\left({x}_{i},{y}_{i}\right)|\left({x}_{i},{y}_{i}\right)\in D and {x}_{i} belongs to training period\right\}$

${D}_{test}=\left\{\left({x}_{j},{y}_{j}\right)|\left({x}_{j},{y}_{j}\right)\in D and {x}_{j}belongs to testing period\right\}$

2. Extract relevant features from the timestamp data:

$F=\{{f}_{1},{f}_{2},...,{f}_{m}\}$

$N$ormalize the features:

${F}_{norm}=\{{f}_{norm1},{f}_{norm2},\dots ,{f}_{normm}\}$

3. Use hyperparameter optimization technique to find the optimal combination of hyperparameters:

${H}_{opt}=argmi{n}_{H}loss({D}_{train},{F}_{norm}, H)$

H_opt=argmin_Hloss(D_train,F_norm,H)

4. Initialize the ensemble estimator:

${E}_{0}=initial\_prediction$

5. Train the base estimator on the training data:

${E}_{0}=train({D}_{train},{F}_{norm},{H}_{opt} )$

6. Compute the residuals for the training set:

${R}_{0}={y}_{train}-predict({E}_{0},{X}_{train})$

7. Fit a new estimator to the residuals, using the extracted features as input:

${E}_{1}=train({D}_{train},{F}_{norm},{R}_{0},{H}_{opt})$

8. Update the predictions by adding the predictions from the new estimator to the previous predictions:

$ytrai{n}_{pred}=ytrai{n}_{pred}+predict({E}_{1},{X}_{train})$

9. Repeat the following steps for a specified number of iterations:

For i = 2 to N:

• Fit a new estimator to the residuals:

${E}_{i}=train({D}_{train},{F}_{norm},{R}_{i}-1,{H}_{opt})$

• Update the residuals:

${R}_{i}={R}_{i}-1-predict({E}_{i},{X}_{train})$

• Update the predictions:

$ytrai{n}_{pred}=ytrai{n}_{pred}+predict({E}_{i},{X}_{train})$

10. Use the trained ensemble of estimators to make predictions on the testing set:

$ytes{t}_{pred}=predict(E,{X}_{test})$

A comparison is performed based upon performance parameters with other popular models, such as SVM, KNN, Random Forest, LSTM, RNN, CNN, Facebook Prophet model to determine which approach yields the most accurate predictions.

The complexity of the proposed model, is computed as follows:

The data loading and preparation steps have a complexity of O(n), where there are n data points. The train-test split operation also has a complexity of O(n). The grid search with cross-validation complexity is $O\left(m * k * c\right)$, where m is the number of options for each hyperparameter, The number of cross-validation folds is k, and The total number of hyperparameter combinations is c. The training of the GBHT model has a complexity of $O\left(e * n * d\right)$, where e is the number of estimators (trees), n is the number of data points, and d is the maximum depth. The evaluation metrics and plotting steps have O(n) complexity [37–38]. As a result, the proposed model's overall complexity can be expressed as:

$$O\left(n\right)+O\left(n\right) + O\left(m *k *c\right)+O\left(e *n *d\right)+O\left(n\right)+O\left(n\right)$$

Since the complexity terms are additive, combined the terms with the same variable n:

$$2 * O\left(n\right)+O\left(m * k * c\right)+O\left(e * n * d\right)$$

Removed the constant factors:

$$O\left(n\right)+O\left(m *k *c\right) + O\left(e *n *d\right)$$

considering the worst-case complexity, it is assuming that the values of m, k, c, e, and d are relatively small constants. Thus, we can approximate the complexity by dropping the constant factors:

$$O\left(n\right) +O\left(n\right) +O\left(n\right)$$

Final complexity value of the proposed GBHT model is: O(n)

Therefore, the final complexity of the proposed model can be represented as O(n), displaying a linear complexity in relation to the n^th data point.

This section describes the environmental setup, results and metrics comparison of the proposed GBHT algorithm. A server machine with an Intel® i7 processor, 4 cores, and 2.9 GHz clock speed with 128 gb RAM. Python 3.7 installed on the server machine. The pandas, scikit-learn, and numpy are installed. The Dataset taken from github and it contains the CPU Usage data of Microsoft Azure, which is a cloud service, sampled every 5 minute. The dataset has three attributes i.e. max cpu utilization, average cpu utilization, minimum cpu utilization as shown in Fig. 2. The parameters used in the work are describes in the Table 2. The implementation code loads the dataset, split it into input features (X) and target variable (y), and then further split it into training and testing sets. It defines a parameter grid for hyperparameter tuning, including the number of estimators, learning rate, and maximum depth.

The code creates a gradient boosting regressor model and perform grid search to find the best combination of hyperparameters using cross-validation.

Table 2

Experimental set-up parameters and their values.
Parameter	Values
Epochs	100–500
Learning_rate	0.01 − .9
Max_depth	3–9
'n_estimators'	100–500
Training data	70%
Test data	30%
min_samples_split	2–10
min_samples_leaf	1–6
subsample	0.2-1
Search	Grid search

Finally, it evaluated the model's performance on the test set and print the best hyperparameters, MAE, MAPE, RMSE, and R². The comparison metrics used in the proposed work are describes as follow

Table 3

Comparison using Metrics
Model	MAPE	MAE	MSE	RMSE	R²
Machine Learning Models
SVM	3.08%	38525.54	1994839288.27	44663.62	0.84
KNN	1.35%	17938.21	794784034.67	28191.91	0.94
Random Forest	1.05%	13629.01	342898580.64	18517.52	0.97
Gradient Boost	1.09%	14204.60	380139800.03	19497.17	0.97
Deep Learning Models
LSTM	1.35%	17232.49	446265580.40	21125.00	0.96
RNN	0.91%	11692.12	245462958.54	15667.26	0.98
CNN	1.17%	15263.26	417714732.56	20438.07	0.97
Facebook Prophet model	0.02%	29479.49	1557694333.18	39467.64	0.87
Hybrid Model
Hybrid LSTM + Gradient Boost	1.08%	13961.64	359225395.53	18953.24	0.97
Hybrid Gradient Boost + SVM	3.17%	39763.68	2104915743.43	45879.36	0.83
Proposed GBHT Model
Gradient Boost with Grid Search based Hyper parameter tuning	0.01%	166.62	286635.90	535.38	1.00
	Best Parameters: {'learning_rate': 0.3, 'max_depth': 5, 'n_estimators': 400}

Mean Absolute Error (MAE), a measure of the average absolute difference between predicted and actual values. MAE provides a glimpse into the magnitude of errors committed by the prediction model, with each discrepancy contributing to the collective understanding of its performance.

Mean Absolute Percentage Error (MAPE), a relative metric that unveils the average percentage deviation between predicted and actual values. It expresses the prediction error as a percentage of the actual value, enabling us to comprehend the relative impact of errors in the context of the real-world domain.

Mean Squared Error (MSE), an metric that captures the average of squared differences between anticipated and actual values. Through MSE, we embrace the inherent variability of predictions and unlock deeper insights into the model's performance.

Root Mean Squared Error (RMSE). It emerges as the square root of MSE, possessing the remarkable ability to measure the average magnitude of errors in the same units as the original data.

$$RMSE = \surd \left(MSE\right)$$

Coefficient of Determination (R2), a force that unveils the proportion of variance in actual values explained by the predicted values. R2, known as R-squared. It shows how well the prediction model fits the data that was actually observed.

$$R2=1-(SSR/SST)$$

where SST is the total sum of squares and SSR is the sum of squared residuals, y is the predicted values and ŷ is the actual value.

In the aforementioned equations, n is the number of data points. Higher values of R2 signify a better fit of the model to the data, whereas lower values of other metrics signify better prediction accuracy [39]. The effectiveness and dependability of the CPU utilization prediction model in the cloud context are evaluated and proposed GBHT model performance compared with other popular models. Results shows that GBHT model exhibits the highest level of accuracy and reliability in CPU utilization prediction shown from Fig. 3 to Fig. 16.

The performance of the models was evaluated using several key evaluation metrics: MAE, MSE, RMSE, and R2 as shown in Table 3. Total 7 models selected for comparison, which further divided into 4 categories i.e. machine learning (SVM, KNN, Random Forest, Gradient Boost), deep learning (LSTM, RNN, CNN), time series model (Facebook Prophet) and as well as the hybrid models, combining LSTM with Gradient Boost and Gradient Boost with SVM., time series and hybrid models. Among the machine learning models, SVM had a MAPE of 3.08%, indicating relatively higher prediction errors. KNN performed better with a lower MAPE of 1.35%, followed closely by Random Forest and Gradient Boost, both achieving MAPE values below 1.1%. These models exhibited strong predictive accuracy with high R2 scores ranging from 0.94 to 0.97, indicating a good correlation between predicted and actual values. In the realm of deep learning, LSTM achieved a MAPE of 1.35%, while RNN demonstrated even lower error with a MAPE of 0.91%. CNN also performed well with a MAPE of 1.17%. These models showcased high R2 scores ranging from 0.96 to 0.98, indicating their ability to accurately predict target values. The time series model, Facebook Prophet, stood out with outstanding results, boasting an extremely low MAPE of 0.02%. This model demonstrated superior prediction accuracy with an R2 score of 0.87. Among the hybrid models, the Hybrid LSTM + Gradient Boost achieved a MAPE of 1.08% and maintained consistent accuracy. However, the Hybrid Gradient Boost + SVM exhibited a higher MAPE of 3.17%, indicating relatively higher prediction errors.

The proposed GBHT model fared better than all other models, with a remarkable MAPE of 0.01%. It achieved perfect predictive accuracy with an R2 score of 1.00, indicating a flawless correlation between predicted and actual values. The GBHT model demonstrated its superiority by significantly reducing errors and optimizing hyper parameters through grid search [40]. These results validate the effectiveness of the GBHT model in achieving highly accurate predictions and highlight the importance of hyperparameter tuning in optimizing model performance.

The accurate prediction of CPU utilization enables efficient load distribution among virtual machines, minimizing the need for VM migration and reducing overhead time. This optimization results in improved performance and resource utilization in the cloud. The proposed work focused on the development and evaluation of a fine hyperparameter tuning Gradient Boosting model for CPU utilization prediction in cloud computing environments. The proposed GBHT model performance compared against 4 machine learning, 3 deep learning,1time series and 2 hybrid models by considering metrics like MAE, MSE, RMSE, and R2. The findings indicate that the proposed GBHT model outperformed the other models in terms of MAE, MSE, RMSE, and R2. With a remarkably low MAPE of 0.01% and a perfect R2 value of 1.00, the proposed model demonstrated superior accuracy and predictive power for CPU utilization forecasting. The success of the fine hyperparameter tuning Gradient Boosting model highlights its potential as a valuable tool in cloud computing environments. The accurate CPU utilization predictions provided by the GBHT model can help optimize resource allocation, minimize underutilization or overutilization, and facilitate efficient capacity planning. This, in turn, improves system performance, cost-effectiveness, and overall customer satisfaction. Further research can explore its application in different cloud scenarios and investigate its performance under varying workload conditions. Additionally, incorporating real-time data and dynamic adaptation into the model can enhance its predictive capabilities in dynamic cloud environments.

Conflict of interest

The authors have no relevant financial or non- financial interests to disclose

Ethical Approval

Not applicable

Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Saxena, D., Singh, A. K., & Buyya, R. (2021). OP-MLB: An online vm prediction based multi-objective load balancing framework for resource management at cloud datacenter. IEEE Transactions on Cloud Computing.
Sharma, G., Miglani, N., & Kumar, A. (2021). PLB: a resilient and adaptive task scheduling scheme based on multi-queues for cloud environment. Cluster Computing, 24(3), 2615–2637.
Sharma, G., Khurana, S., Harnal, S., & Lone, S. A. (2022). CSFPA: An intelligent hybrid workflow scheduling algorithm based upon global and local optimization approach in cloud. Concurrency and Computation: Practice and Experience, 34(23), e7176
Cortez, P., Rio, M., Rocha, M., & Sousa, P. (2012). Multi-scale Internet traffic forecasting using neural networks and time series methods. Expert Systems, 29(2), 143–155.
Miglani, N., & Sharma, G. (2018). An adaptive load balancing algorithm using categorization of tasks on virtual machine based upon queuing policy in cloud environment. Int J Grid Distrib Comput, 11(11), 1–2.
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2016). Time Series Analysis: Forecasting and Control. Hoboken, New Jersey: John Wiley & Sons, Inc.
Hamzacebi, C. (2008). Improving artificial neural networks’ performance in seasonal time series forecasting. Information Sciences, 178(23), 4550–4559.
Wang, X., Ma, L., Wang, X., Shi, Y., Yi, B., & Huang, M. (2022). Truthful vnfi procurement mechanisms with flexible resource provisioning in nfv markets. IEEE Transactions on Cloud Computing.
Xie, Y., Pan, L., Yang, S., & Liu, S. (2022). A random online algorithm for reselling reserved iaas instances in amazon’s cloud marketplace. IEEE Transactions on Network Science and Engineering.
Bi, J., Yuan, H., & Zhou, M. (2019). Temporal prediction of multi-application consolidated workloads in distributed clouds. IEEE Transactions on Automation Science and Engineering.
Kabir, H. D., Khosravi, A., Mondal, S. K., Rahman, M., Nahavandi, S., & Buyya, R. (2021). Uncertainty-aware decisions in cloud computing: Foundations and future directions. ACM Computing Surveys (CSUR), 54(4), 1–30.
Griner, C., Zerwas, J., Blenk, A., Ghobadi, M., Schmid, S., & Avin, C. (2021). Cerberus: The power of choices in datacenter topology design-a throughput perspective. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 5(3), 1–33.
Khurana, S., Sharma, G., Miglani, N., Singh, A., Alharbi, A., Alosaimi, W., Alyami, H., Goyal, N. (2022). An intelligent fine-tuned forecasting technique for covid-19 prediction using neuralprophet model. Comput. Mater. Contin, 71, 629–649.
Gao, J., Wang, H., & Shen, H. (2020). Task failure prediction in cloud data centers using deep learning. IEEE Transactions on Services Computing.
Ruan, L., Bai, Y., Li, S., He, S., & Xiao, L. (2021). Workload time series prediction in storage systems: a deep learning based approach. Cluster Computing, 1–11.
Ruan, L., Bai, Y., Li, S., Lv, J., Zhang, T., Xiao, L., Fang, H., Wang, C., & Xue, Y. (2022). Cloud workload turning points prediction via cloud feature-enhanced deep learning. IEEE Transactions on Cloud Computing.
Singh, N., & Rao, S. (2014). Ensemble learning for large-scale workload prediction. IEEE Transactions on Emerging Topics in Computing, 2(2), 149–165.
Feng, B., Ding, Z., & Jiang, C. (2022). FAST: A forecasting model with adaptive sliding window and time locality integration for dynamic cloud workloads. IEEE Transactions on Services Computing.
Yang, Q., Peng, C., Yu, Y., Zhao, H., Zhou, Y., Wang, Z., & Du, S. (2013). Host Load Prediction Based on PSR and EA-GMDH for Cloud Computing System. 2013 IEEE Third International Conference on Cloud and Green Computing.
Xu, D. (2014). On-demand Resource Prediction and Optimal Resource Allocation Method Research in Cloud Computing Environment (Doctoral dissertation). Hefei University of Technology.
Chen, J., & Wang, Y. (2019). Hybrid Method for Short-Term Host Utilization Prediction in Cloud Computing. Journal of Electrical and Computer Engineering, 1–14.
Mehmood, T., Latif, S., & Malik, S. (2018). Prediction of Cloud Computing Resource Utilization. 2018 15th International Conference on Smart Cities: Improving Quality of Life Using ICT & IoT.
Amiri, M., Feizi-Derakhshi, M. R., & Mohammad-Khanli, L. (2016). IDS fitted Q improvement using fuzzy approach for resource provisioning in cloud. Journal of Intelligent & Fuzzy Systems, (Preprint), 112.
Qin, W., Teng, Y., Man, Y., Yu, S., & Zhang, Y. (2013). A detection method for handover-related radio link failures based on SVM. In Q. Zu, M. Vargas-Vera, & B. Hu (Eds.), Joint international conference on pervasive computing and the networked world (pp. 476–486). Springer.
Sotiriadis, S., Bessis, N., & Buyya, R. (2018). Self-managed virtual machine scheduling in Cloud systems. Information Sciences, 433–434, 381–400.
Duggan, M., Mason, K., Duggan, J., Howley, E., & Barrett, E. (2017). Predicting host CPU utilization in cloud computing using recurrent neural networks. In The 12th International Conference for Internet Technology and Secured Transactions (ICITST-2017).
Ullah, Q. Z., Hassan, S., & Khan, G. M. (2017). Adaptive resource utilization prediction system for Infrastructure as a Service Cloud. Computational Intelligence and Neuroscience, 2017.
Rao, J., Bu, X., Xu, C. Z., Wang, L., & Yin, G. (2009). VCONF: A reinforcement learning approach to virtual machines auto-configuration. In Proceedings of the International Conference on Autonomic Computing (ICAC) (pp. 137–146).
Shaw, R., Howley, E., & Barrett, E. (2017). An advanced reinforcement learning approach for energy-aware virtual machine consolidation in cloud data centers. In 2017 12th International Conference for Internet Technology and Secured Transactions (ICITST) (pp. 1–6).
Kumar, J., Goomer, R., & Singh, A. K. (2018). Long short-term memory recurrent neural network (LSTM-RNN) based workload forecasting model for cloud datacenters. Procedia Computer Science, 125, 676–682.

Download PDF

Version 1

posted

You are reading this latest preprint version

A fine tune hyper parameter Gradient Boosting model for CPU utilization prediction in cloud

Status:

Version 1

Abstract

Figures

1. Introduction

1.1 Innovation and major contribution

2. Related Work

3. Preliminaries

3.1 Recurrent Neural Networks (RNNs)

3.2 Long Short-Term Memory (LSTM)

3.3 Facebook Prophet model

3.4 Random Forest

3.5 K-Nearest Neighbours

3.6 Convolutional Neural Networks (CNNs)

3.7 Support Vector Machines (SVM)

4. Problem Statement

Proposed GBHT model

5.1 Complexity

6. Results and Discussion

7. Conclusion

Declarations

Conflict of interest

Ethical Approval

Availability of data and materials

References

Status:

Version 1

Notation	Description
D	Historical CPU utilization data
\(Dtrain\)	Training dataset
\({D}_{test}\)	Testing dataset
\({E}_{0}\)	Initial ensemble estimator
F	Feature set
\({F}_{norm}\)	Normalized feature set
H	Hyperparameters for Gradient Boosting
\({H}_{opt}\) \(\)	Optimal hyperparameter combination
\({E}_{i}\)	Estimator at iteration i
\({R}_{i}\)	Residuals at iteration i
\({y}_{train}\)	Ground truth CPU utilization values for training set
\({X}_{train}\)	Input features for training set
\(ytrai{n}_{pred}\)	Predicted CPU utilization values for training set
\(ytes{t}_{pred}\)	Predicted CPU utilization values for testing
\({N}_{test}\)	Samples in the testing set, number
\(ytes{t}_{i}\)	Value for the ith sample in the testing set's actual CPU usage
\({y}_{test}\)	Mean of the testing set's actual CPU utilization values
N	Iterations count