Optimal prediction of cloud spot instance price utilizing deep learning

Cloud platforms often offer a variety of virtual machine (VM) models of various types and capacities, enabling users to choose the instances that best meet their requirements. Cloud providers have devised systems to make the most of their redundant computing resources. The cost fluctuates dynamically based on supply and demand. "Spot price" is a common term for this. To be able to use this instance, the user must create a suitable offer above the spot price. Accurate spot price prediction allows users to pre-prepare bid prices and run time to increase the reliability of the method. For this purpose, we consider Amazon EC2 as a testbed and use its spot price history to predict the future price by constructing a proposed modified gated recurrent unit (MGRU) model and providing a proposed dropout method. Compared with other sophisticated methods, test results show that the proposed method works superior and more accurately.


Introduction
The advantages of cloud computing, such as easy access to computer resources, the existence of various payment models, and a scalable environment, have led to significant growth in recent years. Leveraging the cloud is a good model for running programs with high reliability and low cost compared to other platforms [1,2]. Cloud

3
Optimal prediction of cloud spot instance price utilizing… services are often segmented into three different types of offerings: infrastructure as a service (IaaS), software as a service (SaaS), and platform as a service (PaaS) [3]. Infrastructure as a service (IaaS) is a unique service model at the bottom base layer of the cloud computing stack, near the hardware. IaaS provides users with virtual machine-based infrastructure resources, allowing them to elastically lease VMs with various capacities and functionalities [4].
Virtual machines are typically supplied in a variety of leasing periods and pricing structures, depending on the demands of consumers and different conditions [5]. Cloud providers offer fixed-rate virtual machines, such as on-demand instances, under the fixed pricing model [6]. Users pay a fixed amount based on the type of region and availability zones in the on-demand instance, which varies based on the number of CPU cores, clock rates, memory size, and other factors. The fixed-price VMs offer an important level of stability and availability, although they are more expensive than other models. Furthermore, on-demand servers are irrevocable until the user's needs are met [7].
Cloud providers offer dynamic and changeable pricing for virtual machines, including spot instances in a variable-price paradigm. Using additional computing capacity provides many discounts that are very beneficial to users, especially those with large workloads, although they do not ensure availability or stability and are revocable. As a result, spot instances are frequently used in applications that are fault tolerant and time insensitive [8,9]. Spot instances allow users to provide the maximum offer they are willing to pay. If the offer is higher than the spot price, the request will be accepted. Otherwise, the user will wait. Amazon updates the spot price regularly based on demand (user's bid) and supply (resource availability). The instance runs until the client terminates it, or the spot price changes to a value higher than the user's suggested price [10,11]. If the cloud platform pulls a spot server for any reason at any time, the interrupted spot may be terminated or hibernated according to recent Amazon updates. When the VM is no longer needed, if the spot price is higher than the user's maximum bid, the saved memory and context are reloaded [10]. Today, large companies such as Yelp, NASA JPL, FINRA, Netflix, and Autodesk are benefiting from using Sis for various applications such as modeling and time analysis, video and image processing, data processing computation, and big data [11]. Accurately predicting the spot price increases the reliability of using these instances and also helps users find the optimal time and price for the proposal.
Much research has been conducted on the issue of prediction. Among them, time series prediction techniques are divided into several groups. Because they include linear components, traditional statistical models like the autoregressive integrated moving average model (ARIMA) have limited real-world relevance. On the other hand, several nonlinear statistical methods have been proposed for time series prediction with nonlinear patterns, such as the autoregressive conditional heteroscedastic (ARCH) and general autoregressive conditional heteroscedastic (GARCH), which are usually suitable for certain nonlinear models. Finding a good model for real-time series is a difficult real-world task [12,13]. Recently, artificial neural networks (ANNs) have made great strides in predicting time series due to their universal approximation, being data driven, and ability to model nonlinear patterns in data [14]. Recurrent neural networks (RNNs) are a special class of ANNs used in 1 3 predicting variable workloads problems due to their superior sequential processing ability. RNNs often suffer from vanishing problems and lack the ability to learn long-term memory dependencies well. With the advent of long short-term memory (LSTM), gated recurrent unit (GRU), and transformer models, many problems of RNNs have been solved due to the large capacity of information management. However, these methods also have limitations in predicting time series [15,16]. ANNs with deep architecture have been found to perform far better in accurate modeling of high-order nonlinear time series than ANNs with non-deep architecture, even though different ANN architectures can offer good linear patterns for time series data. However, the substantial number of parameters and the complexity of the deep neural network topology increase the probability of overfitting occurrence. Therefore, regularization methods such as dropout should be used to avoid reducing the impact of fit on test data [17,18].
In the highly dynamic market of cloud computing, it is important to provide research for spot pricing proposals. By accurately predicting spot prices, users will effectively solve the problems of bid setting and instance selection. Thus, it will help users save a considerable amount of money and improve reliability obviously. In addition, users can know the price trend of spot instance in the future by predicting the price in advance. In this study, a new prediction method named MGRU is proposed which predicts future prices accurately and quickly minimizing prediction error. Moreover, using the proposed dropout in this method leads to the effective use of training data to reduce overfitting. After making the necessary modifications to increase the accuracy and flexibility of the structure, the authors determine the hyperparameters of the model, perform experiments, and compare the results based on the criteria used in the work. We apply the proposed method to real Amazon EC2 Spot data and compare it with other advanced time series forecasting techniques. The results demonstrate that the proposed method outperforms other existing methods. The rest of this paper is organized as follows. A literature review in Sect. 2 provides a brief overview of the study. Section 3 gives a brief introduction related to temporal models. Section 4 presents the architecture of the proposed method. A description of the setup is provided in Sect. 5 to evaluate the techniques provided. Section 6 presents and analyzes the performance evaluation results, and Sect. 7 presents and analyzes the conclusions.

Related works
In recent years, scholars have studied the existence of a number of advantages in various price forecasting difficulties; for example, accurate spot price prediction is especially important in cloud computing. This reduces resource usage costs and reduces task computation time. As a result, recent scientific efforts have proposed a number of strategies. First, we will look at some of the classical spot price prediction methods that have been proposed and then discuss RNN-based solutions.

3
Optimal prediction of cloud spot instance price utilizing…

Classic methods for price prediction
For spot price prediction, Singh and Dutta [5] suggested an autoregressive-based model. In the complicated cloud space, linear models do not function well. Moving average techniques (such as basic, weighted, and exponential) are used by the authors in [19] to predict next-hour spot prices using estimates, and Alkharif et al. use a seasonal ARIMA (SARIMA) model in [20]. Spot instances have been offered to improve the accuracy of price prediction. Due to the extreme volatility of spot prices on the cloud platform, these models are often incompatible.
The k-nearest neighbors (kNN) regression model was introduced in the article [21] to achieve the best performance in spot price prediction. Despite its benefits, kNN often comes with a significant computational cost.
Wallace et al. [22] present spot price predictions based on typical multilayer perceptron; however, the long sequence problem cannot be effectively processed in such neural network topologies.
Most of the classical approaches proposed by the researchers require a series of sequences in a particular order for a good prediction, but there are large series with significant volatility in the cloud centers. As a result, we rarely see optimal prediction using such methods, necessitating the development of more intelligent strategies to improve the effectiveness of spot price prediction that fluctuates a much.

RNN-based approaches for price prediction
RNNs perform well in short-term dependency series, but researchers in [23] have shown their incapacity to cope with long-term dependencies and demonstrated that when the distance between the information related to the point at which this information is needed is large, the RNN has a problem with gradient vanishing in training, and as the data increase, the inability to remember long-term dependency becomes more serious, preventing further training of the RNN.
To overcome the RNN challenge, LSTM networks with the ability to learn longterm dependencies for time series sequences were proposed. In [24] used LSTM deep neural network structure in Amazon EC2 to predict spot prices. Similarly, Agarwal and colleagues in [25] proposed an LSTM-based model for estimating spot prices using the 90-day past price data history provided by Amazon. GRU is another model derived from RNN that, despite its structural advantages and computational efficiency, little research has been conducted on using it to predict cloud computing [26].
Overall, limited research on spot price assessment and prediction has not been able to comprehensively and accurately assess the challenges in this area, which can lead to unnecessarily time-consuming and high computational loads and reduce the accuracy of the expected price. Accordingly, we designed a framework that takes into account a variety of concerns and leads to improved prediction accuracy and computational efficiency in a cloud computing environment. The accuracy of the proposed approach is confirmed by comparison with several other models.

Temporal models
In this section, the recurrent neural network (RNN) and gated recurrent unit (GRU) are briefly introduced. Based on the basic principles of these two techniques, we aim to solve the problem of high variance on cloud spot price prediction.

Recurrent neural networks
Recurrent neural networks (RNN) are purposeful construction of a neural network that allows the modeling of complex relationships in time series data by associating layers and neurons connected to each other [27]. The RNN uses a recursive loop to allow the network's knowledge to be reused for subsequent computations at the end of each calculation. As a result, data can be passed from node to node during the phases. All recursive neural networks are made up of a series of neural network modules that repeat in a sequence [28]. These repetition modules in regular recursive neural networks have a fairly simple structure, such as a Tanh function. A sketch of a standard RNN is depicted in Fig. 1a. And its unfolded structure along with the temporal axis is shown in Fig. 1b [29].
As shown in Fig. 1, RNN is the computation of hidden state h t ; at each timestamp t, it linearly sums current input information x t and previous hidden state h t−1 and then computes h t . Because there is no technique to handle the issue of vanishing gradient [30], RNN is ineffective when we need more and further information. During the training phase of a large set of datasets, the gradients' values steadily move so small as to move to the beginning of the grid, with a minor change in weight, which slows down the training process, and in more severe cases, this causes the training process to come to a halt [31].

Gated recurrent unit (GRU)
There have been advances in RNN-derived models in recent years that have done effectively overcome the shortcomings of applying the long-term sequenced RNN model. Cho et al. [32] introduced the GRU model in 2014, which is derived from the LSTM model [33]. This architecture is intended to alleviate the gradient fading Optimal prediction of cloud spot instance price utilizing… problem in the RNN recursive neural network as well as reduce the overhead in the LSTM architecture [34,35].
Cell states and GRU gates have the same basic functionality as the LSTM. Figure 2 shows the GRU's only two gates: the update gate and the reset gate [36,37]. Their mathematical form for time step t is as follows, assuming that an input is X t and C t−1 representing the value of the previous time step: The update gateway ( Γ u ) is a so-called switch that determines whether to use the previous state, input or a combination of both. The network will be able to store in memory elements of the distant past thanks to this capacity.
The reset gate ( Γ r ) acts as a switch, allowing the network to determine how much previous information is not needed in the current step and how much information from the previous step is used in the current step. More precisely, because this switch is set to 0, the network is forced to act as if it were reading the first half of the input string, allowing the network to forget the previously computed state. The reset gate is used by the current value candidate to store past data. The input is multiplied by its weight, along with the gate gain and the number of the previous time step multiplied by each element. The Tanh activation function is then used to get the candidate value ( Ĉ t ): Finally, the update gateway determines the quantity of information used in determining the new value ( C t ) for each of the values C t−1 and Ĉ

Proposed approach
This research aims to obtain an accurate model for spot price prediction. The key components of the proposed approach are shown in Fig. 3. Utilizing an optimal approach for accurate and fast price prediction in the spot pricing model to increase reliability is a fundamental step of the cloud platform. The steps of the methodology are described next.

Data preprocessing
Historical price information from the cloud data center is used in the proposed prediction model. To make collected data more applicable, the raw data ⃗ X = x 1 , x 2 , x 3 , … , x n must first be pre-processed to length n. Data cleaning and normalization are two important data preparation tasks that should be implemented in the data preparation stage of the proposed prediction method.

Data cleaning
Data cleaning was employed to handle the noise in the data. If the time series being used contains noise and missing values, the noise should be smoothed, and the missing values replaced before normalizing.  Optimal prediction of cloud spot instance price utilizing…

Data normalization
Since there is a large difference in the range of spot data values for different time intervals, it is necessary to normalize the main data before proceeding to the next stage. This improves the convergence speed.

Hyperparameter tuning for proposed approach
Hyperparameter tuning or optimization expresses the robust procedure of identifying and finding the best feasible values of hyperparameters for a machine learning model to attain the desired resultant modeling outcome. Hyperparameters are variables that must be specified before training to determine the structure of the network and how it is trained [38,39]. To analyze all combinations of hyperparameters, the rapid and reliable grid search approach [40] is utilized in this research. First, a subset of the values of each hyperparameter is established, and the combination of the hyperparameters of each iteration is calculated. Then the optimal learning combination is selected and implemented. Using the grid search method, proposed approach adjusted eight before training parameters of the learning rate, epoch size, batch size, dropout rate, update gate coefficient, number of hidden layers, number of neurons in each layer, and optimization algorithm.

Proposed model
Considering the features and capabilities of the GRU, we proposed a model based on the modified GRU (MGRU) structure in this study. In addition to the advantages of the GRU model, there are also disadvantages that can be addressed to make this method more viable.
The forget gate and the update gate are independently responsible for forgetting and updating the information flow in LSTM networks, and the rate control mechanisms of each gate can be adjusted as desired, whereas the GRU performs both operations through the update gate, and the new value information is controlled through a trade-off between the previous time step value and the candidate value [35][36][37].
If the reset gates are frequently active, the unit is talented at learning short-term dependence, according to relations 1, 2, 3, and 4. As a result, as Γ u approaches zero, the influence of Ĉ t increases and information pass more quickly. On the other hand, we witness long-term dependency when update gates are frequently active. As a result, if Γ u tends to one inclination, the network prefers to keep C t−1 more and prevents the information from being passed too quickly [32]. For ease of analysis of Γ u and (1 − Γ u ) , C t is the linear interaction between the value of the previous time step and the candidate value, which can be replaced by α 1 and α 2 in relation 4. It is self-evident that the sum of these two components equals one and that they have the opposite impact on one another, such that if α 1 goes to zero at a constant pace, α 2 goes to the opposite rate, to the value of one, which means problems arise in learning and information flow [41]. When training the model, the interrelationship of parameters α 1 and α 2 can prevent proper information flow, resulting in slower learning speed. Therefore, two solutions have been proposed in this paper to modify the linear constraint and increase the learning rate. To begin with, instead of α 1 + α 2 = 1 , the following equation is suggested: The update gate coefficient γ, which manages memory, is then defined as a factor with a value between 0 and 1.
The value of α 2 is defined as follows using Eqs. 6 and 7: Then, in relation 5, substitute relations 7 and 8: The size of the variable γ will lead to short-term dependence or long-term dependence, as indicated in Eq. 9.
The key steps of the proposed MGRU are illustrated in Algorithm 1.
The dropout technique is another advantage of the price prediction structure in the proposed approach, apart from recommended above. The dropout layer, which disables a random fraction of distinct connections at each training iteration to avoid overfitting, is an effective regularization strategy for reducing overfitting [42]. We experimented with various Dropout timings. Many of them were either able to deal effectively with overfitting phenomena, or efficient network utilization did not occur in training iterations due to the harsh application of regularization methods, and therefore we did not observe suitable outputs. The proposed dropout probability scheduling model during training is introduced in this study on the grounds that the risk of overfitting is lower during the initial training epochs and when the probability of overfitting is increased, we need more regularization. As a result, in the proposed scheduling model, the dropout probability begins at 0 and increases steadily until it reaches its maximum value at the end of training and then stays there. Equation 10 describes a proposed strategy that aims to effectively utilize the network for better training.
p l is the probability of dropping out in l iteration of the training which is used in the M epochs.
With methods mentioned above, dropout is configured at the output of each MGRU layer to ensure the strong performance of the model on large training data. The dropout layer randomly drops the output of each MGRU layer based on the dropout rate.

Dataset description
To reduce error, optimal latency, and manage traffic, Amazon Web Service (AWS) has different regions and availability zones. Users can designate the location of their resources in a variety of EC2 instances, each of which has its own set of CPU, RAM, storage, and other capabilities.
We use actual Amazon EC2 spot prices data for C3.2xlarge, M3.2xlarge, and M3.medium virtual machines in time series data from March 7 to June 7, 2016, USwest region (Oregon) [43] to validate the proposed model that Table 1 shows their main specifications.
A vCPU (virtual centralized processing unit) represents a portion or share of the underlying, physical CPU that is assigned to a particular virtual machine (VM).The vCPU is responsible for the processing and orchestration of all applications. ECU is an EC2 compute unit that provides the relative measure of the integer processing (10)  power of an Amazon EC2 instance. Besides, each virtual machine consumes memory based on its configured size, plus additional overhead memory for virtualization. The configured size is the amount of memory that is presented to the guest operating system. To get more accurate results with the proposed model, we use box plot to remove outliers and replace missing values based on linear interpolation. Key data must then be normalized due to the need to accelerate the convergence of learning-based algorithms due to significant time differences in the range of spot price data. The normalization method of this article follows the relationship below: where mean( �� ⃗ X) is the mean value of �⃗ X and is the standard deviation. The data are entered into the batches in the prediction model. To be more specific, The original dataset is divided into three subsets as follows: 60% of the data are the training set, which is used to determine the weights and thus to train the model. The validation set, responsible for hyperparameter selection and other model selection tasks, is 20% of the dataset and test set, responsible for evaluating the performance of the selected model, representing 20% of the data set after selecting the optimal model.

Performance metric
In the experiments, the performance of the proposed approach is compared with the ARIMA, RNN, LSTM, GRU, and transformer. The performance of several metrics such as root mean squared error (RMSE), mean absolute error (MAE), the coefficient of determination ( R 2 ), and mean absolute percentage error (MAPE) are used to evaluate the prediction accuracy of all models. RMSE is defined by: MAE is expressed as: R 2 is also introduced by the following relation: Optimal prediction of cloud spot instance price utilizing… The MAPE, as a percentage error measure, can be computed as follows: where the y i ,i ŷ i and y i represent the actual, predicted, and mean values, respectively. The number of evaluated data samples is also shown with n.

Generating a list containing combinations of hyperparameters
Selecting the settings of different hyperparameters to solve the overfitting and underfitting concerns in the given model is a crucial step after providing the optimal algorithm for the proposed approach. First, it is necessary to set a subset of values to optimize each hyperparameter, and then, by evaluating all the different hyperparameter combinations, the best hyperparameter combination is selected. Learning rate, epoch size, mini-batch size, P max in relation to dropout, the number of hidden layers, number of neurons in each layer, update gate coefficient, and optimization algorithm are all desired hyperparameters, and each is expressed as follows.
a. The amount of weight change can be controlled by adjusting the learning rate, which is a tunable hyperparameter. The correct learning rate can help the model to converge quickly and accurately. Higher learning rates accelerate learning but may not be optimal and converge, while lower learning rates lead to prolonged network convergence and very slow model training [44]. b. The number of iterations of the total transmission of all training data is expressed as the size of the epoch. Full learning remains incomplete in a small number of courses, resulting in insufficient learning. Overfitting occurs in a large number of courses, and new, previously unknown data are not efficiently predicted [45]. c. At each learning stage, adjusting the batch size allows only one sample to be taught at a time. Therefore, it is essential to find the ideal batch size that gives us the fastest performance improvement [46]. d. Another hyperparameter that controls the pace at which some neurons drop is the maximum dropout rate ( p max ). Selecting the best value has a direct effect on model performance and helps to avoid overfitting [47]. e. In neural networks, choosing the right number of hidden layers improves the accuracy of learning. Despite the fact that deep neural network topology has been shown to be more flexible than single-layer topology, the increase in accuracy cannot be guaranteed by increasing the number of hidden layers. Therefore, to increase performance, it is important to test the performance of the model with multiple hidden layers [48]. f. Another important parameter to modify in the proposed hyperparameter model is to determine the ideal number of neurons for each layer. A network with insuf- ficient neurons cannot learn the problem well, while a network with more neurons overfits, resulting in poor network performance and inaccurate prediction [14]. g. The hyperparameter of the update gate coefficient γ in the proposed method is the memory regulator, which leads to short-term or long-term dependent regulation by modifying the linear limit. h. Another hyperparameter that has been discussed is the selection of an efficient and effective optimizer to accomplish the task of minimizing the neural network's objective function in the best possible way [27].
For the learning process, the parameter settings are selected and set in Table 2. Balances the best overfitting and underfitting hyperparameter.

Results and discussion
To better evaluate the performance of the proposed approach, extensive experiments are conducted in this section. We compare the proposed model to five sophisticated methods: ARIMA, RNN, LSTM, GRU, and Transformer, to determine its predictive capability. In order to accurately compare the results in all the methods used in this section (ARIMA, RNN, LSTM, GRU, and transformer model), all combinations of hyperparameters for each method were evaluated using the grid search, and then the best combination was selected individually. The performance of the approaches is demonstrated by the results of the three criteria listed in Figs. 4 and 5. The proposed method has the lowest error (RMSE and MAE) and best accuracy ( R 2 ), according to the results.
The results show that the modifications made and the proposed mechanism are very useful and successful in increasing the performance of the original structure. Overall, two results can be obtained from two Figs. 4 and 5. First, deep learningbased approaches perform better than the conventional statistics-based model (ARIMA). Secondly, the proposed model based on MGRU outperforms all the other approaches studied. The results suggest that MGRU can receive long-term dependencies from consecutive data. Besides, despite modeling longer sequences more  Table 3 represent the performance of the proposed model and alternative deep learning-based approaches (RNN, LSTM, GRU, and transformer model) for predicting spot prices with varying levels of prediction length in the subsequent review of the results. As shown, the use of the RNN approach generates higher data prediction error than other ways due to the rapid volatility of values in cloud data centers, whereas the proposed method has the least error. Figures 6 and Table 3 clearly illustrate that the error grows with a time scale for all models, but the prediction accuracy of our model with superior performance prediction outperforms the other approaches tested, demonstrating the useful characteristic of its structure in gradient vanishing.
Another feature built into proposed framework is using a suggested dropout. To avoid overfitting, dropout layers are placed into the network.  Table 2. Figure 8 shows the improved results using the proposed dropout method compared with the identical dropout-free structure as well as the simple fixed-rate dropout structure, with a clear improvement according to time.
The cumulative distribution function (CDF) of the method proposed by MSE performs better than other methods at different levels of prediction, as shown in Fig. 9. The vertical axis of each CDF line is limited to the range of 0 to 1, and the closer the CDF line is to the vertical axis, the greater the accuracy and therefore the lower the  MSE. As shown in Fig. 9, extending the prediction time significantly improves the performance of the proposed method compared to other techniques.

Summary and conclusions
Users can overcome the significant volatility of the spot instance and accurately predict future price patterns by using an efficient strategy. By setting the purchase time, the best price can be supplied, avoiding an excessive bid price that leads to a prohibitive cost or a low bid that prevents the instance from being used. By examining past Amazon EC2 spot sample price trends, a prediction method is proposed in this work to effectively predict future prices. The proposed GRU-based architecture allows for more adaptive and accurate spot price forecasting. We have also demonstrated how a proposed dropout method, based on the correct use of the regularization technique in our model, can be used to train a neural network effectively. ARIMA, RNN, LSTM, GRU, and transformer models are evaluated using similar datasets to assess the proposed model's performance. The results show that the proposed strategy produces more accurate point price predictions than other methods studied. The results show that deep learning-based techniques can accurately predict real-world data, with the proposed model being the best. Furthermore, while predicting different models with longer prediction lengths, the prediction error in the suggestive model decreased, indicating its good ability in situations that depend on long-term memory.
This feature can help maintain dynamic changes in price history and make accurate predictions. The objective of future research is to finer comprehend and predict the spot price from multiple perspectives, in addition to expanding it to other time series. Moreover, we will review a combination of other pricing models. Funding No funding was received to assist with the preparation of this manuscript.
Availability of data and materials All data generated or analyzed during this study are included in this published article.

Declarations
Competing interest The authors declare no competing interests.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval Not applicable.