Prediction of construction material prices using ARIMA and multiple regression models

Construction material prices (CMP) variations have become a major issue in properly budgeting construction projects. Inability to accurately forecast CMP volatility can also lead to price overestimation or underestimation. Enhancing the accuracy of predictions of CMP can also enhance the accuracy of predictions of total construction costs. The purpose of this study is to present a model for predicting construction material prices that assist decision-makers to make better decisions over the life cycle of a project. The price records for CMP namely; steel, cement, brick, ceramic, and gravel, and the indicators affecting them in Egypt were used for the prediction procedures. The practical methods for using the Box-Jenkins approach Autoregressive Integrated Moving Average (ARIMA) time series and multiple regression models for forecasting building material prices are outlined in this research. Out-of-sample predictions are used to evaluate the provided model’s performance in predicting future prices. The models are compared according to the Mean Absolute Percentage Errors (MAPE). The generated models show good results in predicting month-to-month variations in material prices, with MAPE ranging from 1.4 to 2.8 percent for the selected models. This research can assist both owners and contractors in improving their budgeting processes, and preparing more accurate cost estimates.


Introduction
The number of large-scale construction projects for residential, commercial, and government structures has recently surged around the world. Construction costs for mega projects have become a major source of concern under current conditions, due to their high prices and numerous design modifications during their long construction durations. Contractors have also been impeded to create accurate cost estimates as a result of this issue. Since the material prices can account for up to one-fourth of overall project costs (Hwang et al., 2012). The wide variation in the CMP makes accurate planning and cost estimating difficult for both owners and contractors. Contractors may lose bids or revenues owing to cost overestimation or underestimation (Ashuri and Lu, 2010). Many scientists attempt to accurately forecast cost increases, but predicting prices for a variety of construction materials requires a simple and automated procedure. Enhancing the efficiency of material price predictions can also improve the accuracy of total cost estimates. Various project stakeholders might benefit from predicting short-and long-term variations in construction material prices. Contractors can avoid losing bids or profits by enhancing the accuracy of their cost forecasting. This avoidance of losses leads to fewer hidden price contingencies postponed or canceled projects, budget irregularities, and erratic project flows. Owners of projects might profit from avoiding these undesirable consequences.
To account for probable changes in future material prices, cost estimators, for example, raise the estimated material costs to the planned construction date's midpoint (Anderson et al., 2006). Another cost estimators followed the method of adding a fixed percentage of the overall estimated cost as a risk premium to account for material price increases, such as asphalt cement (Laryea & Hughes, 2009). These simplistic solutions have ignored the fact that CMP varies significantly even over short periods of time. Considerable uncertainty regarding the rate of escalation for material prices, a probabilistic approach based on Monte Carlo simulation was utilized to assess the project cost range (Back et al., 2000). Monte Carlo simulation was used to generate random values for the escalation rate of material price. Monte Carlo simulation does not address the impacts of autocorrelation in historical CMP, which is a critical flaw in this approach (autocorrelation represents the relationship between a time series of variables over various time intervals). The results show that the suggested model performs similarly to present practice in terms of expectation while also offering theoretical uncertainty bounds that are well suited to future volatility, which is possibly more relevant.

Literature review
Numerous studies have sought to address cost escalation factors by concentrating on rapidly fluctuating construction material costs in an effort to make cost planning more feasible. The primary problems here are identifying escalation drivers and properly and simply calculating project costs. Shiha et al. (2020) represented three models that employ artificial neural networks (ANNs) to estimate future costs of major building materials, such as steel reinforcing bars and Portland cement, 6 months ahead in the Egyptian construction sector. The three models were included Genetic Algorithm (GA), Neural Tools software, and the Python programming language. Historical data on steel and cement prices, as well as macroeconomic indices, were used in Egypt to train, test, and validate the suggested models. To forecast the 7-day and 28-day strength of concrete specimens, Kaveh and Khalegi (1998) used artificial neural networks for various types of concrete mixtures. They consider both plain and concretes with admixtures. The Backpropagation technique is used to train and compare neural networks with one, two, and three hidden layers. The strongest networks are then chosen, with relatively modest mistakes, and utilised to forecast the strength of concrete combinations. Kaveh and Servati (2001) effective neural networks are trained for the creation of double layer grids in this article. they considered square diagonal-on-diagonal grids with spans ranging from 26.5 to 75 m. Effective neural networks were trained using the backpropagation algorithm to evaluate the weight, maximum deflection, and design of double-layer grids. To decrease the nonlinearity of the data and speed up training, a unique method for data sorting was devised. Additionally, this strategy offers the required steadiness. Using the created data ordering, more neural nets are trained and tested for grid construction. Marzouk and Amin. (2013) formulated a fuzzy logic to assess the degree of importance of each material type through the three main criteria, 1. The percentage of the elements participating in the total cost items; 2. The difference in the calculating price index of the elements during the research period; 3. The percentage difference in the price of the cost elements. In this research, they also made comparisons between the Artificial Neural Network and Regression Analysis. Results showed that the technique of Neural Networks surpasses regression analysis according to the estimated error. Lee et al. (2019) suggested a technique for forecasting raw material prices with the purpose of inspiring more accurate predictions. The prediction approach is a multivariate analysis of the time series, and the prediction goal is iron ore price, which is the primary driver of steel raw material prices. The accuracy of the prediction results over a specified period was compared with past average values. The results show that the proposed method is 2.3 times more accurate than the previous average values. Faghih and Kashani (2018) introduced a vector error correction (VEC) model for estimating construction material prices in the United States. The association between construction material pricing and a collection of key explanatory factors was studied using this model. The use of VEC models to anticipate construction material prices filled a gap in the current literature caused by the necessity of forecasting both short and long-term movements of particular construction materials being overlooked. Kissi et al. (2018) modelled the tender price index (TPI), in Ghana using an autoregressive integrated moving average with exogenous factors. The results showed that the ARIMAX model outperformed the single method in terms of predictive ability. The study backs up prior research by emphasizing the importance of using an integrated model technique to forecast TPI. Oshodi et al. (2017) studied the accuracy of employing univariate models for tender price index predictions. The modeling tools used in this study were Box-Jenkins and neural networks. In terms of accuracy, the results show that the neural network model outperforms the Box Jenkins model. Ilbeigi et al. (2017) defined and analyzed the observed fluctuations in actual asphalt and cement prices over time to create time series, forecasting models. This study investigated whether and how time series prediction models can expect future prices with higher accuracy compared to established approaches. Four univariate time series forecasting models, namely Holt Exponential Smoothing (ES), Holt-Winters ES, Autoregressive Integrated Moving Average (ARIMA), and seasonal ARIMA (SARIMA), are generated to study the short-term variation in future prices. The forecast results show that all four models of the time series can predict prices with better accuracy than the current approaches, such as the Monte Carlo simulation. The ARIMA and Holt ES models were among the four most reliable predictive models with errors of less than 2%. To forecast future values for the Engineering News-Record (ENR) CCI over a 12-months period, Ashuri and Lu (2010) employed an ARIMA model that took seasonality into account. The mean absolute error (MAE), mean square error (MSE) and mean absolute percent error (MAPE) were used to assess forecasting accuracy. Sonmez et al. (2007) researchers employed regression analysis to develop a model that included 14 potential independent components to anticipate cost contingency in international projects. Abu Hammad et al. (2010) used many explanatory parameters, such as project area and duration, to design a probabilistic regression model to predict the cost of public building projects.
These models were beneficial for addressing cost escalation issues and preliminary estimating in the early design phase, but they have certain limitations in time-varying variables and representing different time lags between influence elements. In fact, a lot of time-related data is dependent or has an autocorrelation. Applying time-related techniques to anticipating trends in material prices is one way to address these restrictions. Time-series approaches, which predict the future increase of a variable based on historical values of the variable and other relevant factors, have been used to handle time-related problems in the aforementioned methodologies. Time-series models are used to forecast trends in a systematic and time-related manner. Such that, based on historical trends, it is possible to generate useful projections (Wong et al., 2005).

Research objective
To make the research presented in this paper suitable for accurate and updatable material price predicting, an automated forecasting system is developed on the basis of both ARIMA, and regression modeling process using historical Egyptian data. A time series forecasting model identifies relevant traits in the past of a variable and predicts future values using those traits and earlier observations. Regression models account for the fact that price fluctuations are influenced by a variety of independent variables. As a result, this paper's aims is to show that price projection models may be created that perform well in terms of expectation and, produce great estimates of future material price volatility (even when data availability is limited). The data utilized in this analysis is accessible in public databases, and the techniques used in this analysis are available in several statistical software packages (the analysis is conducted in SPSS and EViews). Making this research practical and implementable for both practitioners and academics, the goals of this study are to: (1) Discover and analyze fluctuations in actual material prices.
(2) Apply this information to develop CMP forecasting models.
(3) Evaluate if the proposed models can estimate future prices more accurately.
To satisfy the research objectives, the remainder of the study is organized as follows. The subsequent section is the recommended research approach and the steps taken in this study. The material pricing time series data set is introduced, and its main features (autocorrelation and stationary) are studied. The most important indicators influencing the CMP are listed. ARIMA and regression models were constructed based on the stated properties. Each model's predictability is assessed and compared.

Research methodology
Accurate forecasting of construction material prices is an essential practice, particularly in developing countries where high price fluctuations can adversely affect the success of projects and even their viability. To avoid this, a system that can predict the size of the change in material prices with acceptable accuracy is required. As a result, a technique is used, with univariate time-series (ARIMA) and regression approaches used to forecast material prices. Figure 1 presents the process map of the procedures employed in this study. The methods include all important information about the required data, where and how it was obtained, and how a sample was chosen. This method entails four high-level processes, which are briefly outlined below and expanded upon later.
1. Determine the long-term price trend of construction materials, as well as the most relevant indicators influencing price change over time. Conducting the ARIMA models using the material prices historical data, and regression model using the historical data as a dependent variable and the indicators which has a significant relationship with material prices as independent variables. 2. Validate relative model performance to existing practice, out-of-sample projections are used to assess the accuracy of price projection models to current assumptions. Price forecasts are established previous to the present and compared to what really happened in this step. 3. Compare results. 4. Recommend the best-fitting model for each material type.

Data description (input data)
Models were created using publicly available price data from CAPMASS Egypt. The types of materials used in this search are steel, cement, brick, ceramic, and gravel. Which represent an important part of all the construction work items.

ARIMA model
The Box-Jenkins approach (ARIMA) model forecast is a time series prediction approach that is relatively advanced. It is capable of describing the dynamic change rules realistically. Under certain conditions, it can be utilized to do statistical analysis and forecast for time series. The model is particularly well suited to short-term forecasting. When the predicting time scale is long, large variances occur. There are three stages to conduct the ARIMA model: (1) model identification, (2) parameter estimation, and (3) diagnostic checking. Stationary time-series data are appraised in the model identification step, while non-stationary data are turned into stationary data using normal differencing or logarithmic transformation. Transformed data can be used in the next modeling stage. In prediction forms, the ARIMA model can be expressed using Eq. (1).
The order of the AR model (p) represents the order of the autoregressive component, and the order of the MA model (q) represents the order of the moving average element are then established by examining the autocorrelation coefficient (ACF) plot and partial auto-correlation coefficient (PACF) plot to determine ARMA (p, q) models using the provided lag numbers, while the order of differencing I (d) is identified through the model identification stage as the number of differencing to make data stationary. Fitted ARIMA models are recognized during the parameter estimation stage. For ARIMA models, EViews software version 10 was utilized. EViews is a software application that is specifically built to process time-series data. The models were made based on the published monthly price data of materials during the last 10 years (January 2011 to December 2020) and 5 years (January 2016 to December 2020). After testing the appropriate model for each type of material, the model was used to predict price values during the first 6 months of 2021. The least values of mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) criteria can be used to choose the best model. Most researchers suggested MAPE as a method for judgment. In prediction models, the MAPE is commonly agreed to be 10% (Fan et al., 2010;Hwang et al., 2012). MAPE is calculated using Eq. (2).

Fig. 1
The process map of the procedures C is constant; p is the order of the autoregressive component (AR); q is the order of the moving average component (MA); is the coefficient of the autoregressive model; is the coefficient of the moving average model; t is the error Term.
Mean absolute percentage error equation: Y t is the actual value at any specified time; f t is the forecasted value at any specified time; n is the number of forecasts.

Multiple regression models
Multiple Linear Regressions (MLR) is a linear statistical strategy for investigating the relationships between a dependent variable and two or more independent variables. When the focus is on the link between a dependent variable and one or more independent variables, it encompasses numerous approaches for modeling and evaluating multiple variables. When one of the independent variables is changed while the other independent variables remain constant, regression analysis can help you understand how the typical value of the dependent variable (or 'criterion variable') varies. As a result, it provides a strong basis for predicting price changes. For the regression model, IBM Corporation's SPSS statistical program version 25 was used to analyze the data. A set of nineteen indicators that can potentially influence the CMP were identified through a literature review extensive study. The information collected was split into two categories: independent input variables and dependent output variables. When various inputs are used to predict an output, the primary assumption is that these inputs are independent variables that predict the output dependent variable. Raw prices are considered as a dependent variable, and indicators affecting construction material prices are used as independent variables. For this study, the indicators used have publicly published data on one of the official websites indicated in Table 1. If y is a dependent variable and X 1 ,…, X n are independent variables, the multiple regression model predicts y from x in the following manner: (3) The global Economy (f) Producer Price Index Ashuri et al., Trading Economics (b) where Y denotes the output of the dependent variable, C denotes the constant, b denotes regression coefficients, and X denotes the input of independent variables. The following assumptions govern the multiple regression models: 1. Linearity: The dependent variable y is a linear combination of the independent variables x 1………. , x n 2. Independence: Observations are chosen from the population independently and randomly. 3. Normality: Observations are distributed regularly. 4. Variance homogeneity: All observations have the same variance.
Regression models were created for each type of material. Then each model was used to predict the future values of prices using the out-sample method. The prediction period is the first six months of the year 2021. Results of the prediction process are then compared with the actual values. The error rate was calculated for each value separately, and then the MAPE was calculated for each model following Eq. 2.

Results of Regression modeling
The aim of the research using multiple regression models is to find out whether it is possible to describe the relationship between prices and the influencing indicators through some equations. Interpretation of the results includes the issues of (1) analyzing the data, (2) estimating the model, i.e. fitting the line, and (3) evaluating the validity and efficiency of the model. SPSS software was used to analyze the data. The actual price is the (Y) dependent variable in the regression analysis. The independent variables (X) that have been assigned are shown in Table 1.

Analyzing the data
This study aims to create prediction models using only significant indicators i.e., indicators with strong t-statistics and a significance value of less than 0.05 are used in the prediction process. As a result, the final model may not include all of the indicators you selected. These tests of significance are useful for determining if each explanatory variable is required in the model, assuming that the others are already present. As a result, the "p-value" column in Table 2 represents the significant level. In the case of steel as an example, indicators inflation rate, GDP-construction, GDP, revenue, expenditure, industrial production, import, export, external reserve, and balance of payment have p-value of (0.058, 0.635, 0.983, 0.983, 0.313, 0.52, 0.322, 0.444, 0.801, 0.983) > 0.05, respectively. The test tells us that these indicators are not significant for the modeling process, while the other indicators which have p-value (0.00) < 0.05 add a significant contribution to explaining the change in steel prices as indicated in Table 2.

Estimated models coefficients
General forms of the equations for predicting material prices for the fore-mentioned types are obtained from Table 2. When all other independent variables are held constant, coefficients show how much the dependent variable varies with an independent variable. The regression coefficient provides the prospective change in the dependent variable for an increase of one unit in the independent variable.

Determine the suitability of the models
The models' summaries are indicated in Table 3. This table provides the values of R, R square (R 2 ), and adjusted R 2 for the estimate, which can be used to determine the appropriateness of the regression models for the data.
The value of R, the multiple correlation coefficients, is represented in the "R" column. R can be thought of as a metric for the accuracy of the dependent variable's prediction. For the steel model, a value of 0.995 implies a good level of predictability. As displayed in the "R Square" column, the R 2 value (also known as the coefficient of determination) indicates the proportion of variance in the dependent variable that can be explained by the independent variables. Our steel model's result of 0.99 shows that our independent variables account for 99 percent of the variability in our dependent variable. R-squared appears to be a simple statistic that measures how well a regression model fits a set of data. However, it does not provide us with a good ending. R 2 value must be associated with residual plots, other statistics, and an in-depth understanding of the topic area to get the entire picture. Another key issue is to appropriately provide the data interpretation of "Adjusted R Square" (adj. R 2 ). In this example, a result of 0.99 (coefficients table) shows that the predictors that should be kept in the model explain true 99 percent of the variance in the outcome variable. A large difference between the R-squared and Adjusted R Square values suggests a poor model fit. Any superfluous variable introduced into a model reduces adjusted R squared. Adjusted R squared, on the other hand, will rise when more beneficial variables are included. R 2 will always be less than or equal to adjusted R 2 . As a result, adjusted R 2 compensates for the number of terms in a model.
The histogram of residuals for the constructed model of the steel as an example is shown in Fig. 2a. The histogram displays a plot of the regression standardized residuals versus the regression standardized predicted values,  demonstrating that the residuals are normally distributed. The points on the plot are roughly randomly distributed, indicating that the assumption of homoscedasticity or equality of variances has been realized.

Stationary test
General ARIMA modeling and predicting methods are outlined in this section. This procedure is clearly depicted in (Fig. 1). It's worth noting that this is not a straightforward sequential procedure it can contain repetitive loops based on the results of the diagnostic and forecasting stages. ARIMA model is used to examine stationary time-series data, the data must first be determined to be stationary in terms of mean and variance. The steel, cement, brick, gravel, and ceramic historical price data is plotted in Fig. 4. The result shows that for all types of material used, the data was nonstationary in the first inspection. Taking the natural logarithm of the material type's data to eliminate its non-stationary, and taking the augmented dickey-fuller test (ADF) for the logarithm, it was found that it is still greater than the critical value of the significance level of 0.01, 0.05. Further, the first-order difference is performed and a DLsteel, DLcement, DLgravel, DLceramic, and DLbrick sequence are obtained. After taking the logarithm and the first difference for the above-mentioned types of materials, ADF became smaller than the critical value. That is to say, the series became stationary and the significance test for stationary was passed as shown in (Fig. 3).

Model identification
The next step is to develop a suitable ARMA form to model the stationary series after determining the correct order of differencing required to make the series stationary. The Box-Jenkins procedure is used in the classic method, which involves an iterative process of model identification, model estimation, and model evaluation. The Box-Jenkins process is a semi-formal approach that relies on the subjective evaluation of plots of auto-correlograms and partial auto-correlograms of the series to identify models. Plotting the auto-correlogram of a time series is another technique to investigate its characteristics. The auto-correlogram shows the autocorrelation between time series with different lag lengths. The auto-correlogram must be plotted before the Box-Jenkins model can be identified. A Box-Jenkins technique includes evaluating plots of the sample auto-correlogram, partial auto-correlogram, and inverse autocorrelogram and inferring the correct type of ARMA model to use from patterns detected in these functions. This section outlines the theoretical auto-correlogram for various orders of AR, MA, and ARMA models. EViews software had been used to conduct the Auto-Correlation (ACF) and Partial Auto-Correlation (PACF) for all aforementioned material types. Figure 4 shows the ACF, and PACF for steel and cement as an example for the model identification process.

Identify the most significant model
Time series analysts have sought alternate objective approaches for finding ARMA models due to the extremely subjective nature of the Box-Jenkins methodology. The Akaike Information Criterion [AIC] or Final Prediction Error [FPE] Criterion (Akaike, 1974), the Schwarz Criterion [SC] or Bayesian Information Criterion [BIC] are examples of the identifying criteria, time series analysts have used them to resolve the need to minimize mistakes. For this study, eight models were done for each type of material and then the best model was chosen based on the value of adjusted R-squared, Akaike info criterion (AIC) value, and Schwarz criterion (SC) value. The least AIC value and the SC value, on the other hand, are insufficient requirements for the best ARMA model. The procedure followed in this study was to first create a model with the lowest Root Mean Square Error (RMSE), AIC and SC values, and then execute a parameter significance test and a residual randomness test on the estimation result. If the model passes the test, it can be considered the best model. If it fails the test, the second least AIC and SC values are chosen, and the appropriate statistical test is run. And so on, until you have picked the best model. Table 4 shows the most significant model chosen for each type of material. The criteria used for the judgment take the following form:   where N is the total number of data points; y t is the actual material price; yˆt is the forecasted material price; y¯t is the mean of actual material prices; and k is the total number of estimated parameters.

Model diagnostic
The formal evaluation of each of the time series models will be the next stage. This will entail a thorough examination of each model's diagnostic tests. A variety of diagnostic techniques are available to ensure that an acceptable model is created. A useful diagnostic check is plotting the estimated model's residuals. This should highlight any outliers that may have an impact on parameter estimations, as well as any potential autocorrelation or heteroscedasticity issues. Plotting the auto-correlogram of the residuals provides the second test of model appropriateness. The residuals should be 'white noise' if the model is appropriately described. As a result, a plot of the auto-correlogram should die out after one lag (Fig. 5).

Comparison of ARIMA and regression prediction models
To validate the proposed time series models, the predictive accuracy of the Box-Jenkins model (ARIMA) was compared to that of structural multiple regression models. The actual material prices series were used as a basis. The validity of each model was tested using the actual and the predicted values for the six-month out-sample from January 2021 to June 2021. We found that the predicting accuracy level of the regressions model compared to that of the ARIMA model is not very significant, as shown in Table 5 and Fig. 6. Given the small forecast error of both models, it may be stated that both models performed well in terms of predicting. However, from the test data, the ARIMA model outperforms the regressions model in terms of forecasting accuracy as shown in Fig. 6. This finding demonstrates that in the case of material prices that have time-series data, time-series models able to predict well. The recommended model for the prediction of each type of material is explained in Table 5. According to the value of the mean absolute percentage error (MAPE) donated by *

Conclusion
The difficulty of construction project partners to precisely estimate future material prices in the market, especially in the face of economic volatility, is a common challenge. This can often prevent developers from investing, reduce contractor profitability, and cause owners to delay payments. The techniques proposed in this research will help construction contractors and owners accurately estimate material prices. These prediction models take advantage of ARIMA's predictive power by learning from historical trends and the power of regression models, which take the affecting indicator into consideration. Although much research has presented numerous prediction models, the value of this study is the power to forecast price fluctuations even in economically unstable circumstances, which other approaches would not have been able to capture. In the context of the Egyptian construction sector, this research proposes 6-months price prediction models for steel reinforcing bars, Portland cement, brick, ceramic, and gravel. As a consequence, relevant Egyptian indicators were collected and associated with material prices during the study period, which ran from November 2018 to January 2020. ARIMA models were built based on the previous historical data of each type of material using the available data of CAPMAS Egypt. Regression models were built using the historical price data as dependent variables and the most important quantifying indicators as independent variables. The mean absolute percentage error (MAPE) of each generated model's predictions was used to evaluate it. Results of this search indicated that Construction material prices have time-series data, therefore, ARIMA models outperformed in predicting future prices of materials with a very small error rate.

Limitation
This study has empirically identified 19 indicator affecting CMP that have been used in the prediction process. These indicators were carefully chosen to reflect the relationship between the correlated variables. Other researchers may choose other indicators in the prediction process according to the available data. Further investigations on the CMP principal indicators levels can be carried out to improve estimating accuracy and efficiency.
Author contributions All authors reviewed the manuscript Funding No specific grant from funding organizations in any area was given to this research.

Conflict of interest
The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article. There are no declared competing interests of the authors that are pertinent to the subject matter of this study.