Estimating the volume of civil construction materials by machine learning models

Previous construction project cost models, especially in the project initiation phase, often mainly focused on estimating the total project cost without taking into account their constituent aspects such as materials, labor, machineries and equipment. Therefore, building a material quantity estimation model will have a positive impact on improving the accuracy of the total project cost. There have been many studies related to this issue, but there are few studies on building a model to estimate the quantity of materials for civil projects with reinforced concrete structures and they use specialized software (which is difficult to access for many subjects in the construction industry). The founding of many machines learning software, especially Weka software, helps to model with powerful algorithms with high reliability. In this study, suitable machine learning models will be proposed for estimating the quantity of materials as: concrete, formwork, steel of the components: foundation, column, beam and floor. Suitable machine learning models will be suggested to rank for each different model.


Introduction
During the project initiation phase, project developers always need clear information about the project construction cost, meanwhile the information about design drawings, specifications, preliminary cost estimates are very limited (Badawy, 2020). Therefore, it can be said that having information about construction cost in the project initiation phase will help the project developer to make the right decisions that contribute to the achievement of the project's goals. Many preliminary cost models have been implemented, costs have been modeled with different levels of accuracy, completeness and relevance. These levels can be improved by modeling preliminary quantities-quantities of materials, which in turn will result in a more accurate forecast of preliminary cost estimates in the cost model. On the other hand, cost models mainly estimate the total cost without taking into account other influencing factors. While the construction material market is constantly fluctuating, if the material factor is not taken into account, the total cost will not be accurately reflected (Son et al., 2013). Furthermore, according to several studies, material cost accounts for about 42%, in some cases, it accounts for more than 50% of total construction cost depending on the type of project, construction methods and scope of work (Lam & Runeson, 1999) (Wong & Norman, 1997). Therefore, the quantity and quality of materials are the most influential factors on the construction cost (Idowu & Lam, 2020).
To carry out this study, many studies on AI and machine learning (ML)-based modeling have been published. Machine learning is an evolving field of artificial intelligence used for modeling data. There are many forecasting models based on historical data. Among the different machine learning models, the popular ones are artificial neural networks (ANNs), support-vector machines (SVMs), multilinear regression (Sharma et al., 2021).
In the field of construction, the application of artificial intelligence has shown many outstanding advantages. Kaveh et al. (2001) applied the back-propagation neural network for the design of double layer grids, the results from the study show that the hybrid neural network model has a lower cost than other the traditional methods (Kaveh & Servati, 2001). Furthermore, Kaveh A has extended different types of applied artificial neural networks such as: group method of data handling (GMDH networks) used to predict shear strength. of reinforced-polymer concrete (FRP-RC) beams with and without reinforcements (stirrups); compare BPN (backpropagation neural net) and CPN (counter propagation neural net) in the analysis and design of structures with large scale space; or use neural networks with 1-3 hidden layers in estimating concrete strength after 7 days and 28 days. The common denominator of the results from these studies shows the advantages of artificial neural networks over traditional methods: with GMDH networks, the results are more accurate and reliable in estimating the shear strength of FRP-RC; BPN and CPN give good results but CPN is superior due to faster convergence speed, simple structure, less memory requirement and better generalization; Neural networks help to reduce errors in concrete strength estimation compared to other methods (Kaveh & Iranmanesh, 1998;Kaveh & Khalegi, 2009;Kaveh et al., 2018). In addition, artificial intelligence algorithms are also widely used in: wavefront reduction applications in FEM (finite element) engineering problems (Kaveh & Rahimi Bondarabady, 2004), water distribution system design (Pham & Nguyen, 2022); supply chain optimization (2021b; Pham & Nguyen, 2023;Son et al., 2021a), (Son & Hieu, 2021); project schedule optimization (Vu-Hong- Son et al., 2022), optimizing transportation to reduce greenhouse gas emissions (Son & Khoi, 2020), Optimizing time-cost-quality in construction projects ().
For preliminary quantity modeling in civil buildings, Yeh (1998) used a backpropagation neural network combined with statistical regression to estimate the quantity of steel in beams and columns, the quantity of concrete and formwork for these two structures has not been estimated (Yeh, 1998). The predicted variables used in this study are: number of floors, grid layout, total building height, dynamic and static loads, seismic zone coefficient and compressive strength of concrete (Yeh, 1998). Bakhoun et al. (1998) used an artificial neural network to estimate the quantity of concrete for a bridge in Egypt (Bakhoum et al., 1998). Idowu and Lam (2020) used Bootstrapped Support Vector Regression Models. This study provides an estimate of the range quantities of concrete, steel, and formwork for structural components: foundation, column, beam, floor. The estimated range quantities of steel and formwork is at relative level compared to the reality, so it is necessary to raise this level higher (Idowu & Lam, 2020).
In this study, machine learning algorithms are proposed based on the following factors: popular methods used in the construction industry. The study will consider algorithms such as: Artificial Neural Network (ANN), K-Nearest Neighbors Algorithms (KNNs), Support Vector Regression (SVR) and Ensemble methods. These methods have been presented in similar studies in the construction industry (Bayram, 2017;Byerly, 1996;Czarnigowska & Sobotka, 2014;Peško et al., 2017). Finally, the performance of the models will be evaluated by indicators such as: R, RMSE, MAE, MAPE, SI. In the next section, the considered machine learning techniques, study methods, results, and discussions will be described.

AI based predictive model
Building an effective AI-based model requires a standard for scientific modeling. In data science, the typical model chosen is knowledge discovery in database (KDD) (He et al., 2021). Models based on artificial intelligence are often divided into: single model and ensemble model. The single model will use only one algorithm, the ensemble model will combine two or more algorithms to estimate the output (Fig. 1).

Single model
In this study, single models will include available artificial intelligence-based techniques. Some popular models are presented below to explore the understanding of single predictive models such as: ANNs, AANNs, SVRs, KNNs.

Artificial neural networks and additive artificial neural networks
Artificial Neural Networks (ANNs) Model is an algorithm built based on the simulation of neural networks (neurons) of the human brain. Multilayer feedforward network is one that gives good predictive results. The multilayer feedforward network works as follows: the neurons at the input layer receive the input signal, process it (calculate the weight, then send it to the transfer function) and then produce the result (the result of the transfer function); this result will be transferred to the neurons of the first hidden layer; where the neurons receive it as input signal, process and send the result to the next hidden layer; … the process continues until the neurons in the output layer give the result (Kiên, 2017). The training will be supervised for every desired input-output vector, an adaptive weight that minimizes the error function is calculated by the network based on the measurement of the difference between the predicted output and the actual output (Chou & Tran, 2018).
Additive Artificial Neural Networks (AANNs) is a metamodel that can improve the performance of classical ANNs (Friedman, 2002). The implementation of an additive artificial neural network model has been presented (Truong et al., 2021).

Support regression machine
The support vector machine was developed by Vapnik in 1995 based on statistical theory and the principle of risk minimization (Cortes & Vapnik, 1995). SVMs are increasingly used to solve non-linear problems even with small training data (Idowu & Lam, 2020). For non-linear regression problems, the data will be transformed using a nonlinear kernel function to map into high-dimensional space. Linear hyperplanes will be further built for linear regression of the transformed data in high-dimensional space. The kernel functions will implicitly compute inner the predictor variables in that space. Therefore, the performance of this model depends on the kernel parameter selection. The goal of a regression support vector machine is to find a regression function based on an input data set that is used to predict the desired value (output) (Idowu & Lam, 2020). Idowu et al. (2020) used SVR to model one component of cost-the quantity during the project initiation phase. The performance of support vector regression is better but still worse than that of KNN (Idowu & Lam, 2020).

k-nearest neighbors-KNN
The k-nearest neighbor is one of the simple and widely used algorithms based on historical data to determine the nearest neighbor of a given data point. In this algorithm, to compute a prediction, training instances that are nearest the latest observation are used. Therefore, the accuracy of the prediction depends heavily on the value of k (Yu et al., 2016).

Ensemble model
By combining predictive models to give more accurate predictions, ensemble models are being paid attention in the community using machine learning algorithms. This combination helps the ensemble model predict better than the single model (Ayhan et al., 2021). The following proposed ensemble models help to explore the application of estimation in this study. Model combining different algorithms such as: ANN-SVRs, ANN-KNNs, ANN-SVRs-LR-CART, ANN-KNNs-LR-CART.

Evaluation parameters
Parameters such as R, RMSE, MAE, MAPE and SI are used to evaluate the accuracy of the proposed estimation models (Chou & Tran, 2018). The formulas from (1) to (6) of the parameters will be shown as follows: In which y′ is the estimated value, y is the real value, n is the number of samples In which, m is the number of performance measures and P i = ith measures the performance. SI will be between 0 and 1. The closer to 0 the SI is, the more accurate the estimation model is.

Data sources
Data were collected from 80 projects (48 civil projects, 32 commercial projects) from 7 structural design firms with 6-17 years of experience in this field in Nigeria. The design details of the technical drawings of the projects have been approved and are provided in the form of a spreadsheet of structural results (Idowu & Lam, 2020).  In machine learning models, in order to help models generalize well and avoid overlearning and underlearning, the initial data is divided into two parts: training dataset and test dataset. To get more accurate estimations, (Kohavi, 1995) proposed to use tenfold cross-validation for the model. In this study, the collected data will be randomly divided into 2 data sets using Weka software, in which: there are 70% of training dataset (56 projects) and 30% of test dataset (24 projects). The training dataset is used to generate single models (ANN, AANNs, SVRs, KNNs) and ensemblemodels (ANN-SVRs, ANN-KNNs, ANN-SVRs-LR-CART, ANN-KNNs-LR-CART) while the test dataset is considered as "unknown" data used to evaluate the effectiveness of the proposed models. Supervised learning and tenfold cross-validation will be performed in the training dataset. The study of Jo (2019) showed that the normalized method gives the best performance for machine learning models (Jo, 2019). Recommended models will be ranked through the SI. The model with the lowest SI will be the model that gives the most accurate predictions among the proposed models.

Results
The proposed machine learning models: single models (ANN, AANNs, SVRs, KNNs) and ensemble models (ANN-SVRs, ANN-KNNs, ANN-SVRs-LR-CARTs, ANN-KNNs-LR-CART) are used to estimate the quantity during the initiation phase using data collected from 80 projects, through the parameters of R, RMSE, MAE, MAPE in two stages: training and testing. In the proposed models, it can be seen that the single models and the ensemble models give pretty good estimations. With models to estimate concrete quantity for beam and floor components, the SVR-Poly model gives the most accurate estimation. For the column concrete estimation model, the ANN-1HL model has a relatively low SI value (SI = 0.066), this model has the MAPE, RMSE, and MAE in the testing stage of 32.98%, 30,727 m 3 , 19,311 m 3 , respectively (Figs. 2, 3).
In the foundation concrete quantity estimation model, the ensemble models fully show their strengths when 4/5 of the best proposed algorithms are ensemble models. In which,  (Figs. 4, 5). Although in the testing stage, the MAPE gives good results, the remaining parameters are higher than those in the training stage. Figure 6 shows this, as the Fig. 2 The graph shows the scattering of the prediction of floor concrete quantity of SVR- Poly  Fig. 3 The graph shows the scattering of the prediction of column concrete quantity of ANN-1HL Fig. 4 The graph shows the scattering of the prediction of beam concrete quantity of SVR-Poly For the formwork quantity estimation model, most models using SVR-Poly give the best results, especially the model for estimating formwork quantity for girder components has the lowest SI value (0.033), in which MAPE = 34.36% in the training stage and in the testing stage MAPE = 25.34%, RMSE, MAE are 285,125 m 2 and 229,224 m 2 , respectively (Figs. 7,8,9). Figure 10 visualized that scattering between the predicted and real beam formwork quantity is very good. In addition, the SVR-Poly model also gives good results on the concrete estimation models of components: beams, floors. For the model that estimates the quantity of formwork for column components, the difference between SVR-Poly and the first model of ANN-SVR(Poly) is rather little.
Artificial neural network models give good results in models of steel quantity estimation for components: beams, floors and columns. For the column steel quantity estimation model, the 2-hidden layer artificial neural network (ANN-2HL) gives the best results with SI = 0.046, MAPE For the foundation steel quantity estimation model, although the SVR-Poly model gives the lowest SI of 0.009, the performance measurement parameters for these models are too high, indicating that the outputs of proposed models are not good (Figs. 12,13,14,15,16).

Discussion
Most of the proposed machine learning models give good estimation of concrete, formwork, and steel of components: foundation, column, beam and floor. SVR-Poly model is prominent among those models. This model gives outstanding results in the models used to estimate the quantity of formwork of components: foundation, beam and floor (3/4 models have the lowest SI). The models for estimating concrete quantity for beam and floor still give the best estimation (2/4 of the proposed models have the lowest SI). With the steel quantity estimation model, the SVR-Poly model only gives the best estimation for the foundation component. In Figs. 17 and 18, we can see a certain difference in the SI values between the following models and the first model. It is easy to see that in the formwork quantity estimation models (foundation, column, beam and floor) or concrete quantity estimation models (foundation, column, beam and floor), the difference between the following models and the first model is relatively high. This is also the accuracy of the estimations of models.
In steel quantity estimation models, machine learning models related to ANN are clearly dominant. Specifically, for the model estimated the steel of column component, the machine learning model of 2-hidden layer ANN (ANN-2HL) gives the best estimations. In particular, the machine learning model of 1-hidden layer additive artificial neural network (AANN-1HL) gives the best estimations for beam and floor components. Figure 17 also shows certain difference in the SI between the following models and the first proposed model. For the foundation steel estimation model, due to the specific characteristics of these components, the proposed estimation models have not yet given the expected estimation (Fig. 19).

Conclusion and recommendation
With the results obtained from the proposed machine learning models, most of the selected models show that the preliminary quantity estimation gives quite good results, even though in the project initiation phase, there is lack of project information. This study supports the planning of necessary resources in the early phase of the project when the Fig. 9 The graph shows the scattering of the predictions of column formwork estimation model Fig. 10 The graph shows the scattering of the predictions of beam formwork estimation model 1 3 Fig. 11 The fluctuation of MAPE of formwork estimation models in the training and testing stages Fig. 12 The fluctuation of MAPE of steel estimation models in the training and testing stages construction cost has not been detailed (especially the cost of materials). When the preliminary estimated volumes are obtained during this period, the project manager will be proactive in allocating resources efficiently, planning the next implementation steps, contributing to the reduction of less time, effort and money.
Through the obtained results, the study shows the userfriendliness of Weka software for a wide range of subjects in the construction industry for the purpose of estimating preliminary quantity from the project initiation phase. The research results has demonstrated that the machine learning models could give impressive results about its available algorithms. This is a powerful support tool in building high-performance artificial intelligence application models. It won't take long for different audiences in the construction industry to access this tool and apply the plugin to their work.
The results from the study show that the ensemble models or the additive neural network models give the best estimations among the models specifically proposed for foundation concrete estimation models (ANN-KNN_(Kernel = 2)-LR-CART)-this is a newly discovered model and is steel estimation model for beam and floor components (AANN-1HL). It can be seen that, in all the proposed models, the models built on the artificial neural network or the models combined with ANN give dominant results compared to the other machine learning models. In turn, this can be a suggestion for building different material  This study leads to a number of possible study directions in the future such as: using a dataset for high-rise civil projects, other types of projects; Applying the proposed models for finishing work, MEP, etc.; Explore more single, ensemble or hybrid machine learning models; Consider adding more variables to the model to improve the prediction.
Acknowledgements The researchers would like to express their sincere thanks to Ho Chi Minh City University of Technology-Vietnam