Surface roughness prediction method of titanium alloy milling based on CDH platform

Generally, off-line methods are used for surface roughness prediction of titanium alloy milling. However, studies show that these methods have poor prediction accuracy. In order to resolve this shortcoming, a prediction method based on Cloudera’s Distribution including Apache Hadoop (CDH) platform is proposed in the present study. In this regard, data analysis and process platform are designed based on the CDH, which can upload, calculate, and store data in real time. Then this platform is combined with the Harris hawk optimization (HHO) algorithm and pattern search strategy, and an improved Harris hawk optimization optimization (IHHO) method is proposed accordingly. Then this method is applied to optimize the support vector machine (SVM) algorithm and predict the surface roughness in the CDH platform. The obtained results show that the prediction accuracy of IHHO method reaches 95%, which is higher than the conventional methods of SVM, BAT-SVM, gray wolf optimizer (GWO-SVM), and whale optimization algorithm (WOA-SVM).


Introduction
Studies show that various data including tool type, machine property, workpiece characteristics, and sensor information affect the cutting process. Updating these real-time data conforms to the characteristics of big data such as high speed, high diversity, and massive-and low-value density. Therefore, it is of significant importance to establish an appropriate real-time data processing platform to integrate input, storage, and calculation parts of cutting data. The surface roughness of the metal workpiece directly affects the wear resistance, matching stability, fatigue strength, corrosion resistance, sealing, and contact stiffness [1]. Accordingly, accurate and real-time prediction of surface roughness of machined parts can provide a basis to change tool on time.
Currently, extensive investigations have been carried out on the data platform. In terms of composition and architecture of big data platform, Gong et al. [2] considered different engineering problems and proposed a big data platform architecture consisting of resource, perception access, service, portal, and application layer. To understand the realtime traffic status of the city, Zhang [3] proposed a big data platform architecture that can take the data resource layer as the core and integrate the perception, application, presentation, and user layer. Ramesh et al. [4] demonstrated that the big data platform architecture should include the data source, data platform, data storage, platform presentation, data analysis, and the mining layer. Through distributed file storage and computing, the above platforms greatly improve the efficiency of data storage and mining, data throughput, and control of the whole cluster [5,6]. It is worth noting that the distributed storage and computing framework solves two major problems of transmission and process capacity expansion, thereby making big data processing feasible [7,8]. Currently, a close correlation has been established between big data technology and the mechanical field, thereby improving the development of mechanical automation [9,10].
Reviewing the literature indicates that big data architecture has been applied in numerous applications to solve practical problems. Xian [11] proposed a parallel machine learning algorithm based on the fine-grained pattern Spark. The results showed that the parallel empirical mode decomposition-support vector machine (EMD-SVM) algorithm under the big data Spark cloud computing framework can improve accuracy, process speed, and learning convergence speed of intelligent fault classification. Dong et al. [12] built a Hadoop-Spark big data framework, stored the collected experimental data in blocks in the form of data sets in HDFS for call, and embedded the CSSAE algorithm model with PySpark to evaluate the milling cutter wear state. Furthermore, Min et al. [13] adopted the distributed parallel processing framework of MapReduce to parallelize the signal feature extraction and modeling algorithm and predicted the tool wear driven by big data with high accuracy. Jiang et al. [14] used MapReduce distributed calculation to process the data and transformed the carbon footprint accounting model in the part processing into the MapReduce function to realize the rapid accounting of carbon footprint in the part processing. Sun [15] established a titanium alloy milling data analysis platform based on Hadoop architecture. Using the internal architecture of the platform, the improved k-means algorithm was programmed and implemented on the big data platform, mined the corresponding optimized milling parameter set with surface roughness and material removal rate as performance indicators, and analyzed the optimization results. Duan et al. [16] stored the collected data in the Hadoop distributed file system (HDFS), processed the data by the Spark computing framework combined with the regression algorithm in the Spark ML machine learning library, trained the established model, and finally verified the obtained results. Accordingly, they predicted the service life of the hydraulic actuator for a high-voltage circuit breaker. The current related research has improved the accuracy of data calculation, but most of them analyze the historical data, so it is difficult to give effective feedback to the machining process in real time.
Surface roughness refers to the small spacing and tiny peak-trough irregularities on the machined surface. The gradual rise of artificial intelligence provides a new idea to predict the surface roughness of parts. In this regard, machine learning and deep learning have been applied in numerous applications. Eser et al. [17] applied the artificial neural network (ANN) and response surface method (RSM) and improved the experimental model of the surface roughness. Then they optimized the single hidden layer of the surface roughness by using a 3-8-1 network structure. Xu et al. [18] proposed a new particle swarm optimization (PSO) algorithm to train the fast convergence and global optimization ability of the neural network and SVM models. The obtained results showed that compared with conventional intelligent models, this method has higher prediction accuracy and lower mean square error. Xie [19] proposed a surface roughness prediction model based on energy consumption and used the PSO-SVM to predict the turning surface roughness. The results show that the average relative error of this model is low.
It is worth noting that conventional methods mostly adopt off-line processing, which cannot produce timely and effective feedback to the processed parts. Accordingly, it is intended to design and build a data analysis and process platform based on the CDH, which makes full use of the internal structure of the platform to realize the analysis and calculation of real-time data. The present article is expected to provide a reference to simulate the cutting process.
In this article, a CDH platform is designed to realize the function of data analysis, calculation and storage by integrating data input, and storage and computing layer, which include Spark, Kafka, Flume, Hbase, and other architectures. Use IDEA to run Java programs, complete the task scheduling of the platform architecture, and connect each platform structure layer so that the data flow can run through the platform. The whole process is based on the Java running environment. Then the internal algorithm of the proposed platform is used to model the surface roughness of titanium alloy milling, and a hybrid optimization algorithm (IHHO) is proposed to optimize the SVM modeling. Finally, the accuracy of the proposed algorithm and the superiority of the established platform over conventional methods are evaluated.

Data characteristic analysis
Currently, the automation of manufacturing equipment in the field of cutting manufacturing has been preliminarily realized, and the process information can be collected in real time. In this regard, many data have been collected so far in the machining process: (1) data related to the machine tool, tool accuracy, spindle vibration, and so on, (2) data related to measurement and defect of machining parts such as ablation depth and cutting breadth, and (3) the actual monitoring data of the production such as acoustic emission along X-, Y-, and Z-directions, vibration variables, and acceleration in the cutting process.
The real-time data transmitted by sensors and relevant information collection software have large data scales, diverse data types, and fast production speed. In other words, the produced data by the cutting process has the characteristics of big data.

Requirements of the platform design
CDH platform is a core solution to analyze and process a large number of data generated in the cutting process.
Through the deletion of irrelevant data and the correction of wrong data in the collected data, validity of the collected data can be ensured. Then data from different sources are integrated to uniform the data format and ensure the correspondence of data time. Accordingly, reasonable and reliable data resources can be provided for subsequent data analyses. Currently, most of the conventional platforms are built based on Apache's open-source framework Hadoop. Figure 1 presents the infrastructure of this framework.
It is worth noting that the platform should realize the whole process from data input and perform calculations to store data. Therefore, the platform should be designed from three different aspects, including data input, data storage, and data analysis modules. Data input module is the platform to meet the high throughputs, which can make the key components withstand the sudden access pressure, and does not collapse due to the sudden overload request. On the other hand, the data storage module consists of a layer with the ability to provide automatic data backup to prevent the platform from normal operation originating from data loss or damage. Finally, the data analysis module is the platform to process the input data in real time. Accordingly, the data process layer needs the ability of real-time processing.

Overall architecture of the platform
According to the overall demand analysis of the platform, the data input, data process, and data storage layers are designed respectively, and the appropriate big data architecture is selected.
To this end, Kafka is initially used to upload the collected machine running data to the platform. Then, HDFS and Spark are used to store, process, or analyze input data. Finally, the Hbase database is used to store and manage the real-time data, and the historical data is stored in the Hive data warehouse. Figure 2 illustrates the platform architecture.

Data input module
For the original data collection layer, the platform reads the data from either the real-time data file through the Flume and uploads it to Kafka to prepare for subsequent consumptions or directly through the communication protocol (object linking and embedding for Process Control, OPC) between Labview and Kafka. In the latter method, data storage in the hard disk is eliminated, thereby saving time. For the upper application system, the platform provides users with various algorithms for data processing and stores the processed results in either HDFS or Hbase.

Data storage module
The data collected during production and processing are stored in different databases through different acquisition software. It should be indicated that each data type can only reflect a part of the process or reaction so that the real processing state may involve numerous signals. Accordingly, it is of significant importance to improve the database storage structure and have a reasonable storage form for the collected signal data to facilitate the subsequent data call and processing. HDFS is data storage architecture based on Hadoop that applies the distributed principle to facilitate data expansion and fast storage. Meanwhile, Hbase and Hive can be applied as a non-relational database and data warehouse to store real-time data and historical data, respectively.

Data analysis module
The collected data during the production and processing is a group of large, continuous, and time node collection sequences. Therefore, it should be collected in real-time through Labview software and apply the flow processing technology based on Hadoop for the processing. However, abnormal data may appear in the data collection and data transmission processes, thereby disturbing a correct reflection of the actual state of processing. Accordingly, it is necessary to use the screening mechanism of monitoring parameters to monitor possible errors of data in real time. In this regard, different data processing algorithms have been proposed so far; among them, the linked list search algorithm has the best data screening performance, while the moving translation algorithm has the highest data noise reduction.

Network and software deployment
Generally, the proposed data platform is composed of hardware and software. In the hardware part, the resources of three workstations in the laboratory are combined to complete the deployment of the cluster and realize the integration of limited computing resources. Hadoop, Kafka, and Hbase are deployed on the upper part of the cluster. The management and computing nodes are supported by a server (Dell PowerEdge R740XD with two gold medal 5218CPUs, 32 physical cores in total, 64 logical computing cores, 128G memory, 500 GB solid-state, and four 10 TB SAS hard disks). Moreover, the data computing node is undertaken by two servers (Dell PowerEdge R740 with two gold medals 5218 CPUs, 32 physical cores, 64 logical computing cores, 64G memory, 250 GB solid-state). All nodes are connected through Gigabit Ethernet. Figure 3 illustrates the network configuration diagram.
From the software point of view, install virtual machine software (VMware) and Linux operating system are installed in each node. Then a multiuser and the corresponding user permissions are set. Generally, non-root users operate using the Hadoop platform. Accordingly, a SSH password-free login is set for an ordinary user to ensure the safe transmission of  data and realize appropriate information sharing between nodes without secret. Finally, JDK is installed, the environment is configured, and SecurtCRT is installed to realize remote control of the Linux operating system. Specifications of software are shown in Table 1.

CM deployment
Cloudera Manager (CM) is the management platform of the CDH that can be deployed easily and operate the complete big data software stack centrally. Through the CM, the installation process can be automated, thereby reducing the required time to deploy the cluster. Studies reveal that the CM can provide a real-time running state view of nodes in the cluster and configure it. Moreover, CM can be applied to optimize the cluster performance and improve utilization by including a series of reporting and diagnostic tools.
When installing CM, it is necessary to deploy the server and agent. In this regard, the Hadoop 100 is set as the server, and Hadoop 101 and Hadoop 102 are set as the agent. It is worth noting that CM provides powerful Hadoop cluster deployment capability and can automatically deploy nodes in batches. After installing the CM platform, the user can enter the CM management system login interface through the 7180 web port.

Integration and implementation of the CDH platform architecture
After deploying the CM, the management interface of CM can be accessed through the web port 192.168.160.100:7180. Then installation of big data architecture, including Kafka,   Table 2.
Furthermore, Fig. 4 shows the CM management interface of the platform. This interface can intuitively see the big data framework integrated within the platform and the operation dynamics of the cluster. At the same time, the cluster can be diagnosed and repaired according to the log information on the interface.

Surface roughness prediction based on the CDH platform
The CDH platform has three stages, including data uploading, data processing, and data storing to predict the surface roughness. In this regard, the Labview data acquisition system is utilized to save the collected data by machine tools to the host computer. Then the Kafka background monitoring program reads data files of the host computer. Once the data is transmitted, the program is immediately called to upload data to the HDFS distributed file storage system of the big data platform. Then the machine learning algorithm inside the platform is called to model, predict, and process the uploaded data. Finally, the prediction data is saved to the Hbase database. Figure 5 shows the data processing flow of the established platform.

Data set construction
In order to predict the surface roughness and ensure diversity, authenticity, and reliability of the data, the milling experiment of Ti-6Al-4 V (TC4) titanium alloy was carried out. The test workpiece consists of a rectangular titanium alloy (TC4) block with a dimension of 100 mm × 50 mm × 50 mm.
Then an alumina-coated finishing tool (ST210-R4-20,030, Xiamen Jinlu Company) with an integral carbide vertical milling cutter was applied to achieve the desired surface finishing. The milling process was conducted in accordance with single tooth and down milling methods with a dry cutting mode. The experiment was carried out on a threecoordinate vertical CNC machining center (VDL-1000E, Dalian Machine Tool Company). The surface roughness Ra    Figure 6 shows instruments and test equipment. During the experiment, surface roughness values were measured by adjusting four cutting parameters (f z , v c , a e , a p ). In this regard, a total of 144 experiments were conducted. Each group of parameters was measured three times, and the average value was taken as the final measurement value. For the data collected in the experiment, the first 110 groups of data were set as the training set to train the model, and the last 34 groups of data were utilized to verify the model. The experimental parameters of training set and validation set are shown in Table 3 and Table 4.

Data transmission module
The Kafka message queue is introduced into the cluster internal structure. The background monitor presents the message queue all the time. Once the message is detected and stored, the relevant information is immediately collected and the program is called to upload the signal file to the platform to facilitate the subsequent operation. When the data information is successfully uploaded to the platform, a message displays on the console, indicating that the data uploading is done successfully.

Data processing module
In the previous stage, the milling test data is uploaded to the platform, where the interior support vector machine (SVM) algorithm is used to model and predict the surface roughness. In order to get a more accurate prediction value of the surface roughness, the pattern search algorithm is combined with the Harris hawk optimization (HHO) algorithm and an improved hybrid optimization (IHHO) algorithm is proposed. Then the proposed IHHO algorithm is applied to model the parameters of SVM and predict the surface roughness. Compared with BAT, GWO and WOA optimized SVM and introduced indexes such as mean square error, mean absolute error, and curve fitting degree R 2 , which proved the superiority of IHHO algorithm.

Harris hawk optimization algorithm
HHO is a group intelligence optimization algorithm proposed by Heidari et al. [20] to simulate the predatory behavior of Harris hawks. Fu et al. [21] combined the sine-cosine algorithm (SCA) and cyclic variation strategy and proposed an improved HHO algorithm. Then, the MSCAHHO algorithm was applied to optimize parameters of SVM, and, finally, the optimized SVM model was used for fault classification. Reasonable diagnosis results have been achieved in this regard. Tayab et al. [22] applied the Harris-Hawks optimization algorithm to the feedforward neural network. The results showed that compared with the conventional artificial neural network (ANN) based on the particle swarm optimization (PSO), the least-squares support-vector machine based on the PSO, and the neural network based on BP, the mean absolute error of this method is reduced by 33.30%, 49.54%, and 60.76%, respectively.
The implementation process of the HHO algorithm can be mainly divided into the exploration stage, exploration and development transformation stage, and the development stage.
In the exploration stage, Harris hawk randomly inhabits in some places and waits and detects its prey according to the following two strategies: where X(t + 1) is the position of individuals in the next iteration, X rand (t) is a randomly selected position, X rabbit (t) is the position of individuals with the optimal fitness, and r 1 , r 2 , r 3 , r 4 , and q are random numbers between 0 and 1. Moreover, X m (t) is the average position as the following: The transition stage from exploration to development: HHO algorithm converts different exploitative behaviors according to the escape energy of prey. The escape energy of prey can be expressed in the form below: where E 0 is the initial energy and T is the maximum number of iterations. When the escape energy |E|≥ 1, the search is carried out. When |E|< 1, the development is carried out.
(1) Fig. 6 Ti-6Al-4 V (TC4) milling test tool Development phase: In the HHO algorithm, four possible strategies are proposed to simulate the attack phase. When the parameter |E|≥ 0.5, the soft siege was carried out, while a hard siege was carried out for |E|< 0.5.
(1) When r ≥ 0.5 and |E|≥ 0.5, the soft siege strategy is as follows: where Δ X(t) is the difference between optimal individuals and the current individual, J is a random number between 0 and 2, and r is a random number between 0 and 1.
(2) When r ≥ 0.5 and |E|< 0.5, the strategy of hard siege can be expressed as follows: (3) When r < 0.5 and |E|> 0.5, the siege strategy can be rewritten in the form below: where S is a D-dimensional random vector and LF is a Levy flight function. (4) When r < 0.5 and |E|< 0.5, the siege strategy is as follows:

Improved Harris hawk optimization algorithm (IHHO)
Pattern search method [23] (also known as the Hooke-Jeeves algorithm) is a direct search algorithm that does not rely on derivatives. Studies show that this algorithm has a good performance at solving multidimensional optimization problems. Moreover, it has a simple structure, high accuracy, and strong local searching ability. The main purpose of this algorithm is to find the best direction of function value decline to search and get the optimal solution.
The pattern search method is to search and move the current search point according to the fixed pattern and step size to find the feasible descent direction. The algorithm consists of two alternative searches, including axial (4) and pattern search. The axial search is carried out along n-coordinate axes to determine a new base point and direction, in which the function value decreases. On the other hand, the pattern search is carried out along the continuous direction of two adjacent base points, trying to make the function value decline faster [24]. The optimization steps of the improved Harris hawk optimization (IHHO) algorithm are as follows: (1) Population initialization. Each individual is initialized according to the upper and lower bounds of each dimension of the search space.
(2) Calculating the initial fitness. The position of the individual with the best fitness is set as the current prey position.
(3) Updating the location. The escape energy of prey is updated by equation (3-3), and then the corresponding position update strategy in search of development behavior is performed according to the escape energy and generated random number. (4) Calculating the fitness. The individual fitness after position updating is calculated and compared with that of the prey. If the updated position is better than that of the prey, it is considered as the new prey position. (5) Repeat step 3 and step 4, unless the number of iterations exceeds the maximum number of iterations. Then the current prey position is considered as the estimated position of the target. (6) A new solution scheme is obtained by searching the pattern of the current optimal individual. Compared with the optimal solution of the current population, the better one is retained.

Construction of the surface roughness prediction model based on IHHO-SVM
SVM is an extended machine learning technology of the statistical learning theory. It has superior characteristics, including simple structure, good generalization ability, and global optimality. Compared with other similar algorithms, SVM is a small sample learning algorithm, which simplifies the usual problems such as classification and regression. Accordingly, SVM has been widely applied to identify the data classifier, where the result will output an optimal classification hyperplane. In this method, the given data should not be misclassified, that is, the nearest data distributed on both sides of the hyperplane should have the largest distance from the plane. The hyperplane can be mathematically expressed in the form below: where ω, φ(x i ), and b denote the weight, mapping function, and offset top, respectively.
When data is linearly divisible, the classification accuracy increases as the classified distance increase. In this regard, the following optimization problems should be solved to make the distance of classification larger: ZTherefore, the optimal hyperplane problem is transformed into a mathematical optimal value problem. In the present study, the Lagrange dual transformation is applied for the optimization, and the following expression is achieved in this regard [25]: where α i ≥ 0, i = 1,2,…, and N denotes Lagrange coefficient, takes the partial derivatives of ω and b, respectively, and sets them equal to 0. The following is the constraint condition: Finally, the classification function can be expressed in the form below [25]: Then SVM can be applied to complete the linear and nonlinear analysis of the data by adopting an unreasonable kernel function. Conventional kernel functions are as follows: It should be indicated that the RBF has only one parameter δ, which makes it easier to select the optimal value. Accordingly, the radial basis kernel function is selected in the present study. However, there are often some variables in the training of the SVM model that affect the training accuracy. Among these variables, the most important parameters are penalty variable factor C and kernel function variable δ. In order to ensure the accuracy of model training, it is necessary to use an appropriate intelligent optimization algorithm to optimize the SVM and seek the optimal combination of solutions.
The performed investigations demonstrate that the improved Harris hawk optimization (IHHO) algorithm has a strong global searching ability. Consequently, applying the IHHO algorithm to search C and δ can avoid falling into the local optimal solution so that the global optimal solution can be obtained. Figure 7 shows the processing flowchart of the IHHO-SVM algorithm.
It is observed that the original search and optimization method of HHO is used to get the optimal solution, then the pattern search strategy is implemented, and the obtained result is compared with the current optimal solution. The best solution is retained as the output, which is used to build an improved IHHO-SVM model to predict the surface roughness. Figure 8 illustrates the point and line diagrams of the predicted and real values obtained from different schemes. It is observed that compared with other algorithms, the predicted value of the IHHO-SVM algorithm is closer to the experimental value.

Forecast results
In order to characterize the model reliability, rootmean-squared error (RMSE), and mean absolute error (MAE) are introduced in this article as statistical indicators. These indicators are defined as the following: where n is the number of forecast data. Moreover, y i and ŷ i denote the actual value and the predicted value, respectively. Table 5 presents RMSE, MAE, and curve fitting degree R 2 of different algorithms. It is observed that RMSE, MAE, and R 2 of the optimized SVM are better than those of the standard SVM. More specifically, the RMSE of IHHO-SVM is 0.0225, MAE is 0.0186, and R 2 is 0.94. Compared with other optimization algorithms, the prediction error and fitting accuracy are significantly improved. The results show that applying the improved HHO algorithm to optimize the SVM model is an effective scheme to improve the prediction accuracy of surface roughness.

Data storage module
In the data processing module, the predicted value of surface roughness is obtained, and the predicted value is stored in the Hbase database using the platform integrated database architecture. Meanwhile, the stored data in the Hbase can be queried automatically to facilitate subsequent use.

Conclusion
In the present study, the CDH platform is used to predict the surface roughness of titanium alloy milling and resolve the shortcoming of conventional methods, including low prediction accuracy and real-time feedback. In this paper, the milling test of titanium alloy is completed, and the rectangular block with the size of 100 mm × 50 mm × 50 mm is cut by down milling. The main work is as follows: (1) A data processing platform based on the CDH is designed and implemented, which uses the advantages and characteristics of various architectures integrated on the platform to meet the requirements of data transmission, data processing, and data storage. In the studied cases, it was observed that the platform can accurately upload, process, and store data online and has certain real-time performance.
(2) In order to predict the surface roughness more accurately and improve the prediction accuracy, the pattern search strategy is combined with the IHHO algorithm, and an improved optimization method is proposed to improve the global search ability and avoid falling into local extremum. Then the proposed algorithm is applied to optimize the SVM, and the global optimization of parameters C and δ is carried out to find the optimal solution. The obtained results show that the improved IHHO-SVM algorithm outperforms other optimization algorithms and prediction accuracy of up to 95% can be achieved.
Author contribution XL, CY, QS, and XW contributed to the conception of this research. YS designed and built CDH platform and optimized the prediction algorithm. YQ participated in the debugging of the algorithm program; SYL and LW gave many constructive suggestions on the revision of the thesis. Availability of data and material The data sets used or analyzed during the current study are available from the corresponding author on reasonable request.
Code availability Not applicable.

Declarations
Ethics approval The content studied in this article belongs to the field of metal processing and does not involve humans and animals. This article strictly follows the accepted principles of ethical and professional conduct.

Consent to participate
The authors would like to opt in to In Review.