Application and functional simulation of data mining technology in Hadoop cloud platform based on improved algorithm

Data mining algorithms can process target data and extract useful hidden information, which is helpful for decision making. However, current mining algorithms have some shortcomings such as time-consuming processing of big data or inability to process massive data. Since data mining technology cannot be used in the traditional cloud platform environment, it is necessary to improve the algorithm to make it more suitable for the cloud platform environment. By analyzing the actual application process of BP classification algorithm, this paper expounds the practicability of BP classification algorithm, analyzes the data mining process based on Hadoop cloud platform, and expounds the development ideas of BP classification algorithm. The source of data mining algorithm supported by cloud computing is discussed. Finally, Hadoop cloud platform. This paper designed the corresponding system architecture and data interface, and established a suitable test environment for this system, and completed the simulation experiment test through the design. The calculation time of this algorithm is proportional to the amount of data, showing a linear relationship. In data mining, the optimized BP algorithm in this paper can significantly save resources in terms of spatial features. This paper designs an optimized operating system based on Hadoop platform through comprehensive analysis and improvement algorithm.


Introduction
Cloud computing is the most popular technical term in the IT industry in recent years, but cloud computing is not a new technology. In general, cloud computing technology is a fusion technology, including many technologies such as parallel computing, distributed processing technology and virtualization. It combines various technologies and uses virtualization to capture a large number of computing storage resources, and then establishes the distributed processing of these resources to realize parallel computing. Based on the grid computing platform, it solves the problems of traditional distributed processing, parallel computing, low utilization of grid computing hardware platform It cannot dynamically increase or decrease computing resources, storage resources and other issues, and combined with virtualization technology, it also provides functions such as virtual machine replication and virtual machine migration, which simplifies the management of large-scale computer clusters and improves the overall security of Hadoop platform operation. Because cloud computing has many excellent characteristics, it plays an important role in scientific research, medical treatment, communication, astronomy, geography, geological research, social science and other fields. Today, the global cloud computing ecosystem is evolving, its technology is developing day by day, and related services and products are fully optimized. At present, a large number of Internet enterprises have begun to build their own distributed computing cloud platform, which can be used for data analysis and business processing, as well as data interaction with other enterprises (Divakarla and Kumari 2010). By applying data mining technology, users can obtain target data and use it, which can effectively assist scientific research and business decision-making process. However, the traditional data mining operation process is a single line operation mode (Khademolqorani and Hamadani 2013). & Nina Dai 2008015@muc.edu.cn The operation feature of this mode is to load all data into memory for calculation and analysis. When faced with hundreds of TB or even Pb of large-scale data, the traditional data mining algorithm is not only inefficient, but also may not be successfully implemented, resulting in the embarrassing situation of ''too much data but lack of knowledge'' (Shakya 2021;Sangaiah et al. 2021). This situation can be solved by most high-performance computers and optimization algorithms that provide the computing power required to process a large amount of data, and with the growth of data, the computing power can be improved by using clusters (Shukran et al. 2011). The emergence of cloud platform makes it possible to process large amounts of data, and provides the direction of optimization and improvement for traditional big data technology (Lin and Wei 2020;Sangaiah et al. 2023). This paper is based on this, using the improved algorithm for further data processing optimization research.

Related work
The literature shows that the function of database has been developed by leaps and bounds, which is not only the simple processing of original documents, but also evolved from hierarchical database system to relational database system (Baohua and Ling 2010). Data processing can realize efficient online transaction processing through indexing, various organizational models and tools, good user interaction interface, and optimized query ability. Use database to develop an efficient service method of relational technology, and use it as an efficient storage, retrieval and management method of large amounts of data (Zhang et al. 2020;Javadpour et al. 2023;Sangaiah et al. 2022a, b). The literature shows the broad concept of data mining technology, which is able to extract useful data features based on target big data, which are often expressed in fuzzy or incomplete forms, and people do not know the existence of such data in advance (Kozak and Boryczka 2016;Sangaiah et al. 2022b). The information of data mining is usually expressed implicitly, which is also the process of storing information and knowledge. The literature shows that the cloud computing platform is actually the general framework of Hadoop system, and the former is the basis of the latter. Hadoop platform, written in Java, is a distributed processing platform based on cloud computing technology, with excellent scalability and scalability (Nandakumar and Yambem 2014). The strong deployability based on other platforms accelerates the promotion of cloud computing platforms. The literature improves the sprint decision tree classification algorithm, so that the algorithm can be implemented based on Hadoop platform, so as to effectively reduce the phenomenon of repeated calculation (Sharmila and Vethamanickam 2015). The literature studies spark computing system and random forest algorithm, analyzes their applications in text classification, and makes a simple comparison between parallel text classification in Hadoop MapReduce. The literature reviews the machine learning algorithm of MapReduce model, tests the performance of the improved algorithm, and proves the possibility of integrating machine learning with MapReduce model (Neshatpour et al. 2018). Based on the existing classification algorithm-Bayesian classification in the mllib machine learning library of spark cloud computing platform, the literature improves it and applies it to the sparkstreaming framework to realize the real-time classification of data. A kind of parallel softmax regression algorithm is designed in the literature, which is parallelized with the remote classification algorithm. This process is implemented based on spark cloud computing platform (Wang et al. 2015). In order to improve the computing power of the system, this paper applies it to the loss prediction system of telecom customers. Literature shows that app engine is a web application hosting platform based on Google cloud computing platform, and Google maintains its own virtual machine cluster internally (Gupta et al. 2020;Javadpour et al. 2023). The platform rewrites the execution of the underlying web service code during execution, so that the application calling the web service can automatically perform distributed computing. As long as the network application is created on the platform, it can easily perform distributed data computing and storage, so it has certain scalability (Xian 2020).Through design experiments, it is proved that the algorithm can be applied to the field of data mining and has certain scalability. At the same time, mahout open source project is committed to transforming traditional data mining algorithms into MapReduce models, and has implemented various MapReduce mining algorithms (Alham et al. 2011).
3 Data mining classification algorithm and its improvement

BP classification algorithm application
Classification algorithm is a data mining technology, which can classify a target data set to find the correlation and variance of a large number of sample data, so as to deeply mine data. At present, classification algorithms are widely used in many industries and fields at home and abroad. For example, in the field of information security, classification algorithms can be used to monitor and identify Trojans; in the field of Finance and banking, this algorithm can use classification technology to classify qualifications according to customer file information, reasonably verify their credit rating, and reduce the business risk of bank credit; in the field of health care, classification algorithms can classify patients according to their age, condition, family income and so on. Reasonably evaluate the patient's condition and medication, implement more targeted medical measures, and promote the patient's early recovery; in the retail industry, classification algorithms can be used to analyze commodity sales and customer purchasing power data in different regions. The personalized classification of customers is realized through classified statistics, so that the distribution strategy of the store is more centralized and targeted.

Theoretical basis of BP classification algorithm
The key to this information dissemination process is the update of the weight of each layer, the continuous retrieval and adjustment of the neural network structure. The gradient descent method can control the output error in a certain range, so that it is less than the preset threshold, and complete a complete cycle. Forward propagation process: The output error signal of this neuron is: In order to make the function continuously derivable, the mean square error is minimized here. The instantaneous error of neuron j is defined as: Accumulate the error values of all neurons to obtain the overall error of the network: Set C includes all neurons in the output layer, the formula E n can be minimized by correcting the weight and realized by BP algorithm. Differentiate e j (n) on both sides of Eq. (4), including: Take the differential of y j (n) on both sides of Eq. (3), including: Take the differential of vj(n) on both sides of Eq.
(2), including: Take the differential of w ji (n) on both sides of Eq. (1), including: Formula (9) can be obtained from this: Therefore, the modified value of synaptic weight connected by neuron I to J neuron Dw ji (n), defined as follows according to LMS algorithm: That is, the weight correction value is equal to the product of learning rate, local gradient and neuron i output signal.

Improvement of BP classification algorithm
BP neural network algorithm has its unique advantage interval-this algorithm has excellent learning ability. But at the same time, this algorithm also has some shortcomings, such as easy to fall into local minimization and so on. Traditional BP neural network is a kind of optimized local search method, which can solve relatively complex nonlinear problems. The weight of BP neural network can be adjusted by local optimization, but this process is easy to make the system obtain local minimum. When the weight converges to the local minimum, it is easy to cause the overall training failure of the system network. Traditional BP neural network has very sensitive initial network weight perception. Setting different initial values will have different effects on the system, thus affecting the prediction results of the system and producing inaccuracy.
This paper optimizes it so that it can realize batch learning, which is also divided into two computational stages: feedforward and backpropagation. After passing through the input layer, the input sample is transferred to the hidden layer for processing and output. If its value is far from the target value, backpropagation correction is selected. It can transmit the error to all neurons, so as to obtain the error signal and correct it, so that its transformation is in a controllable range.
The hidden layer adds a polarization value, so that for the dimension matrix (P, j ? 1) y, its final vector is y b = [y, l], that is, the input matrix of the output layer. Therefore, if the output layer neurons are linear, the pseudo inverse algorithm is directly used to calculate the output vector weight vector w with dimension (J ? 1, K): Because OL neurons are linear, the output of OL neurons is: o = u 0 , and its derivatives are as follows: If the output layer neurons are nonlinear, the improved batch BP neural network algorithm can iteratively process the weight W of the output layer. In this process, the hyperbolic tangent function can be applied in the output layer. There exists a (p, k) -dimensional matrix whose output is O, then O 'is its derivative: For sigmoid activation function, the output o and its derivative o' are given by the following formula: Network error, that is, one-dimensional matrix (P, K), is the difference between the expected output d and the network output o: errors = d-o.
The error signal of the product matrix relative to the input level is multiplied by a scalar to select the learning rate g. 5 V and 5W give the product matrix of hidden layer and output layer respectively. The product 5 V matrix with dimension (n ? i, J) is calculated as follows: Similarly, for OL, the (j ? 1, K)-dimensional product matrix is as follows: In this paper, the momentum factor gm is introduced to improve the learning efficiency, and its weight can be updated through the time factor of the momentum term. The momentum term factor of LMS algorithm can be used to adjust the weight of W and V networks, as shown in Eq. 23: 4 Application of data mining technology in Hadoop cloud platform

Principle of Hadoop cloud platform
In the development process of Hadoop project, its sub projects will change accordingly with the continuous update of the system. After several years of development and update, the cloud platform has achieved functions that cannot be achieved in the historical version in version 1.0.0. The project is generally called Hadoop, and its subsidiaries include three sub projects, namely HDFS, Hadoop common and MapReduce, HDFS. Its components are shown in Fig. 1. As can be seen from Fig. 1, Hadoop is the general name of middle-level and low-level projects in the system, while open source projects are related to the top.

Functions of data mining
The function of data mining is to find model types in data mining. Generally, data mining has two purposes, that is, to determine the internal attributes of all data in the existing database and to determine the trend of future data. From the current data set, it can be summarized as description and prediction. The types of data models that can be found by extracting descriptive and predictive data are as follows: (1) Description of concepts or classes: data can be combined with classes or concepts to summarize some internal characteristics of data. At present, the methods used are: data representation and data differentiation; first, collect the data set to be extracted by querying the database, and then summarize the attributes of the data; classify the extracted data of the database through retrieval and query. It can be seen that data variability requires users to specify data sets in advance in the comparison class.
(2) Frequent model and association model: frequent model refers to the data object model that always exists in the data set. The use of this model can lead to the discovery of data association models in data sets, that is, multiple data objects have specific value association rules. Association rules can be not only one-dimensional associations, but also multi-dimensional associations. (3) Classification and prediction: classification is to classify data sets according to existing classification models, which can be expressed by classification rules, decision trees, mathematical formulas or neural networks. Naive Bayesian classification, support vector machine and nearest neighbor K classification are also used to establish classification models. Prediction includes numerical prediction and type prediction, that is, a functional model is established based on the information of known objects to predict the numerical or type information of unknown objects. (4) Cluster analysis: divide the data into several categories to minimize the data difference within the category and maximize the data difference between the categories. Unlike classification, the number of classes is unknown before clustering. Clustering technology usually includes traditional pattern recognition and mathematical classification methods.

Sources of data mining algorithms supported by cloud computing
Computers and related technologies have been fully integrated into people's daily life in the information age. People can use computers to realize online shopping, online communication, travel ticket purchase, work, etc. Different users can use personalized terminals for remote internet access, so as to use various types of resources in the database and enjoy the powerful functions that the server can provide. However, the amount of data output in various fields is also growing exponentially every day, and the data types become more and more complex. Nowadays, in different fields such as medicine, mathematics, computer, biology and chemistry, the amount of data generated every day is huge, complex and unimaginable. The traditional processing method is to use small cluster distributed computers for special data processing, but this algorithm has certain limitations and difficulties, such as facing too many and too complex data sets, the calculation cost of the system is too high and the system itself is difficult to maintain. Based on this point, this paper designs and studies the classical K-means data mining algorithm, introduces it into the MapReduce platform for research, and designs experiments to explore the actual running cost, resource demand, running time, etc. If you want to implement data mining algorithm based on cloud platform, you need to get the support of cloud computing technology. Therefore, you need to start to build and develop an open-source parallel data mining platform, which is based on MapReduce system framework for parallel data processing.

System architecture design
The overall architecture of the system is divided into three layers: user service layer bottom layer. Users refer to the people or developers who use the system to provide services. Users can access the user interface of the system through PC, mobile terminal and other tools, and developers can use the API provided by the system for secondary development. Whether using or secondary development, users do not need to understand the basic implementation of the system, nor do they need to understand the knowledge of data mining to get the final result of data mining. The service layer is divided into two parts: service engine and mining engine. The service engine provides parallel data mining services through restful interfaces, including data management services, algorithm services, resource management services, and recording services. Mining is the concrete realization of service machine. In terms of data management, data is stored in HDFS, and data upload and download are provided; In the algorithm library, a parallel algorithm library that can run on Hadoop platform is provided, including not only K-means and collaborative filtering algorithm; In resource management, the system can provide status information and load status information to nodes; Provide log viewing and downloading tasks in log management. The bottom layer refers to the specific running environment of the algorithm. Here, Hadoop cluster is used. According to specific settings, open different nodes for users to use. The specific structure diagram is shown in Fig. 2.

Data interface design
Data management services provide upload, download and delete data services. In the system, the data required by the data mining algorithm must be stored in HDFS, so the required data must be uploaded to HDFS and managed in HDFS. The results of data mining are stored in HDFS in the form of files, and users can obtain them through the download function.
There are two ways to operate HDFS files: the DFS service command line interface provided by Hadoop and the API provided by Hadoop. Here we choose API based development. Table 1 provides the key interfaces of data management services.
The algorithm service layer performs data mining on the data provided by the data management service, and divides the algorithm service into data processing, data mining and result analysis.
Data processing service: each algorithm has its own data format. If the data format uploaded by the user does not match or contains noise data, the original data must be processed and converted into usable data. After processing the data, store the data in HDFS to prepare for the next data mining.
Data mining service: use the sorted data to call the corresponding data mining algorithm for operation.
Result analysis: the data mining results are stored in HDFS in the form of files. The results can be analyzed through the reservation algorithm, and the mining results can also be downloaded for users to process. For different data mining algorithms, the methods of data processing, data mining and result analysis are different. Here is a summary of the datamining interface interface class, which defines three methods, tidy, datamining, and analysis. Every time a new data mining algorithm is added, a class will be created to implement the interface and add unique technologies.
6 System test level result analysis

Establishment of test environment
(1) Hardware environment.
The algorithm test scheme is designed as follows: create virtual machines on a server installed with VMware virtualization software, set a total of six virtual machines, envelope four slave nodes and two master nodes, and form them into a cloud computing platform.
The configuration of server parameters is as follows: Intel(R) Core76700K@4.00 GHz Processor, 32G memory, 2 T hard disk space.
The tests in this article are all carried out in Linux environment. We also need to install and configure JDK and scala voice environment on each node, and use eclipse development platform for programming. The software configuration of the test cluster is shown in Table 3.
When the cluster platform is built, the node machines need to be divided into host names and networks. The relevant configurations are shown in Table 4.

Data set selection
This article uses the standard mushroom data set for testing. Among them, the data of mushroom database is characterized by dense distribution of frequent sets, which will also produce a large number of frequent sets in the case of high support.

Analysis of test results
In order to verify the computational efficiency and scalability of the algorithm, the experiment uses 10 g and 30 g data to run on 5, 10, 15 and 20 computer nodes supporting 5%, 10% and 15%. Figure 3 shows the test performance under 10 g data. Figure 4 shows the test performance under 30G data. From the experimental results, it can be seen that the calculation time of the algorithm actually increases linearly with the increase of the amount of data, which reflects the characteristics of Hadoop framework matching data according to the size of data blocks.   Application and functional simulation of data mining… 8387

Conclusion
After analyzing the limitations of BP classification algorithm, this paper conducts a special optimization on it based on cloud computing environment, so as to demonstrate the feasibility and effectiveness of using data mining algorithm in cloud computing environment, and solves some practical problems, and extends it based on Hadoop platform. Due to the limited time, this paper only develops and parallelizes the widely used data mining algorithms.
There are still deficiencies in the research, which need to be further explored and optimized.
Funding The authors have not disclosed any funding.
Data availability Data will be made available on request.

Declarations
Conflict of interest The authors declare that they have no conflict of interests.
Ethical approval This article does not contain any studies with human participants performed by any of the authors.