The tasks of data mining include data extraction, classification discovery, clustering and association rules. Among them, association rules are mainly used in transactional databases. Through the mining and analysis of factual data, the association relationship between various indicators is established, so as to obtain valuable information. Generally, according to its mode of action, it can be divided into two categories: predictive mode and descriptive mode. Since the development of data mining, it has penetrated into many disciplines, and produced many mining techniques and models, each with its own strengths. Figure 3 shows a prototype of a data mining system model.
3.1 Regression Analysis Algorithms
Traditional statistics is a science based on empirical data. It uses data samples for analysis and prediction to find certain laws of numbers, and then uses these laws to analyze certain objective laws. Statistics has very important applications in many fields such as enterprise application, scientific research, Web mining, medical treatment, finance and so on. The linear regression model can be observed intuitively through the scatter plot (as shown in Fig. 4). It can be seen from the figure that the straight line simulated by the univariate linear regression equation is not a straight line passing through all the points, but an "optimal straight line", that is, the sum of the squares of the distances between it and all the points is the smallest [21].
The univariate linear regression model is:
$$\left\{ {\begin{array}{*{20}{c}} {Y={\beta _0}+{\beta _1}x+ε } \\ {ε \sim N(0,{\sigma ^2})} \end{array}} \right.$$
1
The XGBoost algorithm is a classic method in the integrated technology Boosting. The idea of Boosting algorithm is to form a strong classifier by integrating a large number of weak classifiers. XGBoost uses the CART regression tree. The CART regression tree is a hypothesis tree binary tree that can continuously split the features. If the tree node is split based on the mth eigenvalue in the data, if the eigenvalue is less than s, it is divided into the left subtree, on the contrary, if the eigenvalue is greater than s, it is divided into the right subtree:
$$\begin{array}{*{20}{c}} {{R_1}(m,s)=\{ x|{x^m} \leqslant s\} }&{and}&{{R_2}(m,s)=\{ x|{x^m}>s\} } \end{array}$$
2
The XGBoost algorithm grows into a tree by continuously adding trees and feature divisions. Each time it is added, it learns a new function to better fit the residuals of the last prediction. After the training of K trees, the sample can be predicted according to the characteristics of the sample. During the training process, each tree will have a score of sub-nodes. By comparing the scores of these sub-nodes That adds up to the final predicted value for that sample.
$$\widehat {{{y_i}}}=\sum\nolimits_{{i=1}}^{K} {{f_i}({x_i}),{f_i} \in F}$$
3
$$F=\{ f(x)={w_q}(x)\} (q:{R^m} \in T,w \in {R^T})$$
4
Formula (3) represents the strong classification prediction model, and formula (4) represents the weak classifier model that needs to be integrated. where F represents all CART regression trees, q represents the structure of the tree mapping each sample, and T represents the number of leaf nodes. In addition to the residual value, the loss function of the XGBoost model also introduces a regularization term to increase its generalization ability. The final objective function is expressed as:
$$Obj(\theta )=\sum\nolimits_{{i=1}}^{n} {l({y_i},\widehat {{{y_i}}})} +\sum\nolimits_{{i=1}}^{K} {\Omega ({f_k})}$$
5
$$\Omega ({f_k})=\gamma T+\frac{1}{2}||w|{|^2}$$
6
The right part of formula (5) represents the regularization term, which represents the complexity of the model to ensure the final learning result and avoid overfitting. The left part represents the loss function, which ensures the difference between the predicted value and the actual value. difference. Here, the objective function is optimized by the additive training distribution, that is, the first tree of CART is optimized first, and then the second tree is optimized, and so on, until the end of optimizing the Kth tree:
$${\widehat {{{y_i}}}^{(t)}}=\sum\nolimits_{{i=1}}^{t} {{f_k}({x_i})=} {\widehat {{{y_i}}}^{(t - 1)}}+{f_t}({x_i})$$
7
After the optimization of the Kth tree, the obtained CART tree()tifx is optimal, which minimizes the objective function on the basis of 1()tifx−, namely:
$$\begin{gathered} Ob{j^{(t)}}=\sum\nolimits_{{i=1}}^{n} {l({y_i},{{\widehat {{{y_i}}}}^{(t)}})} +\sum\nolimits_{{i=1}}^{K} {\Omega ({f_k})} \hfill \\ =\sum\nolimits_{{i=1}}^{n} {l({y_i},{{\widehat {{{y_i}}}}^{(t - 1)}}+{f_t}({x_i}))} +\Omega ({f_t})+constant \hfill \\ \end{gathered}$$
8
When the structure of the t-th tree in the CART regression tree is determined, the optimal value of each leaf node is obtained as:
$$w_{j}^{*}= - \frac{{{G_j}}}{{{H_j}+\lambda }}$$
9
The value of the objective function is
$$Ob{j^*}=\frac{1}{2}\sum\nolimits_{{j=1}}^{T} {\frac{{G_{j}^{2}}}{{{H_j}+\lambda }}} +\gamma T$$
10
The XGBoost algorithm has been improved on the CART base classifier, so that it has the function of supporting linear classifiers, which can not only deal with classification problems, but also deal with regression problems, so that it has more uses and scenarios; Through automatic learning, XGBoost can predict the splitting direction of sample data with missing values, instead of simply skipping, reducing data loss and improving data quality and quantity.
3.2 Clustering Analysis Algorithm
Clustering is the process of using a computer to automatically divide object data into groups according to a certain standard. Objects between groups have similar attributes or relationships. When classifying, only rely on the attributes of the analyzed objects themselves. to distinguish between objects and their similarities. At present, cluster analysis is widely used in speech recognition, image segmentation and other fields, and it can also effectively distinguish consumer groups with similar consumption patterns in business analysis. Simply put, a clustering algorithm is to classify the population in order to find structure in the data, as similar as possible within the classes, and as different as possible between the classes. The clustering algorithm generally goes through four steps: feature acquisition, similarity calculation, grouping, and result display. There are many clustering algorithms, which can be divided into different categories. The more common types are: division method, grid method, hierarchical method and model method [22].
The K-means algorithm belongs to the division method. The K-means clustering algorithm divides the data set into several limited clusters through cluster analysis, and the data in the clusters have similar characteristics. Since it was proposed, K-means clustering algorithm has been applied to all aspects of social life due to its simplicity and flexibility. In the process of using the algorithm, the shortcomings of the K-means clustering algorithm are gradually revealed. The main shortcomings of K-means clustering algorithm are: 1. The value K must be determined in advance before the experiment starts; 2. The effect of clustering is easily affected by the initial cluster center; 3. When the classification attribute is not obvious, the classification effect is poor And it is easy to obtain the local optimal solution; Fourth, when dealing with large-scale data sets, the amount of calculation is large and the time overhead is large.
The K-means algorithm is the simplest and most commonly used clustering algorithm. In the K-means algorithm, the distance is used as the evaluation index of similarity, that is, the closer the distance is, the higher the similarity is. At the same time, the clustering criterion function is the error sum of squares function:
$$V=\sum\nolimits_{{i=1}}^{k} {\sum\nolimits_{{{x_{j \in {S_i}}}}} {{{({x_j} - {u_i})}^2}} }$$
11
The objective function of the classical K-means algorithm is shown in formula (11), which aims to find the optimal partitioning scheme. When the weight is introduced into the K-means algorithm, its objective function will change. The new objective function is as follows:
$$V(P,\lambda )=\sum\nolimits_{{i=1}}^{K} {\sum\nolimits_{{l=1}}^{p} {(\lambda _{ε }^{l}\sum\nolimits_{{{x_{j \in {S_i}}}}} {{{(x_{j}^{l} - u_{i}^{l})}^2}} )} }$$
12
where the centroid is
$$u=(u_{i}^{1},u_{i}^{2}, \cdots ,u_{i}^{p}),i=1,2, \cdots ,K$$
13
The calculation method of each attribute weight, let
$${φ _l}=\sum\nolimits_{{{x_{j \in {S_i}}}}} {{{({x_j} - {u_i})}^2}}$$
14
formula can be converted into formula
$$V(P,\lambda )=\sum\nolimits_{{i=1}}^{K} {\sum\nolimits_{{l=1}}^{p} {(\lambda _{ε }^{l}{φ _l})} }$$
15
When the sum of the distances between the samples in each class and the mass points is the smallest and does not change, V(p) has a minimum value, so the Lagrangian function method can be used to calculate the weights under the constraints. Therefore, the derivation formula of the available attribute weights is as follows:
$$\lambda _{ε }^{l}=\frac{u}{{{φ _l}}}=\frac{{{{(\prod\limits_{{h=1}}^{p} {(\sum\nolimits_{{{x_{j \in {S_i}}}}} {{{(x_{j}^{h} - u_{i}^{h})}^2}} )} )}^{\frac{1}{p}}}}}{{\sum\nolimits_{{{x_{j \in {S_i}}}}} {{{({x_j} - {u_i})}^2}} }}$$
16
The basic process of the improved K-means is as follows:
(1) randomly select K samples from the overall data as the centroids;
(2) calculate the Euclidean distance between each sample and each centroid, find the nearest centroid, and set the It is classified into this class (j);
(3) Calculate the weight of each attribute of each class according to formula (16);
(4) After obtaining the attribute weight of each class, recalculate the centroid of each class;
(5) Calculate the value of the evaluation index according to the clustering quasi-measurement function, if it is the value of the minimum accuracy, end; if not, turn to 2), and perform related operations.
The clustering algorithm can automatically group the samples without classification basis, which is consistent with the work of the hospital to classify the departments. First of all, the hospital has a variety of data to evaluate the departments, but the evaluation items are too many to make it impossible to accurately sort the departments. Second, some departments can artificially and accurately distinguish their levels. For example, the work intensity and technical content of the ICU have been unanimously recognized by everyone. However, most of the departments are relatively close subjectively, and cannot be accurately judged. Third, the data evaluated by the department can be quantified, and it is easy to be grouped from the analysis of each feature, which is suitable for classification by the K-means algorithm. Combining the above three points, this paper uses the K-means algorithm in the clustering algorithm to classify the departments without grading, and then combines the subjective judgment of professionals to identify the iconic departments in the clustering, and finally grade the departments.