Analysis of Influencing Factors on Excellent Teachers' Professional Growth based on DB-Kmeans Method

doi:10.21203/rs.3.rs-1884325/v1

Download PDF

Research Article

Analysis of Influencing Factors on Excellent Teachers' Professional Growth based on DB-Kmeans Method

https://doi.org/10.21203/rs.3.rs-1884325/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The Kmeans clustering algorithm is widely used for the advantages of simplicity and efficient operation. However, the lack of clustering centers in the algorithm usually causes incorrect category of some discrete points. Therefore, in order to obtain more accurate clustering results when studying the factors affecting the professional growth of outstanding teachers, this paper proposes an improved algorithm of Kmeans combined with DBSCAN. Observing the clustering results of the influencing factors and calculating the evaluation standard values of the clustering results, it is found that the optimized DB-Kmeans algorithm has obvious improvements in the accuracy of the clustering results and the clustering speed, and the clustering effect of the algorithm on edge points is more advantageous than the original algorithms according to the scatter diagram.

Clustering

Kmeans

DBSCAN

Education

Teachers' Professional Growth

Rejuvenating the country through science and education is an important policy of our country. The professional growth of teachers affects the development and future of national education. For an ordinary teacher to grow into an excellent teacher, in addition to his/her own efforts, he/she also needs to learn useful experience from other excellent teachers, which can effectively help his/her own growth. The interview records of excellent teachers are an effective summary of teachers' professional growth experience. Extracting key information from these interview texts, and clustering the influencing factors of excellent teachers' professional growth can systematically provide valuable guidance for the professional development of teachers. It is of great and far-reaching significance to improve the professional quality of teachers and promote the development of education in our country.

How to mine and analyze the interview texts of these excellent teachers is of great significance and research value. Under the modern background, the information retrieval of texts puts forward higher requirements for the clustering algorithms. In order to obtain more accurate and effective information more efficiently, it is necessary to optimize the traditional clustering algorithm to deal with all kinds of texts to achieve more efficient in-depth analysis. In this paper, when researching the professional growth factors from the interview texts in the growth process of outstanding teachers, the Kmeans and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithms are improved, and DB-Kmeasn (DBSCAN - Kmeans) is proposed. The initial value of the clustering is optimized, and the selection method of the cluster centroid is improved, which effectively improves the accuracy of the clustering results.

Kmeans clustering is an unsupervised method and a common partition-based clustering algorithm, which is widely used in text analysis and cluster analysis. However, due to the lack of cluster centers in the Kmeans algorithm, it can still be seen that many discrete points are not classified into the correct category. DBSCAN clustering algorithm is a classical density-based spatial clustering algorithm. The algorithm starts with a randomly selected core point and recursively classifies points that satisfy the density requirement. Finally, the maximized area containing the core points and boundary points is obtained. The DBSCAN algorithm does not need to specify the number of clusters in advance, but only needs two parameters: Eps (Epsilon) and MinPts (Minimum Points). However, this algorithm is computationally inefficient, and the computation speed is slow for relatively large datasets. Because the effect of the DBSCAN clustering algorithm depends on the parameters Eps and MinPts, and improper selection of parameters will directly lead to the decline of the clustering quality, it is necessary to conduct multiple experiments to obtain a set of values with better effects. With the continuous development of technology, many scholars have made great improvements to the K-center clustering algorithm and DBSCAN, bringing great benefits to the mining of big data and information acquisition.

Wu Ying [1] proposed a Canopy-Kmeans clustering algorithm. First, the Canopy algorithm was used to cluster the samples, and then the initial clustering was obtained. The clustering result was used as the initial center and number of clusters of the Kmeans algorithm to get the result. Gao Xin [2] proposed a DT-Kmeans clustering algorithm, which first randomly selected a cluster center point, and determined the remaining clusters according to the data object density information and the distance information between the data object and the existing cluster center point. Yan Minghui et. al. proposed the introduction of Gaussian kernel density estimation to obtain the maximum probability to improve the way of Kmeans cluster center acquisition, and finally improved the research effect of the traditional method [3]. Hima Bindu G. et al proposed an improved algorithm of Firefly Algorithm (FA) mixed with Kmeans to find the optimal cluster center [4]. Valarmathy N. proposed a clustering algorithm combining DBSCAN density clustering and K-Distance tree algorithm [5]. Manogaran G et al proposed a modeling method combining Hidden Markov Model (HMM) and DBSCAN with GMM [6][7]. Zhong Jun et al proposed a hybrid algorithm of convolutional auto-encoding and Gaussian mixture, which was applied to the feature extraction of ECG signals, and saved a lot of time and effort of manual labeling [8]. Shi Yongge et al proposed a hybrid algorithm of Kmeans and Extreme Gradient Boosting (Xgboost) to mine designated telecom customers with special behaviors from the vast voice communication records of telecom companies [9].

In view of the shortcomings of the Kmeans and DBSCAN algorithms and the hybrid algorithm idea proposed by the above scholars, this paper proposes an improved algorithm of DBSCAN combined with the Kmeans algorithm. The accuracy of the results in this paper is improved, and it provides data support for our research on the influencing factors of the growth process of outstanding teachers.

The Kmeans algorithm first randomly selects K objects as cluster centers, then assigns the sample points to the class with the closest centroid according to the Euclidean distance, and finally calculates the mean of the sample points in the class, and updates the centroids until the results converge.

The specific steps of the algorithm are as follows:

Randomly generate K centroids;
Calculate the distance between all points in the sample and a random centroid, and classify each data into the cluster corresponding to the centroid that is closest to it. The distance between the object and the random centroid is the Euclidean distance. The formula is as (1);

$$d\left({x}_{t},{z}_{k}\right)=\sqrt{{\sum }_{i=1}^{N}{\left({x}_{t},{z}_{k}\right)}^{2}}$$

Among them, N represents the sample set {x₁, x₂, ..., x_t}, {z₁, z₂, ..., z_k} represents K centroids, and the sample points in N are divided into the class closest to the centroid.

Recompute the k centroids based on the average distance between all points of the class result;
Repeat steps 2 and 3 Until the sum of the distances of all sample points and their corresponding centroids of the class is minimized. The results tend to converge through multiple iterations.

The algorithm flow chart of Kmeans is shown in Fig. 1.

DBSCAN can assume that the clustering results can be determined by the tightness of the sample distribution. The data that are clustered into the same category are closely connected, that is, there must be data belonging to the same category around a certain data in the sample. The final clustering result is obtained by dividing all the closely connected words in the sample set into categories and displaying the results in the form of scatter plots. Different categories are represented by different colors, which are presented with a more intuitive visual experience.

According to [11], for the sample set D = (x₁, x₂, ..., x_m), the DBSCAN algorithm includes 5 core definitions in the implementation process: Eps neighborhood, core object and boundary object, density direct, density reachable and density connected. There can be one or more core points in the cluster. If there is only one core point, other non-core point samples in the cluster are in the Eps neighborhood of this core point. If there are multiple core points, there must be one other core point in the Eps neighborhood of any core point in the cluster, otherwise the two core points cannot be density reachable. The collection of all samples in the Eps neighborhood of these core points forms a DBSCAN cluster.

The quality of the DBSCAN clustering algorithm depends on the parameters Eps and MinPts, so it is necessary to conduct multiple experiments to obtain a set of values with better quality. After many experiments, Eps = 1 and MinPts = 1 are selected. Kmeans and DBSCAN have their own advantages and disadvantages in the implementation process. The comparison results are as follows in Table 1.

Table 1

Analysis of advantages and disadvantages of clustering algorithms
Algorithm name	Advantage	Disadvantage
DBSCAN	(1)No need specify the number of clusters in advance; (2) Outliers and clusters of any shape can be found.	(1) When the sample data is large, the clustering convergence time is long; (2) When the sample data is quite different, the clustering effect is poor; (3) The parameters are complex.
Kmeans	(1)The clustering principle is simple and there are few parameters, so the clustering time is fast; (2)There are few parameters, so the process is simple; (3) The clustering effect is good and the interpretability is strong.	(1) Specify the K value in advance, and the selection is not easy to grasp; (2) Generally, only applicable to convex datasets; (3) The initial value is completely random, so the results may belong to the local optimal solution.

It can be seen from the table that the Kmeans algorithm has simpler parameters than DBSCAN, is easy to implement, and does not take too much time. On the contrary, there are many parameters in the DBSCAN algorithm, which have a great impact on the clustering results, but it does not require Specify the number of clusters K, and the cluster centers all exist in the data samples, while the Kmeans algorithm is a randomly assigned centroid or a value calculated from the mean, not necessarily real in the sample.

In view of the shortcomings of DBSCAN and Kmeans algorithms, this paper proposes a hybrid method of Kmeans and DBSCAN algorithms, referred to as DB-Kmeans. This algorithm can maximize the advantages of Kmeans and DBSCAN algorithms, and avoid the shortcomings to some extent to the clustering results.

Firstly, DBSCAN algorithm is used to perform rough clustering to obtain the number of cluster categories and to cluster center points, and then Kmeans algorithm is performed for further clustering. This processing can benefit from the no need of K value to obtain global optimal solution, and the results also can avoid the shortcoming of being sensitive to noise points and abnormal points of Kmeans algorithm. The clustering steps of DB-Kmeans are as follows:

Through the initial clustering of the DBSCAN algorithm, all the data are divided according to the density, and the cluster center point is obtained;
Use the cluster center and K value in the above results as the initial centroid and the number of categories respectively;
Get the final clustering result and scatter plot based on Kmeans.

This algorithm solves the problem of slow clustering speed of DBSCAN algorithm, and can greatly speed up the algorithm. The more accurate initial value is provided for Kmeans clustering algorithm, which is of great significance to the final segmentation result. The algorithm flowchart is shown in Fig. 2. The pseudocode is shown in Table 2 below.

Table 2

Pseudo code of DB-Kmeans algorithm
Pseudo code of DB-Kmeans algorithm:
Input: sample set D=(x₁,x₂,...,x_m), and initial Eps and MinPts values output: set of clusters Initialize all points as Unvisited while all Unvisited samples: there is Unvisited sample p; denote p as Visited; if p has MinPts other samples in the Eps neighborhood: create a new cluster C, add p to C; set N to the set of all objects in the Eps neighborhood of p # Iterate over all the points in N for each point p' in N: if p' is Unvisited: record p' as Visited; if there are MinPts other samples in the Eps neighborhood of p', add the samples to N; if p' is not a member of any cluster: add p' to C if p' is noise: erase noise marker and add to C else mark p as noise; until has no object marked Unvisited Call the Kmeans algorithm to cluster the sample points according to the rough clustering center Update the centroid of the cluster based on the sample points within the cluster Until the cluster center does not change or the number of iterations reaches a threshold

The experiment is performed on the platform with processing CPU Intel(R) CoreTMi5-1035G1, the Samsung 16G DDR4-3200 memory and Windows10 system. The development environment is Anaconda3 with the programming language python 3.6. The main data includes the results of keyword extraction from 100 long texts, including 500 words, which are from the translation of the interview manuscripts of excellent teachers in Tianjin by laboratory personnel and the interview content of teachers from the open resources online.

The manually extracted keywords are vectorized, and the keywords are clustered on the two-dimensional vector. The scatter plot is shown in Fig. 3 below.

Looking at the scatter plot of the clusters, a lot of data in Figs. 1 and 2 that are not clustered together correctly or are at the category boundary are divided into appropriate categories in Fig. 3. In order to analyze the results more accurately and objectively, the results of the three clustering algorithms used in this paper are labeled according to the category of the cluster to compare with the results of manual clustering. The confusion moment certificate is exported to show the corresponding results of the real and predicted labels of the classification model. Finally, the standard values corresponding to the seven clusters are calculated according to the confusion matrix.

Clustering is an unsupervised learning process, but for the evaluation of clustering effect, we mark the effect of clustering manually, and then carry out the accuracy rate (Accuracy), recall rate (Recall), F1 value (H-mean value) to evaluate the quality of the clustering effect.

Accuracy: Proportion of all predicted correct values to the total. The formula is (2).

$$A\text{ccuracy=}\frac{TP+FN}{TP+TN+FP+FN}$$

Recall rate: Recall rate, that is, the proportion of correct predictions that are positive to all actual positives. The formula is (3).

$$R\text{ecall=}\frac{TP}{TP+FN}$$

F1 value: the arithmetic mean divided by the geometric mean, the larger the better. The formula is (4).

$$F1=\frac{2TP}{2TP+FP+FN}$$

According to the calculation formulas of the three standards, the evaluation result tables are obtained as Tables 3, 4, and 5 respectively.

Table 3

Accuracy of each cluster
	# 1	# 2	# 3	# 4	# 5	# 6	# 7
DBSCAN	92.22	94.22	92.44	95.78	92.00	92.89	94.67
Kmeans	90.22	92.00	88.22	92.00	85.78	88.22	91.11
DB-kmeans	93.78	96.00	93.33	96.00	94.00	92.89	95.78

Table 4

Recall of each cluster
	# 1	# 2	# 3	# 4	# 5	# 6	# 7
DBSCAN	73.33	72.92	72.55	73.08	80.00	75.95	83.95
Kmeans	55.00	58.33	56.86	46.15	67.62	67.09	75.31
DB-kmeans	78.33	83.33	76.47	61.54	82.86	79.75	88.89

Table 5

F1 of each cluster
	# 1	# 2	# 3	# 4	# 5	# 6	# 7
DBSCAN	71.54	72.92	68.52	66.67	82.35	78.95	85.00
Kmeans	60.00	60.87	52.25	40.00	68.93	66.67	75.31
DB-kmeans	77.05	81.63	72.22	64.00	86.57	79.75	88.34

By analyzing the table, we can find the Accuracy, Recall and the F1 values of EB-Kmeans are mostly the highest, which means that the proportion of words with correct clusters, the proportion of words predicted to be positive true values and the value of F1 are mostly the largest. The experiment result means that DB-Kmeans algorithm is the best in the traditional methods. In order to clarify the evaluation results, line charts are shown in Fig. 4.

From the analysis of the line charts, the three evaluation standard values of the DB-Kmeans algorithm are mostly the highest among these algorithms, besides, the accuracy and recall rate can basically reach more than 70%.

In view of the shortcomings of Kmeans and DBSCAN algorithms, this paper proposes an improved DB-Kmeans algorithm, and evaluates the clustering results through three evaluation criteria. Experiments show that the optimized algorithm improves the accuracy of keyword clustering results in the analysis of influencing factors of excellent teachers' professional growth through interview texts. However, due to the small scope of the research field, the amount of data prepared is relatively limited. With the increase of the amount of data, it is necessary to further test the DB-Kmeans algorithm whether can still maintain high-speed and effective calculation. Therefore, the next step of research is to apply this improved algorithm to a wider research field, to calculate a larger amount of data, to further verify the superiority.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Eps

Epsilon

MinPts

Minimum Points

DT-Kmeans

Decision Tree-Kmeans

Firefly Algorithm

HMM

Hidden Markov Model

GMM

Gaussian Mixture Mode

ECG

Electrocardiogram

Xgboost

Extreme Gradient Boosting

TF-IDF

Term Frequency-Inverse Document Frequency.

Acknowledgements

Not applicable.

Authors’ contributions

XG and XD proposed the framework of the whole ideal, structure of the model and the algorithm; TH helped to perform the simulations and conduct the analysis of the results. YK provided the interview records, participated in the conception, and helped to revise the manuscript. All authors read and approved the final manuscript.

Funding

This work was supported by the Scientific Research Project of Tianjin Educational Committee (2021KJ182).

Availability of data and materials

Please contact the authors for data requests.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Y Wu, Research on passenger car passenger order scheduling based on Canopy-Kmeans algorithm [D]. Shanxi University, 2020.
X Gao. An improved K-means clustering algorithm and a new clustering effectiveness index research [D]. Anhui University, 2020.
M Yan, X Xie, W Li, D Wu, X Cui, S Pan. Morphological Clustering Algorithm of Typical Load Curve Based on Gaussian Kernel Density Estimation [J]. Electrical Measurement and Instrumentation: 1-8 (2022).
G. HimaBindu, Ch Raghu Kumar, C. H. Hemanand, and N. Rama Krishna, Hybrid clustering algorithm to process big data using firefly optimization mechanism, Materials Today: Proceedings (2020).
N. Valarmathy and S. Krishnaveni, A novel method to enhance the performance evaluation of DBSCAN clustering algorithm using different distinguished metrics, Materials Today: Proceedings (2020).
Manogaran, Gunasekaran, V. Vijayakumar, R. Varatharajan, Priyan Malarvizhi Kumar, Revathi Sundarasekar, and Ching-Hsien Hsu, Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering, Wireless personal communications, 102(3), 2099-2116 (2018).
W Jia, Y Tan, L Liu, J Li, H Zhang, and Zhao, Hierarchical prediction based on two-level Gaussian mixture model clustering for bike-sharing system, Knowledge-Based Systems 178, 84-97 (2019).
J Zhong, D Hai, J Cheng, C Jiao, S Gou, Y Liu, H Zhou, and W Zhu, Convolutional Autoencoding and Gaussian Mixture Clustering for Unsupervised Beat-to-Beat Heart Rate Estimation of Electrocardiograms from Wearable Sensors, Sensors 21(21), 7163 (2021).
Y Shi, S Yan, M He, and X Li, Hybrid Data Mining Method of Telecom Customer Based on Improved Kmeans and XGBoost, In Journal of Physics: Conference Series, vol. 2010, no. 1, p. 012060. IOP Publishing (2021).
T Li, Research on patent text clustering based on improved k-means algorithm [D]. Hebei University of Engineering (2020).
Jia Xiaoyun, Zheng Ru, Chen Jingxia. An EEG emotion recognition method based on multi-feature extraction [J]. Journal of Shaanxi University of Science and Technology, 36(05), 152-158 (2018).

Download PDF

Reviewers agreed at journal
04 Aug, 2022
Reviewers invited by journal
04 Aug, 2022
Editor assigned by journal
03 Aug, 2022
First submitted to journal
21 Jul, 2022

You are reading this latest preprint version

Analysis of Influencing Factors on Excellent Teachers' Professional Growth based on DB-Kmeans Method

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Kmeans And Dbscan Algorithms

4. The Optimized Db-kmeans Algorithm

5. Results Of Experiments

6. Evaluations And Discussion

7. Conclusions

Abbreviations

Declarations

References

Status:

Version 1