A hybrid clustering approach for link prediction in heterogeneous information networks

In recent years, researchers from academic and industrial fields have become increasingly interested in social network data to extract meaningful information. This information is used in applications such as link prediction between people groups, community detection, protein module identification, etc. Therefore, the clustering technique has emerged as a solution to finding similarities between social network members. Recently, in most graph clustering solutions, the structural similarity of nodes is combined with their attribute similarity. The results of these solutions indicate that the graph's topological structure is more important. Since most social networks are sparse, these solutions often suffer from insufficient use of node features. This paper proposes a hybrid clustering approach as an application for link prediction in heterogeneous information networks (HINs). In our approach, an adjacency vector is determined for each node until, in this vector, the weight of the direct edge or the weight of the shortest communication path among every pair of nodes is considered. A similarity metric is presented that calculates similarity using the direct edge weight between two nodes and the correlation between their adjacency vectors. Finally, we evaluated the effectiveness of our proposed method using DBLP, Political blogs, and Citeseer datasets under entropy, density, purity, and execution time metrics. The simulation results demonstrate that while maintaining the cluster density significantly reduces the entropy and the execution time compared with the other methods.


Introduction
Nowadays, social networks are very popular for facilitating and modeling communication between different social groups [1]. These social networks provide a place for exchanging opinions and sharing people's views and feelings. Social networks contain vast and valuable data, and helpful information can be obtained from analyzing these data. Networks are divided into homogeneous and heterogeneous. In a homogeneous network, all objects and connections between them are of the same type. A heterogeneous network consists of nodes, which represent different types of objects, and edges, which establish relationships between them. In a social network, information can be shown with heterogeneous information networks (HINs).

Research motivation and challenges
The various networks, including computer networks, social networks, signaling networks, etc., are usually modeled by graphs as an effective tool for examining objects and their relationships. The objects are associated with different attributes to enrich the information content of a network. Graph clustering is an exciting and challenging research field due to the difficult structures and connections between objects in the real world. As a result, various aspects of graph clustering have been studied to gain a better understanding of network structure and semantics [2]. The effective factor in clustering is finding a similarity criterion between objects so that the criterion is consistent with the purpose of clustering [3]. The similarity between objects is calculated according to their topological structure or feature. The state-of-the-art methods use only one of the two aspects. The S-Cluster algorithm is a baseline clustering algorithm that only considers topological structure [4,6]. The other baseline algorithm K-SNAP partitions a graph such that each partition has nodes with identical attribute values [4]. In other words, the similarity of objects is measured based on only one of two aspects. In these methods, clustering is not quality because much of the network information is ignored during the similarity calculation and the clustering process. The combined similarity measure effectively solves this limitation [2,[5][6][7][8][9][10][11][12][13][14][15][16][17][18]. However, in clustering the objects of a network into different clusters based on the combination of two aspects, the structural relationships are still more effective than the characteristics of the nodes. For example, most of these methods cannot use the property of nodes completely. Therefore, the extracted clusters may be inaccurate, especially when the network is sparse.

Our approach
This paper aims to perform a clustering process on HINs, with particular attention to node attributes, as an application to link prediction work. The link prediction is one of the tasks that, due to its wide variety of applications in real-world problems, e.g., recommender systems [19], protein-protein interaction predictions [20], etc., have been the center of attention by many researchers in the area of network analysis. Link prediction focuses on predicting a network's spurious, missing, and forthcoming links [21]. But, most of the algorithms and techniques provided to solve this problem focused on the network's structure topological. Since the interactions between network members can be affected by their common characteristics, we can use the information on these characteristics to analyze the interactions. The proposed solution uses the graph clustering solution considering structure and context to achieve desired quality at a lower computational cost. It takes into account the type of connection between nodes. Then, it calculates the adjacency vector for each node based on its relationships with other nodes and provides a similarity measure using the Pearson correlation coefficient. After that, the K-Medoids algorithm is applied to cluster the nodes based on their similarity score.

Contributions
In summary, the contributions of this work are summarized as follows: • We proposed a hybrid clustering approach in heterogeneous information networks as an application for link prediction, which uses the K-Medoids technique to partition nodes based on the combined similarity value. • We use the importance of disconnected nodes' presence to calculate the similarity between nodes. • To evaluate our solution, we perform experiments on DBLP, Political blogs, and Citeseer datasets regarding density, entropy, and purity metrics.

Organization of the paper
The remaining parts of this paper are organized as follows: Sect. 2 examines related work on graph clustering for link prediction in social networks. In Sect. 3, we explain the proposed solution in more detail. Section 4 provides an evaluation of the proposed method and discusses the results. Finally, we present conclusions and future research to develop the current work in Sect. 5.

Related works
This section will discuss the approaches for graph clustering and link prediction problems using structure and attribute similarities in complex networks. Besides, we summarize the research studies to solve the graph clustering and link prediction problem. Ghorbanzadeh et al. [22] proposed a new method for solving the link prediction problem using common neighborhoods in directed graphs. Their proposed method used the authority, hub, and neighborhood direction. Their solution performs in both supervised and unsupervised models. Further, they evaluated their strategy on the SmaGri, Wiki-vote, Political blogs, and Kohonen real-world datasets. They illustrated that their method outperforms in terms of precision and sensitivity metrics than with other methods. Zarei et al. [23] proposed an approach for solving link prediction using hidden relations among users in social networks. Their proposed method categorizes each node's neighbors to calculate the similarity score between a pair of nodes. They used nine real-world datasets and demonstrated that their method was more accurate than the other methods. In [24], the authors presented a link prediction approach for HINs via a deep convolutional neural network. The proposed method in link prediction based on community detection is performed in 4 steps: local neighborhood discovery, local subgraph tensorization, embedded learning, and link prediction. This approach was evaluated on four different types of HINs. In addition to applying to many scenarios, this approach has a reasonable execution time and can be used for various tasks.
According to [25], the solution is proposed to rank and predict links in a network, such that it expands the random walks via a distinct restart probability for each node. The results on two datasets reveal that the proposed method outperforms the classic random walk with restart (RWR) regarding link prediction. The label propagation algorithm for solving graph clustering has been improved by Berahmand et al. [26]. Their proposed version produced a weighted graph that is created from the initial graph by considering the node attributes and topological structure. Further, they evaluated their method on real and artificial datasets. They indicated that their approach is more efficient and precise on the criteria density, entropy, and normalized mutual information (NMI) index. Agrawal et al. [27] studied graph clustering for detecting communities that combine topological and attribute similarities in terms of communication type to provide an efficient plan. Further, their proposed plan balances the distance function and executes clustering using k-Medoid background. They used datasets of DBLP and Political blogs and measured density, entropy, and NMI measures to demonstrate the effectiveness of their algorithm.
In [28], a strategy is proposed to solve link prediction in complex networks. The suggested technique uses path properties of different lengths to compute the similarity score between pairs of nodes. Their strategy has used the concept of allocation of network resources. This technique increases the quantity of information received at the destination node by limiting information leakage by shared neighbors and maximizes the two nodes' similarity score. This work has been tested on various datasets and evaluates this strategy against two measures AUC curve and average precision. The evaluation results revealed that their strategy differs considerably from the baseline techniques. Kumar et al. [29] introduced a new method to predict links based on level-2 node clustering coefficients. Their method presents level-2 common nodes and clustering coefficients to gather information about clusters from the seed node pair's level-2 familiar neighbors. They used eleven real-world datasets in their work and evaluated their method with the baseline methods in metrics ROC curve, AUPR curve, precision, and recall. In comparison with state-of-the-art algorithms, their proposed method showed superiority. Ghasemi et al. [30] proposed a clustering-based method to improve link prediction. Their method is done in two steps: The first step is offline and is executed once. This step calculates local and global metrics for each node using the available data. Then, the classification algorithm is used to develop the classification-based link prediction model. Algorithm AdaBoost has been used as the best classifier. A clustering technique is employed in step two to group social items using estimated similarity criteria. Furthermore, they tested their method on the Facebook, HepTh, and Brightkite datasets and evaluated that based on precision, recall, and fitness metrics.
Lande et al. [31] presented a solution to predict links between objects in HINs. The HINs are analyzed to extract a meta-path, then links below a certain threshold level are removed, and their algorithm is used to calculate the connectional power. They used the Web of Science datasets to demonstrate their method's effectiveness.
Wei et al. [32] proposed a method for embedding in heterogeneous networks for community detection. Their approach uses a random walk strategy based on node degree in the first two stay phases and then jump. They use DBLP, ACM, and IMDB datasets in clustering. To demonstrate the superiority of their method over state-of-the-art methods, they evaluate their method by measuring the NMI at this stage. Berhamand et al. [33] proposed a new version of the spectral clustering algorithm for community detection in social networks by emphasizing the combination of structural similarities and characteristics of nodes. They used this combination to construct the affinity matrix using the stochastic method in their algorithm. They evaluated their method on real and artificial datasets and indicated that their approach is more efficient and precise other than baseline methods. In [34], the authors provided a model for link prediction between researchers in academic environments. This model considers both the dynamic structure and content information. Their experiments show that their method can predict academic collaborations dynamically and effectively. According to [35], a solution is proposed to predict links in multiplex social networks (MSNs) using the local random walks toward dependable pathways. MSNs are a subset of complex networks with the same nodes but different types of relationships. They designed a similarity measure to develop a local random walk measure in MSNs. Extensive experiments demonstrated the effectiveness of their proposed similarity measure. Ghasemi et al. [30] proposed a link prediction approach using graph clustering. They have combined similarity-based and learning-based techniques to improve link prediction in social networks. Their simulation results show better accuracy compared to state-of-the art methods.
According to Table 1, we reviewed and summarized graph clustering and link prediction approaches and compared them in terms of datasets, techniques used, and performance metrics.

Proposed approach
In this section, an explanation of the proposed approach is described. First, a framework based on the combination of nodes' structural characteristics and attributes is presented. The clustering problem is then formulated. Finally, the proposed algorithm for graph clustering of heterogeneous information networks is explained.

Proposed framework
This section will discuss a framework for combining topological structure and attribute of nodes to implement the suggested approach. As shown in Fig. 1, the proposed framework includes five main steps: the data pre-processing, the connection extraction, the similarity calculation, the combining structural and attribute similarities, and the performing the process of clustering and evaluating the clusters, each of which is explained in the following:

Data pre-processing
This step is responsible for pre-processing the input dataset. This step is divided into two processes filtering and coding. Filtering is in charge of data cleaning on the input dataset, and coding is responsible for building relationships between records within the data. The pre-processing step is carried out once, and its results are used in all other steps.

Connection extraction
The connection between node pairs is extracted once, and these connections are used in various steps. The nodes' connection is divided into three types: Directly connected, Indirectly connected, and Disconnected. Directly connected, in this connection type, there is a direct edge between two nodes. For example, in Fig. 2-a, nodes A and B or A and D are Directly connected. Indirectly connected, in this kind of connection, there is no direct edge between two nodes, but a communication path passing through other nodes may establish a connection between two nodes. In Fig. 2-a, nodes E and C are Indirectly connected. Disconnected, in this connection type, exists not a direct edge or a path between nodes. In the proposed method, these nodes may communicate with other nodes in the network based on common features.  In Fig. 2-a, the connection between nodes A and F is called Disconnected. After extracting of connections, the data are modeled in the form of two graphs: the simple graph (G1) and the bipartite graph (G2).

Similarity calculation
In this step, the structural similarity between node pairs in G1 is calculated according to the type of connection between them separately. The result of this step is the structural similarity matrix. In addition, attribute similarity between node pairs in G2 is calculated based on the type of connection between them separately. The result is an attribute similarity matrix.

Combining similarities
The hybrid similarity consists of the combined structural and attribute similarities between pairs of nodes according to their connection type. In this combination, the structural similarity is based only on the edges or communication paths between the nodes, and the attribute similarity is based only on node features. The output of this step is called the hybrid similarity matrix according to the influence degree of the two similarities.

Clustering and evaluating
In implementing the proposed algorithm, the K-Medoids method uses distance values for vertices partitioning. The outcome of the clustering is k clusters, each of which contains a set of vertices. Clusters are mutually separated and collectively complete. After the clustering process, the clusters will be evaluated using three criteria: density, entropy, and purity.

Problem statement
As shown in Table 2, this section introduces the notations and equations used in the proposed solution. The dataset is an undirected, weighted, multi-attributed graph G = {V, E, W, M, A}, not necessarily connected, where V and E are the set of all the vertices and undirected edges, respectively, W is the weight of each edge, M is the number of node attributes, and A is the set of values of each attributeA = {attr 1 , attr 2 , . . . , attr M }. Two graphs, G1 and G2, are extracted from graph G. G1 is an undirected and weighted graph G1 = {V1, E1, W1}, not necessarily connected, where V1, E1, and W1 are adapted from G. If there is a direct link among any pairs of vertices, e.g., V n and V m , then W1 nm > 0. Also, G2 is an undirected, weighted bipartite graph G2 = {V2, E2, W2, M, A}. In a bipartite graph, the graph's vertices are divided into two separate sets, so no two graph vertices from the same set are connected. In the G2 graph, the nodes of the G1 graph are in one set, and the attribute values of the G1 nodes are located in the second set, as shown in Fig. 2-b. Thus, the number of nodes in the first set of G2 is equal to the number of attribute values of graph G1, and the number of nodes in the second set of this graph is equal to the number of nodes in G1. Each attribute value appears as a single node in the bipartite graph. Therefore, V2 equals the sum of G1 nodes, each of these nodes' attribute values. The E2 attribute edge is an edge between a node and each of the characteristics of that node. If the node has the value of one of the attributes, there will be an edge in G2 between the attribute value and that node. W2 represents the edge weight for each attribute; by default, its value equals one. Also, M and A are adapted from G. In G1 and G2, parameter d n , indicates the degree of each node and the number of edges entered The hybrid similarity between two vertices V n , V m d n node degree CN nm Common neighbors between two vertices V n and V m The correlation coefficient between two vertices, V n and V m into it. In G1,CN nm , is the number of common neighbors of two nodes, e.g., V n andV m . In G2, CN nm , is the number of common attributes between two nodes. The goal is to partition the graph into k segments using the combination of topological and attribute similarities such that the nodes in a partition have strong structural relationships and homogeneous attribute values.

Proposed graph clustering algorithm
This section provides a detailed explanation of the clustering algorithm, as shown in Algorithm 1. Initially, the dataset must be processed before other steps can use it. The preprocessing consists of two processes: filtering and coding. In the filtering process, it is tried to extract a coherent dataset with a smaller volume than the initial dataset by applying appropriate filters. In the coding process, data coding is done with a simple coding method for greater integrity. The output of the pre-processing phase is the three sets of nodes, the edges, and the attributes of nodes. In the proposed algorithm, once data pre-processing (line 3) and extracting the connections type between vertices with each other (line 4) is performed. Then, a simple, undirected, and weighted graph (G1) is extracted as a model to solve the structural similarity problem. Also, a bipartite, undirected, and weighted graph (G2) is a model for solving the attribute similarity problem. According to the output of line 4, structural similarity and attribute (lines 6-17) are repeated for both vertices. Then, the hybrid similarity and distance function of each pair of vertices will be calculated (lines [17][18][19][20][21][22] and finally will be done clustering process (line 23).

Structural similarity
This section calculates the structural similarity between the two vertices of the graph according to the connection type between vertices, as shown in Algorithm 2. Similarity-based methods in heterogeneous networks, with only an absolute emphasis on the number of common neighbors, cannot calculate the structural similarity among pairs of nodes well. On the other hand, beyond direct relationships, also hidden relationships between any pair of vertices, such as indirect and disconnected connectivity, may contribute to the structural similarity calculation. First, the adjacency vector is calculated for each node of the G1 (line 4). Then, the union neighborhood set of the pairs of vertices (line 5) and the correlation between vectors is calculated to determine the correlation between two vertices (line 6). Finally, the structural similarity of the pair of vertices is obtained (line 7). The details of calculating the structural similarity of two nodes using the neighborhood of both nodes and their indirect interaction strength in three directly connected, indirectly connected, and disconnected states are described in the next section.
In the following, the method of calculating the structural similarity between directly and indirectly connected nodes is expressed. In most current techniques that consider the connection between nodes in the calculation of similarity, only paths with length two are considered in the indirectly connected type. Since paths with a length of more than two may contribute to the calculation of structural similarity in indirect connections, such paths are considered in the proposed method. The proposed adjacency vector in indirect nodes does not limit the path length. The adjacency vector of each node is calculated by Eq. (1): where n is the node whose adjacency vector should be calculated. If the index number of the adjacency vector is equal to n, the sum of the weight of all edges entered into the node n is placed in this index. If the connection between n and m is indirect, the desired index value in the adjacency vector will be the sum of the weight of the shortest path between n and m in the simple graph. If n and m have a direct connection, then the weight of the direct edge is placed between them in the vector index. And if two nodes are disconnected, a zero value will be entered in the desired index. After calculating the adjacency vector of all nodes of the simple graph, the union neighborhood set between both indirectly and directly connected nodes is calculated based on Eq. (2): To indicate the correlation between the pairs of nodes, the correlation coefficient between the union neighborhood set of the vectors is calculated by Eq. (3): where AV n , is the average value of the union neighborhood set of vectorAV n , which is obtained from Eq. (4): Finally, the structural similarity of any two nodes connected indirectly and directly is calculated using Eq. (5): where W nm , is the weight of the common edge between two nodes n and m in the simple graph and corr nm , is the correlation coefficient between them. Also, the structural similarity between disconnected nodes is assumed to be zero.

Attribute similarity
There are different types of nodes in heterogeneous networks, each node in such a network can contain an M attribute, and each attribute can have a different set of A values. Since the goal is to calculate the hybrid similarity in such a network, the attribute of the nodes should be considered. For example, in a bibliographic network, one of the types of nodes is authors, and one of the attributes of nodes is the interest of each author in different research fields. As shown in Fig. 3, an attribute is defined for each node, which contains four values (e.g., data mining, data base, programming, and machine learning, which are four values for the interesting attribute). The attribute similarity, like the structural similarity, is calculated based on three connection types. To calculate the attribute similarity and simplify the calculations, the G2 is extracted from the sets of V and A. In a bipartite graph, there are two disjoint sets of nodes, such that the nodes of each set are not related, and only their connection is with the nodes of the opposite group, as shown in Fig. 2b. In all the calculations of this section and according to the connection types, the calculations of the attribute similarity will be done on the bipartite graph. The attribute similarity is responsible for calculating the attribute similarity between the two vertices of the graph, as shown in Algorithm 3. First, the adjacency vector is calculated for each node of the G1 based on the G2 (line 3). Then, the union neighborhood set of the pairs of vertices (line 4) and the correlation between vectors is calculated to determine the correlation between two vertices (line 5). Finally, the attribute similarity of both nodes is obtained (line 6). In the next section, attribute similarity calculation is described in detail.
where n denotes the node whose adjacency vector should be calculated if m is equal to n, the weight of all edges entered into the node n in the G2 is placed in the m index. If n and m have a direct connection, then the weight of the common attribute edges between n and m in the G2 is summed with the weight of the common edge between them in the simple graph. We consider the edge weight of each attribute is always considered as one. If the connection between n and m is indirect or disconnected, the value of the desired m index will be the sum of the weight of the common attribute edges in the G2. After calculating the adjacency vector of all nodes, the union neighborhood set between both indirectly connected, directly connected, and disconnected is calculated based on Eq. (7): A higher correlation between the union neighborhood set,UNION nm , vectors AV n and AV m , demonstrates a higher structural similarity among nodes n and m. The correlation coefficient between the union neighborhood set of the vectors is calculated by Eq. (8): where AV n , is the average value of the union neighborhood set of vector AV n , which is obtained from Eq. (9): In Eq. (9), the fraction's numerator is the sum of the nonzero values of the nth node's adjacency vector. The fraction's denominator is the number of members union neighborhood set by the adjacency vectors of two nodes, n and m.
Finally, the attribute similarity between pairs of nodes based on the connection types will be calculated by Eq. (10) as follows: In Eq. (10), W nm is two nodes' common edge weight V n and V m in the simple graph, corr nm , the correlation coefficient between them, and M is the number of attributes of the graph nodes.

Hybrid similarity and distance function
The overall similarity of both nodes with the combination of structural and attribute similarities is calculated by Eq. (11): In Eq. (11), the degree of influence of the two similarities is not the same. The α parameter is a weighting factor used to control the influence of both similarities and, in advance, it must be in the range of [0,1] to be given. The suitable amount of α is the value that divides the graph into k clusters such that the nodes of each cluster have coherent communication structures and the same attribute values. In our method, based on the analysis of the results, the value of this coefficient is assumed to be 0.5, in which identical importance is given to structural and attribute similarities. After calculating the hybrid similarity measure according to connection types for performing the graph clustering process, the distance value for each pair of nodes in the graph is computed with Eq. (12): The distance value is the inverse of the similarity value. The smaller the distance between the nodes placed in a cluster, the better the clustering quality.

Graph clustering
K-Medoids algorithm is applied for graph clustering. K-Medoids is an iterative partitioning solution, as shown in Algorithm 4. We carry out the clustering process until the clusters converge. The number of clustering algorithm iteration and the cluster's number (k) is input to the proposed algorithm. The top k vertices with the maximum degree in V are selected as the k initial centers for the clusters (line 4). The rest of the nodes are assigned to each cluster according to their distance from the primary centroids (line 7). In each iteration of the algorithm, one node is selected from the remaining nodes to have the highest degree among the rest of the nodes (line 12). It is the new centroid of its cluster. The distance of the newly selected center with all other graph nodes is calculated, and the clusters are updated. Next, the distance between each cluster's nodes and the centroid is computed. The total distances of all clusters are added together (lines [13][14][15]. Suppose the obtained value is more suitable than the same value in the previous clustering. In that case, the new centroid is fixed, and the process continues (lines [16][17][18], else the centroid is removed, and the node with the next maximum degree is chosen, and the process will be repeated.

Performance evaluation
This section validates the proposed solution using two real datasets, DBLP, Political blogs, and Citeseer. Then, it describes the simulation parameters setup and performance metrics. Finally, a discussion of the simulation results follows it will provide.

Experimental setup
The experiments were performed on a 64-bit machine with a 2.80 GHz Intel Core i7 processor with 8 GB main memory and Windows 10 as an operating system. Python 3.9 is used as the open-source language to implement the suggested method. We compare the proposed method with the following five basic evaluation approaches. The IGC-CSM, AR-Cluster, and SAG-Cluster methods have been fully simulated and implemented. The results of two SA-Cluster and W-Cluster approaches have been used in the comparisons [8,9]. We chose these methods because they calculate collaborative similarity using topological structures and features in undirected, multi-attribute, and weighted networks similar to ours. The following methods: IGC-CSM [2]: A collaborative approach for clustering a weighted, multi-attribute, and undirected graph. This method computes topological similarity and attributes depending on the types of connection between nodes. The directly connected nodes' similarity is according to the similarity of Jaccard and the weight of the neighbors of the nodes. The structural and attribute similarities of nodes connected indirectly are the linear product of the structural similarity and the linear product of the attribute similarity of the two directly connected nodes in the path of the indirectly connected pair. This approach uses a shortest-path strategy to decrease the computation cost and search space. The K-Medoids method is used to cluster the graph.
AR-Cluster [12]: A collaborative approach for graph clustering is based on the type of connection between nodes. Attracting and recommending degrees are used in this algorithm to compute the structural similarity. In addition, the K-Medoids method is used to cluster the graph.
SAG-Cluster [27]: According to the type of connection of nodes, a cooperative approach is to cluster the graph with the K-Medoids framework. In calculating the structural similarity between directly connected nodes, the weight of all the edges with the neighboring nodes of each node is considered. They calculate the similarity between each indirectly connected pair using the classical Basel theorem and the maximum weighted average. SA-Cluster [7]: A hybrid approach for graph clustering is based on combining structural aspects and attributes between nodes. It uses a random walk strategy.
W-Cluster [8]: A clustering algorithm combines structural similarities and attributes with a weighting function.
In our experiments, we utilize three real datasets, DBLP, Political blogs, and Citeseer. Political blogs 1 : included 1,490 blogs about US politics, with 19,090 links among these web blogs. The attribute of each blog is its political leaning, the value of which is either liberal or conservative. In the experiments, the edge weight between blogs is considered one; also, one attribute with two liberal or conservative values for the nodes is considered. DBLP 2 : We use a subset of DBLP bibliography information data. This network includes information on articles, citations to articles, information on authors, and author collaborations between them. The used sub-network was collected between 2004 and 2014. Our selected sub-network contains four research areas of artificial intelligence (AI), information retrieval (IR), data mining (DM), and data base (DB). This network is a network weighted and multi attributes. In experiments, the attributes of the nodes are the authors' interest in different research fields. The number of co-authorships between authors is the edge weight between them. Each node has four attributes, and each attribute has a value. Looking at the datasets used according to the communication types among the nodes, the number of connections of various types is not the same. The number of indirectly connected links in Political blogs is more than the same type in the DBLP dataset. In DBLP, the number of disconnected links is more than the like in the Political blogs dataset.
Citeseer 3 : The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. In the experiments, the edge weight between nodes is considered one; also, one attribute with six "A" to "B" values for the nodes is considered.

Performance metric
We used the following performance measures to validate the proposed solution with other algorithms. Density: Density is the ratio of the number of edges in a cluster to the number of edges in the entire graph. The proportions of all clusters are accumulated to assess their impact [2]. The density values lie in the range [0, 1]. The density is calculated by Eq. (13): where k is the number of clusters, and the c is each of the graph clusters, |E| is the total number of edges in the graph and (V m, V n ) is the number of edges in cluster c. Entropy: This metric is described to determine the relationships between vertices in terms of attributes [11]. A lower entropy means a better quality of clustering. The entropy value is in the range of [0, 1] and is expressed by Eq. (14): Entropy(attr c , V i ) = − Purity(V i ) = Max j P i attr j where i is each of clusters,i = {1, 2,…,k}, the ith cluster consists of V i nodes, V is the whole number of graph nodes, and P i (attr j ) is the ratio of attribute jth in the ith cluster.

Experimental analysis.
Simulation parameters are set in our proposed solution and other implemented algorithms, as shown in Table 3. α equals 0.5. Also, because specific instructions to achieve the maximum Table 3 Setting simulation parameters for datasets of DBLP, Political blogs, and Citeseer The quality of the results is evaluated using three criteria: density, entropy, and purity. The final results are presented as follows. Figure 4 compares the density criterion for six approaches in the Political blogs dataset. In Fig. 4, the number of clusters is assumed to be 3, 5, 7, and 9, respectively. As the figure shows, when k increases in each approach, the cluster density decreases. When K is equal to 3, the density of IGC-CSM is higher than in other approaches. In other cases, the SA-Cluster approach has a greater density than the other five. When k = 7 or 9, the proposed approach's density value is higher than in the SAG-Cluster, and AR-Cluster approaches. The density of the AR-Cluster approach is lower than other approaches in every case. The density value of the proposed method at k = 9 is higher than IGC-CSM.
The density and entropy values of W-Cluster and SA-Cluster methods are extracted from their original articles [7,8] and compared with [2] for confirmation.
Entropy is used to determine feature relationships between nodes. A lower value of entropy means more homogeneity of the nodes of a cluster in terms of their characteristics. Figure 5 compares the entropy measure for six approaches in the Political blogs dataset with cluster numbers k = 3, 5, 7, and 9. The entropy value of the proposed method at k = 3 is lower than the other five methods. In other cases, the SA-Cluster approach has lower entropy than the other five cases. Entropy comparison of different methods shows the better performance of the proposed method in k = 5, 7, and 9 compared to the IGC-CSM, AR-Cluster, W-Cluster and SAG-Cluster methods except for SA-cluster. In this ranking, the SAG-Cluster method is in third place. The entropy of the SAG-Cluster is better than IGC-CSM and AR-Cluster. In all cases, AR-Cluster and W-Cluster have much higher entropy than the other four approaches, indicating these methods' poor performance. Figure 6 compares the density measure for the six approaches on the DBLP dataset using cluster numbers 10, 30, 50, and 70. The density value of IGC-CSM is the highest. The density of the proposed method in this dataset is higher than the SA-Cluster method in all cases. While k = 10, the density value of the proposed method is lower than the density of the SAG-Cluster and AR-Cluster approaches. The density values of the proposed method are   becomes more optimal, and the density becomes weaker. The performance of the W-Cluster method when K is equal to 50 or 70 is acceptable compared to SAG-Cluster, IGC-CSM and AR-Cluster methods. The entropy of the proposed method is lower than the other five methods in different values of K. It can be concluded that in the proposed method, attention to the characteristics of the nodes is much more than in other methods. Figure 8 compares the density measure for the IGC-CSM, AR-Cluster, SAG-Cluster, and Proposed approaches on the Citeseer dataset with cluster numbers k = 3, 5, 7, and 9. The density of the proposed method is higher than the other three methods when k is equal to 3, 5, or 9. The density of the proposed method has a small difference with the density of the AR-Cluster at K equal to 7, and AR-Cluster performed better in this experiment. The density and entropy data of the SA-Cluster and W-Cluster methods were not found in the Citeseer dataset.   Fig. 9. When k is equal to 9, the entropy of the SAG-Cluster method is better than other methods. Similar to the Political blogs dataset, in this dataset, the AR-Cluster method also performs worse than the IGC-CSM method.
Our proposed method's time complexity is quadratic, making it suitable for small-and medium-sized graphs. Figure 10 shows the execution time of the proposed method in terms of the size of the graph based on the number of nodes in several examples on Political blogs and DBLP data.
The execution time of the proposed approach is shorter than that of the other three IGC-CSM, AR-Cluster and SAG-Cluster approaches, especially in the Political blogs dataset, which has more indirect relationships. Since all three methods calculate collaborative similarity based on the shortest path between indirectly connected nodes, this step increases the overall execution time in them. For example, the execution time of the proposed approach, according to Fig. 10, on a subset of the Political blogs dataset with about 382 nodes is approximately 158 s, and the execution time of the SAG-Cluster approach on the same set is higher than 5400 s. Thus, the proposed approach has a superior runtime compared to other methods. Figures 11 and 12 show a plot of density versus entropy. A line connects all points related to an algorithm. The direction of each line shows the treatment of the corresponding algorithm as the number of clusters increases. Arrowheads and tails indicate the minimum and maximum clusters. The best performance is where the plot between density and entropy is in the upper left corner of the x-y plane, where density is the highest value and entropy is the lowest value. The quality of the proposed and the SAG-Cluster approaches on the DBLP dataset is quite effective compared to the other techniques, as shown in Fig. 11. In the Political blogs dataset, the quality of the SA-Cluster approach is more effective than the five different approaches.  The quality of the proposed method in this dataset ranks second. The W-Cluster approach is weaker than the comparative approach, as shown in Fig. 12.
For further evaluation, we use the purity criterion. Our proposed algorithm is compared with IGC-CSM, AR-Cluster and SAG-Cluster in measuring purity in Political blogs, DBLP and Citeseer, as shown in Fig. 13. The experimental results show that the purity of the proposed method is higher in three datasets. This value is especially high in DBLP and Citeseer datasets compared to other methods. Figure 13, in the Political blogs and Citeseer datasets, the average clustering purity is k = 3, 5, 7, 9, and in the DBLP dataset, the average clustering purity is k = 10, 30, 50, and 70. The purity value indicates better clustering quality. In other words, nodes in a cluster have more similar characteristics.
To evaluate the effectiveness of the proposed method, we compared it with five previous methods with respect to density, entropy, runtime, and purity under three DBLP datasets, Political blogs and Citeseer. Tables 4, 5 and 6 show the performance of all six approaches  on the subset of data used. According to these tables, in most experiments, the entropy of the proposed method is lower than other methods. When the number of nodes in the selected network is not large, the proposed method gives better density than entropy. As the number of nodes increases, the entropy of the proposed method will be better than the density. The purity criterion in our method is always efficient and more than comparable methods. Therefore, we can infer that the proposed method accurately considers feature similarity. In all three Tables 4, 5, and 6, bold values indicate the better performance of each method in evaluation and comparison with other methods.
Finally, we have provided a statistical test analysis to prove the fact that proposed solution significantly outperforms other existing algorithms. We repeated the experiments 16 times on different subsets of datasets. The community average on density and entropy variables is according to Table 7. We can obtain a 95% confidence limit for these two variables using these data, their variance, and the number of observations. According to Table 7, the density Table 4 Comparison of different approaches on the Political blogs dataset  Table 5 Comparison of different approaches on the DBLP dataset  Table 6 Comparison of different approaches on the Citeseer dataset  variable with a sample mean of 0.8 can belong to a community whose average is between 0.7387 and 0.8581. Also, the entropy variable with an average of 0.05 belongs to a community with an average between 0.03685 and 0.05671. Bold values from top to bottom in Table 7 are the means density and entropy of 16 repetitions of experiments, respectively.

Conclusion
With the rapid development of social networks, data analysis of these networks to explore valuable information has become a significant research area. Clustering is one of the approaches to data analysis. The fundamental challenge in the clustering process is to consider the importance of both the structural relationships and the homogeneous characteristics of nodes. In this study, we proposed a hybrid clustering solution as an application for predicting links in heterogeneous information networks. This study pays special attention to the importance of the characteristics of social network nodes and tries to achieve this goal by calculating the optimal entropy. This goal will be able to discover potential links between network members and be used in a wide range of applications, such as academic applications. It uses a combination of structural similarity and attribute similarity of nodes. Hence, we proposed a similarity measure according to the type of connection and correlation among the adjacency vectors of nodes. This measure in indirect nodes does not limit the path length. We evaluated the effectiveness of our solution under three real datasets. By comparing the proposed method with the existing methods, the simulation results showed that it is more effective in terms of entropy, purity, and execution speed. In addition, the cluster density is also preserved. We propose a quadratic time complexity method for small-and medium-sized graphs. The limitations of this study include the lack of access to web-based cloud computing services due to limited conditions and access to appropriate datasets. We will work on large-scale networks in the future, and we can also study the clustering of an information network with directed connections. Furthermore, we will develop a function to detect the convergence of the clustering algorithm based on density and entropy. In addition, we will follow the ability to find the best value for K based on the ratio of density to entropy without K being the input parameter of the clustering algorithm.
Author contributions ZSS was involved in conceptualization, data curation, formal analysis, methodology, software, validation, writing-original draft. ME helped in conceptualization, data curation, supervision. MG-A contributed to conceptualization, data curation, supervision. BM helped in conceptualization, writing-review & editing.