Overlapping community detection algorithm based on similarity of node relationship

Community discovery is a vital link in the research of social networks aiming at the shortcomings of the current local extension-based community discovery algorithm in local community discovery and extension. In this paper, we proposed a algorithm based on relationship similarity and local extension overlapping community detection (RSLO). First, use the node's relationship similarity strategy to find close seed communities. Then, according to the discovered seed community, the similarity between the neighboring nodes of the community and the community is calculated, and the nodes whose similarity meets the threshold are selected. After that, an adaptive optimization function is used to expand the community. Finally, the free nodes that have not been divided into the community are divided into communities, thereby achieving a more comprehensive community discovery. We conduct experiments on classic datasets and artificially generated networks. The results show that the RSLO algorithm can find accurate and objective community structures.


Introduction
Since humans have become the main body in social networks, their social relationships have also been projected onto the network. This kind of network structure containing social relationships is called a social network. Due to the complex relationships between people, it can be called a complex network (Girvan and Newman 2002). We can summarize some people with high similarity in social networks into a community. The so-called community is composed of a group of closely connected nodes in the network. These nodes are relatively closely connected to each other, but are rarely connected to other closely connected nodes. Most intuitively, the process of community discovery is a clustering process. Community discovery aims to discover the most reasonable and true community structure in social networks, and it has become one of the important tasks to explore the network structure (Radicchi et al. 2004). In order to discover excellent community structure, many scholars and researchers have proposed many community discovery algorithms and many algorithms have also been developed to a certain extent. At present, the research results discovered by the community can be applied to many fields, such as personalized recommendation, network public opinion analysis, and disease infection network (Networks 2013).
Because the membership of some nodes in the community structure of social networks is not single, the classic discovery algorithm for nonoverlapping communities is no longer suitable for overlapping community structures. Therefore, the research on the discovery of overlapping community structures has gradually become a hot spot in this field. After years of development, a large number of relevant research results have appeared in this field. These methods can be roughly divided into clique partitioning algorithms, local expansion algorithms, edge division discovery algorithms, and label propagation algorithms (Javed et al. 2018). The following is a brief introduction to the classic algorithms in these categories and the research in recent years. The clique partitioning model algorithm CPM was first proposed by Palla et al. (Palla et al. 2005). They believe that a community is composed of some connected complete subgraphs. The operation of the algorithm is to form a fully connected subgraph composed of k nodes (that is, k-clique): a k-clique is a set of k nodes which are all connected to each other, search for neighboring cliques consisting of k-1 identical nodes from the network to find overlapping community structures. Later, Palla et al. (Palla et al. 2007) proposed a CPMd algorithm that can handle directed graphs. The algorithm uses directed k-cliques instead of k-cliques in the CPM algorithm, completing the overlapping community discovery on the directed graph. Lu et al. (Lu Zhigang 2019) proposed an overlapping community discovery algorithm based on the expansion of greedy clique. First, search for the largest clique in the network, calculate the link strength according to the degree of association between the cliques, and convert the original network graph into the largest clique graph. Under the condition of maximizing the fitness function, greedily expand the seed clique s in the largest clique graph for community discovery. Zhang et al. (Zhang and Wang 2015) merged the searched cliques in the network according to the coupling strength, so as to hierarchically divide the obtained tree diagram to obtain the overlapping community structure. This type of method uses cliques as the medium to explore the structure of overlapping communities, but its results are not ideal when dealing with relatively sparse network structures, and the time complexity of the algorithm is relatively high.
The basic idea of the local expansion algorithm is that in the network, the seed nodes are found according to the relevant strategies formulated, and then the community expansion is carried out according to the found seed nodes through the local optimization function to obtain the optimal community division (Junyu et al. 2016). For example, by Lancchinetti the proposed LFM algorithm (Lancichinetti et al. 2009) found the community based on the fitness function of the node and then selected nodes outside the community as seed nodes for community expansion. Su et al. (Su et al. 2017) use random walk strategy for community expansion. For such methods, the most important thing is the selection of seed nodes. For this reason, Wang et al.  proposed the concept of a structural central node and used it as a seed node for local community expansion. Sun et al. (Sun et al. 2018) proposed the vertex cover growth rate to select the seed node and combined the random walk strategy to expand the community to discover overlapping communities. Li et al. (Yan et al. 2019) proposed an overlapping community discovery algorithm based on the greedy expansion of seed nodes, which uses a greedy strategy based on the fitness function to expand the seed set according to the seed nodes. The algorithm can find high-quality overlapping community structures.
Based on the method of label propagation, each network node is assigned a label containing overlapping membership relationships, and through the propagation of these labels between neighboring nodes, the node's membership relationship with each community finally reaches a stable state, thereby obtaining community discovery results. overlapping community detection algorithm based on similarity of node relationship (Raghavan et al. 2007) research. Typical representatives of this type of method in the field of overlapping community discovery are the multi-label COPRA (Gregory 2010) algorithm and the speaker-listener model-based SLPA algorithm (Xie et al. 2011), andLu et al. proposed LPANNI (Lu et al. 2018) algorithm.
It is not difficult to find that the above algorithm is mainly for the study of nodes and their attributes in the network, in order to find overlapping community structures. In contrast, the edge division discovery algorithm starts from the perspective of edges and discovers the community structure. Wang (Gang 2018) et al. proposed a label propagation algorithm based on edge propagation probability for the traditional overlapping label propagation algorithm COPRA. GUO (Kun et al. 2018) proposed an overlapping community discovery algorithm based on edge density clustering. First, take edges as the research object uses density clustering to detect closely connected core edge communities. Then, according to the boundary edge attribution strategy, the boundary edge is divided into the core edge community closest to it. For isolated edges, a community attribution based on edge degree and edge is proposed isolated edge processing strategy. Wang (Wang Qing et al. 2019) proposed an adaptive overlapping community discovery algorithm based on the mixed parameters of edge trust, which defines the set of neighbors on the network side and the trust function between its neighbors, through information transmission. Obtain the total amount of information of the edge to realize adaptive discovery of overlapping communities. At present, edge partition discovery algorithms have become an important class of overlapping community discovery algorithms.
More recently, there is a community detection technique that utilizes attribute information, which considers the attributes associated with each node, also known as attribute-based community detection. In addition to the topology of nodes connected by edges, nodes or edges themselves always carry attribute information, that is, they form an attribute network. Attributes can be used as supplementary information to overcome topology sparsity (Y ang, J., McAuley, J. Leskovec, J. 2013).
From the perspective of attribute information processing, existing algorithms can be divided into the following categories. The first type is distance-based approach, whose basic idea is to create a new network by designing the distance measurement between nodes by considering the topological structure and node properties of nodes. For example, in CODICL (Ruan et al. 2013), the measure of signal strength between two nodes is designed through a trade-off parameter fusing link strength and content similarity. The second category defines a structure and attribute information fusion model for attribute network community detection. Specifically, it mainly includes methods based on generative models (networks using content and links, in: The 22nd International Conference on World Wide Web, 2013Web, , pp. 1089Web, -1098Web, 2013 and methods based on nonnegative matrix factorization (Xu et al. 2012). The third category is to expresses the community detection problem in the attribute network as a multi-objective optimization problem, and Liu et al.  proposed MOEA-SA.
However, many existing community discovery algorithms are based on the topology of the network, which is the local information between nodes, while ignoring the influence of the connection relationship between nodes. Real social networks are based on the relationship between people living in reality, and independent people lead to individual differences. A real community should fully consider the importance of connections between nodes. Only in this way can we study the behavior patterns of the entire community through individuals. Many algorithms only perform cluster analysis for the purpose of simply discovering communities, thus ignoring the influence of node connection relationships. In order to overcome this problem, we will propose the relationship similarity of nodes to evaluate the value of nodes to the community. For all nodes in the community, there is a certain degree of similarity between nodes in the same community. We express this value through the local clustering coefficient of the node and for a node in the community, each node has a higher similarity with nodes directly connected to themselves, and we express this value through the similarity of the relationship between the nodes. On this basis, the two are combined, and an overlapping community discovery algorithm based on relationship similarity and local expansion is proposed. The algorithm sets the node with the highest centrality as the core node and then judges whether the node in the community can belong to multiple communities in the same time according to the tendency of the nodes in the community. If it is marked as an overlapping node and divided it into the community, the node is removed from the network and iteratively obtains the division result of overlapping communities. Experimental results show that the algorithm has better performance than several classic overlapping community partitioning algorithms.
The rest of this paper is organized as follows. First the concept and computation method of node similarity are briefly introduced in Sect. 2. Then the proposed method is described detailedly in Sect. 3, and the simulation results are presented in Sect. 4. Finally, the conclusions and comments are drawn in Sect. 5.

Related work
For an undirected and unweighted network G ¼ V; E ð Þ, V is a set composed of all nodes in the network and E is a set composed of all edges in the network. Specifically: ð Þ is a set of neighboring nodes of node v i , and k i represents the degree of node v i .

Local clustering coefficient
The clustering coefficient (Watts and Strogatz 1998) is a parameter used to measure the degree of clustering between nodes in the network. In a real social network, the clustering parameter represents the close relationship between friends. Specifically, it measures how close a node is to its neighbors. The local clustering coefficient of an undirected graph can be defined as follows: Among them, k i is the degree of node m i and e i is the number of nodes connected between neighbors of node m i . LCC i is the local clustering coefficient of node m i , and its value is between [0,1]. Under certain circumstances, LCC i ¼ 0 means that the neighboring nodes of node m i have no relationship with each other, and LCC i ¼ 1 represents that all neighboring nodes of m i are connected to each other.

Node relationship similarity
In social networks, the similarity of nodes reflects the similarity between nodes. The intimacy between nodes is reflected by the similarity, which can better reflect the relationship between nodes. Lü (Lü and Zhou 2011) and others summarized the currently popular similarity indexes, which are shown in Table 1, which include the Salton index, Jaccard index, Sorensen index, resource allocation (RA) index, etc.
In most real networks, nodes tend to be relatively closely connected groups, which are characterized by high local connection density. The higher the clustering coefficient of a node, the stronger the cohesion of its neighboring nodes. Professionals have done a lot of research on the exploration and optimization of cluster analysis. In many studies on clustering coefficients, the focus is on the closeness between two nodes and their adjacent nodes, while ignoring the connection between the two nodes themselves, which will reduce the accuracy of similarity. In a real social relationship, two people who become friends will have a friend relationship, which will increase the similarity between them. Existing, obviously according to the individual clustering coefficient, its similarity is 0. Therefore, in the process of studying the similarity, we not only consider the node and its neighboring nodes, but also consider the connection relationship between the nodes themselves. Through the research of these works, in the external algorithm, we use the similarity of the node relationship based on the local clustering coefficient. The relationship similarity of nodes is as follows: Among them, LCC t is the local clustering coefficient of m t and t is the set of common neighboring nodes of nodes m i and m j . Then the relationship importance of nodes can be expressed as: Among them, there are two situations for the parameter z: When the nodes m i and m j are directly connected, , and when the nodes m i and m j are not directly connected, z ¼ t. The similarity not only measures the aggregation coefficient of the common neighbors of two nodes, but also considers the connection relationship between the two nodes themselves. The closer the relationship between the common neighboring nodes, the closer the relationship between the two nodes. The aggregation coefficient of common neighbors is an important indicator for calculating two nodes. The larger the coefficient value, the higher the similarity of the nodes. The relationship between nodes also affects the similarity of two nodes to a certain extent. Therefore, the similarity between nodes is positively correlated with the common neighbors and connections of nodes. The similarity between the two of connection relation and the node is positively correlated.

Local expansion related concepts
In the community discovery algorithm based on local expansion, in addition to precise similarity, several basic concept parameters are still required, which are defined as follows:

Community neighbor set
The community neighboring set N v C ð Þ represents the set of nodes connected to the community C: Among them, C represents a community and C v ð Þ represents the neighboring nodes directly connected to node v.

Similarity between node and community
The similarity S vc v i ; C ð Þbetween node v i and community C is defined as: At this time, the node v i not belongs to the community C. The larger the value of S vc v i ; C ð Þ, the greater the probability that the node v i belongs to the community C.

Community similarity
The similarity S cc C i ; C j À Á between community C i and community C j is: Among them, C i \ C j represents the number of the same nodes in the community C i and the community C j . min C i j j; C j À Á represents the minimum of the number of nodes in the community C i and the number of nodes in the community C j . The larger S cc C i ; C j À Á , the more similar the structure of the community C i and the community C j . Usually, when S cc C i ; C j À Á [ 0:5, the two communities can be merged.
Salton index: The Salton index is also called the cosine similarity in the literature. The similarity between two vectors is measured by measuring the cosine of the angle between them.Jaccard index: Compare the similarities and differences between the limited sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.Sorensen index: It is also known as the community coefficient, similar to Jaccard's index. The larger the Sorensen index value, the higher the sample similarity. Resource allocation index: This index is motivated by the resource allocation dynamics on complex networks. The index predicts the missing links in complex network

Adaptive function
The adaptive fitness function is used to measure the tightness of a group of nodes, and its formula is defined as follows: Among them, Com g in and Com g out are, respectively, the sum of the internal degree and the external degree of community C. In addition, the value of Q f after adding node v to the current community; Þmeans the value of Q f after node v is removed in the current community. The parameter a is a positive real number. To control the size of the community that was be discovered.

RSLO algorithm
In this part, we will briefly sort out the algorithm flow and the pseudocode of RSLO; see Algorithm 1.
The overlapping community discovery algorithm based on relationship similarity and local expansion consists of four parts: (1) seed community discovery; (2) seed community merging; (3) local community expansion; and (4) community final optimization. The discovery part of the seed community is mainly by calculating the relationship similarity of the nodes in the network, then selecting the core node of the seed community according to the local information of the neighboring nodes, and forming the seed community with the neighboring nodes of the tight structure in the local community. In the expansion part, using the relevant information of local nodes and their neighboring nodes, select nodes that have high relationship similarity with the community and can optimize the adaptive fitness function to join the community, and realize the community division of all nodes based on this. Since each seed community expands the community based on the set of community neighbors, overlapping structures in the network can be found.

Seed community discovery stage
In the community discovery algorithm based on local expansion, the selection of core nodes in the seed community will directly affect the accuracy of the seed community discovery. This algorithm designs a new core node selection process. First calculate the local clustering coefficient of the nodes in the network and obtain the relationship similarity of each node v based on this coefficient. Then, according to the relationship importance of each node v, count the number n cc whose value is greater than the relationship importance of its neighboring nodes, if the ratio of n cc to the number of neighbors of node v, C v ð Þ j j is greater than the set threshold c, then node v is divided into the core nodes of the seed community, and then the similarity S vc between the neighboring nodes of node v and the seed community is calculated. Finally, find the node whose value is greater than the threshold d, and add it to the seed community to get the final seed community PC. The specific steps are shown in lines 1-12 of Algorithm 1.

Seed community merger stage
In the process of discovering seed communities, there may be situations where two seed communities have a high degree of similarity. To avoid repetitive calculations in the third part, we can first merge seed communities with higher community similarity. Calculate the similarity between the seed community PC i and the seed community PC j according to S cc . If S cc PC i ; PC j À Á is greater than the threshold d, then the two seed communities will be merged to obtain a tighter set of seed communities PC t . See the 13 * 22 lines of Algorithm 1 for specific steps.

Community expansion phase
After obtaining a close seed community in the second stage, the community can be expanded. The steps of this part are as follows: first obtain the neighboring set N v of the seed community, calculate the similarity S vc with the community according to each node in N v , select the nodes with similarity greater than the threshold d as candidate nodes, and then calculate the nodes in the candidate node set. After joining the fitness function value of the local community, the candidate node that can increase the fitness function value is added to the community, otherwise it is set as a free node in the network, and the node that causes the fitness function value in the community to become smaller is deleted. Finally update N v and repeat the above steps until N v is empty. See lines 23 * 32 of Algorithm 1 for specific steps.

Community optimization stage
In the third part, because some nodes are set as free nodes, they may not belong to any community, and after the community is expanded, there may be communities with higher similarity. Therefore, the expanded community needs to be optimized. When optimizing, first calculate the similarity S vc between the free node and each community. When S vc is greater than the threshold d, add it to the community, otherwise let it become a separate community. Then, calculate the similarity between the community and the community again. The degree S cc then merges the communities whose value is greater than the threshold d. Finally, the final community division result is obtained.

Experiment and result analysis
This section presents a series of experiments on real complex and computer generated networks to prove the validity and effectiveness of the proposed method. The RSLO algorithm has good performance in different community structures and artificial synthetic networks of different scales. The experimental environment of RSLO is an all-in-one computer configured with Intel(R) Core(TM) i5-8400 CPU @ 2.80 GHz and 8G memory. The operating system is Windows 10 Home Edition 64-bit. The algorithm code is implemented based on python3. The python modules used include networkx2.4, python-igraph0.8.2, mat-plotlib3.3, and the java-based network visualization tool Gephi0.9.2.

Datasets
In order to test the performance of the RSLO algorithm, we selected several real network datasets that are widely used in research experiments and datasets based on artificial synthesis. Real network datasets are as follows: especially Karate club network karate, Dolphin network Dolphins, American college football game network Football, American political books network Pol books, etc. And these data have become the most classic datasets in the field of community discovery. Almost all algorithms are in these experiments on datasets, and it can be said that these datasets have become benchmark datasets for measuring community discovery algorithms. The artificially synthesized real dataset usually refers to the synthetic dataset generated based on the LFR benchmark (Lancichinetti et al. 2008) program. It has good node representation and community distribution differences. It is also the artificially synthesized data used in many studies in recent years. A brief overview of these datasets will be given below (see in Table 2).
LFR benchmark: It is an artificial synthetic network used to generate LFR benchmarks. Unlike real datasets, the LFR synthetic network has a clearer community structure and plays an important role in testing the performance of the community discovery algorithm. See the specific parameter settings in Table 3.
The artificial dataset parameters used in this article are as follows in Table 4:

Evaluation index
In the field of overlapping community discovery, the two most commonly used evaluation indicators are modularity EQ (Shen et al. 2009) and normalized mutual information (NMI) (Danon et al. 2005). The two are briefly summarized below.
Modularity EQ is used to evaluate the quality of overlapping community structure, which is mainly used as an evaluation index for the division of real network datasets. The closer the EQ value is to 1, the better the quality of the communities discovered by the algorithm. Usually, if its value is between 0.3 and 0.7, we think that the community discovery result of an algorithm is reasonable. The definition of modularity EQ is as follows: ! Among them, m is the sum of the number of edges in the network, c is the number of communities found after the algorithm runs, O i is the number of communities to which node i belongs, k i is the degree of node i, and A ij is used to determine whether there is a node i and node j connection, if the connection exists A ij ¼ 1, otherwise its value is 0.
Standardized mutual information NMI, on the premise of knowing the real community division result, can use NMI to measure the fit between the division result and the actual division. It measures the similarity between two vectors from the perspective of information theory. By combining with the real community results, the accuracy of the algorithm's community discovery results is objectively evaluated. The value range is between 0 and 1. For two different division results of A and B, the formula is defined as follows: Among them, C is a confusion matrix, and the element C ij in the matrix represents the number of nodes that belong Karate (Zachary 1977): A well-known real network dataset based on long-term observation of 34 members of the American college student karate club. The nodes represent the members of the karate club, and the edges represent the relationships between the members. There is an obvious community structure, one around the coach and the other around the club owner. Dolphins (Lusseau et al. 2003): A real network dataset based on long-term observation of the contact behavior between 62 dolphins. The nodes represent dolphins, and the edges represent frequent contacts between dolphins. Football: According to the real dataset obtained by American college students in the 2000 regular football game. The nodes represent the players, and the edges represent the friendship between the players. Pol books (Krebs et al. 2004): According to statistics, the USA is a network of political books sold by the online bookstore Amazon during the 2004 presidential election. The nodes represent books, and the edges represent the purchase relationship of connected books by the same buyer to the community i in the A division and the community j in the B division at the same time. C A is the number of communities in the A division. C i: represents the sum of the elements in the ith row of the confusion matrix C, and C :j represents the sum of the elements in the jth column of the confusion matrix C. The larger the value of NMI, it means that the community discovery result of the algorithm is closer to the real community structure. Especially, when the value of NMI is 1, the situation of the two communities is same.

Algorithm comparison
In order to verify the performance of the RSLO algorithm, the algorithm is compared with several classic overlapping community discovery algorithms. The contrasted algorithms include LFM algorithm, CPM algorithm, SLPA algorithm, and DEMON algorithm (Coscia et al. 2012).
The real network dataset and the artificially synthesized dataset are compared to verify the effect of the RSLO algorithm.  Figure 1 shows the results based on the EQ value of the RSLO algorithm and the classic algorithms found in four other communities on four real sets. It can be seen from the figure that the RSLO algorithm has good performance in general, except that the performance on some datasets is not as good as other algorithms. Because the arrogant can be find seed of the communities with high quality, and then through community expansion, community optimization and other steps to effectively discover the community structure in the real network. Figures 2 and 3 show the results of Karate and Dolphins. The heterochromatic node is the overlapping node of the two communities.

Comparison on synthetic datasets
According to the network dataset generated by the LFR benchmark, the parameter l represents the complexity of the community structure. The closer the value of l is to 1, the more complex the community structure in the synthesized network. On the contrary, the simpler the synthesized community structure. The following experiments are carried out on the effect of different algorithms according to different l values. Figure 4 is the running results of different algorithms on the LFR1 artificial synthesis network dataset. It can be seen from the figure that although l takes different values, the value of NMI obtained according to the RSLO algorithm is higher than other algorithms, and as the LFR benchmark parameter l increases, the downward trend of the RSLO algorithm is also greater than that of other algorithms. The slowness proves that the algorithm has better performance in more complex networks for other algorithms, which is mainly due to the accuracy of the selection of the core nodes of the seed community.
In the previous section, we carried out an experiment on LFR2. We presented the experimental results of the community complexity of different algorithms in different artificial synthetic datasets. Next, we mainly explain the performance of each algorithm for different community sizes. It can be seen from the Figs. 5 and 6 that for different parameters l, the total number of nodes in the network gradually increases, and the NMI value of the RSLO algorithm is basically stable and higher than some algorithms. This is mainly because in addition to the precise selection of the core nodes of the seed community mentioned above, the algorithm also processes the free nodes in the network to ensure the rationality of the community structure to a certain extent. Therefore, the RSLO algorithm has good performance in different community structures and artificial synthetic networks of different scales.
In order to detect the performance of overlapping community structures, we have compared several algorithms for different network overlap degrees O m . We conducted experiments based on the dataset LFR3. It can be seen from the Fig. 7 that despite the different values of l, the NMI value of the RSLO algorithm tends to be stable and leads other algorithms to a certain extent. With the increase in O m , the discovery of the network structure becomes more difficult, and the performance of all algorithms will deteriorate. This is due to the fact that after the initial seed community is issued, the discovered seed communities are preferentially merged to ensure the quality of the discovered communities when the community is expanded, and then the adaptive function fitness and relationship similarity are used to compare different seed communities. Expand the community to discover nodes in an overlapping structure in the network. Therefore, the RSLO algorithm can realize the discovery of overlapping structures in the network structure.

Conclusion
This paper proposes an overlapping community discovery algorithm (RSLO) based on relationship similarity and local expansion, which can discover overlapping community structures in the network. First, calculate the local clustering coefficient of each node, then calculate the relationship similarity of the nodes according to the local clustering coefficient and the connection relationship between the nodes, find the core node of the seed community according to the local clustering coefficient, and compare it with nodes with close relationships together constituting a seed community. Then, the discovered seed communities are merged according to the similarity of the communities to reduce the amount of calculation in the community expansion stage. After that, the similarity between the neighboring nodes of the seed community and the community is calculated, and the adaptive fitness function is used to expand the community. Finally, optimize the result of community division, add free nodes in the network to the community with the greatest similarity, and merge the communities with too high similarity again to ensure the quality of the community structure discovered by the algorithm. Experimental results show that the algorithm performs well in some real network datasets and artificially synthesized datasets.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Code availability Follow the algorithm to complete the code.
Ethical approval This article does not contain any studies with human participants performed by any of the authors.
Informed consent Informed consent was obtained from all individual participants included in the study.