Network structure optimization for social networks by minimizing the average path length

Network structure plays an important role in the natural and social sciences. Optimization of network structure in achieving specified goals has been a major research focus. In this paper, we focus on structural optimization in terms of minimizing the network’s average path length (APL) by adding edges. We suggest a memetic algorithm to find the minimum-APL solution by adding edges. Experiments show that the proposed algorithm can solve this problem efficiently. Further, we find that APL will ultimately decrease linearly in the process of adding edges, which is affected by the network diameter.


Introduction
Social networks have been analyzed over the past several decades. Most social network research can be classified into the following three fields: (1) analysis of network characteristics, which focuses on finding indices to quantify features of network structure (such as modularity) and utilizing these indices to distinguish broad categories of network structure (such as small-world phenomena, scale-free property) [1][2][3][4]; (2) exploration of mechanisms of network structure formation using statistical or dynamic models such as the Watts-Strogatz or Barabási-Albert models [5][6][7]; (3) the dynamic evolution of networks, which focuses on the transformation of network structures based on specific models in many fields, such as transmission of infection [8] and human cooperation [9]. Obviously, network structure plays a fundamental role in the analysis of social networks, and optimizing network structure can help to improve network performance.
Previous research about optimizing network structure has shown that the primary methods to improve network performance include adding edges, rewiring edges and adding nodes with changing network topology. Watts and Strogatz [5] found the small-world network structure by rewiring network edges through optimization of the clustering coefficient and average path length(APL). Other scholars have concentrated on improving network characteristics by adjusting network structure [10][11][12]. In this paper, we call this process "network structure optimization". It is the process of adjusting the network structure (adding, deleting or rewiring edges) to improve performance with respect to a specified outcome. Although there are different methods of improving structural performance, a standard definition of structural optimization has yet to be proposed. Generally, a network can be defined as a graph G = (V , E) , where {v i } ∈ V represents the set of nodes; {e i j } ∈ E → {0, 1} represents the set of edges, e i j = 1 denotes a relationship between nodes i and j, and e i j = 0 indicates no relationship between nodes i and j. Network structure optimization is as follows: for a specific performance P G of a network G, structural optimization can be formulated as the process of optimizing P G by changing the network structure G = (V , E) to G = (V , E ), which produces P G such that P G < P G .
Optimizing network structure has several applications, such as in transportation networks [13,14] and communication networks [15]. Take the airline network as a real world example, because of resource constraints, creating some new airlines between different cities should fully consider the maximization of the whole network's efficiency. So we want to select a few connections that will maximize the overall benefit by making the network a smaller world.
Problems related to structural optimization have been widely discussed [10,[16][17][18][19]. Barahona et al. [20] treated balance as an optimization problem, and Wang et al. [21] proposed that the transformation of an imbalanced network to a balanced one at the lowest cost is also an optimization problem. Yang and Wang [22] proposed a decentralized small-world optimization method for improving search efficiency. There are different indicators to reflect the effect of network optimization, among them, the small-world properties attracts most attention. For example, Schleich et al. [23] found a high-quality solution to improve small-world properties on vehicular ad hoc networks. Du et al. [24] presented a more effective method for optimizing small-world network generating process. Alstott et al. [25] proposed a method that create a highly clustered network by starting with an existing network and rearranging edges, without adding or removing them.
The small-world properties can be characterised by several methods. However, our focus in this article is the APL. APL is a fundamental quantity that contains important information for understanding the behaviors of networks [26]. For example, APL can be used to analyze topological structures of real world systems such as the World Wide Web [7], to measure the transfer efciency of a metabolic network [27], or to measure robustness with respect to random failures [28]. To minimize APL, we can either decrease communication cost [29], or increase synchronization [30][31][32][33][34]. Here, we take APL as the index to be optimized.
In this paper, we propose a memetic algorithm to optimize APL. This is useful as it correlates with a set of important indexes such as robustness or transmission efficiency, a lower APL is a highly desirable property since it facilitates and amplifies dynamic processes in the network, such as information dissemination, or viral marketing. Several methods focusing on the optimization of APL have been proposed, but the effectiveness of the optimization can still be improved, especially for optimization of random networks. This paper proposes a more effective method to optimize APL. Moreover, some network features of the optimal networks need to be analyzed in order to determine the effect of optimizing APL on different networks, and this analysis can find out the added edges with which feature can decrease APL more efficiently. And we also compute several indices related to optimal networks, such as assortative index or network diameter that relate to the optimal network structure. In the analysis, we also find that APL ultimately decreases linearly in the process of adding edges, and this occurs because the network diameter affects the optimization of APL, which is ignored in the previous studies. In addition, this phenomenon can be used to design more effective methods to optimize APL.
The remainder of the paper is structured as follows. We define structural optimization in terms of minimizing APL in Sect. 2; we propose an optimizing algorithm in Sect. 3; the algorithm is tested on different computer-generated networks in Sect. 4; Sect. 5 gives our conclusions and suggests further work.

Related background
Minimizing APL has been the subject of much research. Donetti et al. [35] proposed a new class of networks, named "entangled networks", which had small APL and large girths. To speed up the convergence in Donetti's model, Xuan et al. [36] suggested a simulated annealing method to achieve the optimal network by minimizing APL. Keren [37] explored the reduction of APL in binary decision diagrams using a spectral technique based on properties of the Walsh spectrum of a Boolean function and its autocorrelation function. Garijo et al. [38] proposed a novel method to minimize the largest distance between any two points in plane Euclidean network.
In this paper, we focus on how to add edges in a network to get an optimal APL, this process can be applied to real optimization problems such as enterprise mergers or location planning. Specifically, for a connected network G = (V , E) with n = V nodes and E the initial number of edges, we add k edges to produce a network G = (V , E ) with the minimum APL. Our objective function can be expressed as Eq. (1): It should be noted that optimization of APL suffers from combinatorial explosion problem. There can be ways of adding k edges to a network with n nodes. Since the time to compute APL is O(n 3 ) finding the best solution takes time O(n 2k+3 ), which is high even for a small size network. This problem can be seen as an amendment to link recommendations, namely to select from the candidates link recommendations that improve the APL, and many related algorithms can be employed to accomplish this [39,40]. Meyerson and Tagiku [41] first defined this link recommendation as Average Shortest Path Distance Minimization (ASPDM) and proposed an approximation method (ASPDM Algorithm) to solve this problem; this method finds a source vertex S, from which there is a lower value of the sum of path lengths to other nodes, and then adds edges from S to k other nodes to give a lower APL. The complexity of this method is O(n 3 ), which is quite time-saving, and the solution was proved to be at least as good a solution, but there still exists a large distance from the global optimal solution. Upon adding a new edge, the path length of some pairs of nodes will be changed, while the path length of other pairs remains unchanged; Parotsidis et al. [42] proposed the EdgeEffect Algorithm to maximize the reduction of those changed pairs' lengths. The complexity of the EdgeEffect Algorithm is O(n 3 · E ) , which is higher than that of the ASPDM Algorithm, but the solution is more effective. Different variants of these algorithms have been considered, leading to qualitatively similar results [43,44]. A greedy algorithm adds edges one by one, and has proved to be the most efficient method in optimizing APL [42,45,46]. Gozzard et al. [47] suggested two efficient algorithms to minimize the APL length by link addition. He also proposed twos algorithms, FirstMinAPL algorithm and SecondMinAPL algorithm, where the FirstMinAPL algorithm perform better in small scale network. The complexity of FirstMinAPL algorithm is O(n 3 ) + O(n 2 · E ). Papagelis and Manos [48] proposed a novel method named "path screening", which quickly identifies important shortcuts to guide the augmentation of the graph. And the complexity of the Path Screening Algorithm is O(n 2 logn) + O(n 2 · E ), a lower complexity and the solution is also effective. In this paper, we modify this greedy algorithm as follows: a new edge is added at each step; specifically, for the first step, we choose the edge giving the minimum APL from the n(n−1) 2 − E nonexistent edges, and for the second step we choose the best edge from the remaining n(n − 1) 2 − E − 1 nonexistent edges, and for the k th step we choose the best edge from the remaining edges. The total time complexity of this greedy algorithm is O(n 5 · k). This algorithm has difficulty finding the globally optimal solution since it performs a local search. Its solution by adding k edges depends completely on the solution of adding k − 1 edges. Therefore, a new optimizing algorithm that uses a global search technique is required to find more precise solutions.

The algorithm
Since most structural optimizations suffer from the combinatorial explosion problem, evolutionary algorithms (such as a genetic algorithm or simulated annealing) that combine global and local search techniques have been widely used and can be effective in preventing combinatorial explosion [49,50]. Memetic algorithm is first defined by Moscato and Norman [51]. In their early definition, it were a modification of Genetic Algorithms employing also a local search operator for addressing the Traveling Salesman Problem. Since memetic algorithm were not proposed as specific optimization algorithms, but as a broad class of algorithms inspired by the diffusion of the ideas and composed of multiple existing operators, the community started showing an increasing attention toward these algorithmic structures as a general guideline for addressing specific problems. Memetic algorithm have been successfully applied, in recent years, to solve complex real-world problems and displayed a high performance in a large number of cases [52]. In this section, we propose a memetic algorithm combined with a genetic algorithm and a heuristic local search to minimize the APL.

Framework
Algorithm 1 gives the framework of our proposed algorithm. We first input the necessary parameters and load the network matrix to be optimized. Then we generate an initial population by the function Initial_Population(). Before the iteration number reaches Imax, or the objective function (APL) remains unchanged for 40 iterations, we repeat a process for finding the best solution. In this repeated process, we first select the parent population for genetic operations by use of the function Tourna-ment_Selection(); then we operate crossover and mutation in the Genetic_Operation() to generate the offspring population; next, we use Local_Search() to find a better solution after the Genetic_Operation(), which can speed up the convergence; Update_Population() sorts the objective function of all chromosomes and selects better performing ones to construct a new population. The process of memetic algorithm is shown in Fig. 1. First, we evaluate the population's fitness, i.e. APL. Second, we choose the population by natural selection. Then we execute genetic operation, which contains the mechanism of crossover and mutation. Finally, combined with local search, we find the better result after continuous iteration executing of this process.

Algorithm 1 Framework of our algorithm
Require: maximum number of iterations: I max ; size of population: S pop ; size of mating pool: S pool ; size of tournament: S tour ; probability of crossover: P c ; probability of mutation: P m . Ensure: the position of added edges; the updating network's matrix; APL of the updating network. 1: Load: the initial network adjacency matrix; 2: P ← Initial_Population(S pop ); 3: repeat 4: P parent ← Tournament_Selection(P, S pool , S tour ); 5: P of f spring ← Genetic_Operation (P parent , P c , P m ); 6: P of f spring ← Local_Search (P of f spring ); 7: P ← Update_Population (P, P of f spring ); 8: until Termination.

Representation and initialization
Our target is to determine in which positions we should add new edges to reduce the APL. To this end, we first find the positions of those pairs of nodes that are not connected (nonexistent edges). Then we give a set of sequence numbers to these positions; thus our solutions can be encoded as chromosomes consisting of these corresponding sequence numbers. An example is shown in Fig. 2, where the second entry of the chromosome changes from 2 to 3, so the added edge between Node 1 and Node 4 changes to the edge between Node 2 and Node 4.
At initialization, we generate a population and randomly assign a set of numbers to every element of the chromosomes in the population. To speed up convergence and retain population diversity, we add two new sub-populations: (1) each element of the chromosomes (i.e., the added edge) connects the two highest degree nodes; (2) each element of the chromosomes connects the highest degree node and the lowest degree node. In other words, the initial population can be divided into three parts: solutions with random assignment, assortative connecting assignment and disassortative connecting assignment.

Genetic operation
The genetic operation consists of two parts: crossover and mutation.
In the crossover procedure, we randomly select two chromosomes from the parent population with probability P c and generate two offspring chromosomes. The common elements of the two offspring chromosomes remain unchanged. For the remaining elements that differ between the chromosomes we give an ordering, and for every unshared element, crossover operates with probability a: if a < 0.5 , the element remains unchanged; and if a ≥ 0.5, the corresponding elements (with the same ordering number) of the two offspring are swapped. Figure 3 gives an illustration of crossover. The sequence numbers 11 and 7 are elements common to both offspring chromosomes, so they do not need to exchange. For the other four elements, assuming that the probability a equals 0.4, 0.8, 0.7, and 0.2, respectively, the first and fourth unshared element remain unchanged since a < 0.5, and the second and third unshared elements are swapped between the two chromosomes since a ≥ 0.5.
In the mutation procedure, we randomly select a chromosome with probability P m , and for a selected element of the chromosome, we randomly choose a different sequence number to replace that element.

Local search
By incorporating some priori knowledge, a local search can efficiently reduce useless explorations and speed up the convergence of genetic algorithms [53]. In this section, we employ Path Length Learning and Neighbor Learning, respectively. Path Length Learning is shown in Algorithm 2. We first update the network matrix by decoding the chromosome. Then, we compute the degrees of nodes connected by the added edges and delete the edge with the minimum sum degree of those pairs of nodes. Finally, we add a new edge to the pair of nodes that has the longest path length. This learning technique helps produce a better result. In Neighbor Learning, illustrated in Algorithm 3, for each element in the offspring chromosome, we choose the value of its best neighbor (a neighbor of an added edge is one of the nonexistent edges sharing one of the same nodes) to judge whether this choice produces a better result. If the new choice decreases APL, then we accept the new element, and the result improves.

Algorithm 3 Neighbor Learning
Require: the chromosome X of f spring2 ; the number of added edges k; the adjacency matrix of the initial network M. Ensure: X of f spring2 . 1: X of f spring3 = X of f spring2 2: Update the matrix M by decoding X of f spring3 ; 3: for i = 1; i ≤ k; i + + do 4: for m = 1; number of X of f spring3 (i)'s neighbor; m + + do 5: X of f spring3 (i) = X of f spring3 (i)'s neighbor; 6: end for 7: find X of f spring3 (i)'s neighbor q with minimum APL(X of f spring ); 8: if APL(X of f spring3 ) < APL(X of f spring2 ) then 10: APL(X of f spring2 ) = APL(X of f spring3 ) 11: end if 12: end for

Complexity analysis
If we add k edges to a network of size n, the time complexity of the proposed algorithm can be estimated as follows: At each iteration, we first need to carry out the crossover S pool 2 times and mutation S pool times, where S pool is the size of the mating pool for the genetic operation. Since the time complexity of computing the APL is O(n 3 ), the genetic operation will cost O(S pool (k +n 3 )). Second, on performing the path learning, updating the matrix costs time O(k), finding the added edge with the minimum node degrees costs O(kn), finding the nonexistent edge with longest path length costs O(n 3 ), so the path learning will cost O(n 3 + kn + k). Third, to perform neighbor learning, we consider n neighbors of each added edge, and it will cost O(kn 4 ) to compute the APL of all those neighbors. As a result, the complexity of our algorithm is O(kn 4 ) at each iteration. If the network size n is big enough, the complexity of the Memetic Algorithm is less than that of Greedy Algorithm.

Experiments
In this section, we test our algorithm on different computer-generated networks and real-world networks.

Computer-generated networks
The experiments were carried out on a 2.40 GHz CPU, 4.00 GB Memory and Windows 10 operating system using MATLAB to execute the procedure. Since the algorithm we proposed is not parameter-free, we need to set some values for these parameters in advance, such as the size of population or mating pool. Some of these parameters are set to the fixed values that we found by trial and error in order to ensure that the proposed algorithm has excellent performance. Table 1 shows the parameters used in the experiments. We first ran our algorithm on random networks with different network sizes, and compared it with five other methods: (1) add k edges at one time, where each of the added edges connects the two highest degree nodes (Denoted as Big-Big, which is an assortative connecting method); (2) add k edges at one time, where each of the added edges connects the highest degree node and the lowest degree node (denoted as Big-Small, which is a disassortative connecting method); (3) add k edges one by one, with each of the added edges giving the minimum APL as described in Sect. 2 (denoted as the Greedy Algorithm). We set k from 1 to 50; (4) ASPDM Algorithm proposed by Meyerson and Tagiku [41]; (5) EdgeEffect Algorithm proposed by Parotsidis et al. [42]; (6) FirstMinAPL Algorithm proposed by Gozzard et al. [47]; (7) Path Screening Algorithm proposed by Papagelis and Manos [48]. Since evolutionary algorithms have inherent randomness, many of them, including our proposed algorithm, may fall into a local minimum in some cases. We ran the above-mentioned algorithms ten times.
For convenience, we present Table 2 gives that the abbr. and their full names that appear in the article.
For the results of the mean APL for these ten runs, the Greedy Algorithm can find the best solutions, while the Memetic Algorithm can find better solutions than other methods. For finding the minimum APL for these ten runs, the Memetic Algorithm performs best. In order to make the experimental results easier to comprehend and for convenience, we list only the results of the minimum APL for the ten runs of these algorithms. We set the number of added edges to 10; Fig. 4 shows the optimizing result with different network sizes. Compared with the other methods, our algorithm and Greedy Algorithm can always find lower-APL solutions with different network sizes.
Then we ran our algorithm on three different networks (random network, regular network and scale-free network of size 30) with different numbers of added edges 60 50 40 30 20 Network size n to explore the effect of each added edges. Figure 5 shows the optimizing results for different networks. Since the degrees of regular network nodes are all equal, the results of Big-Big and Big-Small are the same; here we record just the result of Big-Big on the regular network. Since ASPDM Algorithm adds edges from a source vertex S to other nodes, when all nodes are connected to S in the process of adding edges, then no more edges will be added as shown in Fig. 5.
Compared with the other five methods, our algorithm can find lower-APL solutions except for a few solutions of the Greedy Algorithm and EdgeEffect Algorithm which do better than the Memetic Algorithm on the regular network (this is because our priori knowledge is less useful on regular networks). We conclude that in most cases, our proposed algorithm can effectively minimize APL. Further, we find that the solutions of Big-Small are close to the optimizing solutions in Fig. 5a and c. In a sense, disassortative connecting can find good solutions for minimizing APL. Moreover, we find when the number of added edges increases to a threshold value, APL begins to decline linearly no matter which method is used. This phenomenon can be explained by the feature of Critical Diameters [54,55]. Specifically, when we add enough edges to the network, the longest path length of the network becomes 2 (i.e. the network diameter is 2), in which case, if we randomly add a new edge between a pair of nodes, we can only change the path lengths of these two nodes to 1, and thus the APL can be reduced by just 2 n(n − 1) for each step, which constitutes a linear decline.
In this section, we analyze the structure of the optimal networks to determine the effect of optimizing APL on different networks. The assortative index r and cluster coefficient c are important indices for measuring network characteristics [56]. The assortative index r measures the characteristic of assortative (or disassortative) mixing, and can be formulated as: where j i , k i are the degrees of the vertices at the ends of the i th edge [57]. The For the random network experiments, as shown in Fig. 6, r does not show a regular trend, but the optimal networks turn out to be disassortative since most r values are negative, while c shows an increasing trend as more edges are added to the initial network, which means that a random network is gradually evolving into a small-world network. This figure shows two interesting phenomena. First, disassortative connection for adding edges may contribute more to the decrease of APL. Second, decreasing APL in random networks can simultaneously increase the cluster coefficient, which is helpful in constructing small-world networks.
For the regular network experiments, r and c both decline with the first few edges added as shown in Fig. 7. Thus, we can conclude that when adding a small number of edges, a disassortative connection is a good choice to decrease APL. Since a regular network initially has a high cluster coefficient, c will decrease when we add edges in a disassortative way. However, when we add a larger number of edges to a regular network, the disassortative connection becomes less effective. This conclusion is different from that for the random network.
For optimization of the scale-free network as shown in Fig. 8, all signs of r are negative, which indicates that optimal networks are all disassortative. However, c increases as new edges are added, because adding more edges to a scale-free network, especially linking high-degree nodes with other nodes, has a high probability of constructing more triangles, which will result in a higher c. The property of connections in scale-free networks is similar to that in random networks, and the disassortative feature of the connections is more obvious in scale-free networks. Here, the assortative index r shows great difference between greedy algorithm and memetic algorithm, even if both of them perform excellent in optimizing APL, as shown in Fig. 7a. This is because different algorithm has different edge addition strategy, even if all of them give the same low APL, and the edges they add may be completely different. And it is confirmed when compare Fig. 9b and c, where the added edges are not same. Furthermore, in Fig. 5c, when the number of added edges exceeds 10, the network enters the state of critical diameter, and the changes of APL are basically similar. Then the difference of edge addition strategy between these two algorithms will lead to difference in r . Specifically, when the network reaches the critical diameter, the strategy   of greedy algorithm tends to adding edges randomly, but memetic algorithm tends to link between nodes with large degree value in scale-free network due to the mechanism of path length learning. Therefore, when the number of added edges reaches a certain level, the algorithm result is similar to Big-Big, and the difference with greedy algorithm becomes larger.

Real-world networks
After the experiment on computer-generated networks, we find that the greedy algorithm and our proposed memetic algorithm perform best. Further, we compare the performance of the two algorithms on three real-world networks: (1) Dolphins network, which is a social network of bottlenose dolphins [58]; (2) Terrorists network, which contains contacts between suspected terrorists involved in the train bombing of Madrid on March 11, 2004 as reconstructed from newspapers [59]; (3) Windsurfers network, which contains interactions between 28 Grévy's zebras in Kenya [60]. In all of these networks, we ignore the weight of the edges for convenience. On each net-   Table 3 shows the results. From Table 3, the results show that the memetic algorithm perform equally even better than that of greedy algorithm, which proves the effective of memetic algorithm on real-world networks.
We further observe the difference of adding edges between the two algorithms on the network topology. Figure 9 show the topology of k = 10 added edges on Dolphins network. Fig. 9b and c show the topology of k = 10 added edges via the greedy algorithm and memetic algorithm respectively. Overall, both two algorithms focus on the edges between big communities and the nodes whose degree is higher, which proves such a choice is indeed an effective strategy. More specifically, at this point,  Fig. 9 The figures show the topology of k = 10 added edges on Dolphins network. a shows the topology of the initial network. b-c hide the initial edges and show the added edges. b shows the topology of k = 10 added edges via the greedy algorithm. c shows the topology of added edges k = 10 via the memetic algorithm memetic algorithm's performance is more obvious because some edges in Fig. 9b existing within the community.

Conclusion
In this paper, we have focused on the optimization of a network by minimizing APL by adding edges, and we propose a memetic algorithm to find optimal solutions. The experimental results show that our algorithm can minimize APL efficiently. In the analysis, we also find that APL will ultimately decrease linearly in the process of adding edges, which is directly affected by the network diameter. Further, we compute the assortative index and cluster coefficient for optimal networks with different initial network structures and find that these two properties of optimal solutions may be quite different. We optimize only one indicator, APL, in this paper, but in many cases we need to optimize more than one feature of networks, for example minimizing APL and maximizing the cluster coefficient at the same time. This multi-objective structural optimization should be explored in future work.
Funding This work is supported by National Natural Science Foundation of China(72104194), China Postdoctoral Science Foundation(2021M700107) and Fundamental Research Funds for the Central Universities(SK2021003).