Software Module Clustering Using Grid-Based Many-Objective Particle Swarm Optimization

The poor performance of traditional multi-objective optimization algorithms over the many-objective optimization problems has led to the development of a variety of many-objective optimization algorithms. Recently, several many-objective optimization algorithms have been proposed to address the different class of many-objective optimization problems. Most of the existing many-objective optimization algorithms were designed from the perspective of synthetic many-objective optimization problems. Despite the tremendous work made in the development of the many-objective optimization algorithms for solving the synthetic many-objective optimization problems, still real-world many-objective optimization problems gained little attention. In this work, we propose a grid-based manyobjective particle swarm optimization (GrMaPSO) for the many-objective software optimization problem. In this contribution, the grid-based selection strategies along with other supportive strategies such as two-archive storing and crowding distance have been exploited in the framework of particle swarm optimization. The performance of the proposed approach is evaluated and compared to three existing approaches over five problem instances. The results demonstrate that the proposed approach is more effective and has significant advantages over existing many-objective approaches designed for the software module clustering problems.


Introduction
Many-objective software module clustering problem (MaSMCP) is a special case of software module clustering problem (SMCP) that involves optimizing four or more objective functions simultaneously (Mkaouer et al. 2015). The SMCPs containing single objective is commonly referred as single-objective SMCPs (SoSMCPs)  and SMCPs containing more than one objective is commonly referred as multi-objective SMCPs (MoSMCPs) . In real-world software engineering, the SMCPs associated with the different software engineering purposes such as software remodularization, architecture recovery, refactoring, restructuring, etc often require optimization of four or more objective functions simultaneously. Therefore, MaSMCPs is an important and crucial form of the optimization problem of the software engineering field that must be given special attention.
To address the different aspects of MaSMCPs, several many-objective software module clustering approaches (MaSMCAs) have been proposed (e.g., Praditwong et al. 2011, Kumari andSrinivas, 2016;Amarjeet et al. 2018). Most of the existing MaSMCAs is based on the traditional multi-objective optimization approach (MoOA) framework such as NSGA-II (Deb et al. 2002) and multi-objective harmony search (Praditwong et al. 2011). However, the traditional MoOA frameworks are suited well only for the multi-objective optimization problems (MoOPs) consisting limited number of objectives especially less than four objectives. These approaches generally face various difficulties with optimization problems having more than three objectives (Deb et al. 2014). In general, the MoOPs consisting of more than three objectives are commonly referred to as many-objective optimization problems (MaOPs) (Gong et al. 2016;Zhou et al. 2018).
The aforementioned many-objective optimization strategies have been widely adopted to design the MaOAs for the different synthetic as well as real-world science and engineering MaOPs. For example, industrial scheduling problems (Sulflow et al. 2007), molecular design (Kruisselbrink et al. 2009), control system design (Herrero et al. 2009), Brain-computer interfacing (Pal and Bandyopadhyay, 2016), improve existing package design (Prajapati and Chhabra, 2019), Scientific workflow scheduling in cloud computing (Saeedi et al. 2020) etc. The current development in the domain of MaOA has provided a variety of alternatives to address different aspects of MaOPs and many of them have offered bright prospects to this direction.
Although research in the domain of MaOAs has achieved great success in addressing the different aspects of synthetic MaOPs, their effective potential exploitation, customization and application on real-world MaOPs still needed more study. Formulation of the complex real-world problem as a MaOP and designing appropriate many-objective metaheuristic optimizer is promising and ongoing research. The MaSMCP is an inherent MaOP and their solution has a wide range of applications in the software engineering field. Even after huge development in the designing of manyobjective metaheuristic approaches, the MaSMCPs gained little attention in the search-based software engineering (SBSE) community. To address the different aspects of MaSMCPs, some approaches have been proposed in the literature Mkaouer et al. 2015;Amarjeet et al. 2018). The existing approaches have exploited various existing many-objective optimization strategies and have been applied successfully to address the various forms of MaSMCPs, still, there are many promising many-objective optimization strategies did not gain any attention in this direction. The high importance of MaSMCPs in the software engineering motivates us to explore and design of more effective many-objective optimization approaches for the MaSMCPs by exploiting the most effective many-objective optimization framework and strategies.
From the perspective of complex optimization problems such as discrete and multimodal MoOPs, the swarm intelligence (SI) algorithms are an effective and well suitable approach (Yue et al. 2018). The particle swarm optimization (PSO) (Kennedy and Eberhart, 1995), a class of SI algorithm has been widely successfully applied to address the different discrete and multimodal MoOPs (Coello and Lechuga, 2002;Martínez and Coello, 2011;Moubayed et al. 2014;Dai et al. 2015). The supremacy of PSO framework towards the many-objective optimization problem have also been validated in the various studies (e.g., Figueiredo et al. 2016;Hu et al. 2017;Lin et al. 2018).
Recently, the study  has been demonstrated that the grid-based ranking strategy (Yang et al. 2013) for selection of personal and global in many-objective PSO framework is an effective strategy for the discrete and multimodal MaOPs. Apart from that, the two-archive-based external storage strategies along with effective updation techniques have also been found as a supportive component for the many-objective metaheuristic algorithms .
The favourable characteristics of the PSO framework, grid-based selection strategy, and two-archive-based external storage in solving the discrete and multimodal MaOPs, inspired the development of grid-based PSO for the MaSMCPs. The nature of MaSMCPs as the discrete and multimodal many-objective validates the suitability of the adapted strategies. Overall, in this study, the various existing strategies favourable to the MaSMCPs have also been integrated into the framework of the PSO. The major contributions of this study can be summarized as follows: • An effective many-objective metaheuristic search optimizer named as grid-based many-objective particle swarm optimization (GrMaPSO) has been proposed for the MaSMCP.
• To balance the convergence and diversity in GrMaPSO, the procedure of determining the personal best position and global best selection strategy has been redefined based on the grid-based fitness computation (i.e., grid ranking, grid crowding distance, and grid coordinate point distance).
• To balance the exploitation and exploration, definitions of the operators used for computing the new velocity and position of particles in the swarm have been redefined according to the suitability of the characteristics of MaSMCPs.
• To produce a sample of best non-dominated solution (i.e., a well-distributed approximation of Pareto front) that can be a good representative of all non-dominated solutions of the search space, a two-archive external non-dominated storing strategy has been used.
The rest part of this work is designed as follows. Section 2 presents related works corresponding to the search-based software module clustering. Section 3 explains the basic background of the concepts and strategies used in the proposed approach. Section 4 describes the proposed GrMaPSO approach. Section 5 presents the experimental setup used in the GrMaPSO approach. Section 6 presents results and analysis. Section 7 discusses the possible threats and their mitigation. Section 8 concludes the paper with a discussion of future works.

Related works
Over the last two decades, many researchers and academicians of the software engineering community, has exploited the benefits of various efficient search-based optimization algorithms like metaheuristic algorithms to support the automation of the different large and complex SMCPs. Based on the different categories of SMCPs, we divide the related works into: single-objective software module clustering approaches (SoSMCAs), Multi-objective software module clustering approaches (MoSMCAs), and Many-objective software module clustering approaches (MaSMCAs).

SoSMCAs:
In the direction of SoSMCP, the work (Mancoridis et al. 1998) is considered as the base study where the authors exploited the benefits of search-based optimization algorithm to automate the SMC for source code architecture extraction. To this contribution, they considered the SMCP as a graph partitioning problem where high-level module dependency graph (MDG) is treated as an abstract description of the source code structure. To guide the optimization process modularization quality (MQ) is designed and incorporated into the hill-climbing and genetic algorithm as a fitness function. Motivated by the (Mancoridis et al. 1998), many other researchers and academicians, presented several other aspects of SoSMCAs. The works (Doval et al. 1999) presented a more enhanced version of the SoSMCAs where they customized genetic algorithm with more effective strategies.
To understand the influence of different operators and parameters on the performance of SoSMCA (Mancoridis et al. 1998;Doval et al. 1999), the authors (Mitchell and Mancoridis, 2002) conducted an empirical study. The works (Mahdavi et al. 2003) used the same optimization model (Mancoridis et al. 1998;Doval et al. 1999) for the formulation of SMCP, but they used the multiple hill-climbing algorithms as a metaheuristic search optimizer. The robust fitness function in the presence of software uncertainty is highly required in the search-based optimization. To evaluate the robustness of different existing fitness functions (i.e., MQ and EVM) designed for the SoSMCPs, authors (Harman et al. 2005) conducted an empirical study. To increase the automation and usefulness of the different SoSMCAs, authors ) developed a framework named as Bunch. In the Bunch framework, the genetic algorithm, hill-climbing, and simulated annealing are integrated as the optimization algorithm.
Apart from designing the robust fitness function, the appropriate encoding also plays an important role in the SoSMCP. The works (Praditwong 2011) introduced a Grouping Genetic Algorithm (GGA) for the SoSMCP. The GGA used a group encoding method to represent the candidate solutions for the SMCP. In many SoSMCAs, direct link-based fitness has been widely used. In contrast to the direct link-based fitness, authors (Huang et al. 2016) utilized a similarity-based fitness evaluation approach to measure the quality of candidate solutions. Recently, some researchers contributed to SoSMCAs by designing and tailoring novel metaheuristic search algorithms. For example, authors ) designed a harmony search-based SoSMCA to address the SoSMCPs and the authors (Pourasghar et al. 2020) designed the graph-based clustering approach to address the SoSMCPs. Recently, many researchers and academician presented several improvements in the formulation of multiobjective optimization as well as designing of the multi-objective optimization algorithms for the MoSMCPs. The works (Jalali, et al. 2019) used the structural and non-structural source code feature to model the multi-objective formulation of the SMCP. They used the multi-objective metaheuristic optimization algorithms to optimize the defined objective functions. The works  used the different aspects of source code information to design the different competing objective functions for the multi-objective formulation of the software module restructuring problem. They used the NSGA-II, multi-objective evolutionary algorithm to optimize the defined objective functions simultaneously. The works (Prajapati and Geem, 2020) proposed the harmony search-based multi-objective software module clustering for software architecture reconstruction. The works (Prajapati and Kumar, 2020) tailored the particle swarm optimization algorithm to address the MoSMCP for software remodularization.

MoSMCAs
MaSMCAs:Many of the search-based software module clustering approaches are designed by considering only a few aspects of software quality criteria as objective functions. Moreover, such approaches generally treated SMCPs as a MoOP. However, in the real world, the SMCPs employed for different software engineering purposes such as software remodularization and architecture recovery mostly involve a large number of objectives. Therefore, such MoSMCAs cannot be effective for the MaSMCPs. The authors (Mkaouer et al. 2015) were believed to be the first who considered the SMCP as a MaSMCP and solved it by applying the NSGA-III, a many-objective evolutionary algorithm, for software remodularization. Later, some researchers have also contributed to this direction by exploiting the existing framework and strategies. Recently, authors (Prajapati and Chhabra, 2018) used the concept of fuzzy-Pareto dominance to design the artificial bee colony based many-objective optimization approach for the MaSMCP. Further, the same authors (Prajapati and Chhabra, 2019) exploited the potential of harmony search algorithm along with the several supportive strategies to design the many-objective optimization approach for the MaSMCP.
In the past one decade, a large number of many-objective optimization have been developed to solve the different large and complex MaOPs (Bader and Zitzler, 2011;Yang et al. 2013;Xiang et al. 2017;Gong et al. 2018). Last few years, the many-objective optimization has become a hot topic in the field of metaheuristic search optimization (e.g., Liu et al. 2019;Sun et al. 2019;Liu et al. 2020;Pan et al. 2020). Even after, huge development in the designing of the variety of many-objective optimizers, their applications in the real-world MaOP such as MaSMCPs gained little attention. Apart from that most of the existing many-objective optimizers are designed by keeping the characteristics of the synthetic many-objective optimization problems. Therefore, these approaches can work well with the synthetic problem instances and may not work well with real-world unrealistic optimization problems.
Even though there is a huge opportunity for the application of many-objective optimization concepts in the field of software engineering, it could not gain sufficient attention of researchers and academicians of these community. There are only a few works carried out in this direction covering some particular aspect of many-objective software engineering problems (e.g., Mkaouer et al. 2015;Prajapati and Chhabra, 2019). Therefore, more study is required to explore the potential applications of different existing MaOAs in solving the various aspects of many-objective software engineering problems. However, the effective use of different strategies developed in support of many-objective optimization in designing of the many-objective software engineering approach is a challenging task. To increase the applicability of existing optimization model and strategies of many-objective metaheuristic algorithms in the field of software engineering, we propose a many-objective optimization framework by exploiting the potential of particle swarm optimization framework, grid-based selection approach, and two-archive external archive storage.

Basic backgrounds
This section provides basic descriptions of the problem formulation and the framework and strategies exploited to design the proposed approach.

Problem formulation
Similar to the other many-objective optimization problems (MaOPs), the many-objective software module clustering problem (MaSMCP) involves optimizing four or more objective functions simultaneously. The MaSMCP can be described similarly as the other common MaOPs are defined. The mathematical description of the MaOP is as follows: In the context of the MaSMCP, the objective functions 1 ( ), 2 ( ), . . . , ( ) can be viewed as the clustering quality criteria and the terms ( ) and ( ) can be denoted with clustering inequality and equality constraints, respectively.
The and are the lower and upper bound of the i th decision variable.
To define the decision variables for the MaSMCPs, the problem can be encoded into the integer-based representation. The description of integer-based representation for an object-oriented software system (especially Javabased system) is presented in Fig. 1. In the demonstration of the integer-based software module clustering, we consider an artificial object-oriented software system consisting of eight source code classes, i.e., c1, c2, c3, c4, c5, c6, c7, and c8. These eight source code classes are connected with some relationships. Now consider we generate a module clustering solution by randomly partitioning these eight classes into three cluster/package, i.e., m1,m2, and m3. For the MaSMCPs, the previous researchers (e.g., Praditwong et al. 2011;Amarjeet et al.2018) have used the various software quality criteria as the objective functions. In our proposed approach, we consider the following two objective models for the MaSMCP.

• Extended Equal-size Cluster Approach (E-ECA):
In the E-ECA objectives model following module clustering criteria are included: 1) difference between the maximum and minimum number of modules in a cluster (minimizing), 2) number of clusters (maximizing), 3) sum of inter-edges of all clusters (minimizing), 4) sum of intra-edges of all clusters (maximizing), 5), modularization quality (MQ) (maximizing), 6) average shortest path length between a source and all other reachable clusters (minimize), 7) cluster cyclic dependencies (minimize).
The software quality criteria used in the above two objective models have been widely used in the literature of multiobjective software module clustering. Therefore, we have also used these objective models in the many-objective formulation of the software module clustering problem.

Particle swarm optimization
The particle swarm optimization (PSO) (Kennedy and Eberhart, 1995) is an effective search-based metaheuristic optimizer. The working model of the PSO is inspired by the behaviour of the bird flocking and fish schooling. The PSO framework is highly adaptable and has been customized by the researchers to address the different types of continuous and non-continuous complex science and engineering problems. The individual bird or fish of a swarm is denoted as a moving particle in the spaces. The moving particle in the space is attributed with the velocity and position. To achieve goal or to reach a destination, the particles use their personal experience as well as the swarm experience. In this activity, every particle continuously changes their velocity. The standard rule of velocity and position updation is as follows: where represent the inertia applied to restrain the current speed of the particle. The term and represent the velocity and position of particle i. The symbol is the personal best position of ith particle and is the global best of the swarm. The constant 1 2 are the learning factors of the personal experience component and social experience components respectively. By changing their velocity and position the particles exploits and explore the search space and finally reaches their expected points/position, i.e., optimal solution of the optimization problem.

External archives
In the designing of multi/many-objective optimization framework, most of the approaches exploit the potential of the external archive to store the non-dominated solution and guiding the optimization process (e.g., Praditwong and Yao, 2006;. The size of the archives, the number of archives, and the archive management scheme varies from approach to approach as well as according to the problem context. The effective utilization of the potential of the external archive can help the optimization approach to approximate the Pareto front effectively. Many approaches use the fixed-size single external archive to store the non-dominated solutions found in every generation of the optimization algorithm. However, the use of variable size multiple external archives to store the non-dominated solutions and guiding the optimization process have been found more effective in the designing of multi/many-objective optimization framework. The works (Praditwong and Yao, 2006; proposed the concept of variable size, two external archives in the designing of multi/many-objective optimization framework. More specifically, they used the two external archives namely convergence archive (CA) and divergence archive (DA). In both archives, non-dominated solutions collected from every generation of many-objective optimizers and stored based on their updating rules. The CA archive helps in achieving better approximation of the Pareto front whereas DA archive helps in distributing the optimal points evenly in the Pareto front. Inspired by the effectiveness of the variable size two external archives, we exploit the potential of these concepts to design our proposed many-objective software module clustering approach.

Grid-based fitness evaluation
The main goal of every multi-objective search optimizer is to produce a good approximation of the Pareto front. A good approximation of the Pareto front is commonly attributed as a set of solutions closed to the true Pareto front having evenly distributed optimal points. To produce a good approximation of the Pareto front, the multi-objective search optimizers employed various strategies in the optimization process to maintain the diversity and convergence among the candidate solutions. The fitness evaluation of the candidate solutions and its application in mating or environment selection highly affects diversity and convergence of the final results. In the context of the MaOP, the multi-objective metaheuristic search optimizers based on traditional fitness evaluation and selection method generally fail to generate a well-distributed and close approximation of the Pareto front.
A variety of fitness evaluation strategies have been designed for the selection task in the context of many-objective optimization. However, most of them either promote the diversity or convergence not both simultaneously. In such cases, to balance diversity and convergence become difficult. The grid-based fitness evaluation method (Yang et al. 2013) incorporates both convergence and diversity information in determining the candidate solution. Hence, the gridbased method can be a good alternative for the selection of the candidate solution in many-objective optimization.
However, in the literature of many-objective optimization, it gained little attention. In this article, we adapt the gridbased selection strategy suggested in the literature (Yang et al. 2013) for the fitness evaluation. To assign the fitness of all individuals, there are three grid-based criteria namely grid ranking (GR), grid crowding distance (GCD), and grid coordinate point distance (GCPD) is used. The definition of GR, GCD, and GCPD are based on the concepts of griddominance and grid-difference. The concept of grid-dominance and grid-difference are defined in terms of a grid frame.
The definitions of all these concepts are described as follows: Grid setting-In grid-based optimization techniques, the grid is viewed as a frame which is used to determine the location of candidate solutions of a population or swarm (in case of PSO) in their objective space. The size of the grid depends on the current candidate solutions of the particular swarm; hence, it varies from swarm to swarm. The grid lower boundary (LBm) and upper boundary (UBm) of m th objective of candidate solutions for a swarm S is determined as follows: Grid location of an individual: After determining the lower boundary, upper boundary, grid division with their width, now we can easily determine the grid location point of a candidate solution 's' in swarm. The grid location or grid coordinate of a candidate solution's' corresponding to the m th objective is computed as follows: where Floor [.] works as the floor function, fm(s) denotes the actual objective value corresponding to m th objective function.

Grid-dominance
Where M is the number of objective functions and the grid is constructed for the individual candidate of swarm S.
Grid difference: Grid-difference is used to determine the difference between the two candidate solutions based on their grid coordinates. Let p, q ∈S, the grid-difference between solution p and q is defined as follows: Grid ranking: The grid-ranking (GR) is used to determine the rank of individual of swarm population. The GR is defined as the aggregation of individual's grid coordinate in each objective.
where GLm(s) represents the grid location of candidate solution s in m th objective, M denotes the number of objective functions.
Grid crowding distance (GCD): The grid crowding distance is used to estimate the density of a candidate solution with respect to the distribution of neighbors. The GCD of an individual p is defined as follows: Where M denotes the number of objectives ( ) represents the set of neighbors of p.
Grid coordinate point distance (GCPD): The GCPD computes the normalized Euclidean distance between a candidate solution of their objective space and utopia point in its hyperbox. The GCPD of an individual p of swarm S is defined as follows: Where ( ) and ( ) represent the actual objective value of candidate solution p and grid coordinate, respectively, in the m th objective. and denote the width of hyperbox and lower boundary of grid, respectively, for the m th objective.

Framework of the proposed work
The framework of the proposed work is primarily divided into two major components: 1) encoding of the problem, and 2) application of GrMaPSO. The first part includes the extraction of software entities and their dependencies, the formation of the MDG, the encoding of the problem as a candidate solution, and the representation of objective vectors.
The second part consists of the design of various parts of GrMaPSO. The abstract description of the proposed approach is depicted in Fig.2 and the detailed working descriptions of each component are given in subsequent sub-sections.

Construction of problem optimization model
In the metaheuristic optimization, the problem to be optimized must be represented or encoded into a suitable format so that the operators of the algorithm can be applied effectively. To develop an optimization model for a problem as an input for a metaheuristic algorithm, the different essential components of the optimization problem such as the definition of decision variables, range of decision variables, constraints, and objective functions must be defined.
The proposed approach is specially focusing on clustering of source code classes of the object-oriented system into packages; therefore, we define the classes as entities and method calls, inheritance, etc. between classes as relationships.
To extract this information from the source code of object-oriented software systems especially developed in Java programming language, we use the PF-CDA and Structure 101 tool. After extracting the classes and their relationships from the source code, a class dependency graph (CDG) is formed. In the CDG, the classes are denoted with the nodes and class relationships with the graph edges. Consequently, the overall software system to be clustered is transformed into higher-level abstraction i.e., the class dependency graph. The encoding of the problem as a candidate solution and the objective model has been discussed in Section 3.1.

GrMaPSO
There are many variants of PSO algorithm designed for the different types of science and engineering optimization problems (e.g., single, multi, or many-objective, continuous, discrete optimization, etc.). It is well known that the particular metaheuristic framework and its associated strategies designed for a specific class of optimization problem may not work well with the other types of optimization problems, because, each variant has some merits and limitations when they deal with a particular type of optimization problem. But it is possible to design an effective approach by customizing the existing framework of metaheuristic optimizers with suitable strategies for a particular optimization problem. The MaSMCP is a special kind of discrete combinatorial optimization problem which needs special care while designing the metaheuristic search optimizer.
A new GrMaPSO for the MaSMCP is proposed by exploiting the existing framework of PSO and the various suitable selection and removal strategies. The proposed GrMaPSO has two main features: 1) grid-based selection strategy for the personal best position and global best position, which helps in guiding the GrMaPSO towards the Pareto front, 2) external archives CA and DA with effective updating and pruning strategies, which helps the GrMaPSO to produce a well diverse Pareto front. The proposed GrMaPSO differ from the existing many-objective software module clustering approaches on the determination of personal best and global best position as well as archiving strategy. In this approach, every aspect of SMCP corresponding to many-objective optimization have also been considered while exploiting and redesigning the existing many-objective strategies.
The general framework of the proposed approach is similar to most of the PSO-based multi-objective optimization algorithms, i.e., initialization of swarm, updation and maintenance of the external archive, updation of personal best and global best position, updation of current velocity and position of each particle, and generation of the swarm for a new generation. So, in GrMaPSO, first, the position and velocity of the particles are randomly generated to form an initial swarm. Then, the objective of each particle of the swarm is computed and the external archives CA and DA are updated and pruned if required. Next, the personal best and global best position is updated according adapted strategies. Finally, the velocity and position of the particles of the current swarm are updated for the next iteration. The major steps involved in the proposed GrMaPSO are provided in Algorithm 1.

Initialization of particle's position and velocity:
The working of proposed grid-based PSO begins with initialization of swarm, i.e., initialization of each particle's position and their velocity resided in the swarm. The initialization of the particle's position and their velocity highly affects the convergence speed of the solutions towards optimal solutions. It is commonly known fact that the initialization of particle position which is uniformly distributed in the search space, leads optimization process towards optimal front effectively. In this study, we use the random strategy to initialize the position and velocity of the particles in swarm. To initialize the position vector, the index value of the solution vector is selected randomly from their range values, i.e., between 1 to n (number of software entities). To initialize the velocity vector, the index value is selected randomly either 0 or 1. Finally, a single best solution is returned which is considered as the global best position. The detailed procedure of the selection of the global best position is provided in Algorithm 4.

Experimental design
This section presents the experimental setup designed for the proposed GrMaPSO. This experimental setup includes a selection of test problems, competitor algorithms, result collecting procedure, and statistical test.

Test problems
The test problems include a variety of software projects with different size and characteristics. The proposed approach can be applied to any type of software projects, but this study is mainly focussing on the software projects especially developed in the object-oriented programming paradigm. The selected software projects are Java Servlet API, Junit, XML API DOM, JavaCC, DOM4J, JHotDraw, and JFreeChart. The brief information of these software projects are provided in Table 1. The main reason for selecting these software projects for the evaluation of our proposed approach is that these software projects are highly used in different application. Additionally, the research community of search-based software engineering field have also used these projects to evaluate the similar research methods as test problems.

Competitor approaches
In the field of search-based software engineering, various metaheuristic search optimizers have been designed and developed by tailoring or customizing the traditional metaheuristic algorithms to solve the different aspects of software engineering problems. As our proposed approach is targeting to the many-objective optimization aspect of SMCP, so we have selected only those metaheuristic search optimizers of search-based software engineering to compare our approach which are based on many-objective optimization concepts. The brief descriptions of the selected existing approaches are given as follows: • Two-archive Pareto optimal genetic algorithm (TA-PGA) : The TA-PGA is genetic-based software module clustering approach where external archive concept has been exploited in the traditional multiobjective genetic algorithm.
• NSGA-III-based software remodularization (NBSR) (Mkaouer et al. 2015): This approach was designed to address the many-objective software remodularization problem. In this method, the traditional NSGA-III metaheuristic search optimizer is tailored to optimize the software modularization to improve the quality.
• Fuzzy Pareto-Dominance Driven Artificial Bee Colony (FP-ABC) (Amarjeet et al. 2018): This approach was specially designed to address the many-objective software clustering problem. In this approach, fuzzy-Pareto dominance selection strategy has been introduced in the traditional artificial bee colony algorithm.
The basic reason for selecting the above existing many-objective search optimizer is that these metaheuristic search optimizers are directly related to our proposed approach and designed to address the many-objective software engineering problems.

Parameter settings
The metaheuristic search optimizer generally contains many parameters and their configuration values highly influence the final results. Hence, the parameter setting of these algorithms for a specific problem is a challenging task. To determine the most suitable parameter values of a metaheuristic search optimizer corresponding to a specific problem, various tuning methods are used. In our approach, we have used the trial-and-error approach to determine the parameter values. However, for the existing approaches, we have used the same parameter setting values as used by their designers. The parameter setting values of different optimizers are provided in Table 2.

Collecting Results and statistical tests
The metaheuristic search optimizers are not deterministic. The inclusion of various randomized components in their design prevents them to behave deterministically. Due to stochastic nature, the metaheuristic search optimizers may not generate the same output, if they are executed multiple times on the same problem input. In this situation, it becomes difficult to conclude the output collected from the metaheuristic search optimizers.
To overcome the problem, researchers of the metaheuristic search community generally suggest the use of the statistical test. Another challenge is the many-objective metaheuristic search optimizers do not produce a single best solution but a set of non-dominated solution (i.e., Pareto set). To select a single solution that exhibits the best trade-off to all objective functions is another challenging task. In this study, we use the trade-off worthiness metric (Rachmawati .and Srinivasan, 2009), a more appropriate approach for the selection of the best solution. We run each metaheuristic search optimizers 31 times and using the trade-off worthiness metric and collect the 31 best solutions as a sample for the statistical test. For the statistical test, we used the Mann-Whitney U test with a 95% confidence level (α=0.05).

Assessment criteria
To measure the quality of the modular design of a software system, coupling and cohesion are two important design quality metrics. Apart from the coupling and cohesion, modularization quality (MQ) is another highly used design quality measurement in software module clustering. Along with the design quality measurement, we also use the criterion to measure the quality of generated Pareto front of the metaheuristic search optimizers. The brief description of these quality metrics is given below: • Coupling and Cohesion Assessment Criterion: The software system with low coupling and high cohesion is considered a better software design. The coupling measures the degree of dependency of software entities between the modules and cohesion measures the degree of dependency of software entities within the modules.
• MQ Value as Assessment Criterion: The MQ measures the trade-off between the coupling and cohesion of the software design. The larger value of the MQ promotes the cohesion and penalizes the coupling. Hence, the software system with higher MQ value is considered the better design compared to the software system with lower MQ value.
• Pareto Optimality as Assessment Criterion: The metaheuristic search optimizers produces the results in the form of a set of non-dominated solution which is commonly known as obtained Pareto front. The obtained Pareto front having uniform distribution and closer to the true Pareto front is considered as good Pareto front. To measure the diversity and convergence of the obtained Pareto front the inverse generational distance (IGD) is considered as an effective metric.

Results and discussion
This section presents the results of the experimental setup designed for the proposed, GrMaPSO approach and existing approaches. The performance of the GrMaPSO is investigated with three existing many-objective software engineering approaches which are namely the FP-ABC, TA-PGA, and NBSR. Comparative studies of GrMaPSO approach are carried out with the existing approaches to examine their performance on the MaSMCPs. The results of coupling, cohesion, MQ, and IGD of each algorithm is collected according to the method discussed in Section 5.4 and systematically analysed with Mann-Whitney U test to validate the supremacy of the proposed approach compared to the existing approaches. The significant difference between the results of algorithms is determined at the 95% confidence level (for coupling, cohesion, and MQ performance metrics) and 99% confidence level (for IGD performance metrics).
The coupling, cohesion, MQ, and IGD results of the proposed approach is compared with existing, FP-ABC, TA-PGA, and NBSR approaches. The coupling, cohesion, MQ, and IGD results of the proposed approach compared with existing, FP-ABC, TA-PGA, and NBSR approaches presented in each table are described as follows:1) If the existing approach performs significantly worst compared to the proposed approach then symbol " [-]" is attached with the result of the existing approach, 2) If the existing approach performs significantly better compared to the proposed approach then symbol "[+]" is attached with the result of the existing approach, 3) If there is no significant difference between the proposed approach and existing approach then "[≈]" is attached with the results of the existing approach.

Coupling as assessment criterion
The coupling results obtained through the proposed and existing approacheswith E-MCA and E-ECA on each of the problem instance are presented in Table 3 and Table 4, respectively. If we see the coupling results of each algorithm obtained with the E-MCA formulation presented in Table 3, the proposed approach is performing significantly better in most of the cases compared to the existing approaches. If we compare the coupling results of the proposed approach and the FP-ABC, the proposed approach produces better coupling in six out of seven cases compared to the FP-ABC approach, in which four cases are significantly better.
The proposed approach produces better coupling in seven out of all seven cases compared to the TA-PGA, in which five cases are significantly better. The proposed approach outperforms the NBSR in all seven cases, in which six cases are significantly better. Similarly, if we see the coupling results achieved with E-ECA many-objective software module clustering formulation, the proposed approach is performing significantly better to the existing approaches in most of the cases. Overall, the coupling results presented in Table 3 and 4 validate that the proposed approach is able to generate module clustering solution having better coupling values compared to the existing approaches.

Cohesion as assessment criterion
The cohesion results of both proposed and existing approaches evaluated over seven software projects with E-MCA many-objective software module clustering formulation and E-ECA many-objective software module clustering formulation are given in Table 5 and 6, respectively. The results presented in both Table 5 and 6 corresponding to the E-MCA many-objective software module clustering formulation and E-ECA many-objective software module clustering show that the existing approaches, i.e., FB-ABC, TA-PGA, and NBSR are performing significantly worst in most of the cases compared to the proposed approach.
The cohesion results achieved with the E-MCA show that the proposed approach perform significantly better compared to the FB-ABC, TA-PGA, and NBSR in three, five, and five cases, respectively out of seven cases. Now if we see the cohesion results achieved with the E-ECA, it shows that the proposed approach performs significantly better compared to the FB-ABC, TA-PGA, and NBSR in four, four, and five cases, respectively out of seven cases. There is only one case, i.e., JavaCC in TA-PGA where the proposed approach is demonstrating the worst performance.

MQ as assessment criterion
The MQ results corresponding to both E-MCA many-objective software module clustering formulation and E-ECA many-objective software module clustering formulation achieved through the proposed and existing approaches are provided in Table 7 and 8, respectively. Similar to the coupling and cohesion results, the proposed approach is also able to perform better compared to the existing approaches in terms of MQ quality metric. The MQ results presented in

Pareto optimality as assessment criterion
To test the quality of the achieved approximation set achieved through the proposed and existing many-objective optimizers, we use the IGD metric as Pareto optimality as an assessment criterion. In this section, we compare the proposed approach with the existing approaches in terms of how well each of the many-objective algorithms performs at producing good approximations to the Pareto front. To achieve the well diverse and converge approximations to the Pareto front, we have used exploited the grid-based approach and integrated into the PSO algorithm. Here the gridbased selection strategy is based on both convergence and divergence information. Therefore, the proposed approach is expected to perform better the existing approaches to produce good approximations to the Pareto front.
The IGD values of the proposed and existing software module clustering approaches for both E-MCA manyobjective software module clustering formulation and E-ECA many-objective software module clustering formulation are provided in Table 9 and 10, respectively. The statistical results mentioned in the bracket for all existing approaches clearly show that the proposed approach outperforms the existing approached in most of the cases. For example, if we see the IGD results of proposed and FP-ABC approach for E-MCA many-objective software module clustering formulation, the proposed approach performs significantly better in four cases out of seven cases. Similarly, other comparative results presented in Table 9 and 10, show that the proposed approach can achieve significantly better IGD values compared to the rest of the existing approaches. Overall, the results presented in section 5.1 to 5.4 corresponding to coupling, cohesion, MQ, and IGD performance metrics demonstrates that the proposed many-objective metaheuristic optimizer is able to produce good module clustering solutions and same time outperform the existing approaches. The balanced exploitation and exploration capability strategy and proper fitness evaluation defined in terms of both convergence and diversity information helps the proposed approach to generate such good results.

Threats to validity
In this section, various threats that can affect the validity of the obtained results are discussed. There can be several factors that can be responsible in influencing the validity of results. These factors can be widely divided into two main categories: external and internal threats to validity.
In external validity, the capability of generalization of results over the other set of test problems is considered. In software engineering, there are wider range of software projects developed in different languages. Hence, validation of the proposed approach over the different types of software projects is an important point. In our approach, this threat to validity is mitigated by using the abstract description (dependency graph) of the software system as input. It is possible, that a large number of software systems can be mapped to a single dependency graph. To cover a diverse set of dependency graph as input, we have selected the diverse set of open-source software systems.
In internal validity, various experimental treatments are considered that affects the final results of the algorithms. In this study, various quality measures such as coupling, cohesion, MQ, and IGD have been used. The information used in computing the coupling, cohesion, and MQ is based on the existing works. These quality measures are widely used to evaluate the software quality. The other factor that can affect the validity of the results is the selection of different parameters values. To mitigate this threat, we determined the different parameter's value of the proposed algorithm based on the trial-and-error method as well as the settings of the existing approaches.

Conclusion and future works
In this work, we have presented a many-objective optimization approach named as GrMaPSO for the many-objective software module clustering problem. In this contribution, we have exploited the PSO framework along with grid-based ranking and external archive-based solution storing strategies. The grid-based ranking strategy used for the selection of the swarm leaders (i.e., personal best and global best position) effectively and that helped the GrMaPSO to converge towards Pareto front efficiently. Moreover, the concept of two external archives (i.e., convergence-oriented archive and divergence-oriented archive) along with effective updation and pruning strategies have also been incorporated in the GrMaPSO for balancing the diversity and guiding the optimization process.
In the experimental studies, we applied the proposed GrMaPSO over the five MaSMCPs under the E-MCA and E-ECA many-objective formulation. The obtained results are compared with the three-existing many-objective optimization approaches designed for similar many-objective optimization problems. The results indicate that the proposed GrMaPSO outperforms the existing many-objective optimization approaches in terms of the MQ, coupling, cohesion, and IGD quality indicator metrics. Overall, the results demonstrated that the proposed approach is more effective and has significant advantages over existing search-based software module clustering approaches. The future work for this work can be the evaluation of the proposed approach over some other industrial MaSMCPs where a greater number of module clustering criteria are required.