GP-DMD: a genetic programming variant with dynamic management of diversity

The proper management of diversity is essential to the success of Evolutionary Algorithms. Specifically, methods that explicitly relate the amount of diversity maintained in the population to the stopping criterion and elapsed period of execution, with the aim of attaining a gradual shift from exploration to exploitation, have been particularly successful. However, in the area of Genetic Programming, the performance of this design principle has not been studied. In this paper, a novel Genetic Programming method, Genetic Programming with Dynamic Management of Diversity (GP-DMD), is presented. GP-DMD applies this design principle through a replacement strategy that combines penalties based on distance-like functions with a multi-objective Pareto selection based on accuracy and simplicity. The proposed general method was adapted to the well-established Symbolic Regression benchmark problem using tree-based Genetic Programming. Several state-of-the-art diversity management approaches were considered for the experimental validation, and the results obtained showcase the improvements both in terms of mean square error and size. The effects of GP-DMD on the dynamics of the population are also analyzed, revealing the reasons for its superiority. As in other fields of Evolutionary Computation, this design principle contributes significantly to the area of Genetic Programming.


Introduction
Genetic Programming (GP) is a successful paradigm of Evolutionary Algorithms (EAs) [1,2] that has been applied to a wide variety of domains [3][4][5]. In this paradigm, a population of computer codes or mathematical models is evolved through the iterative application of selection, genetic and replacement operators. In the broader field of Evolutionary Computation (EC), the proper balance between exploration and exploitation is considered a cornerstone for proper performance [6]. Since there is a clear relationship between this balance and the amount of diversity maintained in the population, several strategies to alter the amount of population's diversity have been devised [7]. They are usually classified in terms of the component that is altered, and they can alleviate numerous inconveniences [8], such as premature convergence and oversampling of neutral networks, among others.
In 2015 Segura et al. [9] proposed a novel design paradigm that explicitly relates the amount of diversity maintained in the population to the stopping criterion and elapsed period of execution. Subsequent studies, performed by our research group, have yielded important advances for both single-objective continuous and combinatorial optimization [10]. The main principle behind this design paradigm is that the internal operations of EAs should depend on the amount of computational resources available, and in particular, that the initial phases should focus on exploration while the final phases should be devoted to exploitation, with a gradual transition between those stages. This principle has been successfully used to design new components, such as replacement strategies [8], and crossover operators [11]. In fact, this principle was used to design the winning strategy of the extended round of Google Hash Code 2020 1 , which featured more than 100,000 participants, thus indicating the potential of this principle.
In the case of GP, several authors have already noted the impact of diversity management on performance. Several ways of measuring diversity and strategies to alter the amount of diversity maintained in the population have been devised [12]. However, to our knowledge, the design principle previously described has never been applied in the area of GP. This work analyzes the hypothesis that this design principle is also useful for GP and that state-of-the-art strategies can be advanced further. Accordingly, this paper presents a new general methodology for GP that considers an explicit and dynamic management of diversity that takes into account the stopping criterion and elapsed period. The novel proposal is called Genetic Programming with Dynamic Management of Diversity and, similarly to some of the successful single-objective optimizers, it incorporates the principle discussed through a novel replacement strategy. Specifically, it uses a multi-objective selection that considers both the accuracy and simplicity. Additionally, as in the case of singleobjective optimization, the diversity is managed by means of a dynamic penalization scheme that takes into account a measure of similarity between individuals.

Diversity in genetic programming
Most variants of population-based optimizers lack mechanisms to systematically alter the amount of diversity maintained in the population. Since premature convergence is one of the most common drawbacks found in the design of this kind of optimizers, several diversity management techniques have been proposed in order to influence the amount of diversity maintained in the population. In most cases, they aim to increase the amount of diversity preserved in the population so the term diversity promotion is used [13]. In the particular case of GP, several diversity-aware proposals that are inspired by well-established diversity management strategies [14], as well as techniques that are specific to GP, have been developed [15]. Due to the particular kinds of individuals evolved in GP, measuring the diversity of a population of individuals is a complex task. This section provides a representative sample of diversity measures and strategies to alter the amount of diversity maintained in GP. Most of the techniques discussed are used to validate the proposal put forth in this paper.

Diversity measures
Several ways of calculating diversity in GP, both for analysis and design, have been devised. However, different nomenclatures have been used to refer to similar measures and probably worse, the same term has been used to refer to different measures. For instance, the term phenotypic distance has been used both to refer to distances based on the fitness values [16] and distances based on the mathematical models [17]. In order to avoid misunderstandings, we present a classification of diversity measures that considers the most typical use of the terms.
Structural diversity is related to the variety of mathematical models associated with each individual. Note that in this paper, our mathematical models are trees, so the discussion is restricted to this kind of structure. Since calculating tree-based diversity is complex and expensive [1], several ways of estimating it have been devised. These approximations can be categorized into the following classes: -In edit distances [16], a set of edit transformations and associated costs are established. Then, the edit distance is defined as the minimum-cost sequence of edit operations that transform a given tree into another one. One of the most popular is the ed2 distance. In this case, the two trees compared ( t 1 and t 2 ) are first overlapped and brought to the same structure by adding null nodes. Then, the distance is defined as ed2(t 1 , t 2 ) = dist(p, q) + w ∑ l=1 ed2(t 1,l , t 2,l ) , where p and q are the roots of t 1 and t 2 , is the number of children of the roots, t i,j is the subtree number j of the root of t i and w ∈ ℝ is a constant. dist(p, q) is usually defined as one for unequal nodes p and q and zero for equal nodes and this was the decision adopted in this paper.
Note that the constant w sets the importance of each level. Specifically, the values w < 1 , w > 1 and w = 1 grant more importance to the upper levels, to the lower levels or the same for all levels, respectively. The value w = 0.5 is broadly used in GP and is the value adopted in this paper.
-Subtree based distances take into account the set of subtrees appearing in each candidate solution to estimate their differences. The proposal by Keijzer (kj) [18] is probably the most popular belonging to this group. This distance is defined as follows: In a more recent paper [19], a related proposal that considers both subtrees with limited sizes and their positions is devised.
Behavioral diversity is based on checking the output of the mathematical models with the training cases. There has been an increasing interest in analyzing behavioral diversity because structural diversity does not guarantee behavioral diversity [20]. These approximations can be categorized into the following classes: -In fitness-based measures, the fitness, which summarizes in a single scalar the overall performance of the individual, is used in some way to estimate the diversity. For instance, it can be done through the variance of the fitness values appearing in the population [21]. -In semantic-based measures, the semantic ( (t) ) of a program t is defined as a vector containing, for each training case, its output or a value related to its performance. Then, this vector is used to establish the differences between individuals [22]. The Sampling Semantic Distance (SSD) [23] is a popular example belonging to this category, and it is calculated as is the semantic value associated to program t j in the i-th test case and m is the total number of training cases.
Finally, note that entropy-based measurements are also quite typical, but they can be considered as special cases of the previous ones. In these kinds of schemes, the mathematical models are partitioned in some way and the spread of the population over the partition is taken into account. In order to partition the population, some of the methods previously discussed are used. For instance, in [21], edit distances to a reference tree are used to establish the partitions. Then, the entropy of the population P is calculated by taking into account the fraction of individuals belonging to each group of the partition. Thus, this is an entropy-based structural diversity. An example where the partitions are established in terms of the fitness function values, i.e., an entropy-based behavioral diversity, is presented in [24].

Diversity management strategies
Since a large amount of diversity management strategies have been devised, several taxonomies to classify them have been proposed. The taxonomy provided in [13] takes into account three independent categories: the element that is considered (lineage, genotype or phenotype), the type of selection that is modified (parent, survival or both), and the context-dependency. The taxonomy provided in [7] also takes into account the component that is modified but with a more fine-grained view. It identifies selection-based, crossover/mutation-based, population-based and replacementbased techniques, among others. Additionally, it also differentiates among maintaining, controlling and learning techniques. In the maintaining techniques, the diversity is not measured globally but a modification that aims to alter the amount of diversity maintained in the population (usually to increase it) is included. Note that schemes that measure some distances among individuals but do not use a global measure of diversity, such as the crowding schemes, belong to this group. In the controlling techniques, the diversity is measured and it is used as feedback to steer the evolution. Finally, in the learning techniques, a long-term history of the diversity is stored and used in combination with machine learning techniques to alter the optimization process. This research focuses on management strategies that can be classified as maintaining techniques and in order to further classify them, the altered component is identified.
In the selection-based techniques, the parent selection procedure is modified, usually with the aim of biasing the search towards uncrowded regions. Some of the most popular strategies belonging to this category are the following: • -lexicase selection ( -lex) [25] is an adaption of the well-known lexicase selection [26] to the symbolic regression case. Lexicase selects parents by iteratively considering single cases and discarding non-elite solutions, meaning that the semantic diversity is taken into account. -lex modules the pass condition with the aim of diversifying further the selected parents for symbolic regression by considering a threshold value ( ) in the filter step. Several ways of setting were tested, with the Sum-MAD strategy, which is based on considering the median of absolute deviations, providing quite robust results. • Semantic in Selection (SIS) [27] relies on performing two kinds of selections to promote quality and behavioral diversity. For each pair of parents, the first one is selected by a tournament based on the fitness value, whereas the second one is selected by sampling a pool of ps individuals and then selecting, among the ones with a different semantic than the first parent, the one with the best fitness value. If none of the sampled individuals generate a different semantic, the individual is selected randomly. Note that in the case of symbolic regression, a threshold value ( ) is used to compare behaviors. Thus, the behavior for a test case of two candidate solutions is considered to match when their absolute difference is lower than . • k-nobelty selection (KNOB) [17] is also based on performing two kinds of selections. Specifically, selections based on lexicase (or -lexicase for continuous semantics) are used with probability (1 − k) , whereas the remaining selections are based on novelty. Thus, the first kind of selection comprises both diversity and quality, whereas the second type is based solely on diversity. The novelty is approximated by calculating the mean Hamming distance (with the binarization scheme presented in [28]) to a sampling of the previously generated solutions. The distance is thus calculated as is the error of the tree t j in the i-th case and the binarization b j (e) is 1 if the error e is less than or equal to the mean error of the tree t j in every case, and 0 otherwise. Specifically, a bounded archive A that stores the last |A| evaluated solutions is maintained, and sp individuals are sampled to estimate the novelty. Several ways of setting k were analyzed, and schemes based on a logarithmic decay excelled.
Crossover and mutation operators are quite important for the dynamics of the populations. Thus, redefining these operators by taking into account desired effects on diversity is also quite typical. Crossover/mutation-based strategies alter the genetic operators and/or their probabilities. One of most popular strategies that act on the variation phase is the Diverse Partner Selection with Brood Recombination (DPSBR) [29]. DPSBR alters the probabilities of crossover ( p c ) and mutation ( p m ) dynamically and promotes crossing semantically distant individuals by forcing certain conditions related to the expected improvement. After selecting the parents with a typical tournament based on fitness, those conditions are checked and if they are not fulfilled, alternative parents are tested. After n/2 attempts, where n is the population size, the crossover is abandoned and the last pair of selected parents is accepted. Since this might be indicative of a behavioral diversity that is too low, p c is reduced and p m is increased by 1/n. In order to calculate the semantic distance, the same binarization scheme as in k-nobelty is used. Additionally, brood recombination is applied, meaning that for each pair of parents, multiple recombinations (mr) are performed.
Replacement-based schemes modify the process of selecting the survivors of the next generation. Since this phase is in charge of erasing information, it is quite a natural component to promote diversity. The following strategies are some of the most popular in this category: • Find Only and Complete Undominated Sets (FOCUS) [30] applies a multiobjective approach, where the objective triplets consist of fitness, size and the average squared distance to other individuals. The normalized ed1 distance [12] is used, which is defined as ed1(t 1 , t 2 ) = dist(p, q) + ∑ l=1 ed1(t 1,l , t 2,l ) , where p and q are the roots of t 1 and t 2 , is the minimum between the number of children of p and q, and dist(p, q) is defined as one for unequal nodes p and q and zero for equal nodes. Note that unlike ed2, this distance only takes into account the overlapped nodes, so no expanded tree is created, and it is normalized by dividing by the size of the smaller tree. Distances consider the extended population made up from the union of the current population and the offspring. Then, it applies the Pareto dominance criterion to identify the nondominated solutions, which are selected to survive. • Age Fitness Pareto Optimization (AFPO) [31] also applies multi-objective concepts but using a tuple that consists of fitness and age. The key idea is the concept of genetic age, which is related to the number of generations that the genetic information has been in the evolutionary process. Specifically, the age of solutions of the initial population is set to 0 and at each iteration, the age of every member of the population is increased. The age of the individuals created by crossover is inherited from the oldest parent and, in addition, in each generation a new random individual with its age set to 0 is created. Note that this step is similar to the random immigrants approach [32], but, together with the novel selection process, it further favors young individuals. Finally, the non-dominated individuals survive. However, in this case, there is a target population size n meaning that if less than n non-dominated individuals exist, some dominated individuals also survive. The dominated individuals are selected through Pareto tournaments. • Genetic Marker Density Genetic Programming (GMD-GP) [33] is also based on Pareto dominance, but in this case each individual is associated with a tuple that consists of genetic marker density and fitness. Genetic markers are the main novelty of this proposal and they are partial trees generated by traversing the generated trees from the root to a specified depth (md). Then, the partial tree generated for each individual is used to calculate its density by considering the number of times that such a tree appears in the population. The logic behind this proposal is that, for several cases, the population converges very quickly in the top levels [34]. Note that the same replacement procedure as in AFPO is applied, meaning that all nominated individuals survive and, additionally, some dominated individuals might survive to reach the target population size.
In the case of population-based diversity-management strategies, the main mechanism is not focused in a single evolution component, rather the typical panmictic model with a fixed size is modified, for instance, by considering a notion of sub-population or by altering the population size dynamically. One of the most popular strategies belonging to this category in the GP area is the Age Layered Population Structure [35,36] (ALPS) scheme. The core idea behind ALPS is to protect recently created individuals in the population from being overshadowed by fitter and more evolved individuals. Thus, only individuals with similar ages compete among them. In order to attain this aim, the population is partitioned into layers and each layer has a capacity (lc) and an age limit. Specifically, the age limit of layer i ( age i ) is set to agegap × i 2 , except for the last layer, which has no age limit. Inside each layer, a generational strategy with elitism is applied, meaning that the best el solutions in terms of their fitness, as well as the offspring, survive. At each generation, the individuals' ages increase and they are promoted to the next layer if their age is greater than their current layer limit. Additionally, individuals in the first layer are regenerated randomly at specific generation intervals (agegap), meaning that short new individuals enter in the competition. Genetic operators are applied inside layers to generate offspring and the age is inherited from the oldest parent involved. Finally, note that some hybrid diversity management strategies that combine variants of several of the schemes presented above have also been devised. For instance, GMD-GP has been combined with lexicase to simultaneously consider the behavioral and structural diversity [37].

Proposal
The proposal put forth in this paper was designed by considering the hypothesis that relating the diversity management to the stopping criterion and elapsed period of execution might bring additional benefits to the GP area. Specifically, this principle was used to design a novel replacement phase. This component is an adaptation of a replacement algorithm proposed by our research group, the Replacement with Multi-objective Dynamic Diversity Control (RMDDC), which is a strategy that has been applied successfully in combinatorial single-objective optimization [8]. One of the most important features of RMDDC lies in the incorporation of penalties in the replacement phase with the aim of avoiding the survival of individuals that are too similar. The definition of similarity is based on distance-like functions which are problem-dependent, whereas the notion of being too similar is dynamic. In particular, an initial threshold distance to distinguish between penalized and non-penalized individuals is established. Then, this threshold is decreased linearly during the evolution in such a way that it attains the value 0 at the end of the optimization process. Note that this means that as the evolution progresses, closer individuals are accepted with the aim of shifting gradually from exploration to exploitation. The penalization promotes the survival of less fit but diverse solutions, but at the same time, the dynamic threshold promotes the search to focus on the most promising regions during the last phases of the optimization.
In order to test the potential of adapting RMDDC to GP, and with the aim of comparing it fairly against other diversity management strategies, we implemented quite a large number of different strategies using a Simple GP (SGP) template (Algorithm 1). The template is quite general and standard. First, an initial population ( P (0) ) with n individuals is created (line 1) and evaluated (line 2). Note that g denotes the number of the current generation in the pseudocode. Then, several generations are evolved until the stopping criterion is met (lines 3-10). The evolution process is as follows. First, the parents ( P s ) are selected from the current population P (g) (line 4). Then, they are subjected to crossover (line 5) and mutation (line 6) to create the offspring (O), which is evaluated (line 7). Finally, the new population ( P (g+1) ) is selected from the offspring and current population (line 8). Note also that, at each generation, the best individual ( t best ) is updated (line 9). The best individual is returned at the end of the evolutionary process (line 11).
In this general template, the components enumerated in the pseudocode are considered as input parameters of the strategy, meaning that different GP variants and adaptations to specific applications can be developed. Additionally, when instantiating specific components, some other input parameters might be required. When integrating a specific diversity management strategy, only the components involved in the strategy are altered, whereas the remaining ones take a default behavior. In this paper, Symbolic Regression (SR) is used for comparison purposes. Thus, default components were specified for the SR task, and in particular, a tree-based GP with very standard behavior is considered. The components selected as default for SR are the following 2 : -Population-initialization: The standard ramped half and half strategy is used.
It combines grow and full tree generation techniques to generate n individuals.
where m is the number of training cases, and y i , ŷ i are the expected and predicted outputs for the i-th test case of the training set. Note that individuals are only evaluated if necessary, i.e., in cases where crossover or mutation alters the tree. In our experimental validation, the terms fitness and MSE are used interchangeably.
-Selection: n parents are selected by applying tournament selections. In each selection a pool of individuals of size tsize is randomly sampled from the population and the fittest individual is selected. In case of ties, they are broken at random. The selection returns an array of parents pairs P s . -Crossover: The parents pairs are subjected to the subtree-exchange operator with probability p c . In this operator, two nodes, n a and n b , each from a parent, are selected at random. Then, the subtrees rooted at these nodes are exchanged to create the offspring. Individuals belonging to pairs that are not subjected to crossover are moved to O ′ without any change. -Mutation: For each individual in the offspring, the subtree-replacement method is applied with probability p m . In this mutation, a node n is selected randomly and the subtree rooted in n is replaced by a new tree that is generated by the grow method. The method returns a new offspring array O where p m × |O � | elements are expected to be mutated. -Replacement: A generational scheme with elitism is used, meaning that the next population P (g+1) consists of the offspring together with the best solution from the current generation population P (g) , but only if this solution is better than any of the offspring. In the novel Genetic Programming with Dynamic Management of Diversity proposal, the only component that is modified from the SGP template is the replacement phase. Algorithm 2 details the proposed replacement. The inputs are the population (P), the offspring (O), the number of desired survivors (n) and some information that is used by the penalization approach, which is the number of elapsed evaluations ( e e ), the stopping criterion set as a maximum number of evaluations ( e s ) and Genetic Programming and Evolvable Machines (2022) 23:279-304 the initial distance threshold ( d i ). The aim is to select n members to form a new population (NP) for the next generation.
GP-DMD's replacement strategy initially joins the population and offspring into a set of candidates (C) (line 1). The best individual in terms of fitness is selected from the candidates to form part of the new population (line 2) and removed from the candidates (line 3). In case of ties, the smallest individual is selected and, if there is still a tie, it is broken randomly. Then, a threshold value ( ) is calculated (line 4), which is used in subsequent steps to penalize the candidates. After the previous initialization, the algorithm performs n − 1 iterations to select the survivors from the candidates (lines 5-15) to form the new population. At each iteration, the candidates are categorized and included either in the set of penalized candidates ( C p ) or the set of non-penalized candidates ( C np ) (line 6). Specifically, any candidate whose Distance to Closest Survivor (DCS) is lower than the threshold is categorized as a penalized candidate; otherwise, it is categorized as non-penalized, i.e, individuals are classified depending on its distance to its closest already selected individual. If there are non-penalized candidates, a multi-objective approach based on a tuple with two objectives related to accuracy and simplicity is used for selecting a random nondominated candidate (lines 7-9) and the penalized individuals are ignored. Otherwise, i.e., if all candidates are penalized, the farthest individual is selected (line 11). Note that this situation might indicate that the diversity is too low, so selecting the most distant individual seems promising and aligned with previous research on RMDDC. Finally, the selected candidate is included in the new population and removed from the candidates set (lines [13][14]. Note that this general replacement strategy is quite similar to the one applied in RMDDC. However, the process to select among the potential non-penalized individuals differs. In the case of RMDDC and its variants, two different strategies were analyzed. In the first variant, a multi-objective approach based on a tuple of fitness and diversity was applied and the selection process chooses survivors at random from among the non-dominated individuals. Subsequently, a second simpler strategy based only on fitness also behaved properly [38], so the step to categorize individuals into the penalized and non-penalized classes seems to be key to proper performance. In the case of GP, both the accuracy and simplicity of the models are important. Thus, in this case, the multi-objective approach is based on a tuple of objectives that considers accuracy and simplicity, respectively. The minimum-requireddistance function is used to set the threshold that is applied to distinguish between penalized and non-penalized individuals. In this paper, this function is similar to the one applied in RMDDC, i.e., it starts from an initial distance value d i (a parameter of GP-DMD), which is reduced linearly. Specifically, the threshold is set as follows: In order to fully specify our proposal for a given application, some problemdependent decisions must be made; specifically, the objectives related to accuracy and simplicity, as well as the way to calculate distances between individuals, must be defined. In order to deal with the SR task, the following decisions were made. The accuracy and simplicity are considered by minimizing the MSE and tree sizes, respectively. In the case of the distance-like function, we considered a normalized ed2 distance 3 , which was previously presented. In order to restrict the distance values between two given trees t 1 and t 2 to the range [0, 1], the ed2 distance is divided by an upper bound of the maximum attainable distance ( max ed2 (t 1 , t 2 ) ). This upper bound is calculated by considering a tree whose structure is formed by overlapping the trees of t 1 and t 2 and creating nodes in any position where at least one of the trees has a node. Then, max ed2 (t 1 , t 2 ) is calculated by considering this structure and assuming that differences appear in each of the nodes. The normalized distance used is ed2 norm = ed2(t 1 ,t 2 ) max ed2 (t 1 ,t 2 )

Experimental validation
The main focus of this paper is to show that the design principles related to the management of diversity previously discussed provide important benefits to the field of GP. In order to show the potential of GP-DMD, the experimental validation is focused in symbolic regression, a simple and well-established benchmark for GP. The goal of SR is to generate a predictive model as an analytic function, from a set of input/output pairs [39]. GP has been shown to yield impressive results with SR, and both the simplicity and accuracy are important [40]. However, simple variants of GP that evolve a single tree have not excelled, and the state-of-the-art algorithms are hybrid schemes that incorporate other kinds of procedures, such as multiple linear regressions and/or operators specifically devoted to SR but that suffer from exponential trees growth [41,42].
The focus of this paper is not to further develop the state-of-the-art strategies for SR. Thus, these kinds of ad-hocs and hybrid methods are not considered in our experimental validation, which is focused on comparing GP-DMD against other strategies that also evolve a single tree and that provide some specific strategies to alter the degree between exploration and intensification. As it is subsequently shown, GP-DMD provides significant advances in comparison to other diversity management schemes. However, to further explore the generality of GP-DMD, additional problems should be taken into account. This is beyond the scope of this paper, so in this sense, this validation should be considered as a first proof-of-concept.

Experimental setup
In order to validate the achievements of GP-DMD, our extensive comparison considers the following proposals: -LEX [25], SIS [27], DPSBR [29], FOCUS [30], GMD-GP [15], AFPO [31], KNOBELTY [17], ALPS [35] and SGP (Algorithm 1 with default components). All these schemes were previously presented and they were implemented by changing some of the default components of Algorithm 1 to incorporate the specific changes devised in each of these strategies. Table 1 details the parameters applied in each scheme and clarifies the component that each of the methods altered with respect to SGP. These parameters are similar to those put forth in the original proposals, with some minor tweaks to adapt each proposal to the specific problems that are addressed in this paper. The common parameters used in all the algorithms are detailed in Table 2.
In order to facilitate reproducibility, free software implementations are available for public download 4 , 5 . The programming languages used are Crystal (v1.0) for the core framework and Ruby (v3.0) and Python (v3.9) for pre and post processing scripts. Instructions and requirements for replicating the experiments presented in this paper are provided in the repository.
Experiments were performed on 25 SR problems from the benchmark proposed in [43]. Note that some preliminary executions with additional problems were performed. Since similar conclusions could be drawn and due to time restrictions, we limited the experimentation to the first 25 problems, which are among the most popular ones. The symbolic regression problems are listed in Table 3. For each problem, the training and validation sets were created as specified in their original papers, and in order to provide a fair comparison, the same data were considered for all the executions. Note that E indicates an equidistant point sampling, and U a uniform sampling. In order to facilitate future comparisons, codes to generate the training and validation sets are included in our repository.
Finally note that, since stochastic algorithms were considered in this study, each execution was repeated 30 times and comparisons were carried out by applying a set of statistical tests. Specifically, the following tests were applied, assuming a significance level of 5%. First, a Shaphiro-Wilk test was applied to check if the values of the results followed a Gaussian distribution. If so, the Levene test was used to check for the homogeneity of the variance. When similar variances were confirmed, an ANOVA test was done; otherwise, a Welch test was performed. For non-Gaussian distributions, the non-parametric Kruskal-Wallis test was applied. The statement "algorithm A is superior than algorithm B" means that the differences between them are statistically significant and that the median obtained by A is lower than the median achieved by B. Note that the same kinds of statistical tests are used to compare fitness and sizes. The following subsections include analyses related to accuracy and simplicity, as well as some studies regarding the dynamics of the population which contribute to a better understanding of the internal behavior of the set of GP strategies considered. Additionally, some variants that alter the penalization scheme of GP-DMD are tested with the aim of better understanding the implications of this particular component.

Fitness comparison
In order to validate GP-DMD in terms of accuracy, it is compared against the above mentioned methods in the 25 selected benchmark problems in terms of fitness. Table 4 summarizes the performance of the algorithms. Specifically, it shows the number of instances where the given algorithm was either the best algorithm (lowest median fitness) or it did not exhibit a statistically significant difference with the algorithm that reported the lowest median (for detailed pairwise comparisons and  the specific fitness values obtained, see the supplementary material). This is reported both for the training and validation sets. It should be noted that GP-DMD was in the group of best-performing algorithms in more instances than any other approach, and probably more importantly, when considering the validation set, it remained in this group in 23 of the 25 instances. Thus, not only is the accuracy of GP-DMD competitive, but it is attained in quite a robust way. Figure 1 shows boxplots of the fitness in the validation set for 30 independent executions in three selected instances 6 . Note that GP-DMD achieved a very low koza-01 nguyen-01 nguyen-03 nguyen-04 median and the lowest variability. While other methods eventually yield good solutions, GP-DMD consistently generates good models, so the way to explore the search space provided by GP-DMD, with a more explicit management of the diversity, results in a much more robust behavior.

Solution size
Generating easily interpretable models is important in machine learning tasks [44,45]. One key aspect related to interpretability is the size of the mathematical models (trees in this case). However, a typical drawback in the design of GP algorithms, is the phenomenon known as bloat [46], which is the uncontrolled growth of code without a significant performance improvement [30]. Avoiding this uncontrolled growth is important for several reasons. First, concise final trees provide advantages in terms of reduced computational complexity, increased generalization and easier examination of the structure. Second, in the presence of bloat, the structural diversity measures become an unreal measure of the difference between individuals, so the diversity management schemes might not behave as expected [17]. Thus, diversity management strategies, and especially those based on structural diversity, should take size into account. For the above reasons, this section is devoted to analyzing the implications of the different tested algorithms on the sizes of the trees. Table 4 presents a summary of the results of the statistical tests when applied to compare tree sizes (number of nodes). It shows the number of instances where the given algorithm was either the best algorithm (lowest median size) or it did not exhibit a statistically significant difference with the algorithm that reported the Table 4 Number of instances for each method where there was no statistical difference with respect to  the best (lowest) median for fitness in training and validation sets and model size   GP-DMD KNOBELTY GMD-GP ALPS  -LEX FOCUS SGP DPSBR SIS AFPO   Training  21  15  14  11  11  5  4  3  3 2   Validation 23  11  10  20  8  8  9  8  8 8   Size  17  2  3  14  1  10  2  These results show that the proposed method has a positive and significant impact on the size of the solutions, which is usually considered beneficial for avoiding overfitting and improving generalization. Note that this benefit is confirmed with the performance analyses in the validation set.

Population dynamics
In order to shed some light on the reasons behind the good performance of our proposed method, a discussion of some of the properties related to the dynamics of the population is presented. To this end, above is a set of graphs with the trend, through the optimization process, of several measurements for three test cases. The behavior observed in the remaining 22 cases is quite similar. Figure 3 shows the median among executions of the fitness for the validation set. Note that GP-DMD yields improvements in a continuous and gradual manner, which indicates that the algorithm does not suffer from premature convergence. This contrasts with the remaining methods, where the improvements are not very significant for a large period of the execution. Figure 4 depicts the median among executions of the mean size (number of nodes) of the solutions in the population. It is clear that ALPS, GP-DMD, DPSBR, and FOCUS exhibit a slow but steady increase in the sizes, whereas for the remaining methods, there is initially a fast increase in the population tree size that never shrinks, meaning they are searching much more complex models. Note also that those schemes that maintain lower sizes are faster, meaning larger populations or executions could be used to potentially improve the quality of solutions further. Figure 5 shows the median among executions of the mean ed2 norm distance to the closest individual in the population (DC). As expected, in the case of GP-DMD there is quite a linear decrease in this measurement. It is also important to point out that the diversity resulting from the population initialization method is not very high, and GP-DMD is capable of promoting the diversity to the desired value despite the initial diversity value. This behavior is quite different to the ones exhibited in the remaining methods, where in most cases, a relatively constant amount of diversity is maintained for a large period of the execution. This means that GP-DMD is in fact moving from exploration to exploitation, which is the main aim behind the design principle studied in this paper.
As previously stated, the behavioral diversity is also important. In order to analyze the trend in the behavioral diversity, Fig. 6 shows the median among executions of the median pair-wise fitness differences appearing in the population. A large value indicates that solutions with different kinds of behaviors are present in the population. Differently, a low value showcases a convergence in this property. Note that this does not necessarily imply a convergence in terms of the tree structures; however, presenting a too homogeneous behavior might be an indication that several of the trees are in a neutral network, which in the case of GP might be due to the appearance of introns. While GP-DMD does not explicitly control the behavioral diversity, it attains a progressive decrease in this metric. This is particularly clear in the keijzer-06 and koza-03 cases. In contrast, the other approaches tend to maintain a relatively fixed amount of this kind of diversity after a relatively small period of execution. The gradual reduction presented by GP-DMD indicates that the explicit management of structural diversity indirectly manages the behavioral diversity in a proper way in the case of GP-DMD. The gradual decrease of the median pair-wise fitness difference exhibited by GP-DMD in combination with the controlled diversity in DC, seems to indicate that the proposed algorithm is successful in promoting and managing both behavioral and structural diversities.
These trends show that the search features are quite different in the proposal put forth in this paper than in other methods. These trends are quite similar to the ones observed in the field of combinatorial optimization, meaning that these interesting features could be translated to the realm of GP and that in this case, it also provides significant benefits to the results. Specifically, there are important benefits in terms of robustness, meaning that the standard deviation of the MSE and tree size are quite low, and there were no problems with a significant degradation in performance, which is also one of the advantages found in complex combinatorial optimization problems.

Analysis of execution time
The previous experiments involved 7500 independent executions that were performed in the High Performance Computing (HPC) cluster "Laboratorio de Supercómputo del Bajio". Servers with two Intel Xeon E52620 v2 processors with 6 cores at 2.10 GHz and 32 GB of DDR3 RAM were used. The stopping criterion was set by evaluations, as previously described. However, there were important variations in terms of execution times.
The analyses of the runs show that in some cases, most of the time is invested in evaluating the solutions and the time associated to the diversity management strategy is not meaningful. However, in other cases, the overhead associated to the diversity management strategy is more significant. Note that, at each generation the evaluation stage takes O( mn) , where is the mean size of the solutions, m is the number of test cases and n is the population size. From the tested algorithms, GP-DMD presents the highest worst-case overhead complexity due to the diversity management strategy. In particular, the current implementation of the replacement operator takes O( n 2 + n 3 ) . This is a relatively high complexity and it might be an issue for very large populations. However, for typical population sizes, such as the value 200 used in this paper, this overhead is not too significant.
The properties of the instances used in this paper are very diverse. In particular, there are instances with very short m values, such as keijzer-06, and instances with very large m, such as vladislavleva-04. Figure 7 shows the median of the time of the 30 independent executions for each algorithm. Particularly, the relation between time and number of evaluations is shown. In the vladislavleva-04, GP-DMD is the fastest algorithm. Since in this case m is much larger than n, the most computationally expensive part is the evaluation of the solutions, with complexity O( mn) . Thus, in this case, the cost is proportional to the size of the trees and in fact, the relation between the time invested in the executions and the sizes analyzed in Table 4 is clear.
In the case of the instances with lower m, such as keijzer-06, the overhead associated to the diversity management strategy is more important and as a result, GP-DMD is slower than some other methods, such as ALPS and FOCUS. The reason is that, while the tree maintained by GP-DMD are slightly smaller, the total cost is dominated by the O(n 3 ) part associated to the diversity management strategy. However, in comparison to the rest of the methods, which maintain much larger trees, GP-DMD is faster. Finally, it is also important to mention that in both instances, GP-DMD is slower than several other methods in the initial phases. In these initial exploration stages, the trees in all the algorithms are small, so the time associated to the diversity management strategy dominates. However, as the evolution progresses and GP-DMD is able to maintain short trees, the advantages described above appear. In summary, it is important to mention that significant drawbacks in terms of performance might appear when considering very large populations and, less significantly, with very low training samples. Thus, for such cases, some modifications of GP-DMD might be required.

Analysis of GP-DMD configurations
This last experiment is devoted to analyze some variations of GP-DMD with the aim of better identifying the reasons behind its significant advantages. Since this research is focused on exploring the advantages of the dynamic diversity management, several dynamic and static schemes are tested. Specifically, 11 variants of GP-DMD with different features, including the baseline proposal used in previous experiments that considers a linear decrement with d i = 1∕8 , are analyzed. Three variants test different initial d i . Particularly, the values 1 8 , 1 4 and 1 2 are considered. Four variants use a static threshold value ( ) in the penalization approach. Thus, the minimum-required-distance just returns a fixed value. The tested values were 0, 1 8 , 1 4 and 1 2 . Note that using = 0 means that only the clones are penalized. Additionally, two schemes that consider a dynamic threshold schedule different to the linear one are included. Particularly, the schedules follow a bezier curve [47] with control points (0, 0) (bezier-b) and (e s , 0.5) (bezier-t), respectively. Note that all the previous proposals consider the normalized ed2 distance. Finally, two additional distances are used. They are the normalized ed1 and the normalized hamming distances; the ed1 is a structural distance, while the hamming distance represents a drastic modification because it is a semantic distance. Note that in these two last cases, the linear dynamic schedule for the penalization approach used in our previous experiments is maintained. Figure 8 shows, for all these variants, the trend of the median of several features through the execution considering 30 independent runs in the koza-03 instance. The behavior with other instances is quite similar. The subplot (A) refers to the fitness of the best solution, considering the validation set. The subplot (B) shows the mean DC distance of the population. Finally, the mean tree size is shown in subplot (C). Probably, the most obvious insight is that when using very low threshold values in the penalization, bloat appears. This is particularly clear when using = 0 , but it is also obvious in the schedule beizer-b, which maintains a low threshold for a long period. Our hypothesis is that this increase is due to the appearance of introns but more analysis is required to fully understand the reasons. However, it is clear that applying not too low thresholds is good for maintaining small trees. It is also remarkable that the DC distances follow quite precisely the schedule applied. The only exception appears when applying the Hamming distance, meaning that applying penalties in terms of semantic distances do not allow to maintain appropriate structural diversity, resulting in not so proper results.
In order to summarize the results attained with all the instances, Table 5 presents the results of the statistical tests for the size and fitness in the validation and training sets. As in previous experiments, for each variant it shows the number of instances where the corresponding model attained the lowest median among all the variants or did not exhibit a statistically significant difference with the algorithm that reported the lowest median. In terms of fitness, it is clear that using a linear dynamic threshold is superior than using a static value. In fact, for all the tested values ( 1 8 , 1 4 and 1 2 ) the schemes considering a linear decrease attained a superior performance than the ones using a fixed value. These results confirm that, as in combinatorial optimization, promoting a gradual shift from exploration to intensification provides important benefits. Regarding the proposals with a linear decrease, it is also clear that the initial threshold value significantly impacts its effectiveness. Using lower values provokes an increase on the number of instances with adequate fitness, but this is at the cost of increasing the size of the resulting trees. Thus, different parameterization might be used depending on the desired performance and interpretability. It is also worth noting that the comparisons among the results of these three d i values with linear decrement, and the remaining diversity management strategies analyzed in previous experiments, reveal the superiority of the proposals with linear decrement. Thus, while the value 1 8 was used to report the experimental results in the previous sections, any of the tested values allow reaching similar conclusions (for a summary of the statistical tests considering all the algorithms, see the supplementary material). Finally, using the ed1 distance did not result in significant changes in the results, so the proposal is robust against minor modifications in the distance-like functions. However, the Hamming distance resulted in poor results. As previously discussed, the application of this distance did not ensure the maintenance of a proper structural diversity, so in order to properly apply behavioral diversity distance-like functions in our proposal, additional research is required.

Conclusions and future work
The proper balance between exploration and exploitation is one of the keys to designing effective Evolutionary Algorithms. A design principle proposed and studied by our research group, which is based on relating the amount of diversity maintained in the population to the elapsed period of execution and stopping criterion, has yielded significant benefits in the field of combinatorial optimization. This work studies the application of this design principle in the area of GP; specifically, this paper presents a novel GP variant, called GP-DMD, which manages diversity explicitly through a novel replacement strategy that incorporates the previously mentioned design principle and at the same time favors fit and simple solutions by applying multi-objective concepts. The novel proposal was compared to a diverse set of well-established algorithms, including several diversity-aware techniques. Specifically, the validation is carried out using the symbolic regression task. The experimental validation shows that the approach presented allows for the generation of high-quality solutions with impressively small sizes, significantly improving the algorithm's robustness. The dynamics of the population in terms of size, fitness and diversity shows the remarkable differences in GP-DMD when compared to other related algorithms. Additionally, extensive experiments show that the method is quite robust, in the sense that small variants of the proposal also behave properly. Particularly, proposals that follow different kinds of schedules to penalize individuals and distance-like functions behave similarly. Moreover, the findings are somewhat similar to those achieved in the field of combinatorial optimization, meaning that this design principle proposed in that area could be successfully transferred to GP.
One of the weaknesses of our proposal is that it requires setting a stopping criterion based on time or evaluations in order to promote the gradual shift from exploration to exploitation. In practical terms, using a stopping criterion related to quality is beneficial in some cases, so in future work, we plan to analyze the application of the ideas explored in this paper but in a way that is not incompatible with setting a stopping criterion based on quality. Also note that, while GP-DMD is quite robust for different parameterizations, using more advanced parameter control techniques to adapt some of its internal components might bring additional benefits. Finally, we would also like to apply GP-DMD to other kinds of applications and perform some extensions to consider additional interpretability metrics that might be used in combination with the minimization of the sizes of the trees.