HCPC: A New Parsimonious Clustering Method based on Hierarchical Characters for Morphological Phylogenetic Reconstruction

Background: Phylogenetic trees are reconstructed frequently to provide a better interpretation of the evolutionary history of species. However, most traditional methods ignore the hierarchical relationships among characters and neglect the inapplicable state that frequently exists in the morphological data, resulting in poor performance of the phylogenetic analysis. Results: In this study, we propose a phylogenetic clustering method based on hierarchical characters. Accordingly, we call our method Hierarchical Characters Parsimonious Clustering(HCPC). To combine prior phylogenetic knowledge and treat the inapplicable state more reasonably, two stages are proposed, i.e., Phylogenetic reconstruction and parsimonious tree search. During phylogenetic reconstruction, HCPC is able to infer the shared ancestral relationships among species. For the search of the parsimonious tree, we use a simulated annealing algorithm to heuristically search the phylogenetic tree based on the parsimony criterion. In addition, HCPC


Introduction
Phylogenetic reconstruction is an interesting topic in biology and an important analysis method to investigate the evolutionary process and to develop an understanding of when and where speciation events may have occurred (Yang et al., 2012). A phylogenetic tree consists of edges, internal nodes, and leaves. The leaves represent the Operational Taxonomic Units (OTUs), which are the actual species used to reconstruct a tree. The internal nodes are the Hypothetical Taxonomic Units (HTUs), which represent the hypothetical shared ancestors to all other species arising from them. The species arising from an internal node are the direct descendants of this internal node.
The leaves connected to the same internal node are called a sister group and have the closest shared ancestral relationship (Scott et al., 2012). The edges represent the relationship between the two nodes.
Traditionally, there are two types of methods for phylogenetic reconstruction: distance-based methods and optimization-based methods (Goloboff et al., 2005;Yang et al., 2012). In a general way, distance-based methods depend on various types of data to reconstruct a tree, i.e., the genetic distance of the sequences and the Euclidean distance (Felsenstein, 1988;Scott et al., 2012). In terms of distance-based methods, there are commonly used algorithms, i.e., the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) (Sneath et al., 1973), Neighbor-Joining (NJ) (Saitou et al., 1987;Tamura et al., 2004), and the Fitch-Margoliash method (Farabee, 2007). Each algorithm possesses known properties so that the reconstructed tree is very similar to the real tree (Scott et al., 2012). However, optimization-based methods rely on a variety of phylogenetic characters including genetic, morphological, behavioral, and molecular to perform the analysis (Goloboff et al., 2005). Generally, these algorithms such as Maximum Parsimony (MP), Maximum Likelihood (ML), and Bayesian Inference (BI) are based on an optimization criterion (Lewis, 2001;Goloboff et al., 2005;O'Reilly et al., 2016;Puttick et al., 2017).
However, existing methods consider characters to be logically independent and ignore the hierarchical relationships among them (Zaragüetabagils et al., 2007;Nelson et al., 2010). If one character state depends on another character state in a species, it means that there is a hierarchical relationship between the characters (Lee et al., 1999;Seitz et al., 2000). The hierarchical relationship indicates that characters in phylogenetic analysis have some logical dependencies because complex biomorphologic characters can be broken down into secondary characters with logical relationships. These decomposable complex characters are called upper characters, and the secondary characters decomposed by complex characters are called lower characters. The hierarchical relationship among characters implies that the upper character and its lower character are logically dependent. In other words, the lower character is only applicable to a species that has its upper character (Seitz et al., 2000). For example, suppose some species have no tails, some have blue tails, and others have red tails. The tail color character and tail length character depend on the presence of tail character. The hierarchical relationship between the characters in this example is shown in Fig. 1. The lower characters, such as tail color and tail length are applicable only for species that have a tail. The species with no tails are recorded as inapplicable state in tail color character (Strong et al., 2010;Wilkinson, 2010). As is well known, the treatment of inapplicable state is a concern in phylogenetic reconstruction, especially for paleontological research, in which morphological data is the only material for phylogenetic reconstruction (Zaragüetabagils et al., 2007). Unfortunately, both distance-based methods and optimization-based methods are problematic when some inapplicable states exist in the morphological data (Zaragüetabagils et al., 2007;Goloboff et al., 2017;O'Reilly et al., 2018).
Meanwhile, there are two main approaches to treat inapplicable state. One frequently used method is Missing Data Replacement (MDR), which treats inapplicable state as missing data. However, this result in an assumption of homology where no homology exists (Maddison, 1993;Seitz et al., 2000;Platnick et al., 2010). The second is Separate Value Replacement (SVR), which treats the inapplicable states as separate character states; this approach is often used in the MP method.
However, the inapplicable state is not comparable to other states defined by the characters and this method implicitly weights the upper characters of the species (Lee et al., 1999;Seitz et al., 2000).
In other words, all the aforementioned methods often result in poor performance of phylogenetic analysis (Congreve et al., 2016). Different from other character states, the inapplicable state does not imply that a certain homology hypothesis can be used to reconstruct phylogenetic trees (Lee et al., 1999), whereas the above-mentioned method assumes that the inapplicable states implies an assumption of homology.
In the morphological data, the hierarchical relationship leads to inapplicable state. Furthermore, when manually reconstructing phylogenetic trees, the species with the shared derived character states can be distinguished from other species based on the polarity of the characters (Huang, 1996).
Thus, it is crucial to explore the hierarchical relationships and the polarity of the characters when reconstructing phylogenetic trees. Therefore, we divide the framework of HCPC into two steps, namely, phylogenetic reconstruction based on the hierarchical relationship and a parsimonious tree search. HCPC reconstructs phylogenetic trees based on the asymmetric binary relationships and hierarchical architecture of characters. In particular, the polarity of characters is quantified into a distance calculation to measure the shared ancestral relationship among species, and the character vectors of the internal nodes are inferred from the hierarchical relationships. Therefore, when reconstructing a phylogenetic tree, no homology hypothesis regarding inapplicable states is required. Our experimental results show that in most cases, HCPC achieves better performance than traditional approaches such as MDR and SVR in terms of the Robinson-Foulds (RF) distance.
The paper is organized as follows: Section 2 introduces the framework of HCPC. We present the experimental results and the analysis in Section 3. Finally, Section 4 describes the conclusions drawn from the work.

Method
It is well known that if species share more derived character states, they may have closer shared ancestral relationships (Huang, 1996). Moreover, in phylogenetic analyses, inapplicable states are attributed to the hierarchical relationships among characters (Lee et al., 1999). Therefore, when 6 inapplicable states exist in the morphological data, HCPC combines the polarity of the characters with the hierarchical relationships among the characters to reconstruct a phylogenetic tree.
Specifically, HCPC reconstructs the phylogenetic trees using phylogenetic clustering and searches for the parsimonious trees using a simulated annealing algorithm. Fig. 2 shows the framework of the proposed HCPC method.

Phylogenetic clustering for phylogenetic reconstruction
When reconstructing phylogenetic trees, the shared ancestral relationships among the species are inferred based on the shared derived index. The species with the maximum value of the shared derived index are selected to generate an internal node. Then the character vector of this internal node is inferred and added to the data matrix D as a "new" species instead of its direct descendants.
The shared ancestral relationships among species at different levels are established after repeating this process. The flowchart of the phylogenetic clustering based on the character hierarchy for phylogenetic reconstruction is shown in Fig. 3.

Measuring the shared ancestral relationship using the shared derived index
Let D{ 1 , … , } be a finite character matrix of n species where represents the i -th species' character states. The state of the p-th character for is denoted as . The number of characters used to reconstruct phylogenetic trees is denoted as m; therefore, the character vector of species is { 1 , … , }. Suppose that the upper character of the p-th character is the q-th character. That is, when some species lack the q-th character, the p-th character is inapplicable.
According to the polarity of characters, the character states can be divided into two categories, i.e., derived states and ancestral states. The polarity of characters is a dialectical relationship between the derived states and the ancestral states. Let be the ancestral state of the p-th character. Thus, S{ 1 , … , } marks the ancestral state of each character, which is an ancestor's inference. Thus, the character states in the character matrix that are same as the ancestor's inference are the ancestral states. Otherwise, the remaining states are derived states. Specifically, if = , is an ancestral state for species ; otherwise, is a derived state for species .
The derived state and the ancestral state are an asymmetric binary relationship for characters according to the principles of phylogenetic analysis. The number of identical derived character states between and is defined as a shared derived index d( , ), which measures the shared ancestral relationship between and . The shared derived index d( , ) is obtained as follows: , ℎ ≠ "-" and ≠ "-". (1) For each species, the shared derived index is calculated in pairs. From this, we can obtain the metric matrix defined in Eq.
(2). , if { = , ≠ , ≠ " − ", ≠ , and ≠ " − ". = "-" or = "-",and = . , if = "-" or ="-",and = . (3) The character vector of species H is inferred according to the ancestral state of the p-th character . In particular, when the p-th character is an inapplicable state, the state of is inferred by the state of the q-th character . Therefore, the character vector of node H can be determined and the internal node H is added to the matrix D as a "new" species instead of , .
After generating a new hypothetical ancestor H, the difference between any two nodes , (including the HTUs) can be calculated by: Let represents a phylogenetic tree. The tree length of is denoted as ( ). ( ) is the sum of the character changes between all adjacent nodes.

An example to illustrate phylogenetic clustering
In this section, an example is given to illustrate the specific implementation process of phylogenetic clustering. As shown in Fig. 4, there are five characters in the data matrix. Character 1 and character 3 are solitary characters. Character 2 is an upper character so that character 4 and character 5 are applicable only when character 2 is "1". There are four species including Suppose that the ancestor's inference is S{00000}, then the shared derived index is calculated based on the ancestor's inference S{00000}. 3 and 4 have the maximum value of shared derived index so that 3 and 4 are selected to generate a new hypothetical ancestor 5 . The character vectors of 5 are inferred according to 3 and 4 . Subsequently, 5 is added to the data matrix as a "new" species instead of 3 and 4 . Repeat the above steps, then 2 and 5 are selected to generate a new hypothetical ancestor 6 . There are some inapplicable states in the character vectors of 2 so that the states of character 1, 2, and 3 for 6 are inferred according to the polarity of character 1, 2, and 3, whereas the states of character 4 and character 5 for 6 are inferred based on the state of character 2. The state of character 2 is "0"; therefore, the states of character 4 and character 5 are "-" for 6 . After inferring the character vectors, 6 is added to the data matrix as a "new" species instead of 2 and 5 . We repeat this process until the parsimonious tree has been reconstructed.

Simulated annealing algorithm for the parsimonious tree search
There are several methods for establishing the polarity of the characters, including outgroup comparison, stratigraphic positions, and developmental biology (Baum et al., 1996;Foote et al., 2007;Tian et al., 2017). However, none of these is an absolutely convincing method (Tian et al., 2017). Therefore, an exhaustive search is conducted on the ancestor's inference and various possible phylogenetic trees are calculated. However, it is an NP-hard problem to search for the parsimonious tree in the tree space. What should be of concern is that the tree with the minimum evolutionary steps to explain the change in the characters is the most parsimonious tree in the tree space based on parsimony principle (Desper et al., 2002;Farris, 2010). In other words, the tree with the shortest tree length is the most parsimonious tree and represents the object of interest here.
Therefore, a simulated annealing algorithm (SA) is applied to search for the parsimonious tree heuristically according to the principle of parsimony.
The SA is a stochastic optimization algorithm based on the Monte-Carlo iterative strategy. The SA with asymptotic convergence is a global optimization algorithm and has resulted in a probability of l in the search for the optimal solution (Mahapatra et al., 2018). Furthermore, the parsimonious tree obtained by the SA is independent of the initial solution.
In this study, the SA is decomposed into three parts: the ancestor's inference space, objective function, and initial ancestor's states. The ancestor's inference space consists of all possible states for each character. According to the parsimony principle, the parsimony score L( ) is the objective function. S{00 … 00} is the initial ancestor's states, which consists of state "0" of each character. For each character, a random state ("0" or "1") is selected to obtain a new ancestor's inference . The difference between L( ) and L( ) is an attenuation factor. In the SA, the inner loop termination criterion consists of a small change in the attenuation factor in the successive loops, whereas the outer loop termination criterion is to reach a certain number of iterations.

Experimental Results and Analysis
To evaluate the performance of HCPC, we conducted experiments on seven morphological datasets, including three datasets of living species and four paleontological datasets. In the studies by Bouamer et al. (2003), Tang et al. (2014), and Chen et al. (2015), datasets of living species were used and the model trees were obtained using state-of-art methods such as MP and BI. Yang et al. appropriate for binary characters. Therefore, multistate characters have to be transformed into binary characters using binary coding.
We use the RF distance to assess the difference between the reconstructed phylogenetic trees; it determines the number of unshared portions between the inferred tree and the model tree (Goloboff et al., 2010). The RF distance calculation can be expressed by Eq. (8): where is the number of species. | (tree)| is the number of split tree sets of the tree. 1 represents the inferred tree, 2 represents the model tree. The inferred tree refers to the evolutionary tree constructed with the proposed method and the model tree refers to the phylogenetic tree accepted by biologists. The range of the RF distance is 0 ≤ RF ≤ 1. Generally speaking, the shorter the RF distance between the inferred tree and the model tree, the more similar the trees are. (Robinson et al., 1981;Wilkinson, 2010).

Experimental results
Based on the characterization of the morphological datasets and the prior knowledge of biologists, a character hierarchy model is constructed. We use the paleozoic lobopodians of Yang et al. as an example (Fig. 5).   Fig. 6 shows that the inferred trees and the model trees are basically the same for the living species. In this study, we compare the NJ method with HCPC. Fig. 7 shows the performance of the HCPC and NJ methods for the three living species datasets. We observe that the inferred trees reconstructed by HCPC are more similar to the model trees than those based on the NJ method ( Fig. 7). We compare the HCPC method with MP-MDR, MP-SVR, ML, and BI. Fig. 8 shows the RF distance between the inferred tree and the model tree obtained by different methods for the paleontological datasets. Compared with the other methods, HCPC has a smaller RF distance for the datasets of Yang et al. (2015), Liu et al. (2011), andHan et al. (2017). The HCPC method results in a decrease in the RF distance by 45.3% compared to the ML method for the dataset of Liu et al. (2011) and a decrease by 50.1% compared to the BI D for the dataset of Yang et al. (2015).
The performance of the HCPC method is worse than that of the ML method for the dataset of Liu et al. (2010) (Fig. 8).

Discussion
In terms of the RF distance, HCPC outperforms the NJ method for morphology datasets of living species, indicating that HCPC results in a tree that is more similar to the actual phylogenetic tree.
In phylogenetic analysis, the shared derived character states among species are the only evidence for shared ancestral relationships (Foote et al., 2007). Thus, the proposed phylogenetic clustering method measures the shared ancestral relationship among the species based on the shared derived index. Furthermore, under the parsimony criterion, no hypothesis is required regarding parallel evolution and reverse evolution when inferring the character vectors. Therefore, HCPC is a greedy algorithm based on the principle of parsimony, so that HCPC always ensures that the current internal node has evolved its descendants with the least number of evolutionary steps when reconstructing a phylogenetic tree. That is, HCPC prefers that the similarity based on the parsimony assumption is interpreted as parallel evolution or evolutionary reversals. HCPC is effective when the evolution rate is low (the probability of parallel evolution and reverse evolution is also low) or the characters have evolved independently. To some extent, HCPC is also effective when the evolution rates of all species are equal.
Since HCPC is based on hierarchical characters for phylogenetic tree construction, HCPC outperforms the other methods when morphological data are used, especially paleontological morphology data. Thus, HCPC avoids explaining inapplicable states through the assumption of homology. Moreover, the character vectors of the internal nodes are inferred based on the polarity of the characters and the hierarchical relationships among the characters. However, MDR often results in an increase in missing data (Zaragüetabagils et al., 2007). For missing data, every state is possible, whereas, for inapplicable state is impossible . Large amounts of missing data reduce the accuracy of the phylogenetic tree for traditional methods (Maddison, 1993).
SVR results in extra emphasis on the absence of a single structure and acts as a form of weighting in MP; this approach often results in an inappropriate phylogenetic tree (Lee et al., 1999).
There are two reasons for the poor performance of HCPC for the Yang et al. (2015) dataset.
Firstly, the problem of inapplicable states only occurs if two or more regions with applicable states in the branches are separated by regions with species with inapplicable character states (Maddison, 1993). Secondly, some distortions inevitably occur when determining the shared ancestral relationships among species; therefore, HCPC may fall into local optimum. However, in most cases, HCPC provides inferred trees that are very similar to the model trees and the main branches exhibit good consistency between the inferred trees and the model trees. The failure to determine a sufficient number of characters for the Tang et al. (2014) dataset is an important reason for the low confidence in that particular phylogenetic tree. Furthermore, some individual diversity is lost when multivariable data are reduced to a single dimension to measure the shared ancestral relationship among species.

Conclusion
To address inapplicable states more elegantly, the proposed HCPC method incorporates additional prior phylogenetic knowledge including the polarity of the characters and the hierarchical relationships among characters to reconstruct the phylogenetic trees. In this way, the processing of inapplicable states does not involve an assumption of homology. The experimental results show that HCPC results in better performance for phylogenetic tree construction from morphological data than other common methods, especially for early paleontology. In future research, we will focus on using HCPC to obtain global optimal solutions at lower computational costs. In addition, we will also improve HCPC for multivariate character conversion.