HCPC: A New Parsimonious Clustering Method based on Hierarchical Characters for Morphological Phylogenetic Reconstruction
Background: Phylogenetic trees are reconstructed frequently to provide a better interpretation of the evolutionary history of species. However, most traditional methods ignore the hierarchical relationships among characters and neglect the inapplicable state that frequently exists in the morphological data, resulting in poor performance of the phylogenetic analysis.
Results: In this study, we propose a phylogenetic clustering method based on hierarchical characters. Accordingly, we call our method Hierarchical Characters Parsimonious Clustering(HCPC). To combine prior phylogenetic knowledge and treat the inapplicable state more reasonably, two stages are proposed, i.e., Phylogenetic reconstruction and parsimonious tree search. During phylogenetic reconstruction, HCPC is able to infer the shared ancestral relationships among species. For the search of the parsimonious tree, we use a simulated annealing algorithm to heuristically search the phylogenetic tree based on the parsimony criterion. In addition, HCPC combines asymmetric binary relationships and character hierarchies to solve the problem of the ambiguity of the inapplicable state.
Conclusion: The experimental results show that the proposed method provides better performance of phylogenetic analysis than existing methods and a scientific and quantitative basis for biologists to study species evolution.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the latest manuscript can be downloaded and accessed as a PDF.
This looks like an intriguing approach to an important problem, though there are a number of details that I'd like to see clarified. Firstly, I'm not clear that your clustering method favours similarity due to common ancestry – could internal nodes simply represent 'averages' of species that are similar due to convergence? It seems that concerns have been raised against neighbour-joining approaches are equally relevant with this method. Secondly, I'm not convinced that the inference of character vectors given in equation (3) is sufficient to reconstruct the most parsimonious situation for an ancestor in all cases. Admittedly, I've not had the time to think this through carefully, but are all of the edge cases advanced in the Supplementary Materials to Brazeau et al. (2019, Syst Biol - see https://rawgit.com/TGuillerme/Inapp/c4d502fa4ae3702eb207b2da3a692e469ba3b6ab/inst/gitbook/_book/index.html) really treated in a satisfactory manner? As one example, why must the ancestor in fig. 4u bear inapplicable tokens: could not X2 represent a case of secondary loss of the parent character? I do not see why a reconstruction of X6={10101} should not be considered equally valid. If the new method really does out-perform alternatives, then perhaps these concerns can be dismissed as merely theoretical. But I'd want to be thoroughly convinced that the perceived improvement in performance is genuine. A cynic might raise a number of concerns about your analysis. (i) Are the model trees really true? If 'the opinions of palaeontologists' are sufficient to guarantee that a tree is correct, then why bother with phylogenetic analysis – we could just ask a palaeontologist! I've seen two approaches here: simulating data from known trees (perhaps difficult as there is no good model with which to simulate inapplicable data), and using trees that are well corroborated by molecular data (see for example Pattinson et al. 2015 Syst Biol; Asher et al. 2019 Royal Society Open Science). (ii) Is the normalized RF distance misleading? For thoughts on normalization, see Smith (2019, Biology Letters); for thoughts on the limitations of the RF distance, and more suitable alternatives, see Smith (2020, Bioinformatics). (iii) Are the datasets correctly coded? The Yang et al. dataset is exemplary in that it explicitly lists all hierarchies; have all datasets under consideration been coded with equivalent care, scrutiny and transparency? (iv) Why not compare your method with other methods that explicitly handle inapplicable data, such as that of Brazeau et al. (2019, Syst Biol) and Tarasov (2019, Syst Biol)? (Surely these approaches warrant discussion in your paper?) (v) How are the comparison analyses performed? I couldn't see details of the BI runs, for example: had they converged? Were characters partitioned? Why compare with equal-weights parsimony, when implied weighting is known to perform better (e.g. Smith 2019, Biol Lett)?
Posted 06 Jan, 2021
HCPC: A New Parsimonious Clustering Method based on Hierarchical Characters for Morphological Phylogenetic Reconstruction
Posted 06 Jan, 2021
Background: Phylogenetic trees are reconstructed frequently to provide a better interpretation of the evolutionary history of species. However, most traditional methods ignore the hierarchical relationships among characters and neglect the inapplicable state that frequently exists in the morphological data, resulting in poor performance of the phylogenetic analysis.
Results: In this study, we propose a phylogenetic clustering method based on hierarchical characters. Accordingly, we call our method Hierarchical Characters Parsimonious Clustering(HCPC). To combine prior phylogenetic knowledge and treat the inapplicable state more reasonably, two stages are proposed, i.e., Phylogenetic reconstruction and parsimonious tree search. During phylogenetic reconstruction, HCPC is able to infer the shared ancestral relationships among species. For the search of the parsimonious tree, we use a simulated annealing algorithm to heuristically search the phylogenetic tree based on the parsimony criterion. In addition, HCPC combines asymmetric binary relationships and character hierarchies to solve the problem of the ambiguity of the inapplicable state.
Conclusion: The experimental results show that the proposed method provides better performance of phylogenetic analysis than existing methods and a scientific and quantitative basis for biologists to study species evolution.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the latest manuscript can be downloaded and accessed as a PDF.
This looks like an intriguing approach to an important problem, though there are a number of details that I'd like to see clarified. Firstly, I'm not clear that your clustering method favours similarity due to common ancestry – could internal nodes simply represent 'averages' of species that are similar due to convergence? It seems that concerns have been raised against neighbour-joining approaches are equally relevant with this method. Secondly, I'm not convinced that the inference of character vectors given in equation (3) is sufficient to reconstruct the most parsimonious situation for an ancestor in all cases. Admittedly, I've not had the time to think this through carefully, but are all of the edge cases advanced in the Supplementary Materials to Brazeau et al. (2019, Syst Biol - see https://rawgit.com/TGuillerme/Inapp/c4d502fa4ae3702eb207b2da3a692e469ba3b6ab/inst/gitbook/_book/index.html) really treated in a satisfactory manner? As one example, why must the ancestor in fig. 4u bear inapplicable tokens: could not X2 represent a case of secondary loss of the parent character? I do not see why a reconstruction of X6={10101} should not be considered equally valid. If the new method really does out-perform alternatives, then perhaps these concerns can be dismissed as merely theoretical. But I'd want to be thoroughly convinced that the perceived improvement in performance is genuine. A cynic might raise a number of concerns about your analysis. (i) Are the model trees really true? If 'the opinions of palaeontologists' are sufficient to guarantee that a tree is correct, then why bother with phylogenetic analysis – we could just ask a palaeontologist! I've seen two approaches here: simulating data from known trees (perhaps difficult as there is no good model with which to simulate inapplicable data), and using trees that are well corroborated by molecular data (see for example Pattinson et al. 2015 Syst Biol; Asher et al. 2019 Royal Society Open Science). (ii) Is the normalized RF distance misleading? For thoughts on normalization, see Smith (2019, Biology Letters); for thoughts on the limitations of the RF distance, and more suitable alternatives, see Smith (2020, Bioinformatics). (iii) Are the datasets correctly coded? The Yang et al. dataset is exemplary in that it explicitly lists all hierarchies; have all datasets under consideration been coded with equivalent care, scrutiny and transparency? (iv) Why not compare your method with other methods that explicitly handle inapplicable data, such as that of Brazeau et al. (2019, Syst Biol) and Tarasov (2019, Syst Biol)? (Surely these approaches warrant discussion in your paper?) (v) How are the comparison analyses performed? I couldn't see details of the BI runs, for example: had they converged? Were characters partitioned? Why compare with equal-weights parsimony, when implied weighting is known to perform better (e.g. Smith 2019, Biol Lett)?