Phylogenies From Unaligned Whole Genomes Using Sequence Environments of Amino Acid Residues


 Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Several algorithms have been implemented in diverse software packages. Despite the great number of existing methods, most of them are based on word statistics. Although they propose different filtering and weighting strategies and explore different metrics, their performance may be limited by the phylogenetic signal preserved in these words. Herein, we present a different approach based on the species-specific amino acid neighborhood preferences. These differential preferences can be assessed in the context of vector spaces. In this way, a distance-based method to build phylogenies has been developed and implemented into an easy-to-use R package. Tests run on real-world datasets show that this method can reconstruct phylogenetic relationships with high accuracy, and often outperforms other alignment-free approaches. Furthermore, we present evidence that the new method can perform reliably on datasets formed by non-orthologous protein sequences, that is, the method not only does not require the identification of orthologous proteins, but also does not require their presence in the analyzed dataset. These results suggest that the neighborhood preference of amino acids conveys a phylogenetic signal that may be of great utility in phylogenomics.


Introduction
It is a well-established fact that different genes from the same set of organisms often lead to different phylogenetic trees 1 . That happens even with mitochondrial-encoded genes 2,3 , despite the fact that such genes are inherited together without recombination, and the risk of confusing orthologous with paralogous sequences is non-existent. If, in addition, phenomena such as horizontal gene transfer, recombination, unrecognized paralogy, and highly variable rates of evolution are in place, the task of reconstructing accurate phylogenetic topologies can be seriously compromised. Not surprisingly, species phylogenies derived from comparison of single genes are seldom consistent with each other. To overcome this problem, two strategies are currently used when resolving phylogenies based on multiple alignments. In the so-called supermatrix approach, individual aligned genes or proteins are concatenated into a supermatrix, which is then subjected to phylogenetic analyses using either maximum likelihood or Bayesian inference 4 . In the alternative supertree method, gene or protein data sets are analyzed separately. Afterwards, the trees derived from these independent analyses are used to infer a single joined phylogeny 5,6 . Each of these alternatives has its own strengths and weaknesses, which has led to extensive discussions regarding the best strategy to conduct phylogenetic analyses of sequence data from multiple genes or proteins [7][8][9][10] . Nevertheless, both approaches have in common that they are time consuming, and they often require manual intervention. On the other hand, among the diverse sources of error in molecular phylogenies, incorrect sequence alignments rank high 11,12 . Therefore, those methods based on sequence alignments are prone to artefacts when used in phylogenomics 13 . Indeed, a number of previous studies have shown that the alignment method can have a considerable impact on tree topology [14][15][16][17][18] . Although attempts have been made to deal with multiple sequence alignment uncertainty during phylogeny reconstruction 19 , a satisfying and computationally tractable way to deal with alignment uncertainty is still lacking. Alignment artefacts have become even a bigger problem in the era of phylogenomics, where thousands of genes are automatically analyzed without accounting for alignment uncertainty 14 .
With the advent of modern genome sequencing techniques, it is now possible to consider phylogeny inference based on total genome sequences. However, given that most genomes contain millions of nucleotides, the standard approach based on positional homology (where each column from a multiple sequence alignment is considered as a homologous character) represents a daunting challenge that becomes impractical. Consequently, alternative approaches to compare whole genomes have been proposed. Thus, gene arrangement 20 and gene content 21,22 have been explored as strategies to compare whole genomes. More recently, a wide number of alignment independent methods to compare sequences have been developed, and their utility in phylogenomics has been evaluated 23 . Thus, the so-called alignment-free approach include methods based on words-counting 24,25 , some of which implement diverse strategies to discriminate signal from noise 26-31 . Other published methods are based on matching statistics (i.e., they compute the length of common substrings with or without allowing mismatches) 32,33 , information theory 34,35 , splits driven by common subsequences 36 , or even based on micro-alignments 37,38 .
In this study, we describe a new and fast method for generating molecular phylogenies using multiple proteomes or protein-coding genomes. This method, which does not require sequence alignment or the identi cation of orthologous proteins, is based on a rationale previously unexplored in the context of phylogeny: the preference of each amino acid to be surrounded by other amino acids [39][40][41] . These species-speci c preferences seem to posse a phylogenetic signal enough to reconstruct accurate tree topologies, even when the proteins analyzed from each species are functionally unrelated to the proteins selected from the other species.

Material And Methods
The species vector space It has been shown that every amino acid has a characteristic sequence environment in proteins 39,46 . In previous works, we have analyzed the sequence environment (10 residues on each side) around methionine residues in human proteins 40,41,47 . Thus, and just based on methionine residues, the human proteome can be characterized by a matrix, M, whose elements (m ij ) provide the absolute frequency of the amino acid i at the position j in the environment of methionine residues ( Figure 1A). Similarly, for each proteinogenic amino acid, X ∈ {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V}, a matrix x ij ∈ X 20x20 can be computed. In this way, the protein-coding genome of a given species of interest can be characterized by a set of 20 square matrices of order 20 or, equivalently, by a vector, u ∈ U 8000 , of dimension 8000 ( Figure 1B). However, when coding species as vectors, the dimension of these vectors ( ) will depend on the radius chosen. Thus, in general, u ∈ U n , where n = 20(2radius)20 = 800radius. In this way, each vector is used to represent an organism (its protein-coding genome). The components (coordinates) of the vector re ect the preference of the different amino acids at the different positions of the sequence environments.
A suitable metric for the species vector space Once species are encoded as high-dimensional vectors, we can make use of the extensive mathematical tools of numerical linear algebra. Since we are interested in assessing distances between species, we must endow this vector space with a suitable metric for our purpose. To this end, we must look for functions, d, able to provide a distance between vectors. d: U n xU n → R In general, any function, d, to be considered a distance must satisfy the following 4 properties. (i) Positive de niteness: d u i , u j ≥ 0∀u i , u j ∈ U n ; (ii) coincidence axiom: d u i , u j = 0 ⟺ u i = u j ; (iii) symmetry: d u i , u j = d u j , u i ∀u i , u j ∈ U n ; and (iv) triangle inequality: d u i , u j ≤ d u i , u k + d u k , u j ∀u i , u j , u k ∈ U n . A wide variety of metrics can be used to measure relatedness between vectors. However, as illustrated in Figure 1C and 1D, not all of them will be equally suitable for our purpose of establishing evolutionary relationships between species. Furthermore, the link between a given metric and its performance is not always obvious 48 . In the context of latent semantic analysis, a common measure of similarity between two vectors is the cosine of the angle between them 48,49 . Since protein sequence data can be regarded as a complex written language, Stuart and coworkers have proposed the use of the cosine between two vectors as a suitable measure of vector similarity when the vectors being considered contain information related to protein sequences 26, 27 . For instance, if we have the protein-coding genome of the species i and j ( Figure 1C), their similarity can be assessed by the expression: Where u T i u j is the dot product of the vectors u i and u j , and ‖‖ is the Euclidean vector norm. It should be noted that the function f u i , u j = cosθ ij is not a distance properly speaking. For instance, suppose that two species have identical genomes, in this case {u}_{i}={u}_{j} and we would expect a null distance between them. However, f\left({u}_{i},{u}_{j}\right)=cos{\theta }_{ij}=cos0=1. Nevertheless, pairwise cosine values can be converted into pairwise evolutionary distances using the following formula: This formula converts a similarity measure into a distance measure 26 . It is important to note that this evolutionary distance is not a proper distance metric as it violates the coincidence axiom. For instance, d\left({u}_{i},{2u}_{j}\right)=0 but {u}_{i}\ne 2{u}_{i}. However, this violation is very convenient for our goals.
A concrete example will be useful to understand this assertion. Suppose that we have a population that splits into two species. Suppose, further, that one of this species undergoes a genome duplication event, but otherwise their proteomes are identical. In such a scenario, the computed species vectors would be {u}_{j}=2{u}_{i}. Although both vectors have different lengths, since their directions are identical, we obtain d\left({u}_{i},{u}_{j}\right)=0, which conveniently re ects the fact that their proteomes are equal and therefore the neighborhood preferences of their sequence environments are the same in both species ( Figure 1D).

Environment-based trees
After encoding the genome of each species into a vector, as described above, these vectors are used to obtain a matrix of pairwise cosine values that are subsequently converted into a matrix of pairwise evolutionary distances using the formula given in the previous section. Alternatively, other metrics can be used to obtain a distance matrix. In the EnvNJ package accompanying this paper, we have implemented 28 different metrics among which the user can choose. However, in our experience the cosine-based dissimilarity and the Jensen-Shannon distance are among the best performing metrics for phylogenetic analyses using sequence environments. In any case, the obtained distance matrix can be used to produce a phylogenetic tree employing the neighbor joining algorithm 50 .

Implementation
The Env-NJ tree building method has been implemented in an R package, EnvNJ. The package, which works on all major operating systems (Windows, MacOS and Linux) can be installed either from CRAN, install.packages("EnvNJ"), or from its bitbucket repository, typing consecutively the following three commands in an R terminal: install.packages("devtools"), library(devtools), install_bitbucket("jcaledo/envnj", subdir = "REnvNJ"). Since the protein sequence datasets analyzed in the current work (see below) have also been included into the package, once it has been installed, the trees shown in Figures 2 and 3C can be easily obtained with the commands envnj(bovids, r = 2) and envnj(reyes, r = 46), respectively. Further help can be obtained from the package documentation by introducing into the R terminal: ?envnj. A vignette about the use of the EnvNJ package can be found as Supplementary Material.
In addition to the Env-NJ method described in this paper, the package EnvNJ also implements the method based on the SVD-n-Gram approach, previously described by Stuart and coworkers 26 . The aim was to facilitate its use for comparative purposes (check the documentation, ?svdgram).

Mitogenome Datasets
The mtDNA-encoded protein sequences were obtained from the NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/organelle). Two sets of mitogenomes have been analyzed in the current work. The rst set is formed by 11 species of bovids including Bison bison, Bison bonasus, Bos grunniens, Bos indicus, Bos javanicus, Bos primigenius, Bos taurus, Bubalus bubalis, Bubalus depressicornis, Pseudoryx nghetinhensis, Syncerus caffer. An R dataframe containing these sequences can be loaded and examined by typing data(bovids) after having installed the R package EnvNJ.
A second set of mitogenomes analyzed in this study is the one formed by 34 mammalian species spanning 13 orders, rst used by Reyes and coworkers 42

Env-NJ trees using non-orthologous protein datasets
We chose three closely related primate species (human, chimp and gorilla) and two Arabidopsis species (A. thaliana and A. lyrate). Since we wanted to make sure that no orthology could be established between any pair of proteins from the dataset subjected to analysis, we proceeded as described next. First, we started by identifying a set of 907 one-to-one orthologous proteins present in the ve species. To achieve that, we took advantage of the REST API for the OMA orthology database 51,52 . Both, the dataset (oseq.Rda) and the script (Oma_PlantAnimal.R) used to obtain it, can be downloaded from https://bitbucket.org/jcaledo/envnj/src/master/Datasets and https://bitbucket.org/jcaledo/envnj/src/master/AncillaryCode, respectively. In this way, the oseq.Rda object is a dataframe with ve columns (one per species) and 907 rows (one per orthologous protein), and each entry contains the corresponding protein sequence. To form a dataset of non-orthologous proteins we proceeded as follows. For the rst column (species) we randomly chose 180 rows (proteins). Afterward, the randomly selected rows were discarded from the dataframe before proceeding with the next column (species). Among the remaining rows, again we randomly selected 180, and the corresponding proteins from the second species were selected before removing the randomly selected rows from the dataframe. This operation was repeated until reaching the last species, at which point we had a collection of 900 non-orthologous proteins (180 per species). This randomly selected dataset formed by 900 non-orthologous proteins was then subjected to Env-NJ.

Robinson-Foulds distance
As a measure of the accuracy attributable to each phylogeny, the normalized Robinson-Foulds (nRF) distances between the reconstructed trees and the reference trees were computed. The Robinson-Fould algorithm to compute distances between trees topologies 53 , as implemented in the R package phangorn 54 , was used for this purpose. Brie y, Let T 1 and T 2 be two sets formed by all the splits at internal edges for tree 1 and tree 2, respectively (the two trees whose topologies we want to compare), then the cardinal of the symmetric difference of these two sets provides the Robinson-Foulds distance.

Results
As a proof of concept, the phylogenetic relationships of 11 species of bovids were addressed using their protein-coding mitogenomes and the new method described above, hereinafter referred to as Env-NJ. The topology of the reconstructed tree is shown in Figure 2. This topology fully matches that of the tree inferred using traditional alignment-based methods (reference tree), which re ect the accepted phylogeny for this groups of bovids. For comparative purposes, the same mitogenomes were employed with the following alignment-free tree-building packages: Feature Frequency Pro le (FFP) 24 , alfpy (a stand-alone Phyton application that implements different approaches as well as different metrics to assess vectors distances) 25 , CVTree 30 , ALFRED-G 33 , SANS-serif 36 and Prot-SpaM 37 . In all cases, the resulting trees were compared to the reference tree. Table 1 shows the corresponding normalized Robinson-Fould distances. Table 1 Normalized Robison-Foulds distances between trees obtained using different alignment-independent methods with respect to the refence tree shown in Figure 2.
The whole protein-coding mitogenomes of a group of 11 species of bovids were analyzed. For details regarding the input parameters used for each method, please consult the given reference. Brie y, cos, jsd and cheb stand for cosine, Jensen-Shannon and Chebyshev metrics, respectively. The abbreviation f.p. stands for 'further proccessing'. Thus (a) lters considering background words frequencies; (b) uses singular value decomposition, SVD, to analyze the 4-mer frequency data; (c) the matrix used for the W-metric analysis was Blosum62. (d) The parameters for the use of Prot-SpaM were weight of w = 6, with d = 40 (don't-care positions) and m = 5 patterns. The parameter lter of SANS-serif was set to 'strict'. NCD is the acronym of normalized compression distance.  37 We next challenged the Env-NJ method with a larger, more diverse, and more controversial dataset consisting in the mitogenomes of 34 mammalian species spanning 13 orders. The phylogenetic relationships between the organisms of this set were rst analyzed by Reyes andcoworkers using a maximum likelihood approach 42 . Figure 3A reproduces the topology of the tree obtained by these authors. Nevertheless, since alternative hypotheses for the phylogeny of this set of species have been proposed by different authors 27,34 , in order to adopt a reference tree we resorted to the community resource VerLife 43 to draw the topology of the tree ( Figure 3B) that relates the species under study according to this source 43 . As shown in Figure 3, the Env-NJ method yields a credible phylogeny where rodents, primates, carnivores, cetartiodactyls and perissodactyls are some of the well-established mammalian lineages that appear as uninterrupted groupings within the Env-NJ tree. In order to obtain a more quantitative comparison, we next built 14 trees using the same dataset and different alignment-free tree building approaches. Afterwards, the normalized Robinson-Foulds symmetric difference between these trees and the reference tree was computed ( Table 2). As it can be observed in this table, the Env-NJ with the Jensen-Shannon metrics provided the best result, understood as the one that provided the lowest Robinson-Foulds distance to the reference tree. Table 2 Normalized Robison-Foulds distances between trees obtained using different alignment-independent methods with respect to the refence tree shown in Figure 3B. The whole protein-coding mitogenomes of a group of 34 mammalian species were analyzed. For details regarding the input parameters used for each method, please consult the given reference. Brie y, cos, jsd and cheb stand for cosine, Jensen-Shannon and Chebyshev metrics, respectively. The abbreviation f.p. stands for 'further proccessing'. Thus (a) lters considering background words frequencies; (b) uses singular value decomposition, SVD, to analyze the 4-mer frequency data; (c) the matrix used for the W-metric analysis was Blosum62. (d) The parameters for the use of Prot-SpaM were weight of w = 6, with d = 40 (don't-care positions) and m = 5 patterns. The parameter lter of SANS-serif was set to 'strict'. NCD is the acronym of normalized compression distance.  37 At this point, we reason as follows. If the amino acid neighborhood preference in sequence environments is a species feature, then perhaps it may be possible to reconstruct phylogenetic relationships using nonorthologous proteins sets. To explore the potential of the Env-NJ method to provide such an achievement, we selected ve species (three animals and two plants) whose phylogenetic relationships are well established. For each species a random set consisting of 180 proteins was selected, with the only restriction that no protein belonging to this set could be homologous to any of the proteins belonging to the remaining species. These random sets of non-orthologous protein sequences were used to generate the Env-NJ tree. To compare the performance of our method with that of previously proposed alignmentfree approaches, the same dataset was subjected to 7 alternative methods, including Prot-SpaM 37 , Wmetric 25,31 , FFP 24 , CVTree 30 , ALFRED-G 33 , normalized compression distances (NCD) 25 and SANS-ser 36 . As shown in Figure 4, the new methodology yielded the correct tree topology even when nonorthologous proteins were employed, and under this speci c conditions it seems to outperform other tree building methods.

Discussion
Traditionally, the starting point to construct a molecular phylogeny has been identifying and gathering a set of evolutionary related (orthologous) sequences. However, before using these sequences to build a tree, it is important to ensure that each nucleotide or amino acid in each sequence is compared only with the corresponding homologous nucleotide or amino acid in the other sequences, what is referred to as positional homology. This preliminary task, that is one of the trickiest parts of the whole phylogenetic reconstruction process, is performed by aligning the sequences to one another. It should be noted that none of the frequently used alignment programs is capable of consistently producing perfect alignments, even when moderately divergent sequences are employed 44 . For that reason, it is always important to check the alignment quality before continuing with the phylogenetic reconstruction procedure. Obviously, this protocol is not scalable to phylogenomics. Since most genomes contain millions of sequence characters, these traditional methods based on positional homology comparisons, carried out over ambiguously resolved large-scale alignments, are unbusinesslike 14 . Thus, it seems that the problem of phylogenomic reconstruction based on site-evolution has no solution in the near future. To overcome this problem, different approaches have been explored.
One of these approaches to whole genomes phylogenetic analysis has focused on the ordering of the genes along the chromosomes, others have resorted to the gene content as its primary data. The use of gene-order and gene-content data in the context of phylogeny is the subject of important research efforts. However, there remain important challenges. Thus, mapping a full genome is a demanding task. Furthermore, the posterior analysis of the annotated genome is computationally expensive and time consuming because of the extreme mathematical complexity of gene orders. For instance, for a chromosome with n distinct single-copy genes, the number of possible states is {2}^{n-1}\left(n-1\right)! 45 . This computational burden means that all reconstruction methods face a considerable challenge, even on small datasets consisting of only a few genomes. Furthermore, in these approaches the information contained into a genome is largely simpli ed, in the sense that point mutations are completely ignored, that is, these methods somehow make use of lossy data compression, so that relevant information contained in a genome is not used to infer its evolutionary history.
In this report, we describe a method for generating phylogenetic trees from whole genomes without resorting to lossy data compression. The protein-coding genomes of the set of organisms being analyzed are converted into a matrix, where each column vector represents a species. More concretely, these vectors represent the species-speci c amino acid neighborhood preferences (Figure 1). During our previous investigations, we had observed that different species exhibited a differential preference for amino acids in the vicinity of their methionine residues, despite the fact that the relative frequencies of the proteinogenetic amino acids were very similar in the analyzed species. This observation prompted us to explore the potential of sequence environments to accurately reconstruct phylogenies using genome datasets of unaligned sequence information.
As a rst approach, we decided to carry out a pilot study using an optimal set of genomes, in the sense that the expected tree topology is widely accepted. To this end, we chose the protein-coding mitogenomes of 11 species of bovids. This dataset is small and simple, coding for only 143 proteins whose sequences are curated by the NCBI and are expected to be very accurate. Furthermore, the orthologous relationships among the proteins belonging to this set are obvious and undisputed. Moreover, mitochondrial sequences are often used to generate metazoan phylogenies, hence the tree generated Env-NJ can be easily compared to those generated by other methods, either based on sequence alignments or alignmentindependent. Given that the group of organisms was formed by species closely related, and the optimal conditions discussed above, not surprisingly, most methods consistently produced the same tree topology ( Figure 2 and Table 1).
Encouraged by this success, we next tested the Env-NJ method with a larger, more diverse and more controversial dataset consisting in the mitogenomes of 34 mammalian species spanning 13 orders. This dataset, rst analyzed by Reyes and coworkers using alignment-based methods 42 , has been used later by different authors employing different tree building methods. Thus, the same genome set has been analyzed by Stuart and coworkers using the SVD-4-Gram method 27 and also by Li and colleagues, using a method that also works on unaligned sequences, but in this case exploiting the Kolmogorov complexity concept to estimate distances between genomes 34 . In the R package EnvNJ accompanying the current paper, we have implemented, in addition to the Env-NJ method, those utilities required to reproduce the   Table 2 summarize the topologies comparison of the trees obtained with the different methods being compared. Overall, the Env-NJ tree seems to be a reasonably good approximation to the true tree, at least as good as any of the trees obtained by alternative methods. However, the Env-NJ method, as discussed below, may outperform other alignment independent methods when operating with non-orthologous protein sequence datasets.
Hitherto, we have assessed the performance of the Env-NJ method on mitogenomes. In this context, the new method seems to be a valid alternative for phylogenomics since it has three valuable properties: (i) accuracy, (ii) speed and (iii) independence of positional homology. Indeed, Env-NJ does not rest on positional homology, and it does not require to identify orthologous proteins to proceed with the computation. However, one thing is that the method does not require identi cation of orthologous proteins, and quite another is that the method does not require the presence of orthologous proteins in the dataset. The latter is guaranteed when working with mitogenomes, where the presence of one-to-one orthologous proteins is guaranteed. Therefore, we next wondered whether the amino acid neighborhood preference is a global property of the species, containing a phylogenetic signal strong enough to infer its evolutionary history. In other words, was Env-NJ able to reconstruct a phylogeny analyzing nonorthologous proteins? That is, when each species contributes a set of proteins completely unrelated to the protein sets contributed by the other species under analysis.
To address this issue, we chose a small set of species formed by three animals (human, chimp and gorilla) and two plants (Arabidopsis thaliana and A. lyrate). For each species we randomly sampled 180 protein sequences from its proteome. The selection process was random with the only restriction that there were no pairs of orthologous proteins among the 900 sequences that made up the dataset (both the script to sample the sequences and the sequences themselves can be obtained at https://bitbucket.org/jcaledo/envnj/src/master/AncillaryCode/Oma_PlantAnimal.R, and https://bitbucket.org/jcaledo/envnj/src/master/Datasets/oseq.Rda, respectively). When this dataset was subjected to Env-NJ, the recovered tree was the expected one (Figure 4), where human was closer to chimp than gorilla, and the two plants appeared as sister operational taxonomic units (OTU's). More interestingly, under these astringent conditions, the strategy based on sequence environments was the only one that provided an acceptable result. Having a tree building method that does not require orthologs identi cation is a good thing, and Env-NJ is indeed such a method. Having a method that does not even require the presence of orthologs in the dataset, is even better and Env-NJ may ful l this feature.

Conclusion
In the current report we describe a new tree-building method and its implementation into an R package (EnvNJ). This new method presents many advantages: (i) it does not resort to lossy data compression; (ii) it is computationally very fast, making it suitable for addressing whole genomes; (iii) because the method makes use of whole genomes, there is no gene tree versus species tree problem; (iv) there is no need for multiple sequence alignment, which contributes to the speed of the method and avoids the impact of misalignments on the tree topology; (v) it does not require orthology identi cation, which further contributes to shortening computation times. Finally, the possibility that Env-NJ may perform well even with non-orthologous protein datasets, is a line of research that deserves further work in the future.

ACKNOWLEDGEMENT
The author is in debt to Pablo Aledo for his advice and help with all the aspects related to the linear algebra. The author also thanks Alicia Esteban and Elena Aledo for helpful discussion during the preparation of this work. This work was supported by European Regional Development Fund and the University of Málaga [UMA18-FEDERJA-149].

DATA AVAILABILITY
The Env-NJ method is implemented in the R package EnvNJ. Release versions are available via CRAN and work on all major operating systems. The development version is maintained at https://bitbucket.org/jcaledo/envnj/src/master. The mtDNA-encoded protein sequences in the 11 species of bovids can be obtained from the EnvNJ package just typing, after loading the package, data(bovids). Similarly, the 442 protein sequences that make up the dataset referred to as Reyes, can be obtained typing data(reyes). Alternatively, all the data employed in the current work, together with their corresponding descriptions can be obtained from the Bitbucket repository at https://bitbucket.org/jcaledo/envnj/src/master/Datasets.

SUPPLEMENTARY DATA
Supplementary data are available online.

CONFLICT OF INTEREST
The author declares that he has no con ict of interest. Molecular phylogeny of bovids. The phylogenetic relationships of 11 species of bovids were addressed using their protein-coded mitogenomes and three different tree building methods, one depending on multiple sequence alignment (MSA-NJ) and two alignment-independent methods, SVD-4-Gram and Env-NJ (radius 2). All the three methods produced the same tree topology. Thus, this tree was taken as the reference tree in subsequent benchmarking analyses.