Sensitivity of A Frequency-Based Alignment-Free Approach For Phylogeny Rreconstruction.

Background: Phylogenies enrich our understanding of how genes, genomes, and species evolve. Traditionally, alignment-based methods are used to construct phylogenies from genetic sequence data; however, this process can be time-consuming when analyzing the large amounts of genomic data available today. Additionally, these analyses face challenges due to differences in genome structure, synteny, and the need to identify similarities in the face of repeated substitutions resulting in loss of phylogenetic information contained in the sequence. Alignment Free (AF) approaches using k-mers (short subsequences) can be an ecient alternative due to their indifference to positional rearrangements in a sequence. However, these approaches may be sensitive to k-mer length and the distance between samples. Results: In this paper, we analyzed the sensitivity of an AF approach based on k-mer frequencies to these challenges using cosine and Euclidean distance metrics for both assembled genomes and unassembled sequencing reads. Quantication of the sensitivity of this AF approach for phylogeny reconstruction to branch length and k-mer length provides a better understanding of the necessary parameter ranges for accurate phylogeny reconstruction. Our results show that a frequency-based AF approach can result in accurate phylogeny reconstruction when using whole genomes, but not stochastically sequenced reads, so long as longer k-mers are used. Conclusions: In this study, we have shown an AF approach for phylogeny reconstruction is robust in analyzing assembled genome data for a range of numbers of substitutions using longer k-mers. Using simulated reads randomly selected from the genome by the Illumina sequencer had a detrimental effect on phylogeny estimation. Additionally, ltering out infrequent k-mers improved the computational eciency of the method while preserving the accuracy of the results thus suggesting the feasibility of using only a subset of data to improve computational eciency in cases where large sets of genome-scale data are analyzed.

genomes do not share the same set of genes, or if sequences have very low similarity and thus are not alignable to each other 9 . Additionally, alignment-based phylogeny reconstruction is affected by rearrangements and inversion events, and different sequence lengths among different species 10 . A high number of insertions and deletions in the sequences also negatively in uence the accuracy of the alignment 11 .
Sequence alignment can also be complex and error-prone due to the erroneous positioning of gaps during reconstruction 12,13,14 . Some aligners substantially underestimate insertion rates and overestimate deletion rates and thus produce alignments with inaccurate indels 15 . These errors affect the construction of the phylogenetic tree that is built from DNA or RNA sequences 16, 9 . Alignment free methods Alignment-free (AF) approaches are less susceptible to these challenges and have been proposed as an effective alternative for phylogeny reconstruction 17 . AF methods do not take into account the exact position of the nucleotides 18 and thus are more resistant to the effects of structural variants such as inversions, translocations, chromosome fusions, chromosome ssions, and reciprocal translocations 19 .
AF methods are especially successful in genome-based phylogenetic studies because they are capable of processing large numbers of genome-size sequences 20 . Additionally, AF approaches do not require genome assembly and alignment steps, which are complex and computationally expensive 21,22 . The scalability of AF methods allows processing large sets of genome data more e ciently than the traditional alignment-based algorithms.
AF methods typically analyze k-mers, which are short substrings of the genome sequence 23 . AF methods can be classi ed broadly 17 into methods based on k-mer frequencies 24,25,26 , a number of word matches 27,28 , lengths of common substring [29][30][31] , and micro-alignments [32][33][34][35]36 . However, various AF methods perform differently depending on the data type (mitochondrial vs. nuclear genomes), taxonomic groups (animal, bacterial, or plants), and data size (assembled genomes and unassembled sequencing reads of different coverage) 17 . No single method is successful across all types of data 17 .
Several factors must also be considered when working with AF approaches that can affect the accuracy of the phylogeny: k-mer length used in the analysis, number of substitutions per site in the genome, a metric used to calculate genome distances between sequences, and the use of assembled genomes vs. unassembled raw reads. Different AF methods use k-mers differently and thus there is no universal length of k-mer used in all AF methods 37 . Instead, different optimal k-mer lengths have to be determined for the different methods 17 . Shorter k-mers are more likely to be present in a sequence (e.g. 1-mers); thus they are less informative in analyzing closely related genomes 38 . Longer k-mers are more unique to particular species and are therefore more useful for similarity identi cation across species 39 . Additionally, the problem of k-mer homoplasy should be considered when identifying k-mer length for accurate phylogeny reconstruction using AF methods. Shorter k-mers are more likely to have greater sensitivity to single evolutionary events. However, identical k-mers could be obtained from the genome regions which are not identical in functional or evolutionary properties and thus are not homologous (k-mer homoplasy) 40 . Longer kmers are less susceptible to k-mer homoplasy. Thus, there is a tradeoff between sensitivity, which requires shorter k-mers, and reducing k-mer homoplasy, which requires longer k-mers 40 . Optimal k depends on the AF approach used 37 . The optimal lengths of k-mers differed for the "word count" AF approach and the "match length" approach 37 . K-mer length and genome size are correlated, which affects the accuracy of phylogeny reconstruction 41 .
Evolutionary distance is another important parameter to consider in phylogeny reconstruction 42,43,44 .
The presence of a high number of substitutions among samples causes loss of phylogenetic signal contained in the sequence (saturation) and is an important source of systematic bias 45,46 . Multiple substitutions at the same site due to a high substitution rate can make individuals appear more similar than they actually are 47 . In extreme cases, the similarity between the sequences will depend entirely on the similarity in nucleotide frequencies that often does not re ect phylogenetic relationships 46,48 . More speci cally, the long branches present in the phylogenies may have acquired many substitutions that cause a historically random resolution of the internode. Saturation is known as a cause of erroneous outgroup placement in phylogeny reconstruction 49 . Additionally, genetic saturation introduces a longbranch attraction problem (LBA) 50 , where one lineage falsely appears to be closely related to another long-branched lineage due to the large amount of molecular change accumulated in the genomes rather than due to the actual relatedness by descent 51,40 . The AF methods based on "number of word matches" and "match length" had greater accuracy of similarity identi cation for sequences with low divergence 41 whereas "word count" methods performed better at an intermediate level of sequence divergence 37 .
The metric used to calculate distances between genomes is also an important factor to consider in AFbased phylogeny reconstruction, as the distance metric affects both computational e ciency and sensitivity 52 . Distance metrics include Minkowski (e.g., Euclidean) 53 , D2 (e.g., D S 2 ) 41 , mismatch (e.g., Jaccard) 40 , inner product (e.g, normalized vectors, cosine) 54 , and length difference 55 . Differences in the accuracy of phylogeny reconstruction depend on the metric used 37 . The accuracy of phylogenies based on D 2 statistics (based on the number of word matches) and kmacs (based on length of common substring) varied by the k-mer length and sequence divergence level [56][57][58][59]41 .
Here, we use simulated genomes and HTS data to assess the sensitivity of AF methods based on k-mer frequencies to a range of numbers of substitutions per site in the genome using different k-mer lengths and two genome distance metrics for phylogeny reconstruction. We analyzed phylogenies with branch length (in numbers of substitutions per site) from 0.00001 to 0.1 based on differences among primates reported in the Tree of Life 60-62 to understand how sensitive these methods are to different sequence divergence levels. We show that AF approaches based on k-mer frequencies of common k-mers alone can generate accurate phylogenies from assembled genomes with high evolutionary distances, so long as k-mers are su ciently long. This work provides a better understanding of optimal parameter selection for future phylogenetic studies using frequency-based AF methods.

Results
Phylogenies from simulated assembled genome data Phylogenies built from Euclidean distances had a median RF score of 0 for all substitution rates with kmer lengths of 9 -16 (  Figure A2).

Phylogenies from simulated reads at uniform coverage
Comparison of the phylogenetic trees generated from cosine distances calculated between read sequences simulated by the "sliding window" method at uniform 10X coverage to the reference tree produced a median RF score of 0 for all rates, and k-mer lengths of 10 through 23 (means = 0.06 -0.36, sd = 0.34 -1.18). Median RF scores of 0 -10 were produced for phylogenies generated using k-mer lengths 4 through 9 (means = 0.64 -9.26, sd = 1.13 -2.64) for all rates (Supplementary material Appendix 1, Figure A3).

Analysis of simulated reads with non-uniform coverage
Comparison of phylogenetic trees generated from cosine distances calculated between simulated Illumina read sequences at 10X non-uniform coverage (as would occur in empirical sequencing data) produced median RF scores of 8 through 10 for k-mer lengths of 10 through 23 (means = 8.68 -10, sd = 0 -0.99) for a range of substitutions per site, 10 −5 to 10 −1 (Supplementary material Appendix 1, Figure A4). Comparison of phylogenetic trees generated from Euclidean distances calculated for the same dataset produced the same median RF scores of 10 for the k-mer length of 10 through 23 (mean = 10, sd = 0) (Supplementary material Appendix 1, Figure A5).
To analyze the method's performance at higher sequencing coverage we increased the coverage of the simulated Illumina reads to 100X for 10 samples and recalculated the distances. Comparison of phylogenetic trees generated from cosine distances produced median RF scores of 8 to 10 for k-mer lengths 10 through 23 (mean = 7.6 -10, sd = 0 -2.27) for all rates (Supplementary material Appendix 1, Figure A6).

Filtering of infrequent k-mers and e ciency improvements
Each sample's nearest neighbor remained the same after ltering out k-mer frequencies below 10 5 . We rebuilt the dendrograms and con rmed that the clustering of samples was not affected by ltering out frequencies below this threshold value while reducing computation time. Calculating cosine distances on ltered frequencies showed to be 25 times faster than calculating on un ltered frequencies. Calculating cosine distance between a pair of 12-mer human genome pro les took 48 seconds using TensorFlow 63 with a GPU-based framework nVidia Tesla K40c with 128Gb of RAM; thus, distance calculations for 25 samples would take four hours. However, calculating distance on a pair of 12-mer pro les that do not include frequencies below 10 5 takes only two seconds.

Discussion
In this work, we show that a frequency-based AF approach is suitable for analyzing data with high numbers of substitutions per site. Our AF approach was able to recover the expected phylogeny from data containing a high number of substitutions using longer k-mers and cosine and Euclidean distances. These results suggest that an AF approach may be applied in place of traditional alignment-based methods where the problem of genome saturation and LBA in phylogeny reconstruction exists 64 . Identi cation of k-mer length ranges that are appropriate for phylogeny reconstruction from a dataset with a range of numbers of substitutions per site may serve as a recommendation for the selection of suitable parameters for frequency-based AF methods.
In the analysis of the whole genome data our AF approach successfully reconstructed the phylogeny from the genome with realistic substitution rates, using a range of higher k-mer lengths using cosine and Euclidean distance metrics. However, the method produced less accurate phylogeny from sequences using shorter k-mers, even from more closely related sequences. Our results of successful phylogeny reconstruction with k-mer frequencies using longer k-mers are consistent with prior work suggesting longer k-mers are required to produce accurate phylogenies 65,66,24 . For example, another frequencybased method (FFP) successfully reconstructed the phylogeny from assembled microbial genomes using longer k-mers and Jensen-Shannon Divergence, while a method based on the number of word matches (D 2 ) was shown to successfully estimate phylogeny from nucleotide and amino acid sequences, using D 2 statistics family and a higher k-mer length. Here we con rm these results with alternative distance metrics.
AF approaches using HTS reads are accurate with uniform but not stochastic coverage The AF approach based on k-mer frequencies was able to capture the correct phylogeny from reads at a uniform level of coverage. The analysis of the reads at precisely 10X coverage, where each base was sequenced exactly 10 times, produced a correct phylogeny for all datasets using longer k-mers. However, the method was not able to recover the correct phylogeny for the same dataset using a lower range of k-mer lengths. This result mirrors the results with whole-genome data but expands to allow for the truncation of k-mers due to short reads. In reality, a uniform level of coverage is not plausible due to the randomized nature of the sequencing process. However, analysis of simulated read data generated by the Illumina simulator at non-uniform coverage did not produce the expected phylogeny for datasets even at high coverage levels. The random nature of sequencing likely introduces stochasticity to the k-mer frequencies and negatively affects the distances between sequences 67 . While low sequencing coverage and sequencing errors are known to have a negative effect on phylogeny reconstruction using AF methods based on k-mer frequencies 40 , even at high read coverage, the counts of k-mers are distorted due to the randomized nature of the sequencing algorithm. Cosine and Euclidean distances were sensitive to these distortions resulting in a phylogenetic error in our analysis.
Because the accuracy of the AF approach based on k-mer frequencies is highly dependent on the precision of the read coverage, using read data would depend on correction for sequencing error and coverage depth. Both cosine and Euclidean metrics take exact counts of k-mer frequencies present in a genome. Incongruence in the frequencies due to variable read coverage and sequencing error skews the distance between two vectors of k-mer counts.
Alternatively, other non-frequency-based methods may be considered. For instance, methods based on micro-alignments 32,34 and methods based on the number of word matches 27,28,40 . These methods employ several techniques to correct for the effects of coverage and sequencing error. The techniques range from estimating the amount of sequencing error and coverage and including the estimates into the distance formula 28 to correcting tip branches and bootstrapping of the obtained phylogenetic trees 40 .

Choice of distance metric
The choice of metric for calculating genome distances affected the accuracy of phylogeny reconstruction. While cosine and Euclidean distance metrics produced distances from which an accurate phylogenetic tree was recovered from genomes when analyzing reads, the Euclidean distance metric produced a less accurate phylogeny. Prior work has also found that AF tools that employ different metrics performed differently depending on the data type -assembled genomes and unassembled raw reads 17 . For instance, AFKS 52 , an AF tool that employs the Jaccard index metric, was more accurate at phylogeny reconstruction from assembled genomes and less accurate for phylogeny reconstruction from unassembled raw reads using the same metric. On the other hand, mash 27 , a method that also uses the Jaccard index metric, produced accurate results for both data types.

Filtering low-frequency k-mers
Filtering out infrequent k-mers improved the computational e ciency of the method. Earlier work of 68,69 had discussed ltering out low k-mer frequencies, in particular, singletons due to them being a result of sequencer error. Our analyses showed that infrequent k-mers constitute the majority of k-mer content in the genome. Distance calculation using only frequent k-mers showed no negative effect on the accuracy of the results and greatly reduced the computational time of the method. We believe the majority of the information comes from a small fraction of the most frequent k-mers. These results suggest the feasibility of using only a subset of data to improve computational e ciency in cases where large sets of genome-scale data are analyzed.

Conclusion
In this study, we have shown an AF approach for phylogeny reconstruction is robust in analyzing assembled genome data for a range of numbers of substitutions using longer k-mers. Similarly, an AF approach recovered an accurate phylogeny from reads with uniform coverage from data with a high number of substitutions using higher k-mer lengths. However, using simulated reads randomly selected from the genome by the Illumina sequencer had a detrimental effect on phylogeny estimation. Thus we recommend using longer k-mers to accurately reconstruct phylogeny from assembled genome data with high numbers of substitutions. However, phylogeny reconstruction using frequency-based AF methods with read data is not recommended. Additionally, k-mer frequencies below 10 5 may be ltered out to improve the e ciency of the distance calculation between sequences while preserving the accuracy of the results.

Methods
To nd the parameters for which frequency-based AF methods produce accurate phylogeny from sequences of different levels of divergence using k-mers, we simulated genomes for multiple related individuals and estimated phylogenies using AF methods based on k-mer frequencies. We also simulated and estimated phylogenies from simulated HTS read data to compare the accuracy of the phylogenetic trees between two types of sequence data.

Trees
We initially generated a balanced tree for simulations with eight individuals and equal branch lengths ( Figure 1). We varied these branch lengths for simulations from 10 −5 to 10 −1 substitutions per site incremented by orders of magnitude. These datasets provide a testable range of sequence divergence to assess the sensitivity of alignment-free methods in recovering the topology.

Assembled genome data simulation.
We simulated genomes using a subset (for e ciency purposes) of the reference human genome, chromosome 19 70 , as an ancestral sequence. A custom script was used to generate 5 sets of data (one per order of magnitude) with 8 genomes per set (Figure 1). We repeated the analysis 100 times per set to examine statistical signi cance.
We simulated HTS read sequences on each of the simulated genomes using two methods -the Illumina read simulator in BBMap 71 and a custom "sliding window" method to generate reads containing identical coverage of every location in the genome. This allowed us to compare the sensitivity of the approach to stochastic variation in coverage. Simulated Illumina read sequences were generated for 10X and 100X coverages to compare how e ciently the method works at the two levels of coverage with no sequencing errors. We used a custom script to generate read sequences at 10X coverage using the "sliding window" method where each read was generated by sliding a window of 150 bp along the sequence with 15bp overlap. We repeated the analysis of reads at 10X coverage generated by the custom script and the reads generated at 10X coverage by the Illumina read simulator 100 times per set to examine statistical signi cance. Analysis of the Illumina read dataset at 100X coverage was only repeated 10 times due to high memory usage. 4. Distance calculation between simulated assembled genomes based on k-mer frequencies.
To calculate distances between genomes we rst generated k-mer pro les using Jelly sh 72 , a tool for fast, memory-e cient counting of k-mers in the DNA sequence, for k values in the range 4 through 16 for each sequence in the sets of the simulated data. Each k-mer pro le contains the frequency of canonical k-mers (k-mer or its reverse complement, whichever comes rst lexicographically) present in the genome. We performed genome distance calculations using two metrics, Euclidean and cosine distances. Both metrics use the exact count of k-mers present in the sequences and thus represent the extent of similarity that sequences share with higher sensitivity. The Euclidean distance is based on the inclusion of all of the k-mer frequencies through the union operation and therefore takes into account the complete k-mer pro le of a sequence. Conversely, cosine distance uses the set intersection of k-mer pro les and uses less information about the sequence, but has the advantage of not being sensitive to the number of reads. Speci cally, cosine distance looks at the angle between two vectors in high-dimensional space, while ignoring their length. Greater read coverage changes the magnitude of a vector but not its distance to any other vector.
We calculated the set intersection (for cosine distance) and the set union (for Euclidean distance) for each pair of k-mer pro les using custom scripts for distance calculations. Computational resources required to calculate distances on such a large set of genome data, based on the frequencies of k-mers, grow exponentially as the length of k increases. Thus, to accommodate multiple datasets with a large number of genomic-scale sequences, we computed distances between each pair of k-mer pro les using TensorFlow 63 and multiple GPUs. 5. Distance calculation between simulated read sequences based on the k-mer frequencies.
To calculate distances between read sequences we built k-mer pro les using the same approach as in the genome analysis for k-mer length 4 through 16 (for reads of uniform coverage), and 10 through 23 (for reads of non-uniform coverage) for each sequence. Our preliminary results showed lower accuracy in phylogeny estimation from reads of non-uniform coverage using shorter k-mers thus we increased the kmer length range for this dataset. To emulate the real-life analysis process we ltered out singletons, which are usually considered as possible errors produced by the sequencer in the real dataset 68 . We analyzed longer k-mers to overcome possible k-mer homoplasy in read analysis 40 . We calculated the set intersection and set union of k-mers. For reads of non-uniform 10X coverage, we calculated cosine and Euclidean distances. We saw lower accuracy in phylogeny reconstruction using Euclidean distance metric in raw reads and thus used only cosine distances for analyzing reads of uniform 10X coverage and reads of non-uniform 100X coverage. We used TensorFlow 63 and multiple GPUs to accommodate large datasets of k-mer pro les which grow exponentially as k-mer length increases.

Phylogeny estimation
To analyze the accuracy of our approach we examined the grouping of samples based on the distances among them. We built phylogenetic trees using the Neighbour-Joining algorithm for each dataset 73 . We measured the similarity between the generated tree and the reference tree by calculating the Robinson-Foulds (RF) distance 74 . 7. Adding e ciency to calculations.
We hypothesized that the majority of k-mers are infrequent. Approximately 99% of 12-mer frequencies in a human genome at 30X coverage are < 10 5 ( Figure 2) and will not contribute to the dot product in the cosine distance calculation. Thus, removing these k-mers from the calculation should reduce computation time without impacting the cosine distances. To better understand the distribution of k-mer frequencies, we built histograms. Frequencies were normalized by the level of genome coverage. In the analysis of complete genomes, we ltered out k-mers with frequencies below thresholds ranging from 10 0 -10 7 at different orders of magnitude, recalculated distances for each set of ltered k-mers, and examined the closest sample to each sample. This ensured an appropriate comparison between the tree generated by analysis of the full dataset and the tree from the ltered data.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.   Example violin plots of the distribution of Robinson-Foulds scores calculated for 100 trees generated from simulated genomic data when compared to the reference tree for a range of substitution per site of 10 -5 to 10 -1 incremented by orders of magnitude. The median RF score was 0 for all values. Each sample is an assembled genome. a) Trees were generated from Euclidean distances calculated using the k-mer length of 12. K-mer lengths 10 -16 produced comparable plots. Each sample is an assembled genome. b) Trees were generated from cosine distances calculated using the k-mer length of 9. K-mer lengths 10 -16 produced comparable plots.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SuppMatFigA1A6.pdf