Natural Family-Free Genomic Distance

doi:10.21203/rs.3.rs-198423/v1

Download PDF

Research Article

Natural Family-Free Genomic Distance

https://doi.org/10.21203/rs.3.rs-198423/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome.

The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkamper et

al. (J. Comput. Biol., 2020) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost

empty matchings give smaller distances.

Results: In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model

then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search

space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkamper et al. for instances with the same number of multiple

connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.

Molecular Biology

Comparative genomics

Genome rearrangement

DCJ-indel distance

additionalfile1.pdf
Additional file 1 | Online supplemental material This file is a document containing supplementary material on the model and on the experiments, including supplementary figures and tables.
additionalfile2.ods
Additional file 2 | Complete table of Drosophila homologies inferred by our approach This file contains tables with the lists of inferred gene matchings for each pairwise comparison of Drosophila genomes. It also contains homolog gene sets (gene families) for D. melanogaster, D. simulans and D. yakuba defined by FlyBase (release FB2020 04) with gene identifiers converted to NCBI identifiers.

Download PDF

Review #1 received at journal
11 Apr, 2021
Review #2 received at journal
01 Mar, 2021
Reviewer #2 agreed at journal
08 Feb, 2021
Reviewers invited by journal
04 Feb, 2021
Reviews received at journal
04 Feb, 2021
Reviewer #1 agreed at journal
04 Feb, 2021
Editor assigned by journal
02 Feb, 2021
Submission checks completed at journal
02 Feb, 2021
Editor invited by journal
02 Feb, 2021
First submitted to journal
01 Feb, 2021

You are reading this latest preprint version

Natural Family-Free Genomic Distance

Status:

Version 1

Abstract

Figures

Full Text

Supplementary Files

Status:

Version 1