DIVIS: A Semantic Distance to Improve the Visualization of Incomplete Heterogeneous Phenotypic Datasets

doi:10.21203/rs.3.rs-742853/v1

Download PDF

Research

DIVIS: A Semantic Distance to Improve the Visualization of Incomplete Heterogeneous Phenotypic Datasets

https://doi.org/10.21203/rs.3.rs-742853/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Thanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insight into these datasets and reduce the data matrices size scientists often rely on multivariate analyses. However such approaches are not always easily practicable in particular when faced with mixed datasets with missing values. Moreover displaying large numbers of individuals leads to cluttered visualizations which are difficult to interpret.

Results

We introduce a new methodology to overcome these limits. The underlying principle consists in (i) grouping similar individuals, (ii) representing each group by emblematic individuals we call archetypes and (iii) build sparse visualizations based on these archetypes. As a preliminary step to the clustering we design a new semantic distance tailored for both quantitative and qualitative variables which allows a realistic representation of the relationships between individuals. This semantic distance is based on ontologies which are engineered to represent real life knowledge regarding the underlying variables. Our approach is implemented as a Python pipeline and illustrated by a rosebush dataset including passport and phenotypic data.

Conclusions

The introduction of our new semantic distance and of the archetype concept allows us to build a comprehensive representation of an incomplete dataset characterized by large proportion of qualitative data. The methodology described here could have wider use beyond information characterizing organisms or species and beyond plant science. Indeed we could apply the same approach to any incomplete mixed dataset.

Bioinformatics

mixed datasets

heterogeneous datasets

phenotypic traits

multivariate analysis

ontologies

semantic distance

clustering

visualization

Download PDF

Editorial decision: Major revision
15 Sep, 2021
Review #1 received at journal
14 Sep, 2021
Review #2 received at journal
12 Sep, 2021
Reviewer #2 agreed at journal
05 Sep, 2021
Reviewer #1 agreed at journal
01 Sep, 2021
Reviewers invited by journal
30 Aug, 2021
Editor assigned by journal
22 Jul, 2021
First submitted to journal
22 Jul, 2021
Submission checks completed at journal
21 Jul, 2021
Editor invited by journal
21 Jul, 2021

You are reading this latest preprint version

DIVIS: A Semantic Distance to Improve the Visualization of Incomplete Heterogeneous Phenotypic Datasets

Status:

Version 1

Abstract

Full Text

Supplementary Files

Status:

Version 1