Thanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insight into these datasets and reduce the data matrices size scientists often rely on multivariate analyses. However such approaches are not always easily practicable in particular when faced with mixed datasets with missing values. Moreover displaying large numbers of individuals leads to cluttered visualizations which are difficult to interpret.
We introduce a new methodology to overcome these limits. The underlying principle consists in (i) grouping similar individuals, (ii) representing each group by emblematic individuals we call archetypes and (iii) build sparse visualizations based on these archetypes. As a preliminary step to the clustering we design a new semantic distance tailored for both quantitative and qualitative variables which allows a realistic representation of the relationships between individuals. This semantic distance is based on ontologies which are engineered to represent real life knowledge regarding the underlying variables. Our approach is implemented as a Python pipeline and illustrated by a rosebush dataset including passport and phenotypic data.
The introduction of our new semantic distance and of the archetype concept allows us to build a comprehensive representation of an incomplete dataset characterized by large proportion of qualitative data. The methodology described here could have wider use beyond information characterizing organisms or species and beyond plant science. Indeed we could apply the same approach to any incomplete mixed dataset.