dom2vec: Capturing domain structure and function using self-supervision on protein domain architectures

doi:10.21203/rs.3.rs-58816/v1

Download PDF

Research article

dom2vec: Capturing domain structure and function using self-supervision on protein domain architectures

https://doi.org/10.21203/rs.3.rs-58816/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 19 Jan, 2021

Read the published version in Algorithms →

Version 1

posted

You are reading this latest preprint version

Background: Word embedding approaches have revolutionized natural language processing (NLP) research. These approaches aim to map words to a low-dimensional vector space, in which words with similar linguistic features cluster together. Embedding-based methods have also been developed for proteins, where words are amino acids and sentences are proteins. The learned embeddings have been evaluated qualitatively, via visual inspection of the embedding space and extrinsically, via performance comparison on downstream protein prediction tasks. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector.

Results: Here, we present dom2vec, an approach for learning protein domain embeddings using word2vec on InterPro annotations. In contrast to sequence embeddings, biological metadata do exist for protein domains, related to each domain separately. Therefore, we present four intrinsic evaluation strategies to quantitatively assess the quality of the learned embedding space. To perform a reliable evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of domains. These are the structure, enzymatic and molecular function of a given domain. Notably, dom2vec obtains adequate level of performance in the intrinsic assessment, therefore we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperform sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.

Conclusions: We report that the application of word2vec on InterPro annotations produces domain embeddings with two significant advantages over sequence embeddings. First, each unique dom2vec vector can be quantitatively evaluated towards its available structure and function metadata. Second, the produced embeddings can outperform the sequence embeddings for a subset of downstream tasks. Overall, dom2vec embeddings are able to capture the most important biological properties of domains and surpass sequence embeddings for a subset of prediction tasks.

Bioinformatics

Protein domain architectures

InterPro

Machine learning

Neural networks

Word embeddings

Quality assessment

SCOPe secondary class

Enzymatic Commission class

Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the latest manuscript can be downloaded and

accessed as a PDF.

dom2vecsupplementary.pdf

Download PDF

Journal Publication

published 19 Jan, 2021

Read the published version in Algorithms →

Version 1

posted

You are reading this latest preprint version

dom2vec: Capturing domain structure and function using self-supervision on protein domain architectures

Status:

Journal Publication

Version 1

Abstract

Figures

Full Text

Supplementary Files

Status:

Journal Publication

Version 1