Recognition of arthropod species names using bigram-based classification

doi:10.21203/rs.3.rs-26532/v1

Download PDF

Research article

Recognition of arthropod species names using bigram-based classification

https://doi.org/10.21203/rs.3.rs-26532/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

The task of recognizing species names in scientific articles is a quintessential step for a large number of applications in high-throughput text mining and data analytics, such as species-specific information collection, construction of species food networks and trophic relationship extraction. These tasks become even more important in fast-paced species-discovery areas such as entomology, where an impressive number of new arthropod species are discovered each year. This article explores the use of twocharacter n-grams (bigrams) in machine learning models for arthropod species name recognition. This particular method has been previously applied successfully to the task of language identification [1] but the application to species name identification had yet to be explored.

Results

Arthropod species names, regular English words used in scientific publications and person names were collected from the public domain and bigrams were extracted and used as classifier features. A number of learning classifiers spanning 7 algorithmic categories (tree-based, rule-based, artificial neural network, Bayesian, boosting, lazy and kernel-based) were tested and the highest accuracies were consistently obtained with LIBLINEAR [2], Bayesian Logistic Regression [3], the Multilayer Perceptron [4], Random Forest [5], and the LIBSVM [6] classifiers. When compared with dictionary-based external software tools such as GNRD [7] and TaxonFinder [8], our top-3 classifiers were insensitive to words capitalization and were able to correctly classify novel species names that are absent in dictionary-based approaches with accuracies between 88.6% and 91.6%.

Conclusions

Our results suggest that character bigram-based classification is a suitable method for distinguishing arthropod species names from regular English words and person names commonly found in scientific literature. Moreover, our method can also be used to reduce the number of false positives produced by dictionary-based methods.

Bioinformatics

machine learning

classification

species names

arthropod

bigram

Additional file 1 – ZIP archive including all datasets used in this work. Each file is labelled by applying the naming convention used in the manuscript.

All datasets are also made publicly available at:

http://animalbiosciences.uoguelph.ca/~dtulpan/papers/specrec2020

Additionalfile1.zip

Download PDF

Version 1

posted

You are reading this latest preprint version

Recognition of arthropod species names using bigram-based classification

Status:

Version 1

Abstract

Figures

Full Text

Additional files

Supplementary Files

Status:

Version 1