Recognition of arthropod species names using bigram-based classification


 BackgroundThe task of recognizing species names in scientific articles is a quintessential step for a large number of applications in high-throughput text mining and data analytics, such as species-specific information collection, construction of species food networks and trophic relationship extraction. These tasks become even more important in fast-paced species-discovery areas such as entomology, where an impressive number of new arthropod species are discovered each year. This article explores the use of twocharacter n-grams (bigrams) in machine learning models for arthropod species name recognition. This particular method has been previously applied successfully to the task of language identification [1] but the application to species name identification had yet to be explored.ResultsArthropod species names, regular English words used in scientific publications and person names were collected from the public domain and bigrams were extracted and used as classifier features. A number of learning classifiers spanning 7 algorithmic categories (tree-based, rule-based, artificial neural network, Bayesian, boosting, lazy and kernel-based) were tested and the highest accuracies were consistently obtained with LIBLINEAR [2], Bayesian Logistic Regression [3], the Multilayer Perceptron [4], Random Forest [5], and the LIBSVM [6] classifiers. When compared with dictionary-based external software tools such as GNRD [7] and TaxonFinder [8], our top-3 classifiers were insensitive to words capitalization and were able to correctly classify novel species names that are absent in dictionary-based approaches with accuracies between 88.6% and 91.6%.ConclusionsOur results suggest that character bigram-based classification is a suitable method for distinguishing arthropod species names from regular English words and person names commonly found in scientific literature. Moreover, our method can also be used to reduce the number of false positives produced by dictionary-based methods.


Background
With recent advancements in technology and scientific methods, research findings are produced at rates unseen before. There are approximately 28,000 scientific journals with over 2.5 million articles published every year [9]. The majority of these journals has been digitized providing access to online articles, nevertheless there is a considerable amount of valuable research results that is hard to find and to connect with other results. To this extent, natural language processing tools have been developed to help wade through the vast amount of available information.
Species name identification is a process utilized to link information from different publications [10]. This is useful when attempting to extract information about a specific organism or taxon and can be formulated under the Named Entity Recognition (NER) framework, which focuses on finding and classifying named entities in text [11].
Recognition of species names from a text corpus has been attempted by various methods that could be grouped in three categories: dictionary-based, rule-based and machine learning-based.
A dictionary-based approach involves creating a large list of words relevant to the knowledge domain and then matching text against that dictionary. This approach is particularly powerful and useful when large collections of species names are available from well-trusted taxonomic sources such as the Catalogue of Life [12] and the Entomology Society of America's Common Names of Insects Database [13]. The dictionary method is advantageous because it guarantees a high percentage of true positives. A curated dictionary will only contain species names that have been vetted and checked ensuring that only pertinent text is extracted. However, this approach requires extensive memory due to the size of the required libraries and it is unable to -5 -identify misspelled words, optical character recognition errors or species names that are absent from the dictionary.
The rule-based approach uses a set of rules such as capitalization, word variants, phrases and patterns to match text [14]. The advantage of rule-based methods is speed because capitalization and punctuation can identify putative species names quickly. It allows for fast filtering of large amounts of data, so more arduous processes can be completed on smaller subsets. This can help to overcome disadvantages of the method that may omit particular fields in the text due to their increased variability, therefore leading to false negatives.
The machine learning approach generally involves building a model to detect and classify specific entities. These tools can be generic or specific to a given ontology and they do not depend on a dictionary. The strength of the machine learning approach consists in its ability to recognize species names that may be misspelled, may not match any rules or may not be included in a dictionary. It widens the scope of what can be found but some models might be susceptible to biases towards their training sets.
In practice, most tools represent hybrid combinations of two of the three approaches.
For instance, TaxonGrab [15] is a tool that incorporates both, the dictionary and the rule approaches to identify taxonomic names. Similarly, LINNAEUS [16] uses a combination of rules and a dictionary to identify species names in biomedical literature.
Both, NetiNeti [10] and Organism-tagger [17] utilize a combination of rules and machine learning approaches to identify species names. Solr-Plant [18] is part of a subset of tools that focus on plant name identification and utilizes an Apache Solr-based system that incorporates the Smith-Waterman alignment algorithm [19]. Whatizit [20] is another suite of text mining tools that finds mentions of organisms in databases and -6 -COPIOUS [21] is a corpus of biodiversity entities that can be used to train text analysis tools.
The Global Name Recognition and Discovery (GNRD) tool [7] identifies species names from unstructured text by looking up the words in multiple species name repositories.
The GNRD tool is part of a suite of web tools that perform multiple tasks related to biological species names (Global Architecture). It encompasses tools that parse, index and resolve text. The Biodiversity Heritage Library uses GNRD to analyse any new or updated information on their pages. Until recently GNRD utilized TaxonFinder [8], a dictionary-based approach to detect Latin scientific organism names at all ranks including kingdom, phylum, class, order, family, genus, species and subspecies in plain text. N-grams have been previously used to automatically identify the language or dialect used in a text corpus [1]. They were built for different languages since n-gram frequencies tend to be unique for each language or dialect. The n-gram method has been applied to several other tasks such as authorship attribution [7], detection of malicious code [8] and file type identification [9].
This manuscript introduces an approach that uses character bigrams extracted from twoword combinations to train, test and validate machine learning classifiers with the purpose of identifying arthropod species names in the presence of regular English words and person names that appear in scientific publications. We used the frequencies of character bigrams to build 15 models utilizing machine learning classifiers spanning seven algorithmic categories (tree-based, rule-based, artificial neural network, Bayesian, boosting, lazy and kernel-based) and we tested the models on three data sets corresponding to three classification problems described in detail in the Methods section.

Results and Discussion
In this work, we investigate the ability of character n-grams-based classifiers to distinguish arthropod species names from regular English words and person names by solving the three problems described in the Methods section. Classification accuracies are summarized in Error! Reference source not found.. The Bayesian Logistic Regression classifier ranked 3 rd and 4 th , respectively on P1 and P2, while it could not be executed on P3 due to its binary class applicability limitation. The Decision Table [22] rule-based classifier's accuracy ranked among the lower half of the models for all three classification problems. The J48 method performed well in both two-class problems but ranked 11 th with an accuracy of 77.67% for the three-class problem. The poorer performers across the board were AdaBoost (ranked last on all problems) and Lazy IBK, which seem to struggle when applied on the three datasets.

Classification accuracy
The three-class problem proved to be more difficult for all models and we observed a significant overall decrease in performance ranging between 5.9% for Random Forest and 33.6% for AdaBoost.
Classifying two-word phrases that represent people names, species names and English words (P3) proves to be more difficult than differentiating species names and English words (P1) or species names and persons names (P2). Different languages have -8 -different n-gram frequencies [23] and require independently built classifiers to distinguish among them. Species names tend to be based on Latin and ancient Greek while the text in a scientific article is largely English. Their n-grams frequencies are different, and some n-grams are more prevalent in one of the two categories, which explains the higher classification accuracy for solving problem P1. This also explains why n-grams are good features for classifiers applied to language identification [24].
Moreover, each language has its own n-gram frequency distribution. Similarly, person names originating from different cultures and language backgrounds have particular n-gram frequency distributions. As a majority of first and last names typically used in English-based languages have either a Latin or Greek origin, the frequency of certain bigrams is high in both (Error! Reference source not found.), the SCI and PEO class instances. In contrast, there is less overlap between sets of high frequency bigrams in SCI and ENG class instances, which could explain why the overall classifier accuracies are higher when solving P1 compared to P2.

Comparison with other species name identification tools
While we attempted to compare our results with nine external tools such as NetiNeti [10], TaxonGrab [25], LINNAEUS [16], SpeciesTagger [26], Organism-tagger [17], Solr-Plant [18], Whatizit [20] and COPIOUS [21], we could only perform comparisons with the Global Name Recognition and Discovery (GNRD) [7] and TaxonFinder [8], the other tools were either not available or did not function as described in the accompanying documentation. The two selected tools are both dictionary-based and therefore the results are expected to be perfect for the recognition of species names, while our method does not use any type of dictionary. In comparison, our top four classifiers solved three-class problems with accuracies ranging from 85.6% to 91.8% on the validation datasets, which is consistent with the performance obtained when applied on the combined training and testing dataset (SCI+ENG+PEO) with 9,000 two-word bigrams.
Our classifiers can also be used as a post-processing step to further improve the prediction of dictionary-based species name recognition tools and reduce the number of false positives. When we filtered the GNRD results using the LIBLINEAR and the MLP bigram-based classifier for the V_PEO dataset, only 13 people names remained miss-classified, while 40 were correctly classified as non-species names, thereby improving the accuracy of GNRD by 9% to a total of 97.4%.

Runtime analysis
Classifier runtimes were measured on an HP Notebook -14-cf0018ca equipped with a 4-core Intel Core i5-8250U CPU (base frequency of 1. We also observed that the prediction of all 15 classifiers produced the highest accuracy when applied on the SCI-ENG (P1) binary problem but decreased significantly when applied to the 3-class problem (P3). Moreover, we noticed that their accuracies decreased less stringently when trying to distinguish between arthropod species names and person names (P2: SCI-PEO). We hypothesize that their decrease in accuracy when applied to P2 compared to P1 could be due to the presence of an increased number of high frequency bigrams in the SCI and PEO datasets compared to the SCI and ENG datasets. With respect to execution time, neither the fastest, nor the slowest classifiers were top performers. When compared with two dictionary-based software packages (GNRD and TaxonFinder) on a validation dataset, our bigram-based classifiers maintained their performance, which was comparable but slightly lower than that of GNRD and TaxonFinder.
Moreover, our approach can be used to improve the performance of other species name identification approaches by reducing the number of false positive identifications.
-12 -When applied on the GNRD dictionary-based method, our approach improved its overall detection accuracy by 9% to a total of 97.4%.
In the future, we plan to explore the use of 3-and 4-character gram features on the same classification problems and estimate the trade-off between the practical application and increased computing cost of such methods versus the gain in accuracy. We will also investigate how increasing the size of the training dataset will impact the overall prediction accuracy of the models. Moreover, hybridizing the n-gram based classification method with other approaches such as part-of-speech identification and capitalization rules could lead to significant improvements for arthropod species name identification in non-structured text.

Datasets
We prepared a comprehensive dataset for training the classifiers, which include the following types of information (classes): (i) SCI: 3000 two-word phrases representing arthropod species names, (ii) ENG: 3000 two-word phrases including two consecutive English words commonly used in scientific literature published in English, and (iii) PEO: 3000 two-word phrases representing first and last person names. We selected a well-balanced and sufficiently large number of class instances for the training of basic machine learning models to be able to use them for predictive purposes, while at the same time being able to properly use prediction accuracy for their evaluation. To validate our results, we prepared one more dataset corresponding to the same types of information, each including 500 two-word phrases for each class and we affixed the prefix "V_" to each class to help the reader distinguish between validation and testing/training results (V_SCI, V_ENG and V_PEO). The data from the validation -13 -dataset was not used in the classifiers' training and testing. The arthropod species names were collected from BugGuide -a community site for entomologists who share information and photos of arthropod species [16]. Arthropod species names were scraped from BugGuide and curated using a Python script, resulting in a collection of 11,720 unique two-word phrases, from which we randomly selected 3000. Sets of two consecutive English words were gathered by pre-processing sentences from five research papers and each word was checked against an English dictionary to ensure its correctness. The person names were created from combinations of first and last names obtained from GitHub repositories [27,28].
The dataset instances were processed and the frequency of all 676 two-letter combinations (bigrams) corresponding to the standard 26 letter English alphabet were calculated for each dataset entry. Therefore, each two-word phrase is represented by a row of 676 bigram frequencies corresponding to all entries in the three categories.

Classification problems
The goal of this work was to identify ML algorithms capable to distinguish arthropod species names in a scientific publication. The two major challenges specific to this task were to distinguish between arthropod species names and regular English words or person names, respectively, which represent major confounders in a text corpus.
To address these challenges, we defined three classification problems, two of which are two-class problems and one is a more generic three-class problem.
The first problem (P1) is defined such that, given a set of two-word species names of arthropod species (SCI) and groups of two consecutive English words commonly The datasets used for P1 and P2 contain 6000 instances each, with 3000 instances in each class, while the dataset for P3 contains 9000 instances, with 3000 instances in each class. In each dataset, the instances are represented by bigrams in the forward direction.

Classification methods
Machine learning models were trained and tested using Weka [29]. Ten-fold cross validation [30] Table) and boosting (AdaBoost). The default settings for each classifier were used. All dataset files were saved in CSV format and converted to the Weka ARFF-format.

Tree-based classifiers
Random Forest [5] is an increasingly popular ensemble-based classification method that uses a group of decision trees and bootstrap aggregations to make predictions calculated as the average prediction of each individual tree. The J48 classification method is an implementation of the ID3 algorithm, which in turn represents an improvement of the C4.5 algorithm developed by Ross Quinlan [31].
The algorithm is used to create decision trees by iterating over the features of a dataset.
In each iteration the algorithm selects a new feature and calculates its entropy (H) or information gain (IG). It then selects the attribute with the smallest H or IG and splits the set of data instances based on the selected attribute and organizes them as a tree.
The algorithm continues to split each data subset using the same principles until all instances are labelled. A new instance is then classified by traversing the tree based on the values of its features until a leaf node is reached. The class of the leaf node becomes the class of the instance.

Kernel-based classifiers
LIBLINEAR is an open source library for large-scale linear classification that supports two popular binary linear classifiers: Logistic Regression (LR) [32] and linear Support Vector Machines (SVM) [33]. For a given set of instance-label pairs (xi,yi), where i=1..n and label yi={-1,1}, LR and SVM solve an unconstrained optimization problem with three loss functions (f1, f2 and f3) as described in Equation (2) classification scenarios [34]. The LIBSVM algorithm in this paper used a radial-basis function kernel.

Artificial neural network (ANN) classifiers
The Multi-Layer Perceptron (MLP) classifier is one of the most popular types of neural networks consisting of a feedforward architecture with three or more layers of artificial neurons (nodes): an input layer, an output layer and one or more hidden layers. Each node sums up the weighted inputs forming an activation potential, which is further processed by an activation function (e.g. sigmoid). The input weights represent connection strengths among nodes, and they get continuously updated during the training phase. The activation function acts as a threshold allowing only strong signals with sufficiently high activation potentials to produce an output. At each iteration, an input vector x is processed and the MLP outputs the vector y(x, w), where w is a vector of adapted weights. The error between calculated and expected outputs is estimated and minimized iteratively using a backpropagation approach [35] that adapts the weight values during each iteration until an acceptable error level is reached.

Bayesian classifiers
Bayesian classifiers are probabilistic models that use Bayes Theorem [36] to relate inputs with outputs. The Bayes Theorem is based on a formula for conditional probabilities (Equation 3).
The learning modules of Bayesian classifiers are tasked with constructing a probabilistic model including all data features and uses the model to predict the class of new data instances. The Bayesian classifiers used in this study are: Naïve Bayes [37], Bayesian Logistic Regression [3], Bayesian Networks [38], Complement Naïve Bayes [39], Naïve Bayes Multinomial [40], Naïve Bayes Updateable [37].

Lazy classifiers
Lazy classifiers operate by delaying generalization beyond the training data until a new instance is encountered [41]. When a new instance is presented to the classifier, a set of similar instances is retrieved from the training set and the new instance is labelled based on its similarity with the extracted instances. Lazy classifiers are able to solve several problems concurrently because target functions are approximated locally within the system. The disadvantage is the space required to store training data when large training sets are involved. Although the training of the system can be fast, the evaluation phase is usually slower. An example of a lazy classification method is the K* algorithm [42], which applies cluster analysis and the final classification of a new instance is based on the cluster that has the nearest mean. The IBK algorithm [43] is another example of lazy learner, which relies on a k-nearest neighbour classifier where the final classification is based on the majority vote of the k nearest neighbours.

Rule-based classifiers
Decision

Boosting classifiers
AdaBoost is a boosting type of algorithm introduced in 1995 by Freund and Schapire [44]. Initially, AdaBoost assigns equal weights to the data features and iteratively applies a basic classifier on the data. In each iteration, the algorithm augments the values of the weights for incorrectly classified instances, therefore forcing the base classifier to focus on the hard-to-classify instances of the training set and boost its performance.

Experimental setup and performance metrics
All 15 classifiers were evaluated with Weka [29] using a 10-fold cross validation approach [30]. All tests were performed using the default settings for each classifier.
Since classes are balanced and include 3,000 items each for training and 500 each for validation, we report the accuracy of correctly predicted instances (%) and the execution time (seconds). N-gram extraction was achieved using the Python 3 re library.

Ethics approval and consent to participate
Not applicable.

Not applicable
Availability of data and materials

Additional files
Additional file 1 -ZIP archive including all datasets used in this work. Each file is labelled by applying the naming convention used in the manuscript.