The proposed system takes a sequence of paragraphs from a neuroscience article and extracts a list of metadata entities that best characterize the data described in the article. The core extraction mechanism is a sequence labeling algorithm, which requires training. Once the sequence labeling algorithm is trained, we take advantage of a probabilistic sentence classifier tuned with term statistics extracted from the publication and from NeuroMorpho.Org (Ascoli et al., 2007) to sort the extracted entities and provide a ranked metadata suggestion list. The following sections describe the algorithm, the training dataset, and explain the structure of the entity suggestion system.
Corpus preparation and preprocessing
The data utilized in our metadata suggestion system consists of two main parts. The first part is the dataset prepared for training the sequence labeling algorithm. Starting from a corpus of over 2000 neuroscience articles processed for NeuroMorpho.Org, we selected 13,688 sentences via active learning (Chen et al., 2015). We then used the open-source annotation software DataTurks (https://github.com/DataTurks) to manually annotate the sequence of words with nearly 40,000 target metadata labels of interest (Table 1). The length of the annotated output sequence (labels) is thus the same as that of the input sentence. For this NER annotation, we adopt the BIO format (Ratinov and Roth, 2009), a tagging scheme that captures both boundary and the named entity types (Fig. 1). All training data described above, including publication identifiers and annotated sentences, are available at https://gitlab.orc.gmu.edu/kbijari/neuroner-api/-/tree/master/data.
Table 1
List of neuroscience entities of interest for NeuroMorpho.Org with their abbreviated form that is used in this article along with an example of its type and their distribution in the annotated sentences.
Entity | Abbreviation | Example | Count |
Cell type | CEL | Interneuron | 7549 |
Developmental stage | DEV | Adult | 1243 |
Experimental condition | EXP | Control | 9860 |
Sex or Gender | GEN | Female | 685 |
Objective type | OBJ | Oil | 156 |
Protocol | PRO | In vivo | 1482 |
Reconstruction software | REC | Imaris | 1209 |
Brain region | REG | Amygdala | 8314 |
Slicing direction | SLI | Coronal | 362 |
Species | SPE | Rat | 4481 |
Staining method | STA | Biocytin | 1402 |
Strain | STR | Wistar | 3169 |
Total | - | - | 39876 |
The second part of the data consists of metadata entities of specific collections of neural reconstructions publicly shared on NeuroMorpho.Org and associated with 812 published articles (Bijari et al., 2020). These highly curated metadata summaries constitute the gold standard for benchmarking the automated suggestion system (Fig. 2). It is important to note that typically not all NeuroMorpho.Org metadata are explicitly mentioned in the publication: certain entities (such as the name of the collection and the structural domains included in the data) are only provided directly by the dataset contributors at the time of submission to the repository (Parekh et al., 2015). Therefore, this work solely focuses on extracting the subset of metadata most commonly reported in publications (Table 1). These 12 metadata entities to be extracted can be logically grouped in three broad categories: animal, anatomy, and experiment, as briefly explained below.
The four metadata entities pertaining to the animal category specifies information about the subject of the study: the species, strain, sex, and developmental stage. This knowledge is the simplest to extract from the article, as the corresponding details are almost always clearly stated in the text of the publication.
The two metadata entities in the anatomy category represent the nervous system region and the cell type of the reconstructions described. These are the most difficult characteristics to recognize automatically for three main reasons. First, their semantics strongly depend on the species: many regions and cell types only exist in vertebrates (e.g., cerebellum and pyramidal cells) or invertebrates (e.g., antennal lobe and Kenyon cells). This provides tight contextual constraints that require considerable specialization for proper interpretation. The second reason is that anatomical entities can often be labeled according to different criteria, which typically vary based on the specific focus of the study. For example, regions can be partitioned functionally (e.g., visual vs. somatosensory vs. motor cortex) or structurally (e.g., occipital vs. parietal vs. frontal lobes); and cell types can be classified electrophysiologically (fast-spiking vs regular spiking), or molecularly (calbindin-expressing vs. calretinin-expressing). These labels often overlap to a certain extent across criteria, considerably complicating the annotation task. The third challenge is that NeuroMorpho.Org divides both anatomical regions and cell types into three hierarchical levels, from generic to specific (e.g., hippocampus / CA1 / pyramidal layer and interneuron / basket cell / horizontal). Not all hierarchical descriptors might be explicitly mentioned in the article, as authors often rely on the reader’s tacit knowledge for correct understanding. To overcome these obstacles, we take advantage of the substantial information contained in the manually curated metadata from thousands of publications by more than 900 labs as provided by NeuroMorpho.Org and its public metadata hierarchies (Polavaram et al., 2017). Specifically, from these records we constructed a lookup table for individual terms listing correlations with other metadata dimensions, potential hierarchies, synonyms, and frequencies.
The last six metadata entities targeted by our suggestion system, belonging to the experiment category, describe methodological information: the preparation protocol, experimental condition, label or stain, slicing orientation, objective type, and the tracing software. These details are also relatively straightforward to extract since peer-reviewed publications usually mention the experimental specifications explicitly.
If any metadata detail is not provided by the contributor nor mentioned in the publication, the corresponding entry is marked “Not reported” in NeuroMorpho.Org. Moreover, certain details are labeled as “Not applicable” depending on the specific preparation: for instance, if the experimental protocol is “cell culture”, then the slicing direction is not a relevant entity.
Metadata Extraction – Problem Definition & System Architecture
Given a full text of a publication (P) captured from the publisher’s application programming interface (API), and an empty list of target metadata (M) dimensions (Table 1), the entity extraction task is to collect all metadata entities noted within the sentences of P and add them into the appropriate sub-list of M. Then, the algorithm must sort all sub-lists of M based on the relevance of the extracted entities.
The metadata extraction architecture we designed to solve the above task consists of three main elements: preprocessing, entity extraction, and entity collection/ranking (Fig. 3).
Preprocessing starts by resolving all abbreviations mentioned in the full text using Hearst’s algorithm (Schwartz and Hearst, 2003) and a list of common abbreviations collected throughout our practice on the NeuroMorpho.Org knowledge repository. Afterward, we replace all Latin numeric mentions with their corresponding Arabic numerals (e.g., i/ii→1/2). Once this is done, we use the NLP toolkit (Loper and Bird, 2002) to break the text into paragraphs and sentences; next, sentences are tokenized into a sequence of words and punctuations. Having the full text of publication transformed into a list of word sequences, we can turn to tagging metadata entities.
The sequence-to-sequence algorithm to extract named entities consists of two sub-parts. First, NeuroBERT encapsulates the preprocessed sequence of words with special tokens ([CLS] and [SEP]) to demarcate the sentence boundaries and creates appropriate metadata tags for each term. Second, in order to ensure that all entities are fully extracted from the text, and nothing is missed, we cross-check the terms in the sequence for exact matches in the lookup table of NeuroMorpho.Org metadata terminology.
The above process produces labels for every term in all publication sentences. This dense annotation is then trimmed by removing labels that cannot be mapped to suitable target terms from the NeuroMorpho.Org lookup dictionary. Matching between extracted labels and NeuroMorpho.Org target term is assessed by Jaro similarity (Jaro, 1989). This metric measures edit distance between two strings in the [0, 1] range, with 1 indicating identical strings and 0 indicating no character in common between them. For our purpose, we set a similarity threshold of 0.85 to retain a label.
The result of this sequence-to-sequence processing is a list of candidate terms for each metadata entity type (category). At this stage, the terms within each entity type must be ranked to identify the most accurate metadata suggestion.
Metadata Entity Ranking
In order to rank terms within the list of identified candidates for each metadata entity type, we assign to each candidate term a score that is a function of its occurrence frequency and location in the text, its usage rate in NeuroMorpho.Org, and the structure of the sentence in which it appears. Specifically, the score of each extracted entity is determined based on the following equation.
$$\begin{array}{c}Score\left(term,sec,sen\right)=\alpha \times Freq\left(term\right)+\beta \times Rate\left(term\right)\\ +\gamma \times SecScore\left(sec\right)+\delta \times SenScore\left(sen\right)\end{array}$$
Here, ‘term’ is the identified metadata entity, while ‘sec’ and ‘sen’ are respectively the section of the publication (e.g., Introduction, Materials and Methods, etc.) and the sentence in which the term is found. ‘Freq’ calculates the frequency of ‘term’ by simply counting the number of times the term appears within the publication. ‘Rate’ computes how often NeuroMorpho.Org uses the term by dividing the number of times a group of neural reconstructions is annotated with that specific entity by the count of all group of reconstructions annotated by any entity within that metadata category.
‘SecScore’ returns the importance of the section in which the term is identified, assigning, for example, greater weight to Materials and Methods or Results than to Introduction or Discussion (Table 2). Figure legends are assigned the SecScore value of the section they belong to (typically Results). If a term is found in multiple sections within the publication, the maximum SecScore value is utilized.
Table 2
List of different sections considered in publications along with their relative importance for the modeling. “Summary” is considered synonymous with Abstract and “Conclusions” with Discussion. “Others” include Acknowledgments and References as well as any additional section.
Section | Title | Abstract | Keywords | Introduction | Methods | Results | Discussion | Others |
Importance | 1.0 | 1.0 | 1.0 | 0.5 | 1.0 | 1.0 | 0.6 | 0.4 |
‘SenScore’ calculates the relevance of the sentences containing the term. For this purpose, we trained the logistic regression classifier Scikit-learn (Pedregosa et al., 2011), using default parameters, on 375 sample sentences randomly selected from neuroscience articles associated with NeuroMorpho.Org data, and manually labeled as 0 or 1 based on their informativeness. This classifier reads the embedded sentences from the last layer of NeuroBERT (Fig. 3) and uses a sigmoid function to produce a likelihood value based on its structure. For example, the label ‘Species = rat’ in the sentence "experiment was performed on 55 adult male Sprague-Dawley rats" will have a higher value (SenScore=0.85) than in the sentence "previous in vitro studies of adult rat have shown that correlation depends on the level of excitation" (SenScore=0.40). If a term is found in multiple sentences within the publication, the maximum SenScore value is utilized.
Model Training and Parameter Settings
The ranking values (Freq, Rate, SenScore, and SecScore) were min-max normalized to the [0, 1] range. Their coefficient values [α, β, γ, δ] were optimized using grid search in [0, 1] interval with 0.05 incremental steps as those maximizing annotation performance (Table 3). We used default values for most BERT hyperparameters (Devlin et al., 2019), except for the following. The learning rate was set at 0.00002 upon testing the 5 recommended values and a batch size of 8 was used after comparing the results with a value of 16. We chose 10 as the number of training epochs, which produced the best results among all values from 1 to 50. Model training and optimization were performed using Python 3.10 under Linux operating system on a Tesla K80 GPU with 32GB of RAM and lasted three days. The full list of packages and libraries used is available at https://gitlab.orc.gmu.edu/kbijari/neuroner-api.
Table 3
Best performing parameters for the model
Parameter | α | β | γ | δ |
Value | 0.20 | 0.25 | 0.35 | 0.20 |