The accurate evaluation of the biodiversity for any given ecosystem is a keystone element, even imperative, in numerous biological and applied disciplines, including ecology, conservation biology, food regulatory compliance, forensics, and ecosystem monitoring and assessment [1–2]. In response to the needs, DNA-based taxon identification constituted on the Cytochrome Oxidase I (COI) gene are habitually used to assess biodiversity, including species identification, species boundaries and species diversity analyses. The inventory of DNA barcodes is deposited in BoLD [3], a cloud-based data storage and analysis platform, that further employed as a curation tool. BoLD currently contains (updated to April, 2020) about eight million barcodes, encompassing > 310,000 animal, plant and fungi species. The rationale for using the COI gene in species barcoding lies on the understanding that the intraspecific diversity for the COI gene is lower than the interspecific diversity backed by the difficulties associated with the traditional taxonomy, the morphologically based species identification [4]. The major benefit of using BoLD is immediately emerged when an unknown sequence is compared against a database to determine its closest species match, an evaluation strictly depends on the correctness and reliability of the data stored in BoLD and on the quality of the barcode libraries [5]. Yet, it should be noted that while barcodes may be excellent tools to identify species that are already in BoLD, they may have poor predictive power in identification of unknown species.
As a curator tool, it is inferred that all barcode sequences stored in the BoLD database are backed by vouchered specimens and thoroughly identified by taxonomy experts. Yet, being a public database, it is inevitable that BoLD, as any other similar curation tool, will not accrue erroneous data, sometimes significantly [2, 6]. Taxonomic misidentifications and/or taxonomic conundrums, composites of cryptic species complexes, technical faults such as deficient DNA extraction, primary bias and/or PCR-based errors and contaminations by foreign DNA, including bacterial COI sequences, are just part of the causes that may unavoidably generate erroneous data and inaccurate sequences [2, 6–10]. The above difficulties may affect dramatically the accuracy of barcoding. For example, it was claimed that they lead to the lack of an unambiguous species level identification by the BIN (Barcode Index Number) tool in the BoLD system, and to taxonomic conflicts by the assignment of more than a single species name per bin [11].
The European Register of Marine Species (ERMS [12]) is an authoritative taxonomic check-list of species that are found in all European marine environments (the all-taxon marine species inventory from the Canaries and Azores to Greenland and north west Russia, towards the Mediterranean sea and the Baltic Sea), from the deep sea, all continental shelf areas and up to the splash zone above the high tide mark, and in estuaries, down to 0.5 psu salinity. During 1997–1999, ERMS was published on the internet and subsequently as a book, containing a list of about 30,000 marine species of the kingdoms Animalia, Plantae, Fungi and Protoctista occurring in the European marine environment [13]. It is projected that this marine species inventory will be used as the standard reference and technological tool for marine research and for management of the marine environment in Europe.
Until recently, the standardized methodologies available for biological monitoring and management in the marine environments, primarily for practitioners, were restricted to traditional morphological taxonomy, tedious and time-consuming methodologies that require the involvement of expert taxonomists with skills that can only be attained via years of practice. This line of analyses is currently being replaced by molecular approaches such as DNA barcoding, environmental DNA (eDNA) and metabarcoding [5, 14–17]. For these approaches, it is of special interest to identify gaps in already existing or developing DNA barcode reference libraries, primarily those that are pertinent in the context of the EU Water Framework Directive (WFD) and the Marine Strategy Framework Directive (MSFD). A recent global study on this perspective [5] has revealed that the barcoding coverage varies strongly among taxonomic groups, and among geographic regions, pointing to many missing species and imperfect data (e.g., errors in species identification, discordance among taxonomists) that are relevant to monitoring and highlighted the needs for improving quality assurance of the barcode reference libraries.
Following Weigand et al. [5] global analysis, we aim here to investigate potential gaps in already DNA barcoded organisms (based on publicly available data in BoLD database) listed in two reference libraries of the ERMS inventory. We discuss the necessity of quality control (QC) when building and curating a barcode reference library, and provide recommendations for filling the gaps in the barcode library of European aquatic taxa.