NPDB: a natural product database with relational data between natural products and biological sources


 We designed a natural product database filled with hand-curated data of natural product information from journal literatures, to establish the relationships between natural products and biological sources. Therefore, we can figure out all the biological sources of a specific natural product reported in literatures, and also can inspect for the holistic view of a specific organism’s components, metabolites or extracts.


Introduction
The studies of deriving natural products from biological sources presented in publications provide abundant information of diversity both in biology and chemistry. Although knowledge of natural products has inspired new medicines, agrochemicals and materials, broad research on universal natural products derived from various organisms can be deficient. [1][2][3] For example, the correlations between the molecules of natural products and their host species are still ambiguous. What are the complete chemical components or metabolites of one specific species? What are the relationships between species sharing identical chemical components or metabolites? Those mysteries may be studied by computational approaches, there are already several data resources useful for natural product researches. [4][5][6][7][8][9][10][11][12][13][14] The data resources are various from free-access and commercial, from comprehensive and specialized, nevertheless, the demand of scientific exploration is still beyond the amount of available datasets, as natural product deriving achievements are increasingly published in journals. [4][5][6][7][8][9][10][11][12][13][14][15][16] We herein describe a natural product database (NPDB), a data resource includes the information of relationships between natural products and biological sources reported in publications. The relational data links a specific species and all the natural products derived from it, and contrarily links a specific natural product and all the biological 3 sources, and each relational data has corresponding bibliographic references. In this database, the natural products are represented by molecular structures of the molecules derived from organisms, and the biological sources are represented by species names of the organisms. Other available information like deriving-parts of the organisms, names of the natural products, and computable molecular properties are also include in NPDB. The volume of the database is extending continuously, as journal literatures of natural product deriving researches are on the increase, and we intend to involve more publications in the data acquisition stage.  [18,19] Insert Table 1 here.

Utility And Discussion
The top 10 molecules of natural products derived from most biological sources in NPDB, as shown in Table 2, are 8 terpenoids, 1 steroid, and 1 aliphatic acid, each one is derived from over 4,000 biological sources. Terpenoids are a large group of substances which occur in most organisms, playing vital roles of biofunctionality such as antioxidants and nutritions. [20] The steroid "β-sitosterol" and the aliphatic acid "palmitic acid" also widely exist in organisms as important classes of bioorganic molecules. [21] Insert Table 2 here.
The molecular features of the natural products in NPDB, as shown in Figure 1, are perceived as chemically differing from the molecules in our general chemical databases.
[22] For the structural complexity, more than 86% of the natural products have ring system, over one third have more than 3 rings (Figure 1. A), 56% of them are heterocycles ( Figure 1. B). Approximately half of the natural products are aliphatic, 58% of them have chiral centers (Figure 1. B). The natural products also present extremely higher oxygen content on average, over 93% of them have oxygen atom, and the percentage of the natural products having more than 10 oxygen atoms reach 16%, it seems odd when compared with nitrogen content (Figure 1. C).
Insert Figure 1 here.
For the interest of taking natural products as starting points for medicinal chemistry and drug discovery, the Lipinski's rule of five parameters may have significant referential value to insight into "drug-likeness" of the molecules in NPDB. [23] Over half of the natural products are within the bounds of Lipinski's primary five parameters (Figure 2. F): molecular weight less than 500, number of hydrogen bond donors less than 5, number of hydrogen bond acceptors less than 10, number of rotatable bonds less than 10, and LogP 5 less than 5. There is no doubt that the natural products are the treasure of potential drug candidates.
Insert Figure 2 here.

Construction And Content
As a strategy of screening literatures for data sources, our data analysts browse the contents of each journals by issues and volumes, select the required articles based on judgments of the titles and columns, obtain the PDF version of the full text, or abstracts and bibliographic information when the PDF file is unavailable. The journals involved cover the major publications of natural product deriving researches both domestic (Chinese) and international (English), 45% of them are extensive research on natural product, others are on phytochemistry, traditional chinese medicine, food industry and miscellaneous, as shown in Fig. 3. The main list of publications that NPDB covered is included in the supporting information file provided as supplemental materials .
Insert Figure 3 here.
In the early stage of this work, we collected the raw data of NPDB from publications manually. Our data analysts reviewed the literatures, indexed the information of biological sources and the molecules of natural products. With the practical experience and plenty of hand-curated data, we developed a rule-based text mining system for natural product data acquisition (NPDsys), and attempted to extracted the required data information from textual description of the journal literatures automatically (Fig. 3).
Insert Figure 4 here.
The biological informations recognized in NPDsys are species names of the organisms, such as "Alternaria alternate" in Fig. 3. B, [24] and derive-parts of the organisms, such as "secondary metabolites", "leaves", and "aerial parts". The chemical informations 6 recognized in NPDsys are trivial, systematic or semi-systematic names of the natural product molecules, and the author-numbers of the molecules. The author-numbers such as "Compound A", "Compound (1)", are used for associating different representations of the same natural product appear in abstract, introduction, results and experimental section, or English and Chinese names of the same natural product in Chinese literatures. When the data meet the definition of "large-scale", that information will be complementary in natural product name translation and molecular structure converting. As an exploration of chemical text-mining, the NPDsys not only get the chemical entities in literatures but also take an attempt to recognise the connections between the entities. Nevertheless, a great quantity of literatures are not available as text-mining materials, and some literatures do not provide appropriate systematic names for the "new-found" natural product molecules, lead to extra procedure of hand-drawing structures or web-searching. Therefore, the available data set in NPDB is hand-curated data at present.
The raw data had been processed properly before added into NPDB. We first assume the correctness of the primary literatures, unless there are apparent errors like typos, then backtrack on the original document when encounter abnormal data in the subsequent processing. For the same biological source from multiple literatures, we merge the data and remain the distinct natural products, and list all the references. We have similar approach for the same natural product derived from different biological sources reported by multiple literatures. When encounter multiple biological sources in one literature (For example, components of mixed species researches), [25] we label the relational data as "optional", once other literatures reported data corroborated the natural products from one of those biological sources, the relational data will be "confirmed".
In the process of the natural product data acquisition, the structures of the molecules are generated by "name to structure module" of ACDLabs, ChemOffice and OPSIN for 7 computer-aided data analysing, and we use ChemDraw and Reaxys for drawing and searching the structures manually. [26][27][28][29] A machine translation tool of chemical nomenclature has been used for Chinese compound names to/from English translation. [30] We have an evaluation of the structures generated by different toolkits, scores have been made by molecular formula comparison in order to evaluate the different structures from the same compound name, and the eventual structures are standardized to MDL Molfile format. RDKit has been applied in Python for molecular properties computing in NPDB. [31,32] Conclusions With continuous upgrading and optimizing of NPDsys, we might have an efficient and lowcost tool to expand the volume of NPDB in the future. Owing to the limitation of professional, we haven't concentrated on the classification and analysis of the biological source data. Nevertheless, current datasets in NPDB has shown considerable diversity both in biology and chemistry, and the relational data provide significant clues in the correlations between some special natural products and their host species. Compare to the top databases of natural product, the NPDB has not exceed their volume or coverage, but has greater clarity about the relationships between natural products and biological sources. [4][5][6][7][8][9]14] Besides mining information from literatures, to make published data quickly accessible and to combine them logically for further applications, the main purpose or destination of this database is clarifying about the relationships between natural products and biological sources, since the relationships are intersecting. For example, the same natural product and its homologues can originate from various biological sources, adopters can have a wider choice on biological sources from their own concerns. On the other hand, integrated constituent information from a specific biological source, can inspire novel applications of the biological source, for example, adopters can analyse the potential toxicology or pharmacology of a Chinese medicinal herb at molecular level. We are looking forward to discovering more secrets of natural products using new approach of cheminformatics, and providing sophisticated data-support for pharmaceutical research by web interface and retrieval functionality.

Competing interests
The authors declare that they have no competing interests.  14 Figure 3 The coverage of the journals involved in the database. The design of the natural product data acquisition system. (A) Illustration of natural product data acquisition system. (B) Demonstration of the procedure for natural product data acquisition.