Chemical space of InflamNat database
Among the 1351 InflamNat compounds, the largest structure class is flavonoid, followed by triterpenoid, and diterpenoid (Figure 2A). As discussed in our previous study, these structural classes are most frequently acquired and reported in the isolation of natural products. Furthermore, the phenolic hydroxyl groups and aromatic rings in flavonoids may contribute to their wide range of bioactivities by forming intermolecular interactions with protein targets. Triterpenoids possess a similar structure to steroid hormones, which play important roles in modulating immunological reactions . The scaffolds of the NPs identified in InflamNat are very diverse (Figure 2B), ranging from simple aromatic natural products with a single ring to complicated skeletons with a 5-6 ring system.
The distribution of physicochemical properties of InflamNat compounds is shown in Figure 3A. According to Lipinski’s rule, 60% of the InflamNat compounds are drug-like (MW < 500, LogP < 5, #HBD < 5, #HBA < 10 and#RotB < 10), while 29% have a topological polar surface area (TPSA) < 60, indicating their potential to cross the blood-brain barrier (BBB). As shown in Figure 3B, InflamNat compounds cover a similar but smaller chemical space compared to approved drugs.
1. Bioactivity overview of InflamNatcompunds
The anti-inflammatory activity of InflamNat compounds in cells was obtained from the literature. In addition to the major indices, such as the inhibitory effect on the production of NO, PGE2, IL-1, IL-6, IL-8, and TNFα, cytotoxicity data were collected to exclude the effects of cell viability on the production of inflammatory factors. It was discovered that the inhibition of NO production was the most frequently reported data. Notably, NO production only represented specific inflammation signaling pathways, such as the classical NF-κB pathway, whereas other pathways may have different indices, such as IL-1β. However, data on the inhibition of the production of IL-1β and other inflammatory factors were insufficient to develop a machine learning model (Figure 4A). Therefore, only the inhibitory activity of NO production was selected to train the prediction model of anti-inflammatory activity.
Since the anti-inflammatory effects were sensitive to the cellular model, the cell types used in the assays were also recorded (Figure 4B), with the majority of the assays performed in mouse macrophage models (including RAW264.7 and J774A.1). The mouse microglial cell line BV-2 are macrophages residing in the central nervous system. The data acquired in macrophages were selected for model construction.
Only about 1/3 of InflamNat compounds were protein targets. The top 100 targets of InflamNat are listed in Figure 4C. The length of the protein names corresponded to the frequency with which the protein appeared in the records. The targets of InflamNat compounds were related to a wide range of diseases, including cancer (Tyrosyl-DNA Phosphodiesterase 1, TDP1), anti-inflammation (15-Hydroxy-prostaglandin dehydrogenase, HPGD), nervous system disease (Amyloid-β, Abeta), and diabetes (Protein Tyrosine Phosphatase 1B, PTP1B). Enzymes related to drug metabolism, such as the cytochrome P450 proteins (CYPs), represented another type of target.
Model Training and Prediction Performance Evaluation
The machine learning-based predictive tools in InflamNat, namely AI-A and C-T, were implemented based on the open-source machine learning framework Pytorch (https://pytorch.org). The details of model training and evaluated results for AI-A and C-T are presented in this subsection. Ten-fold cross-validation was used for experimental evaluation, in which experimental datasets were divided into ten parts. One part was used as the test dataset, another was used as the validation dataset, and the remaining eight parts were used as the training set. First, the training and verification sets were used for training and verification, and the test set was used for testing. The dataset of each part was used as a test set in turn, and the average classification accuracy obtained by ten-fold cross-validation was used to evaluate the performance of the classifier. In this study, the receiver operating characteristic curve (ROC curve for short) and the AUC value of the area under the curve were used to evaluate the prediction performance of the proposed model. All experimental tests were carried on a Windows 10 operating system with a Dell Precision T5820 workstation computer with an intel W-2145 8 core, 3.7 GHz CPU, and 64 G memory.
(a) Tokenization and Pre-training
A total of 1,938,745 SMILES sequences were collected from ChEMBL , and 476,715 protein sequences from UniProt  as a corpus for pre-training.For SMILES compounds, Byte pair encoding (BPE)  and Extended-Connectivity Fingerprints (ECFP)  were used to produce tokens. BPE is a data-driven tokenization algorithm that is described in detail in . BPE first learns a vocabulary of high-frequency SMILES substructure from a large chemical dataset (ChEMBL was used in this study), then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. ECFPs are a type of fingerprint method that is specifically designed to capture molecular characteristics associated with the molecular activity. In ECFP, all substructures surrounding all heavy atoms of a molecule within a defined radius are generated and assigned unique identifiers. In our study, radii of 1 and 2 were used, thus they were called ECFP1 and ECFP2, respectively.
Figure 5 displays the statistical results of BPE, ECFP1, and ECFP2 tokenization for the collected ChEMBL dataset. The mean lengths for BPE, ECFP1, and ECFP2 tokenization were approximately 6, 22, and 25 tokens, respectively. According to the results, different tokenization methods provided different token sets, which resulted in different sequence partition semantics. For protein sequences, k-mers  and BPE were adopted to generate various tokens.
The tokens were considered as “words” and compounds (or proteins) as “sentences”. The Word2vec algorithm was then applied to the drug (or protein) corpus to obtain high-dimensional embeddings of tokens, where the vectors for chemically related tokens occupied the same part of vector space. These token embeddings were used as the initial feature representation of drugs (or proteins).
(b) Training and Evaluation of AI-A
According to the experimental requirements of a ten-fold cross-division, 890 NPs compounds molecular labeled by anti-inflammatory activity (represented by 1) and inactivity (represented by 0) were used to train the MTT-based encoder and binary classifier.
After fine adjustment of model parameters, the dimension of the feature vector was set at 128, the heads of attention of the transformer at 6, the layer number of transformers at 5, and the learning rate at 0.01. Figure 6 shows the prediction performance comparison between MTT(ECFP) (MTT(ECFP) represents the classifier using the MTT encoder and ECFP represents tokenization), MTT(BPE), and MTT(ECFP+BPE). The results revealed that the adoption of multiple tokenizations can improve prediction performance. Finally, MTT with AUC 0.8476 was obtained.
In order to evaluate the effectiveness of MTT with multi-tokenization, we compared the prediction performance of MTT-based classifier with other methods in our NPs classification datasets, such as SA-BiLSTM, PaDEL-SVM, PaDEL-RF . PaDEL-SVM and PaDEL-RF represented prediction methods using PaDELfor compound description whereas and SVM and random forest as classifier respectively. The comparison is shown in Figure 7.
(b) Training and Evaluation of C-T
The aim ofC-T was to predict the interactions between the compounds and targets. In this study, C-T was still modeled as a binary classification problem to classify the given compound-protein pair interaction or not. MTT was used as the encoder for both compound SMILES and protein sequences. After obtaining the embedding of the compound-protein pair, the embedding was input into the MLP-based classifier, which produced the final interaction score.
A total of 9126 compound-protein pairs labeled “1” (means compound-protein interact) or “0” (not interact) were used as datasets for the training prediction model. The datasets included 325 compounds and 796 proteins, with 7164 positive pairs (“1”) and 1962 negative pairs (“0”).
Ten-fold cross-validation was used to evaluate the prediction performance of the C-T model. Specifically, 10% of both the positive and negative pairs were randomly selected from the positive and negative datasets as the test set. The remaining pairs were used as training sets.
The dimension of the feature vector was set at 128, the heads of attention at 4, the layer number of transformers at 5, and the learning rate at 0.001. Finally, C-T obtained an AUC of 0.8724. Figure 8 shows the prediction comparison. MTT(ECFP + BPE) represents the classifier using MTT with ECFP1, ECFP2, and BPE tokenization. MTT(BPE) represents the classifier using MTT with only BPE tokenization. PreTrain+MLP represents the vectors derived by classification using Pretrain do not use the Transformer layer for presentation learning. Experimental results show that the adoption of multiple tokenization can improve prediction performance.
1. Website interface
InflamNat (http://www.inflamnat.com/ or http://18.104.22.168/) combined one database and two machine learning-based predictive tools (Figure 9). Users can search the database using several approaches: 1) providing the NP structure (SMILES, MOL2, SDF), 2) selecting a range of molecular properties, and 3) entering the name or ChEMBL ID of target proteins. The retrievable data included the basic compound information (Name, IUPAC, SMILES, InChiKey, ChEMBL_ID, PubChem_ID, compound class, and origin organism), physicochemical properties (MW, molecular formula, LogP, #HBA, #HBD, and #RotB), cell-based anti-inflammatory bioactivity (inhibiting the production of NO, PGE2, IL-1, and cytotoxicity), and protein targets (IC50 < 50 μM). The NP-target network can be visualized by downloading the complete dataset (including negative NP-target interaction data) via the links on the home page.
Furthermore, users can predict the anti-inflammatory activity of natural products by uploading their structures. The results will be sent via e-mail and presented as the probability of having an IC50 (inhibition of NO production in macrophages) < 50 μM. For InflamNat compounds and targets that are collected in the database but lack existing relationship data, one can predict the relationship of the given compound and target, as well as retrieve all the potential targets for a specific compound.