MolData, A Molecular Benchmark for Disease and Target Based Machine Learning

doi:10.21203/rs.3.rs-968557/v2

Download PDF

Research Article

MolData, A Molecular Benchmark for Disease and Target Based Machine Learning

https://doi.org/10.21203/rs.3.rs-968557/v2

This work is licensed under a CC BY 4.0 License

Version 2

posted

You are reading this latest preprint version

Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge is necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain meaningful information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChMBL offer the screening data of millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological information which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset are collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https:// github.com/Transilico/MolData as well as within the supplementary materials.

Computational Chemistry

Artificial Intelligence

Benchmark

Biological Assays

Big Data

Database

Drug Discovery

Machine Learning

PubChem

In the last decade, Artificial Intelligence (AI) has played a major role in modern computer aided drug discovery (CADD). Major improvements in both structure-based and ligand-based virtual screening have been recorded by training smart systems capable of identifying hidden molecular patterns. Learning models for Ligand-Based Drug Discovery (LBDD), or non-structural drug discovery, have been truly revolutionary in multiple aspects of the early drug discovery process. Deep Learning (DL) models have demonstrated the ability to discover abstract features of small molecules, allowing for better screening of both cell-based and target-based CADD. Using conventional methods, scientists would need to screen every molecule in a library on a specific target or cell, which is expensive, labor intensive, and time consuming. Virtual screening algorithms have introduced more affordable and faster alternatives that eliminate most of the early drug discovery costs. However, despite advances in CADD, the accuracy of traditional molecular modeling methods in most cases had not been satisfactory, prior to the introduction of Machine Learning (ML). Automatic feature extraction from molecules, and learning of hidden features in a large molecular library, are just some examples of what AI has changed forever in the drug discovery field [1][2].

One of the most important factors of a reliable model is its training data, and deep learning models utilize this data to automate both pattern extraction and the prediction of bioactive molecules [3][4]. In general, datasets that are large, more diverse, and less biased result in training smarter systems with better inner features, performance, and generalization. Therefore, the first goal for machine learning scientists should be identifying and curating the right dataset per disease state. In addition, understanding the biological knowledge behind a dataset is as important as the data quality. Since data curation, model training, and model evaluation are time consuming and tedious, it is crucial to know the exact applications of the biological target for the disease of interest. Current input datasets need to be improved upon. Firstly, biomedical datasets tend to be very biased and imbalanced based on the biological assay and the chemical library [5]. Secondly, understanding the exact cellular and molecular mechanism of the assays requires expert knowledge that ML scientists or cheminformaticians might not possess. Without knowing the biological background of the data, it would be difficult to devise solutions for data balancing and model evaluation. This knowledge is also necessary for finding appropriate public datasets due to their complicated descriptions and goals. Lastly, the chemical diversity, druggability, and toxicity of the predicted molecules need to be investigated [6][7]. With the emergence of AI in non-structural drug discovery, there has been a renewed need for cleaned and clustered public molecular databases with simple and sufficient biological information, including the proper disease and targets involved in each bioassay.

There are multiple molecular depositories containing millions of molecules and hundreds of thousands of bioassays for specific biomedical aims. PubChem bioassays, ChEMBL datasets, and ChemSpider are among some of the most comprehensive and well-known examples [8][9][10]. These databases collect large sets of molecular activity outcomes for specific cells or protein targets. Even though these databases are excellent resources for model training, curating, and discovering the right bioassays, categorizing assays from them based on disease, targets, and signaling pathways can often be challenging and non-intuitive. Therefore, the scientific community has been benchmarking datasets and methods with these depositories, and in-house databases, in order to facilitate their usage and accelerate the advancement of molecular machine learning. Researchers often curate, analyze, and publish datasets with intended targets for discovering specific patterns in bioactive molecules. One of the first examples would be the ‘Merck Molecular Activity Challenge’ which had 15 biological assay tasks. In this dataset, targets are selected based on their cellular pathway relevance [11]. In toxicity field, the Tox21 dataset from National Center for Advancing Translational Science (NCATS) containing 12 specific assays for nuclear receptor (NR) and stress response (SR) signaling pathway has been one of the most popular sources for advancing different learning methods such as transfer learning, multitask learning, few-shot learning etc. [7][12][13]. Additionally, the PCBA dataset [14] from MoleculeNet and Massively Multitask Learning projects provided more than 120 PubChem bioassays with diverse sets of targets. It consists of curated public datasets, metrics for evaluations, and an open-source library in python called DeepChem [15][16].

Even though these benchmarks have served to aid cheminformatician and ML scientists in discovering candidate drugs and allowed for better modeling, their bioassays lack the essential information like disease and target relevance. In addition, they generally do not cover a diverse set of diseases with a high number of screening molecules. We believe a reliable and practical ML model can be designed based on a known set of targets for a specific disease, fulfilling the need for a benchmark dataset that provides a comprehensive set of related assays. MolData is one of the most comprehensive disease and target-based benchmarks for democratizing molecular machine learning. It consists of 600 diverse bioassays from PubChem which are curated and clustered into 16 different diseases and 14 unique protein target classes. More than 1.4 million distinct molecules are presented in this benchmark, which consists of more than 170 million molecular screening data points. MolData aims to assist in the discovery of better and more diverse candidate drugs via the meaningful aggregation of large datasets. In doing so, it can be one of the main sources for the ML and Data Science community to develop practical molecular machine learning models. To demonstrate the application of MolData, we have run a correlation analysis to investigate drug repurposing, from which we have discovered three sets of bioassays highly correlated in both active and inactive molecules. Lastly, we trained more than 30 different multitask learning-based models, each for a specific disease or target, and one for all bioassays combined. These models can serve as a baseline for the data science community in order to advance molecular machine learning and enable better drug discovery.

1 – Benchmark Creation Pipeline

The overview of the benchmark creation pipeline is depicted in Figure 1. The process started by downloading the descriptions and summary of each data-source from PubChem. Due to the large number of selected bioassays, computational methods were implemented to aid in the creation of the benchmark, and serve as guideline for the manual tagging of each bioassay. The assay descriptions were first grouped into 10 clusters using BioBERT [17][18], and tagged using a similar disease entity recognition model. Having done so, each description was tagged manually with the assistance of the computational model results. By tagging assays in clusters separately, the similar keywords used for tagging were easier to detect. Manual tagging resulted in sixteen different disease-based categories of data. In addition, we used ChEMBL repository [10] to identify each task’s target class. After assigning each bioassay to one or more disease and target categories, the benchmark was analyzed with multiple approaches. After assigning each bioassay to one or more disease categories using specific keywords, the benchmark was analyzed with multiple approaches, such as mapping the molecular domain. For the application of drug repurposing, we ran a correlation analysis on the data and discovered three sets of correlated bioassays. Finally, different multitask graph convolutional neural networks (GCNNs) [19] were trained in order to create a baseline for the performance of multitask learning models in each disease related category.

2 – Data Aggregation Results

MolData benchmark originates from 9 open source data sources on PubChem, which are the largest in terms of number of screened molecules and number of active bioassays [20], as shown in Table 1. Initially, these collected data contained more than 1,000 bioassays, which were then triaged to 600 bioassays (AIDs) after filtering datasets smaller than 100,000 molecules or 15 active molecules. We included the updated Tox21 source [21] with more than 55 different bioassays due to their applicability to drug screening. As seen in Table 1, the activity percentage of each screening task was usually less than 1%, showing the imbalanced nature of the screening datasets. The bioassay IDs of the used datasets as well as their related source are available in supplementary file 1.

Table 1- Data source summary.

PubChem Source	Aid count	Active data points	Total data points	Activity percentage (%)	Unique active molecules	Total unique molecules	Unique molecule activity percentage (%)
Broad Institute	67	125627	22.2m	0.56%	85579	472858	18.1%
Burnham Center for Chemical Genomics	67	139021	21.9m	0.63%	77159	381794	20.21%
Emory University Molecular Libraries Screening Center	12	24195	2.47m	0.98%	20964	348231	6.02%
ICCB-Longwood Screening Facility, Harvard Medical School	11	8358	2.1m	0.39%	6656	564021	1.18%
Johns Hopkins Ion Channel Center	22	48545	6.8m	0.71%	35487	344497	10.30%
NMMLSC	42	48186	11.5m	0.42%	37949	369431	10.27%
National Center for Advancing Translational Sciences (NCATS)	174	720319	53.4m	1.35%	240096	592616	40.51%
The Scripps Research Institute Molecular Screening Center	148	275224	47.6m	0.58%	142055	920418	15.43%
Tox21	57	21475	0.47m	5.67%	4183	8743	47.84%

3- Data Description Domain

To better understand the diversity within the 600 gathered bioassays, the description of each assay was fed to a BioBERT model [17]. This model, which is trained on a large corpus of biomedical text, can create meaningful representations from the description of each assay and map the domain which these descriptions cover. Figure 2 depicts this map after clustering, showing how descriptions from different sources can have similar context to each other (e.g., bioassays from John Hopkins Ion Channel Center and The Scripps Research Institute Molecular Screening Center in cluster 2) or be distinct from the rest (e.g., bioassays from Tox21 in cluster 9).

The same model trained on disease entity recognition was also used to identify disease related key words in each description [22]. While each cluster had some degree of similarity in terms of the diseases covered within each domain, it was far from perfect in correctly dividing the data domain based on their disease categories. Therefore, manual tagging was performed using the clusters and the disease entities as guidance. This process included highlighting disease related words within each bioassay’s description and using them as tags to represent each bioassay. The dataset descriptions as well as their highlighted words are available in supplementary file 1.

4- MolData

4.1. Data Summary

After collecting all the specific disease identifiers or key words, we clustered them into 16 different categories. These categories were selected after carefully investigating all disease related words and their counts. The categories are: 1) Cancer, 2) Aging, 3) Bacterial, 4) Viral, 5) Fungal, 6) Parasitic, 7) Cardiovascular, 8) Immunological, 9) Nervous System, 10) Diabetes, 11) Epigenetic and Genetics, 12) Pulmonary, 13) Obesity, 14) Metabolic Disorder, 15) General Infection, and 16) Toxicity. The count of assays for each disease category is shown in Table 2. Overall, MolData consists of 600 bioassays with 1.4 million unique molecules, with nearly half of the molecules possessing activity in at least one bioassay. Moreover, MolData contains 224 tasks belonging to 2 or more disease categories. The MolData benchmark data is available at https://github.com/Transilico/MolData. All molecules, binary labels and splits are available in one file (supplementary file 2), with two mapping files containing the mapping of each bioassay to each disease category (supplementary file 3) and to each target category (supplementary file 4).

Table 2 - Disease-based information for the MolData Benchmark

Tag	Aid Count	Active Data Points	Total Data Points	Activity Percentage (%)	Unique Active Molecules	Total Unique Molecules	Unique Molecule Activity Percentage (%)
All Diseases	600	1410950	168345532	0.84	672935	1429989	47.06
Cancer	236	575454	68649771	0.84	230049	1323311	17.38
Nervous System	174	378812	54753975	0.69	170353	651249	26.16
Immune system	129	322362	38418661	0.84	157333	579658	27.14
Cardiovascular	94	212162	28660627	0.74	124270	542902	22.89
Toxicity	54	48653	2452656	1.98	30936	487219	6.35
Obesity	53	90837	14516199	0.63	65993	545513	12.1
Virus	47	113946	14679312	0.78	81702	621945	13.14
Diabetes	43	61408	11645151	0.53	47830	543600	8.8
Metabolic Disorders	42	126772	9985491	1.27	70665	527382	13.4
Bacteria	40	132593	12314737	1.08	89554	1290782	6.94
Parasite	24	98950	7302206	1.36	75027	500228	15
Epigenetics, Genetics	23	92837	6815597	1.36	65244	439537	14.84
Pulmonary	19	45940	6122297	0.75	36467	524167	6.96
Infection	11	93444	3312920	2.82	63782	521473	12.23
Aging	10	9030	3079580	0.29	8527	511471	1.67
Fungal	7	9253	2147751	0.43	8824	444373	1.99

The composition of each data category is depicted in Figure 3; showing how combining data from each data source resulted in the creation of each category. This combination demonstrates one of the main motivations for this work’s data aggregation, as each disease category has related bioassays with multiple data sources. Furthermore, some categories such as Aging and Pulmonary are unexplored compared to those like Cancer and Nervous System, when large screening data is examined. These categories were selected based on their importance and the number of occurrences.

The protein targets of MolData in Figure 4 were classified by either 1) direct mapping to the ChEMBL database, 2) finding highly similar target in ChEMBL, or 3) manual curation (See methods). From the 419 total unique targets in MolData, 296 were classified into 14 classes (Figure 4). Enzymes (167/296) (Enzyme (other) + Hydrolase + Protease + Kinase + Transferase + oxidoreductase + NTPase + phosphatase) are the most prevalent class, followed by membrane receptors (44/296) and nuclear receptors (25/296). The occupancy of target classes is also reflected in the total assays for each class. For example, enzymes constitute the most prevalent class among the targeted assays (182/383), followed by membrane receptors (85/383) and nuclear receptors (53/383). The assays are overall enriched in the “privileged” targets, that is, membrane receptors, kinases, nuclear receptors, and ion channels. These four classes have been historically the most prevalent among approved drug targets [23], accounting for 70% of the total approved drugs. In our dataset, however, 199 assays (52% total) represent targets from classes other than membrane receptors, kinases, nuclear receptors, and ion channels. When counting the total unique targets, these historically “unprivileged” targets even give a higher representation of the dataset with 190 counts (64% total). Therefore, MolData captures a higher diversity in the target classes compared to those of the approved drugs.

There are, additionally, classes that are overrepresented by our dataset compared to the set of targets with available approved drugs. For example, NTPases are targeted by 76334 unique compounds (29% of the total compounds from targeted assays), while only 2% of drugs target NTPases. Additionally, epigenetic regulators represent the target of 51776 unique compounds (20% of the total compounds from targeted assays), while only 0.3% of drugs interact with this class of proteins [23]. These higher hit rate in the targets of MolData compared to the approved drugs could imply the inherent low druggability of such target classes or the lower significance of the targets for pharmaceutical industries.

Table 3 - Target-based information for the MolData Benchmark

Target	Aid count	Unique target count	Active data points	Total data points	Activity percentage (%)	Unique active molecules	Total unique molecules	Unique molecule activity percentage (%)
All Targets	383	296	862370	103440515	0.83	261715	675161	38.76
Membrane receptor	85	44	146956	25922533	0.56	91489	458818	19.94
Enzyme (other)	54	51	83657	16210090	0.51	57808	632142	9.14
Nuclear receptor	53	25	74776	6083509	1.22	42838	442487	9.68
Hydrolase	36	32	113185	10830324	1.05	66195	526391	12.57
Protease	29	26	37943	7965313	0.47	30619	606793	5.05
Transcription factor	27	18	53416	4775685	1.11	40067	503249	7.96
Kinase	24	23	38257	7369690	0.52	31327	377519	8.29
Epigenetic regulator	23	20	76793	6840095	1.12	51776	523904	9.88
Ion channel	22	14	37402	6745762	0.55	28853	511873	5.63
Transferase	18	17	43955	6279651	0.7	30432	519646	5.85
Oxidoreductase	10	8	33956	2953760	1.15	30054	432578	6.94
Transporter	9	8	15390	2538579	0.60	15046	369621	4.07
NTPase	6	5	114465	1981575	5.78	76334	439967	17.34
Phosphatase	5	5	8090	1693773	0.48	6913	368329	1.87

4.2. Molecular Domain

To investigate the diversity of the screened molecules, all collected molecules are represented as vectors using ECFP4 [24]. The results after applying Principal Component Analysis (PCA) are shown in Figure 5. The color in this figure represents the density at each point, with denser areas becoming darker. The resulting map shows that while the selected molecules can occupy a large area within the fingerprint domain, a large percentage of them reside within the dark area, denoting a large degree of similarity within most of the screened molecules.

4.3. Correlation Analysis, a Showcase for Drug Repurposing

Drug repurposing is the process of finding new applications for already approved molecular drugs. These new applications can be target or disease based depending on the specific case of study. For example, during an outbreak, drug repurposing could be the fastest and most efficient option due to a lack of information about the new virus/bacteria, while novel drug discovery and the drugs subsequent approval may take many years [25]. Azithromycin, a macrocyclic antibacterial, has shown to be effective against Ebola virus with EC₅₀ of 5.1 M [26]. It also has shown promising results as a potential antimalarial (Plasmodium falciparum) when prescribing alone or in combination therapy [27][28][29]. For this benchmark, we hypothesized that correlating bioassays screened on different sets of targets would provide interesting information for better and faster drug repurposing. Therefore, the correlation score between the molecule bioactivity labels were calculated using a Pearson correlation coefficient.

Between all categories, toxicity showed the highest correlation between tasks, which is understandable due the nature of toxicity and the close biological relationship between the assays. In Figure 6, correlation heatmaps are shown for Toxicity assays and all non-toxicity assays with a correlation of 0.5 or more which have different targets. The second chosen group indicates higher correlation can exist between the labels of bioassays from the same, or different sources. Two sets of correlating targets and a viral similarity were discovered through this analysis. The first set of targets with a high correlation were 1) NPC1 2) SMN1 3) ATAD5 4) Rab9 5) STAT1. NPC1 and Rab9, with a 98% correlation, are important players in cholesterol metabolism and Niemann Pick Disease Type C (supplementary file 5) [30][31]. AIDs 485297 and 485313 were designed to discover the activators of mentioned proteins using luciferase reporter assays. Their high correlation to assays targeting STAT1 or ATAD5, which are important in cancer and immune disorders [32][33][34], is a valuable finding by the analysis of MolData benchmark for drug repurposing. Another interesting discovery was infectious disease based, as molecules targeting the Lassa Virus and Marburg Virus showed a high correlation. The Lassa Virus is a single stranded RNA virus with a circular morphology from the family of Arenaviridae, and is cause of Lassa hemorrhagic fever [35][36]. On the other hand, the Marburg virus belongs to the family of Filoviridae, with a shepherd's crook morphology, and causes similar symptoms to the Ebola virus, with a fatality rate of ~50% [37][38]. Both bioassays used the viruses envelop glycoproteins on a pseudotype virus system. We were curious to see if there has been any candidate drug with promising potency against both viruses. Favipiravir is a pyrazine carboxamide derivative that has shown effectiveness against both the Lassa and Marburg viruses [39][40]. These data suggest that MolData would be valuable source for further drug repurposing investigations.

4.4. Benchmark Classification Modeling, a Showcase for Bioactivity Prediction

The data from each disease and target category, as well as the aggregation of all bioassays, are used as training inputs for GCNNs. The classification results are shown in Table 4 and Table 5 as the baseline for each category. These results are from the imbalanced (untransformed) test set, weighted to ignore missing data points for each task (weight of 0), then averaged across all tasks within each category. The detailed results for each model and bioassay are presented in supplementary file 6. These results show the baseline performance for multitask models, with ROC AUC serving as the most important comparison metric due to the imbalance nature of the data.

Table 4 - Classification results on the test set of disease categories, averaged on all tasks within each category

Disease Benchmark	Accuracy score	Recall score	Precision score	ROC AUC score
All Tasks	72.62 %	67.17 %	4.45 %	0.7756
Cancer	73.31 %	63.91 %	3.62 %	0.7648
Nervous System	72.22 %	61.79 %	2.43 %	0.7389
Immune System	71.35 %	62.87 %	2.82 %	0.7532
Cardiovascular	67.88 %	63.69 %	2.22 %	0.7307
Toxicity	59.87 %	74.72 %	14.75 %	0.7324
Obesity	72.02 %	61.74 %	3.71 %	0.7406
Virus	73.99 %	59.89 %	2.57 %	0.7447
Diabetes	69.87 %	64.51 %	3.82 %	0.7412
Metabolic Disorders	70.95 %	59.81 %	5.28 %	0.7200
Bacteria	72.87 %	67.02 %	3.26 %	0.7764
Parasite	73.15 %	72.11 %	4.66 %	0.8046
Epigenetics-Genetics	74.30 %	56.84 %	3.75 %	0.6974
Pulmonary	58.38 %	68.09 %	2.14 %	0.6951
Infection	70.56 %	68.76 %	6.51 %	0.7679
Aging	80.10 %	55.94 %	1.38 %	0.7625
Fungal	79.69 %	51.95 %	1.90 %	0.7484

Table 5 - Classification results on the test set of target categories, averaged on all tasks within each category

Target Benchmark	Accuracy Score	Recall Score	Precision Score	ROC AUC Score
All Tasks w/ Targets	71.9 %	67.68 %	4.44	0.7714
Membrane receptor	66.91 %	62.36 %	1.69	0.7051
Enzyme (other)	66.72 %	74.39 %	1.92	0.7871
Nuclear receptor	62.63 %	73.6 %	11.25	0.7483
Hydrolase	71.7 %	67.23 %	2.85	0.7774
Protease	72.41 %	67.33 %	2.47	0.7606
Transcription factor	71.59 %	66.59 %	10.07	0.7565
Kinase	65.41 %	56.9 %	1.39	0.6664
Epigenetic regulator	70.35 %	68.11 %	3.91	0.7865
Ion channel	67.58 %	59.85 %	1.76	0.7104
Transferase	79.51 %	66.39 %	3.36	0.8079
Oxidoreductase	78.28 %	65.11 %	4.49	0.7868
Transporter	67.66 %	48.54 %	1.67	0.6525
NTPase	82.09 %	42.3 %	19.03	0.7703
Phosphatase	73.24 %	68.72 %	1.91	0.796

There are 383 tasks within the overall dataset that have both a disease-related tag as well as a target-related tag. These tasks are used for training multiple models including models trained on each disease category, each target category, and aggregation of all tasks with or without targets. Due to the repetition in training, different models’ performance on these shared tasks can be compared to assess which multitask learning model was able to perform the best on each task. The results from this comparison are shown in Figure 7.

As shown in Figure 7, combining all tasks results in a higher average ROC AUC with the model trained on all 600 tasks being the best performer for the majority of tasks. However, there are 159 tasks which had their best performing model trained on fewer tasks, such as the models trained on specific disease or target categories, or the model trained on all tasks with target tags. This demonstrates that multitask learning on fewer tasks may be beneficial in some scenarios.

One of the main topics worth discussing is bias within the dataset. MolData consists of roughly 170 million data points. However, this screening was performed on 1.4 million molecules, denoting that each molecule exists on average in nearly 117 assays. Since the data sources are different, this level of repetitiveness shows a large overlap of molecules within the original data sources. Furthermore, as seen from the results of the molecular domain mapping, many of the molecules lie within a small section of the fingerprint domain, emphasizing their similarity. Therefore, a degree of bias exists within the gathered dataset with similar molecules being screened for each assay and in all data sources. We speculate this bias is due to the traditional rules used for selecting molecules as candidates for screening. One effective way to increase the diversity of chemicals would be switching from screening synthetic libraries to natural-based ones. Natural derived compounds have shown a higher hit rate with the potential of targeting unknown and complex biotargets [41].

Another important topic to consider is the benchmark modeling result. The model architecture was selected to be shared within all models; however, this is suboptimal, and hyper-parameter optimization can be performed to find better possible architectures for each data category. This can apply to other hyper-parameters such as learning rate and batch size, which can be improved via a grid-search hyper-parameter optimization. Lastly, the low precision of the models is a focus of improvement since precision plays an important role in selecting molecules for future screening at inference time, directly affecting the cost and time of screening.

MolData is one of the largest efforts in the collection, curation, and categorization of labeled molecular datasets. It consists of roughly 170 million screens of 1.4 million unique molecules distributed in 600 different bioassays and 16 disease categories, from cancer to infectious diseases. It also consists of a state-of-the-art target benchmark with 14 categories. We explored all the disease and target-related details in each bioassay for the development of a comprehensive benchmark to assist data scientists and the ML community in improving model development and computational drug discovery. We believe a key feature of any learning system is the training data, and the validation of a model is only possible with appropriate molecular and biological knowledge of the dataset. MolData takes advantage of a greater amount of labeled data compared to other benchmark datasets, which is an important addition to CADD. It is beneficial for the data science community to have a similar dataset for comparison of model performances; therefore, baseline performance is presented for 32 different categories. MolData hopes to take a step in furthering the molecular machine learning revolution, by providing the means for drug discovery and model development.

A - Data Aggregation

The dataset was collected from PubChem bioassays due to its comprehensiveness and the high diversity of diseases and targets. We started with the selection of PubChem sources with highest number of Live Bioassays Counts, High Throughput Screening (HTS) capabilities, and no requirement for licensing. Hence, we selected nine sources as follows:

1) National Center for Advancing Translational Sciences (NCATS) is one of the most comprehensive centers for drug screening with a goal of therapeutic development trough collaborative research, [42],

2) Broad Institute of Harvard and MIT with a focus on assay development and scientific collaboration for the advancement of Discovery Science and Translational Pharmacology. They have the capability of screening 100s to 1,000s of compound plates a day [43],

3) Sanford-Burnham Center for Chemical Genomics is a well stablished screening center working on multiple projects including NIH Molecular Libraries program (MLP) with applications on multiple diseases [44],

4) NMMLSC is an screening center with capability of using high throughput flow cytometry to discover molecules as chemical probes for drug discovery [45],

5) Emory University Molecular Libraries Screening Center with focus on Biological Discovery through Chemical Innovation and also molecular pathogenesis to global pandemics. [46],

6) Tox21 which contains thousands of medicinal or environmental substances which is a collaboration between NCATS and national toxicology program. Tox21 is an ongoing project with yearly update [47] ,

7) The Scripps Research Institute Molecular Screening Center is an automated center with projects on a variety of diseases like Alzheimer and cancer. They also have capability of assay development, Compound synthesis cheminformatics, mechanism of action discovery etc. [48],

8) Johns Hopkins Ion Channel Center with a focus on membrane proteins and transporters which are permeable to ions [49]. Due to the importance of this class of targets, we decided to include them as one of main sources.

9) ICCB-Longwood Screening Facility, Harvard Medical School which performs most of the HTS assays with the availability of over 500,000 molecules for screening [50].

Aforementioned sources were also selected due to their credibility of HTS data. As the final goal of this article is providing the machine learning community with a large, clustered dataset, we decided to include bioassays containing 100,000 or more molecules screened, as well as bioassays with more than 15 unique active molecules. This threshold was not applied to the Tox21 assays, which have a lower number of screened molecules, which were selected due to the importance of toxicity prediction to drug discovery. Table 1 shows the exact number of each sources’ count, as well as active/inactive molecules.

B – Mapping the Data Domain with Natural Language Processing

After the assays are gathered and filtered by a size threshold, the process of understanding the context of the assays begins. Each assay contains information including the title of the assay, a general description, and optionally the biological target of the screening. To understand the diversity of the assays and map the domains which they cover, the description of each assay is analyzed using natural language processing tools, as elaborated upon in the following subsections.

B.1. Description Pre-Processing

The description of each bioassay was acquired from the PubChem website. Each description can contain a complete molecular and biological background, goal of each assay, and finally a brief description of the biological assay. However, each description may also contain unusable information such as the affiliated center, references, scientists involved in the screening, and grant information. Using Python string parsing capabilities, manual rules were written for of each of the eight data sources to filter out the lines containing the unusable information, resulting in cleaned descriptions explaining the assays’ goal. These rules can include deletion of lines pertaining to Principal Investigators, grant numbers, Screening Center Affiliation, Network, Assay provider, Grant Proposal Number, etc. from the description to extract only the assay description from the text.

B.2. Feature Extraction using BioBERT

The cleaned descriptions were then lower-cased and fed to a BioBERT model for feature extraction. BioBERT is a bidirectional transformer model constructed of multi-head attention modules. This model is trained for language modeling on a plethora of biomedical literature, predicting the masked tokens from raw unlabeled text. Using this pre-training, the model can generate meaningful representation from biomedical text and encode the input in a discernible feature vector. Leveraging this capability, each description is transformed to a numerical vector of size 2048, representing what each assay’s description contained. One disadvantage of this technique is the limited input size of BioBERT (512 token), which resulted in concatenation of some of the descriptions.

B.3. Clustering

Having acquired feature vectors of assay descriptions, they are clustered using K-Means clustering. Since the target of this clustering is to explore the domain which the descriptions cover, the number of clusters are unknown. To find the optimum number of clusters, the sum of squared distances of data points to their closest cluster center (SSE) are calculated and plotted based on the number of clusters. The optimum number of clusters is then found by detecting the knee point of the plot.

C – Tagging the Assays

After distinct clusters are formed from assay descriptions and the domains covered by the datasets are better defined, different assays can be grouped together to form a benchmark. The main form of distinction between the assays chosen in this work is disease category relations. As previously mentioned, it is important for a dataset to provide each bioassay with simple disease and target categories for better computational drug discovery. To find the related disease categories for each assay, the process of tagging is used, during which certain words in the description are chosen as tags to represent the assay. This process was implemented both using AI assistance and manual annotation.

C.1. BioBERT Disease Category Entity Recognition

The first approach implemented in this work to extract the disease related words from the description text of an assay is using a BioBERT model trained for disease category entity recognition. This model takes a text sequence as input and returns the entity class related to each token, with the classes consisting of disease and non-disease category entities. Using this model, all related disease keywords are extracted from each assay, automating the process of tagging. However, one major disadvantage of this technique is that many words within the description are disease category related, but not defining for that assay. As an example, a task would claim that an older drug for a specific virus would be a carcinogen, falsely adding a disease tag related to “cancer” to the assay. The mentioned assay would have nothing to do with cancer, and was just an effort for antiviral drug discovery.

C.2. Manual Tagging

Since many of descriptions contain some biomedical-related words that are not defined for that specific task, understanding the exact biological assay and diseases related to the screening are crucial for tagging. A bioassay description contains a large amount of information regarding the target, related disease categories, other proteins/RNAs/DNA down or up-stream, and in some cases the experimental details of the bioassay. In a task description below, we provide a description from BioBERT cluster zero for AID 1259313 from Burnham Center for Chemical Genomics entitled “uHTS identification of small molecule modulators of NR3A”. As shown in this figure, we first read the description for better understanding the assay as a whole, as well as the tags found by the computational method, and then highlight any words with the potential of directing us to a special disease category. Here, Central Nervous System (CNS), Down Syndrome, and Neurological Disorders are the main words that direct us to the subcategories of ‘Nervous System’ and ‘Epigenetics-Genetics’.

Activity of N-methyl-D-aspartate subtype of glutamate receptor (NMDAR) is essential for normal central nervous system (CNS) function. However, excessive activation of NMDAR mediates, at least in part, neuronal or synaptic damage in many neurological disorders, including hypoxic-ischemic brain injury and in Down syndrome. The dual role of NMDARs in normal and abnormal CNS function imposes important constraints on possible therapeutic strategies aimed at ameliorating or abating developmental disorders and neurological disease: blockade of excessive NMDAR activity must be achieved without interference with its normal function. We propose an approach for NMDAR modulation via modulation of the NR3A subunit, a representative of a novel family of NMDAR subunits with the goal to modulate the NMDAR activity. NR3 subunits have a unique structure in their M3 domain forming part of the channel region that contributes to decreased magnesium sensitivity and calcium permeability of NMDARs. It potently and specifically binds glycine and D-serine, but not glutamate. In addition, we have shown that glycine binding to the ligand-binding domain (LBD) of NR3A is essential for NR1/NR3 receptor activation, as opposed to internalization caused by ligand binding to NR1 LBD.

D – Benchmark Creation

After the disease related words are highlighted and extracted, each assay can be represented by its tags. The next step of the process is to use these tags for grouping related assays together, and to create the benchmark. To do so, major disease categories were first identified which could encompass all tags; and second, each tag was assigned to one or more related major disease categories. The relation between each tag and the major disease category can be found in supplementary file 7.

The classification of the protein targets of our dataset was gleaned by downloading and searching against the ChEMBL 29 [10] database (https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_29.fa.gz). If the sequence of a given target—retrieved from UniProtKB using its UniProt ID matched identically to that of a ChEMBL target, the target classification was copied from ChEMBL. To classify the targets missing in ChEMBL, an all-by-all pairwise alignment was performed between MolBio targets and ChEMBL 29 dataset using phmmer 3.3 [51]. If the top-scoring phmmer hit from ChEMBL aligns to the query sequence with a bit score of at least 100 and shares more than 80% similarity in sequence length, the classification is copied from the ChEMBL hit. The targets that neither mapped to ChEMBL nor aligned confidently to ChEMBL using the mentioned criteria were annotated manually. The dataset originally contained 17 classes, but the list was curtailed to 13 classes to remove the ones with assay occupancy of fewer than 5.

D.1. Molecular Data Pre-Processing

Having populated the disease categories, the molecular data for each assay was downloaded in the form of Simplified Molecular-Input Line-Entry System (SMILES) and their related bioactivity. The SMILES input for each molecule was canonicalized with isomeric information included using RDKit version 2020.09.1. Duplicate or missing SMILES entries were then deleted using Python 3.6 and Pandas library version 1.1.5. Regarding the bioactivity of the molecules, the existing labels in all assays are “Active”, “Inactive”, “Inconclusive”, and “Unspecified”. For the sake of consistency, molecules with inconclusive and unspecified labels were removed, and active and inactive molecules were respectively labeled as 1 and 0.

After the datasets are aggregated and preprocessed, Extended-Connectivity Fingerprints (ECFP4) are used to represent each molecule as a binary vector of 1024 length. The fingerprints were extracted using RDKit and DeepChem version 2.5.0. The scripts for the pre-processing and fingerprint extraction are available on the Github repository alongside the data. This fingerprint represents existence or non-existence of certain sub-graphs within each molecule, which in turn makes it a suitable similarity metric between the molecules. To find the diversity within all the collected molecules, Principal Component Analysis is applied to the fingerprint vectors, projecting the fingerprints into a 2D space. To highlight the denser areas within this 2D map, Gaussian kernel density estimation is used.

D.2. Correlation Analysis

To find correlating bioassay, the bioactivity labels of all molecules are taken as representing vectors of each bioassay. To begin, the shared labels between two bioassays that are non-missing are found. The Pearson correlation coefficient is calculated between these two vectors. This process is repeated for all bioassays within each disease category, as well as all the data. The resulting matrices are depicted in the result section. In order to find interesting correlations, the bioassays with a correlation coefficient higher than 0.5 or lower than -0.5 are selected. If the AID number of these bioassays are within 5 of each other (neighbors), they are dismissed, because in most cases they are very closely related screens. The remaining bioassays are further examined to check for any biological cause for this correlation.

D.3. Classification and Performance Benchmark

After the data is categorized based on their related diseases, using DeepChem the data is split into training, validation, and test sets, with 80, 10, 10 percent shares respectively. This splitting is done after finding the Bemis-Murcko scaffold of each molecule [52], and molecules with shared scaffolds are put into same splits. Splitting based on the scaffolds creates more distinct splits, making the problem of classification harder and more like real-world scenarios where the inference set can often have a different distribution than the training set. Having split the data, some tasks may have no positive data points in the smaller splits, which creates a problem for calculating performance metrics, therefore, those tasks are identified, and one of their positive datapoints from the training set is randomly moved to the smaller split.

The molecules are featurized and converted into graphs with the chirality included in the features. DeepChem was used to featurize the molecules and convert them into undirected graphs with nodes representing atoms and edges representing bonds. These graphs are computationally represented as two matrices: the connectivity matrix and the feature matrix. The feature matrix includes 75 features for each node (atom) within the graph, which include one-hot encoding of the atom type, number of directly bonded neighbors, number of implicit Hydrogens on the atom, formal charge, number of radical electrons, one-hot encoding of the atom's hybridization, and aromaticity. DeepChem also has the option to add chirality features to the feature vectors, which adds three additional values to each vector (78 features in total) representing if the chirality property exists and if so, the classification of the chirality to right-hand or left-hand. The script for featurization of the molecules is available in the Github repository.

To assist the process, the training split is balanced using weight transformers that affect how the loss is aggregated, amplifying the effect of positive samples during training. The training split is used to train a GCNN in a multitask manner for each category, including one model trained on 600 bioassays combined. The parameters for training and the related model are shown in Table 6.

Table 6 – Parameters of the training model.

Parameter	Value	Parameter	Value
Split	Specified	Dropout	0.1
Featurizer	GraphConv	Initial Learning Rate	0.0001
Epoch Number	10	Batch Size	128
Graph Conv. Layers	[512, 512, 512]	Dense Layer Size	1024

The evaluation metrics for the training of the models selected in this work are accuracy, recall, precision, and Area Under the Receiver Operator Curve (ROC AUC). While accuracy is a palpable metric of performance, it is not suitable for comparing models in imbalanced scenarios, where ROC AUC can correctly represent performance. Moreover, recall and precision are important in evaluating virtual screening models, since recall denotes how many of the valuable active molecules were correctly predicted, while precision demonstrates how well the trained model can do at inference time, selecting active molecules from a plethora of possible candidates for screening.

Availability of Data and Materials:

Moldata is available at Supplementaryfile2.zip as well as an opensource GitHub repository at: https://github.com/Transilico/MolData

Supplementaryfile1.csv: The bioassay IDs of the used datasets as well as their related source

Supplementaryfile2.zip: MolData Benchmark

Supplementaryfile3.csv: Two mapping files containing the mapping of each bioassay to each disease category

Supplementaryfile4.csv: Two mapping files containing the mapping of each bioassay to each Target category

Supplementaryfile5.csv: Sets of targets with a high correlation and their related information

Supplementaryfile6.xlsx: The detailed results for each model and bioassay

Supplementaryfile7.csv: The relation between each tag and the major disease category

Competing interests

We declare no conflict of interest

Funding

There is no funding for this project

Authors' contributions

Arash Keshavarzi Arshadi wrote the biological and chemical sections, collected the data, and manually labeled them. Milad Salem implemented the algorithms to clean and categorize the data, trained models and wrote the data science and analysis related sections. Arash Firouzbakht clustered the data to target related ones and wrote the target benchmark section. Jiann Shiun Yuan provided guidance and advised the project.

Acknowledgements

We would like to thank Hani Goodarzi for his pieces of advice and ideas for this project. We also thank Jennifer Collins and Julia Web for their contribution in improving the written sections.

1. Deng, D., Chen, X., Zhang, R., Lei, Z., Wang, X., & Zhou, F. (2021). XGraphBoost: Extracting Graph Neural Network-Based Features for a Better Prediction of Molecular Properties. Journal of Chemical Information and Modeling, 61(6), 2697–2705. https://doi.org/10.1021/ACS.JCIM.0C01489

2. Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., … Pande, V. (2018). MoleculeNet: A benchmark for molecular machine learning. Chemical Science, 9(2), 513–530. https://doi.org/10.1039/c7sc02664a

3. Minnich, A. J., McLoughlin, K., Tse, M., Deng, J., Weber, A., Murad, N., … Allen, J. E. (2019). AMPL: A Data-Driven Modeling Pipeline for Drug Discovery. Retrieved from http://arxiv.org/abs/1911.05211

4. Duan, Y., Edwards, J. S., & Dwivedi, Y. K. (2019). Artificial intelligence for decision making in the era of Big Data – evolution, challenges and research agenda. International Journal of Information Management, 48, 63–71. https://doi.org/10.1016/J.IJINFOMGT.2019.01.021

5. Hussin, S. K., Abdelmageid, S. M., Alkhalil, A., Omar, Y. M., Marie, M. I., & Ramadan, R. A. (2021). Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms. Complexity, 2021. https://doi.org/10.1155/2021/6675279

6. Karim, A., Mishra, A., Newton, M. A. H., & Sattar, A. (2019). Efficient Toxicity Prediction via Simple Features Using Shallow Neural Networks and Decision Trees. ACS Omega, 4(1), 1874–1888. https://doi.org/10.1021/ACSOMEGA.8B03173

7. Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). DeepTox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science, 3, 80. https://doi.org/10.3389/fenvs.2015.00080

8. PubChem. (n.d.). Retrieved October 7, 2021, from https://pubchem.ncbi.nlm.nih.gov/

9. ChemSpider | Search and share chemistry. (n.d.). Retrieved October 7, 2021, from http://www.chemspider.com/

10. Davies, M., Nowotka, M., Papadatos, G., Dedman, N., Gaulton, A., Atkinson, F., … Overington, J. P. (2015). ChEMBL web services: Streamlining access to drug discovery data and utilities. Nucleic Acids Research, 43(W1), W612–W620. https://doi.org/10.1093/NAR/GKV352

11. Merck Molecular Activity Challenge | Kaggle. (n.d.). Retrieved October 7, 2021, from https://www.kaggle.com/c/MerckActivity

12. Richard, A. M., Huang, R., Waidyanatha, S., Shinn, P., Collins, B. J., Thillainadarajah, I., … Tice, R. R. (2020). The Tox21 10K Compound Library: Collaborative Chemistry Advancing Toxicology. Chemical Research in Toxicology, 34(2), 189–216. https://doi.org/10.1021/ACS.CHEMRESTOX.0C00264

13. Unterthiner, T., Mayr, A., Klambauer, G., & Hochreiter, S. (2015). Toxicity Prediction using Deep Learning, 3(February). https://doi.org/10.3389/fenvs.2015.00080

14. Wang, Y., Xiao, J., Suzek, T. O., Zhang, J., Wang, J., Zhou, Z., … Bryant, S. H. (2012). PubChem’s BioAssay Database. Nucleic Acids Research, 40(D1), D400–D412. https://doi.org/10.1093/NAR/GKR1132

15. Ramsundar, B., Kearnes, S., Riley, P., Webster, D., Konerding, D., & Pande, V. (2015). Massively Multitask Networks for Drug Discovery.

16. Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., … Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical science, 9(2), 513–530. https://doi.org/10.1039/c7sc02664a

17. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (n.d.). Data and text mining BioBERT: a pre-trained biomedical language representation model for biomedical text mining. https://doi.org/10.1093/bioinformatics/btz682

18. Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (n.d.). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

19. Kearnes, S., McCloskey, K., Berndl, M., Pande, V., & Riley, P. (2016). Molecular Graph Convolutions: Moving Beyond Fingerprints. Journal of Computer-Aided Molecular Design, 30(8), 595–608. https://doi.org/10.1007/s10822-016-9938-8

20. Data Sources - PubChem. (n.d.). Retrieved October 7, 2021, from https://pubchem.ncbi.nlm.nih.gov/sources/#sort=Live-BioAssay-Count

21. Tox21 - PubChem Data Source. (n.d.). Retrieved October 7, 2021, from https://pubchem.ncbi.nlm.nih.gov/source/824

22. Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H., Leaman, R., … Lu, Z. (2016). BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database: The Journal of Biological Databases and Curation, 2016, 68. https://doi.org/10.1093/DATABASE/BAW068

23. Santos, R., Ursu, O., Gaulton, A., Patrícia Bento, A., Donadi, R. S., Bologa, C. G., … Overington, J. P. (2017). A comprehensive map of molecular drug targets. Nature Publishing Group. https://doi.org/10.1038/nrd.2016.230

24. Rogers, D., & Hahn, M. (2010). Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50(5), 742–754. https://doi.org/10.1021/ci100050t

25. Keshavarzi Arshadi, A., Webb, J., Salem, M., Cruz, E., Calad-Thomson, S., Ghadirian, N., … Yuan, J. S. (2020). Artificial Intelligence for COVID-19 Drug Discovery and Vaccine Development. Frontiers in Artificial Intelligence, 3, 65. https://doi.org/10.3389/frai.2020.00065

26. Madrid, P. B., Panchal, R. G., Warren, T. K., Shurtleff, A. C., Endsley, A. N., Green, C. E., … Tanga, M. J. (2015). Evaluation of Ebola Virus Inhibitors for Drug Repurposing. https://doi.org/10.1021/acsinfecdis.5b00030

27. Schachterle, S. E., Mtove, G., Levens, J. P., Clemens, E., Shi, L., Raj, A., … Sullivan, D. J. (2014). Short-Term Malaria Reduction by Single-Dose Azithromycin during Mass Drug Administration for Trachoma, Tanzania - Volume 20, Number 6—June 2014 - Emerging Infectious Diseases journal - CDC. Emerging Infectious Diseases, 20(6), 941–949. https://doi.org/10.3201/EID2006.131302

28. Arshadi, A. K., Salem, M., Collins, J., Yuan, J. S., & Chakrabarti, D. (2020). Deepmalaria: Artificial intelligence driven discovery of potent antiplasmodials. Frontiers in Pharmacology, 10. https://doi.org/10.3389/fphar.2019.01526

29. Sagara, I., Oduro, A. R., Mulenga, M., Dieng, Y., Ogutu, B., Tiono, A. B., … Dunne, M. W. (2014). Efficacy and safety of a combination of azithromycin and chloroquine for the treatment of uncomplicated Plasmodium falciparum malaria in two multi-country randomised clinical trials in African adults. Malaria Journal 2014 13:1, 13(1), 1–10. https://doi.org/10.1186/1475-2875-13-458

30. Lamri, A., Pigeyre, M., Garver, W. S., & Meyre, D. (2018). The Extending Spectrum of NPC1-Related Human Disorders: From Niemann–Pick C1 Disease to Obesity. Endocrine Reviews, 39(2), 192. https://doi.org/10.1210/ER.2017-00176

31. K, N., A, C., K, D., DK, S., EL, H., DL, M., … RE, P. (2005). Protein transduction of Rab9 in Niemann-Pick C cells reduces cholesterol storage. FASEB journal : official publication of the Federation of American Societies for Experimental Biology, 19(11), 1558–1560. https://doi.org/10.1096/FJ.04-2714FJE

32. Giovannini, S., Weller, M.-C., Hanzlíková, H., Shiota, T., Takeda, S., & Jiricny, J. (2020). ATAD5 deficiency alters DNA damage metabolism and sensitizes cells to PARP inhibition. Nucleic Acids Research, 48(9), 4928–4939. https://doi.org/10.1093/NAR/GKAA255

33. Pensa, S., Regis, G., Boselli, D., Novelli, F., & Poli, V. (2013). STAT1 and STAT3 in Tumorigenesis: Two Sides of the Same Coin? Retrieved from https://www.ncbi.nlm.nih.gov/books/NBK6568/

34. Chapgier, A., Wynn, R. F., Jouanguy, E., Filipe-Santos, O., Zhang, S., Feinberg, J., … Arkwright, P. D. (2006). Human Complete Stat-1 Deficiency Is Associated with Defective Type I and II IFN Responses In Vitro but Immunity to Some Low Virulence Viruses In Vivo. The Journal of Immunology, 176(8), 5078–5083. https://doi.org/10.4049/JIMMUNOL.176.8.5078

35. Richmond, J. K., & Baglole, D. J. (2003). Lassa fever: epidemiology, clinical features, and social consequences. BMJ : British Medical Journal, 327(7426), 1271. https://doi.org/10.1136/BMJ.327.7426.1271

36. Lassa fever. (n.d.). Retrieved October 7, 2021, from https://www.who.int/health-topics/lassa-fever#tab=tab_1

37. OG, G., BE, J., MR, V., WJ, V., GW, T., & HE, L. (2009). Drug targets in infections with Ebola and Marburg viruses. Infectious disorders drug targets, 9(2), 191–200. https://doi.org/10.2174/187152609787847730

38. Marburg virus disease. (n.d.). Retrieved October 7, 2021, from https://www.who.int/news-room/fact-sheets/detail/marburg-virus-disease

39. Rosenke, K., Feldmann, H., Westover, J. B., Hanley, P. W., Martellaro, C., Feldmann, F., … Safronetz, D. (2018). Use of Favipiravir to Treat Lassa Virus Infection in Macaques - Volume 24, Number 9—September 2018 - Emerging Infectious Diseases journal - CDC. Emerging Infectious Diseases, 24(9), 1696–1699. https://doi.org/10.3201/EID2409.180233

40. SL, B., TM, B., J, W., KS, W., SA, V. T., L, D., … TK, W. (2018). Efficacy of favipiravir (T-705) in nonhuman primates infected with Ebola virus or Marburg virus. Antiviral research, 151, 97–104. https://doi.org/10.1016/J.ANTIVIRAL.2017.12.021

41. Li, R., Npr, /, Wilson, B. A. P., Thornburg, C. C., Henrich, C. J., Grkovic, T., & O’keefe, B. R. (2020). Natural Product Reports Creating and screening natural product libraries, 37, 863–1032. https://doi.org/10.1039/c9np00068b

42. Early Translation Branch (ETB) | National Center for Advancing Translational Sciences. (n.d.). Retrieved October 22, 2021, from https://ncats.nih.gov/etb

43. Broad Institute. (n.d.). Retrieved October 22, 2021, from https://www.broadinstitute.org/

44. Home | SBP. (n.d.). Retrieved October 22, 2021, from https://www.sbpdiscovery.org/

45. UNM Center for Molecular Discovery | University of New Mexico flow cytometry research center. (n.d.). Retrieved October 22, 2021, from http://nmmlsc.health.unm.edu/

46. Biological Discovery through Chemical Innovation | Emory University | Atlanta GA. (n.d.). Retrieved October 22, 2021, from https://bdci.emory.edu/

47. Toxicology in the 21st Century (Tox21) | National Center for Advancing Translational Sciences. (n.d.). Retrieved October 22, 2021, from https://ncats.nih.gov/tox21

48. Lead Identification | Scripps Florida. (n.d.). Retrieved October 22, 2021, from https://hts.florida.scripps.edu/

49. Johns Hopkins Ion Channel Center - PubChem Data Source. (n.d.). Retrieved October 22, 2021, from https://pubchem.ncbi.nlm.nih.gov/source/Johns Hopkins Ion Channel Center

50. ICCB-Longwood Screening Facility. (n.d.). Retrieved October 22, 2021, from https://iccb.med.harvard.edu/

51. HMMER. (n.d.). Retrieved October 7, 2021, from http://hmmer.org/

52. and, G. W. B., & Murcko, M. A. (1996). The Properties of Known Drugs. 1. Molecular Frameworks. Journal of Medicinal Chemistry, 39(15), 2887–2893. https://doi.org/10.1021/JM9602928

Download PDF

Version 2

posted

You are reading this latest preprint version

MolData, A Molecular Benchmark for Disease and Target Based Machine Learning

Status:

Version 2

Abstract

Figures

Introduction

Results

Discussion

Conclusion

Methods

Declarations

References

Supplementary Files

Status:

Version 2