**Impact of Neural Network Architecture and Training Set Size.** To evaluate the effect of network architecture and training set size, we used the largest NCI-60 dataset – the growth inhibition data of 50,606 compounds against A549 cell line. We randomly selected 1,000 compounds as the test set for DNN model performance evaluation. From the remaining compounds, we randomly selected subsets, ranging in number from 500 to 40,000, as the training sets and used the leftover compounds as validation sets. Although this introduced variability of the size of the validation set used to determine the training stopping point, the final test set used to evaluate the results remained fixed for these calculations.

For single-hidden layer networks, we evaluated networks of different width, with the number of hidden neurons ranging from 100 to 6,000. For each network architecture and training-set size, we repeated model optimization five times using different initializing conditions, with the average MSE of the test set from the five resulting models considered as the MSE of the network architecture. Table 1 summarizes the numerical evaluation as a function of hidden neurons and number of compounds in the training sets, showing that with a large number of training samples (i.e., ≥ 10,000), the number of hidden neurons did not have an impact on model performance. However, when the training set was smaller, models with too few (i.e., 100) or too many (i.e., 4,000 and 6,000) hidden neurons appeared to perform worse than models with 500-2,000 hidden neurons. Overall, the results are in line with the well-known observation that the larger the number of training samples, the better the model. The model improvement resulting from increasing the training-set size is roughly constant at 10% when doubling the training set size. Given that the absolute error is the largest of the smallest dataset, we can note that the largest absolute benefit in reducing prediction errors using transfer learning will occur for the smallest training sets.

Table 1

Mean squared error (standard deviation) of test-set compounds for predicting A549 cell inhibition using a single-hidden layer neural network trained as a function of increasing training-set size and with a variable number of neurons in the hidden layer. The data show that models with too few (e.g., 100) or too many (e.g., 6,000) hidden neurons do not perform well when trained by small training sets. Doubling the number of compounds in the training set roughly reduced the relative error by 10%. The units of the errors are given in (log10(mol/l))2, and the smallest error for each set of training compounds are indicated in boldface font.

Number of hidden neurons | Number of training compounds |

500 | 1,000 | 2,000 | 3,000 | 4,000 | 10,000 | 20,000 | 30,000 | 40,000 |

100 | 0.82 (0.04) | 0.74 (0.02) | 0.65 (0.02) | 0.62 (0.03) | 0.59 (0.02) | 0.47 (0.02) | 0.43 (0.02) | 0.40 (0.01) | 0.38 (0.01) |

500 | **0.73 (0.03)** | **0.67 (0.03)** | 0.61 (0.03) | 0.58 (0.03) | 0.55 (0.02) | 0.47 (0.01) | 0.42 (0.03) | **0.38 (0.01)** | **0.37 (0.01)** |

1,000 | **0.73 (0.04)** | **0.67 (0.02)** | **0.60 (0.02)** | **0.57 (0.03)** | **0.54 (0.03)** | 0.46 (0.01) | 0.42 (0.02) | **0.38 (0.01)** | 0.38 (0.01) |

2,000 | **0.73 (0.04)** | **0.67 (0.02)** | 0.61 (0.02) | **0.57 (0.03)** | 0.55 (0.03) | **0.45 (0.02)** | **0.41 (0.02)** | 0.39 (0.01) | **0.37 (0.01)** |

4,000 | 0.74 (0.03) | 0.69 (0.02) | 0.62 (0.02) | 0.58 (0.02) | 0.55 (0.02) | 0.46 (0.02) | 0.42 (0.02) | 0.39 (0.01) | **0.37 (0.01)** |

6,000 | 0.77 (0.04) | 0.70 (0.02) | 0.63 (0.02) | 0.59 (0.02) | 0.56 (0.02) | 0.46 (0.01) | **0.41 (0.02)** | 0.39 (0.01) | 0.38 (0.02) |

For Table of Contents Only |

Figures 1 and 2 show the corresponding numerical results for DNNs with two and three hidden layers, with the complete datasets presented in Tables S2 and S3, respectively, of the Supplementary Information. Similar to the results of one-hidden layer networks in Table 1, these results show that the most important determinant of model quality was the training sample size. Compared to variations in training-set size, the depth and width of the neural networks had a much smaller impact on model performance, especially when there were more than ~ 4,000 compounds in the training set.

**Effect of Transferring Parameters from a Data-Rich Model to Develop a Model with Limited Training Data – a Proof-of-Concept Study.** To evaluate the effect of transferring parameters from a model trained with a large number of compounds to develop a model with limited experimental data, we initially used A549 data as a data-rich dataset. We first developed a number of different DNN models for predicting pGI50 by randomly splitting the dataset into a 90% training set and a 10% validation set to train A549 prediction models with an increasing number of hidden layers. For these models, we used network architectures sized as 1,024:1,000:1, 1,024:1,000:1,000:1, and 1,024:1,000:1,000:500:1, where the initial nodes of size 1024 correspond to the number of input features, and the final single output node represents the predicted pGI50 value. The other integers correspond to the number of hidden neurons in the first, second, and third hidden layers.

Next, we designated the HTB132 (a breast cancer cell line) pGI50 data (total number of compounds 5,612) to serve as a data-limited dataset. Figure 3 schematically shows the steps executed in evaluating the transfer-learning approach. We randomly selected 10% of the HTB132 data as a test set for evaluating the DNN model performance. From the remaining HTB132 data, we randomly selected 10% as a validation set. We then trained a series of HTB132 models of the same architecture as that of the A549 model using 500, 1,000, and 2,000 compounds to simulate models trained with small datasets. We also trained a HTB132 model with ~ 80% of the HTB132 dataset (4,546 compounds), with the remaining 20% as the validation and test sets, to establish a reference of the best model one could derive from the HTB132 data only (without transfer learning). We used the MSE of the DNN models for the test-set compounds as a performance measure. Finally, we repeated the previous step of training the HTB132 DNN model, but with one to three hidden layers of the A549 models transferred while freezing the values of the weights and biases, and optimizing the rest of the model parameters using the HTB132 training sets. We then calculated the MSE of the test-set compounds using the resulting HTB132 models. Due to the stochastic nature of gradient decent optimization and random assignment of the initial weights and biases, each optimization ended up with a different set of model parameters. We repeated all model training 10 times with randomly selected training and validation compounds to derive statistically reliable results.

Figure 4 shows the results of our evaluation where each data point represents an average of the MSE over the 10 models trained with the same number of randomly selected training samples, where the vertical bar represents ± 2 standard deviations. The three panels show the results as a function of the number of hidden layers in the networks, i.e., N = 1, 2, or 3. The complete datasets are given in Table S4 of the Supplementary Information. Figure 4 (top) shows that, for each network architecture, without transfer learning, model performance depended strongly on the number of compounds in the training set, with the variability decreasing with increasing training-set size, as expected. The range of minimum MSE achievable using the complete HTB132 data could not be reached with the limited-compound training set. However, using the frozen parameters transferred from the A549 model, optimization of the remaining parameters using the same HTB132 training sets resulted in a marked performance improvement, both in terms of considerably smaller average MSEs and their variability. Even with the smallest training set of 500 compounds, transfer learning resulted in considerably better models than training with all HTB132 compounds without transfer learning. For networks with two or three hidden layers, we transferred parameters for up to three hidden layers, with the results consistently indicating that transfer of the first hidden layer parameters was the most effective. Transferring parameters from additional layers resulted in slightly worse models as judged by the MSE of the test-set compounds. This is most likely due to the presence of more specialized, A549-specific parameters from the A549 growth inhibition DNN model appearing in the second and third hidden layers. Transferring these parameters would not provide any additional benefits to a non A549-specific model, and could instead degrade the prediction performance of the HTB132 model.

**Conditions for Transfer-Learning Success and Expected Benefits.** The results of transferring parameters from the A549 model to develop an HTB132 model are promising, yielding results that were better than what could be achieved by using the entire HTB132 dataset itself. The benefits can be partially explained by the high correlation and similarity of the assays themselves, i.e., by measuring chemically induced growth inhibition in cell-line cultures. In fact, the pGI50 values of A549 and HTB132 cells were highly correlated with a squared Pearson’s correlation coefficient (*r*2) of 0.60, as calculated from the 5,532 common compounds tested in both growth inhibition assays. As suggested by Xu et al. that assay correlation might be the key to success of multi-task DNN molecular activity models,41 we hypothesized that assay correlation may also be an important contributing factor to the success of transfer learning. Trivially, given an assay correlation of 1.0, transfer learning is by definition the optimal choice of weights. To non-trivially test this hypothesis, we need to assess transfer learning across many pairs of datasets with a broad range of inter-assay *r*2 values. Consequently, we selected a number of NCI-60 growth inhibition dataset pairs that included cell lines from different tissue origins and complemented them with additional chemical activity data covering a broad range of inter-assay correlations.

We examined the NCI-60 MALME-3M (a human skin cancer) cell line dataset paired with 28 other cell lines, providing a range of inter-assay pGI50 correlations *r*2 between 0.45 and 0.87. Similarly, we included the MDA-MB-435 (a human breast cancer) cell line paired with 18 other cell lines, with a range of inter-assay pGI50 correlations *r*2 between 0.47 and 0.95. Given the nature of the NCI-60 assays and their relatively high correlations (*r*2 > 0.4), we complemented the NCI-60 dataset pairs with other chemical activity data, such as chemical binding affinity to drug targets, potency to inhibit enzyme functions, as well as physicochemical properties, including lipophilicity and aqueous solubility. Details of these datasets and their pairings are provided in Tables S1 and S5 of the Supplementary Information.

We evaluated transferability of the hidden layers of pre-trained neural networks across the dataset pairs using the 1,024:2,000:1, 1,024:2,000:100:1, and 1,024:1,000:1,000:100:1 network architectures, where evaluation procedure followed the steps outlined in Fig. 3.

Thus, for each dataset pair, we designated the larger dataset as the data-rich dataset and the smaller one as the data-limited set. We used a random 90–10% split for training and validation of the data-rich models to create the weights and biases of the hidden layers so that they could be transferred for the development of the data-limited models. From each of the data-limited datasets, we first randomly selected 10% of the compounds as a test set. From the remaining compounds, we randomly selected 10% as the validation set. We then randomly selected 500 and 1,000 compounds from the remaining compounds as our data-limited training sets to train neural network models with and without transfer learning. We calculated the MSEs of the test sets using the resulting models and calculated TLE from the MSEs of models trained without and with transfer learning. Figure 5 shows the results for training sets consisting of 500 compounds, and Fig. 6 shows the corresponding data using 1,000-compound training sets. The numerical results are given in Tables S5 and S6 of the Supplementary Information. Figures 5 and 6 are similar, with both showing that when *r*2 of a dataset pair was 0.4 or higher, the TLE was larger than zero, and the higher the *r*2, the larger the TLE. When *r*2 was lower than 0.4, the results were less clear-cut and depended on network architecture. Using the shallow network with a single hidden layer, in a little over 50% of the cases (19 out of 35 with a training set of 500 compounds and 21 out of 35 with a training set of 1,000 compounds), transfer learning was able to lower the MSE, as indicated by a TLE > 0. However, using a deeper network with two or three hidden layers, in a majority of the cases, transfer learning resulted in a positive TLE even when *r*2 was lower than 0.4.

Figure 7 shows the mean TLE values as a function of *r*2 and illustrates that the higher the *r*2 between a data-rich and a data-limited dataset, the larger the benefit of transfer learning. The increase in TLE and consequent reduction in prediction error ranged from 10–20% for every 0.10 increase in *r*2 between the datasets. In the cases where the inter-assay correlations *r*2 were lower than 0.4, there was no benefit of using transfer learning for a one-hidden layer network, whereas two- or three-hidden layer network could still benefit.

**Limitations.** We introduced the concept of dataset similarity as a metric for deciding when transfer learning could be beneficial to the augment training of DNNs associated with small-size datasets. Currently, we used Pearson’s correlation coefficient as a purely numerical evaluation of data similarity, and this may not capture all considerations for evaluating transfer learning. Furthermore, we do not know the correlation metric *a priori*, as it has to be estimated from the datasets themselves based on a potentially limited number of compounds tested in both datasets. Although this can be a practical limitation when confronted with narrow chemical diversity among the data, for chemical property applications based on minima; datasets of ~ 102 compound, the transfer learning approach described here might be the only practical way forward to implement a data-driven prediction model.