Data collection
Two datasets were constructed to train and validate ML models for fractional drug release prediction; (i) training dataset, and (ii) external validation dataset. The training dataset used for ML model screening and development. It consisted of 2638 data entries, which described 102 drug release profiles for 22 drug polymer combinations. The data was either generated in-house and previously published by our lab or collected from Web of Science. The data obtained from Web of Science was collected using the keyword combinations of “polymeric microparticle” and “drug delivery”. From appropriate manuscripts, information related to the preparation, final composition, and release kinetics of drug-loaded LAI were collected. The latter was primarily extracted from figures of drug release profiles using the “GetData Graph Digitizer” application. The external validation dataset was exclusively collected from an additional Web of Science search using the same search terms. This dataset consisted of 1016 data entries, which described the release profiles collected for 79 unseen LAI formulations made up from 21 drug-polymer combinations.
The initially collected dataset was composed of a table of drug and polymers names, as well as formulation physicochemical properties, and fractional drug release values at various timepoints. In order to use this data to construct and train ML models it is necessary to describe various elements using computational recognizable descriptors which were collected from RDkit (i.e., using SMILES codes). The polymers and LAI formulations were described exclusively using information reported in the relevant published articles, these included; polymer molecular weight, lactide-to-glycolide ratio (for non-PLGA systems this was set as zero), molecular crosslinking ratio of polymers (for non-cross-linked systems this was set as zero), initial drug-to-polymer ratio, drug loading capacity (DLC), surface area-to-volume (SA-V) ratio for the LAI system, and the precent of surfactant present in the release media (SE; where no surfactant was present in the release media, this was set as zero).
Data splitting strategy for ML model training
In all cases drug-polymer combinations were grouped as such, to allow cross-validation against drug-polymer based splits. Data splitting was done using the group-shuffle-split (GSS) method combined the leave-one-group-out (LOGO) cross-validation strategy. The GGS and LOGO packages were all imported from the Scikit-learn library in Python.22
ML model development
In total eleven ML algorithms were trained and investigated for this task. These included, multiple linear regression (MLR), MLR with least absolute shrinkage and selection operator regularization (LASSO), partial least squares (PLS), decision tree (DT), random forest (RF), light gradient boosting machine (LightGBM), extreme gradient boosting (XGBoost), natural gradient boosting (NGBoost), support vector regressor (SVR), k-nearest neighbors (k-NN), and neural network (NN) models. All of these ML models were built and evaluated in Python. NN models were built using the Keras package with the backend of TensorFlow,23 LighGBM models were built using the LightGBM package, XGBoost models were built using the XGBoost package,24 NGBoost models17 were built using the NGBoost package, and all other models were built using the Scikit learn library.22
In all cases, prior to training any of the ML models, a data preprocessing step was conducted to standardize the data prior to input into the ML models. This was done using the standard scalar package available in the Scikit learn library.22 ML model hyperparameters were tuned using the randomized grid search package in Scikit learn,22 and the negative mean absolute error metric was employed. The hyperparameters screened, and numerical range for these values, varied depending on the ML model being training. This screening process is summarized in Table S1-7.
Pseudo-prospective study (ML model evaluation)
In order to better assess the predictive performance of all the trained ML models a pseudo-prospective study was conducted using the external validation dataset. This dataset (~ 1000 data samples) enabled a more quantitative evaluation of the prediction accuracy of each model. This was done by determining the absolute error (AE) for each prediction made by each model, as well as the mean absolute error (MAE) for various groups (i.e., drugs, polymer, and drug-polymer combinations) within the external validation dataset. The AE, MAE, and standard error of the mean () were determined using the equations shown below
\(absolute error \left(AE\right)= \left|{y}_{i}-{x}_{i}\right|\) | Eq. 1 |
\(mean absolute error \left(MAE\right)= \frac{\sum _{i=1}^{n}\left|{y}_{i}-{x}_{i}\right|}{n}\) | Eq. 2 |
\(standard error of the mean \left(\sigma M\right)= \frac{s}{\surd n}\) | Eq. 3 |
Where, \({y}_{i}\) is the predicted fraction drug release value; \({x}_{i}\) is the experimental fractional drug release value obtained from either the training or external validation datasets; \(n\) is the total number of data points; and \(s\) is the standard deviation.
Feature engineering
Agglomerative hierarchical clustering analysis was performed using the hierarchical clustering package from SciPy in Python 25 to arrange the initial input features into a hierarchy of clusters using the farthest neighbor clustering algorithm. The performance of the optimal ML algorithm (RF) was then assessed following the removal of select clusters based on their linkage distance. The hyperparameter structure for the RF model, that was identified during training (Table S8), remained fixed and only the number of input features were varied.
Confidence intervals for random forest model
Confidence intervals values for the optimal RF model were determined based on variance estimates for bagging originally proposed by Efron et al; the jackknife and infinitesimal jackknife estimators.19,20 In this study, the variance for the optimal random forest model were calculated using the forestci package in Python and used to portray the confidence intervals for the fractional drug release predictions.26
Model interpretation
Shapley additive explanation (SHAP) analysis was conducted on the 12 feature RF model trained for the prediction of fractional drug release from LAIs. The effect of the various input features on the fractional drug release prediction for the external validation dataset was assessed using the TreeSHAP package and the force plot visualizations from the SHAP library in Python. 27,28