2.1 Thermodynamic Basis of the General Solubility Equation (GSE)
Yalkowsky and coworkers [15, 22, 23] developed the General Solubility Equation (GSE), Eq. 1, to predict the solubility of liquid/solid nonelectrolytes (mostly industrial organic chemicals) in water. The method is particularly appealing since it requires no ‘training.’ Merely the melting point (mp in oC) and the octanol-water partition coefficient, either measured (log P) or calculated (clogP), are prerequisites for predicting solubility (log molar units):
log S0 GSE(classic) = 0.5 -1.0 log P -0.01(mp-25) (1)
The thermodynamic basis of the equation was reviewed recently [13]. Briefly, the dissolution of a crystalline substance in water consists of two main contributions: (i) crystal lattice effect (XTL), i.e., the energy needed to break down the lattice to form a hypothetical ‘supercooled liquid’ (SCL), and (ii) solvation effect, i.e., the energy released as the SCL dissolves in water. The total solubility can be expressed as [22, 23]:
where log SwXTL = - ∆Sm (Tm – T) /(2.303 RT); ∆Sm is the standard molar entropy of phase transformation, T is the absolute temperature (K) and Tm is the melting point (K). At 25 oC:
Hansch and coworkers [24] showed that log S of simple liquid solutes correlated linearly with the octanol-water partition coefficients, log P ≈ log (Soctliq/ Swliq). On re-arrangement,
log Swliq ≈ log Soctliq - log P (4)
where log Soctliq is the log solubility of a liquid solute in octanol, ranging from − 0.3 to + 0.9 for small molecules [24]. Yalkowsky and coworkers rationalized log SoctSCL = 0.5 in Eq. 1 [15].
Hansch’s studies suggest that the constant coefficients in Eq. 1 might need to be modified for compounds from novel classes of chemical space. If the ‘supercooled liquid’ form of a large polar solute is not fully miscible with octanol, then the log SoctSCL contribution could a negative number. A large molecule with a decreased SoctSCL (due to decreased miscibility with octanol) is expected to have an increased SwSCL. This would lessen the contribution of lipophilicity to the predicted solubility.
2.2 ‘Flexible-Acceptor’ General Solubility Equation, GSE(Φ,B)
It was found [12] that the sum of Kier’s molecular flexibility (Φ) [25] and Abraham’s [16, 26] basicity descriptor, B, could be incorporated into a nonlinear variant of the GSE to produce a trainable model suitable to predict solubility of various classes of drugs, including large NMEs (MW > 800 g/mol). The resultant GSE(Φ,B) has the form:
log S0GSE(Φ,B) = c0 + c1 ⋅ clogP + c2 ⋅ (mp – 25)/100 (5)
with the variable coefficients modeled here as:
c0 = b0 + b1 exp(-b2 ⋅ (Φ + B)) (5a)
c1 = b3 + b4 [ 1 - exp(-b5 ⋅ (Φ + B)) ] (5b)
c2 = b6 + b7 ⋅ (Φ + B) (5c)
The c-coefficients as functions of Φ + B were determined by partial least squares (PLS open-source package from https://cran.r-project.org/web/packages/pls) analysis of solubility data sorted on values of Φ + B and uniformly binned into 18 groups of 123–775 points, to ensure nearly constant Φ + B increments, as described previously [12, 13]. Since our last study [13], the database has accumulated nearly 1000 new entries. So, a new set of bconstants was determined in the current investigation, using drug-relevant molecules as the training set, but excluding new drugs from the training. Values of Φ were calculated from the two kappa and the heavy atom count descriptors provided by the Landrum’s RDKit open-source chemoinformatics library [27]. Table 1 lists these Φ and B values.
*** Table 1 goes here ***
2.3 Abraham Descriptors and the ABSOLV Linear Model for Predicting Solubility
Abraham introduced five solvation descriptors: A, B, Sπ, E, and V [16, 26]. Two of these constitute H-bond potentials: A is the H-bond acidity (donor strength) and B is the H-bond basicity (acceptor strength) of the solute. Sπ is the dipolarity/polarizability, E is an excess molar refraction in units of (cm3/mol)/10, and V is the McGowan characteristic molar volume in units of (cm3/mol)/100. Values of the descriptors were calculated from 2D structures using the ABSOLV algorithm [26] (cf., www.acdlabs.com) and are listed in Table 1 for the new drugs.
Abraham and Le [16] amended the ABSOLV model to predict intrinsic solubility (log molar):
log S0ABSOLV = d0 + d1 A + d2 B + d3 Sπ + d4 E + d5 V + d6 A∙B (6)
The independent variables are the five solute descriptors, plus the cross product of the H-bond terms. The seven d-coefficients were determined by PLS regression, using the training set database, exclusive of the new drugs set. Quaternary ammonium drugs and drugs with MW > 800 Da were each treated separately. The rest of the molecules were divided into four acid-base classes – with reference to predominant charge state at pH 7.4: acids(-), bases(+), neutrals(0), and zwitterions(±), as was done previously [10]. For each class, separate sets of dcoefficients were determined by PLS regression.
2.4 Statistical Machine Learning Random Forest Regression (RFR) Model
The RFR open-source ‘randomForest’ library for the R statistical software was downloaded from https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm. The method works by constructing an ensemble of hundreds of decision trees employing about 200 RDKit-generated molecular descriptors [27]. The method was re-trained with the presently enlarged database, excluding the newly-approved drugs.
2.5 Sources of Solubility Data for the Test (New Drugs) and Training (Wiki-pS0 Database) Sets
The 2021 mini-review of FDA drug approvals by Mullard [1] was a convenient starting point to identify the new drugs and to begin the search for their solubility values. Since the drugs are new, there are hardly any journal publications reporting properties of the compounds. Almost all the data were found in FDA filing documents. As part of the New Drug Application (NDA) process, the FDA Center for Drug Evaluation and Research (CDER, www.accessdata.fda.gov) publishes reports listing some physicochemical properties of compounds under consideration.
There was virtually no experimental detail about the measurements in the published regulatory reports. Many of the reported solubility values are of drugs in water (Sw), with saturation pH not reported. When the temperature was not stated or was reported as ‘room’ or ‘ambient’, it was assumed to be 23 oC for the purpose of calculations here. In the dearth of experimental detail, it is a challenge to assess the quality of the reported measurements in most of the FDA reports. Nevertheless, there are high quality data in some of the documents, where solubility measurements were published as a function of pH. Examples of some of these are presented below.
Of the 36 small-molecule NMEs approved in 2021, 39 independent quantitative solubility measurements were found only for 28 NMEs [28–62], given that some solubility data are redacted or presented as qualitative values (e.g., ‘insoluble’, ‘poorly soluble’, ‘very soluble’) in FDA reports. The reported values were transformed into the intrinsic solubility scale, S0, using known (or predicted when unavailable) pKa values, and adjusted to 25 oC [63] using the program pDISOLX (inADME Research) [64–70]. Table 1 lists the solubility data (normalized as intrinsic values), along with the pKa values used in the data analysis at the temperatures of measurement.
The Wiki-pS0 intrinsic aqueous solubility database of mostly druglike molecules (currently with 7655 deeply-curated entries) was used to train the ABSOLV, GSE(Φ,B) and RFR models. Several hundred values from the database have already been published [10–13, 63–74], and the entire database is currently being prepared for publication as a book. The newly-approved drugs were used as external test sets and were excluded from the training process.
The structures of the 28 new drugs considered here are shown in Fig. 1. In dual-API drug products, each API was treated as a separate ‘drug’ in the data analysis.
*** Fig. 1 goes here ***
2.6 Sources of Octanol-Water Partition Coefficients (clogP) and Melting Points (mp)
Values of clogP were used in Eqs. 1 and 5 in place of experimental log P values. These were calculated by the Wildman-Crippen sum of atomic contributions method in the open-source RDKit chemoinformatics library [27]. Experimental mp values were employed where available or were calculated otherwise [75].