3.1. Development and validation of prediction models
3.1.1. Individual RT and CCS model development
There is no assumption of an underlying variance structure with the multivariate MARS analysis, and there was no facility to define one within the earth package at the time of implementation. However, for the univariate analyses, a linear model variance structure was defined. This meant the standard deviation was estimated as a function of the predicted response and, hence, allowed for the construction of prediction intervals.
It is essential to use prediction intervals, rather than confidence intervals, in cases where the goal is to predict future values. A prediction interval is wider than a confidence interval and, at the 95% level, will provide bounds within which 95% of predicted values should fall.
All analyses considered the whole set of 1666 molecular descriptors as possible inputs to be used in the models. The assumptions of normality, linearity and homoscedasticity were assessed for the univariate models which held those assumptions. The univariate MARS fit to RT violated the assumptions of linearity and homoscedasticity, so a square root transform was applied. This then reasonably met assumptions.
In summary, three different univariate models were developed for the prediction of RT (Equation 1), CCS data for (de)protonated molecules (CCSH) (Equation 2) and CCS data for sodium adducts (CCSNa) (Equation 3). As an example and to assist with interpretation, in equation 1, the term 0.099·max(0,(nDB-3)) is equal to 0 for nDB ≤ 3, and equal to 0.099·(nDB-3) for nDB > 3.
The univariate models obtained a cross validated R2=0.855 for the RT model, R2=0.966 for the CCSH model and R2=0.954 for the CCSNa model. Table 1 reveals that the univariate models (RT and CCSH) do not share a single descriptor, lending weight toward the argument that univariate models provide better fits to the data than previously explored multivariate model.


3.1.2. RT, CCSH and CCSNa model validation
MARS models were fitted using a 3-fold cross validation with thirty iterations. This procedure splits the data into three sections, fits the model to two of those sections (training data) and then tests the accuracy of the resulting model on the final section (test data). This procedure is then repeated thirty times, each time randomly dividing the data into three sections. The measure of accuracy used to assess goodness of fit is the cross-validated R2, which looks at the average R2 value obtained across all thirty iterations when the model was fit to the test data. This value is usually lower than the R2 for the best model fit but dramatic changes suggest volatility in the data or overfitting in the modelling procedure.
In order to perform an additional model validation and to obtain an overview of the model performance, RT and CCS data was predicted for the molecules used for model development. By comparing predicted and empirical RT data (Figure 1A, top), it was observed that the average deviation obtained using RT model (eq. 1) was ± 0.72 min as shown in Table 2. Yet, 95% of the predictions fell within ± 2.32 min. Additionally, it could also be observed that deviations in predicted data distributed normally around 0% deviation (marked as a red line in Figure 1A, bottom) The prediction accuracy obtained is an improvement for the 95% intervals in previously developed models (± 4.0 min using logKow predictor [20], ± 2.80 min using ANNs [21]) and in line with the model developed by means of ANN by Mollerup et al. (over ± 2 min deviation) [13]. The developed model herein presented also improves the prediction accuracy compared to Barron et al. where they obtained average deviation of ± 1.02 min [18]. As another way of presenting prediction accuracy, Figure 2 plots the predicted vs. empirical data with the 95% prediction intervals (blue coloured area) for the univariate MARS analysis of the . Approximately, only 8% of predicted RT were more than 2 min away from empirical ones.
Prediction accuracy for CCS data was also studied. The deviation observed for CCS data of [M+H]+ using CCSH model averaged ± 1.23 %, being ± 4.05 % within 95% of the cases (Table 2). Figure 1B, bottom shows that deviations randomly distributed around 0% (marked as a red line) value without biasing predicted data. When compared with previous models, CCS data for protonated molecules could be predicted using ANNs with an accuracy of ± 5 – 6% for 95% of the cases [13,22] or slightly over ± 5 % deviation (95% confidence interval) using machine learning [25]. Figure 3A shows the 95% prediction intervals (blue coloured area) for the univariate MARS analysis on CCSH model. The blue lines are placed at predicted values ± 2 Å2 and the purple are ± 5 Å2. It is clear that the model is still predicting well at higher values. However, since there is less data, the prediction intervals are much larger to accommodate the uncertainty. This vast improvement in the accuracy could be explained because of the larger database used for the model development as well as the better fitting of empirical data with MARS than ANNs.
Additionally, application of the CCSH model for the prediction of CCS values for deprotonated molecules was tested, yielding highly accurate predictions (Figure 1C, top). By predicting mobility data for a set of 169 molecules ionized in negative mode, it was observed that the differences between the observed and predicted CCS for the [M-H]- fell, 95% of the time, within -13.4 and 9.3 Å2, with a slight tendency to under-predict CCS values (Figure 1C, bottom). In relative terms, average deviation for [M-H]- data was ± 2.79 % (± 5.86% for the 95% of the cases, Table 2). Although these deviations seem larger than the ones observed for [M+H]+ data, this increase in the deviations observed for [M-H]- was expected since the model was developed with [M+H]+ data. However, it was assumed that the predictions of CCSH model developed with [M+H]+ data could also be extrapolated to the prediction of CCS data for [M-H]-, as no remarkable improvement was expected if a model was exclusively developed for deprotonated molecules.
Ideally, a unique model for the prediction of CCS for (de)protonated molecules and sodium adducts was intended. Therefore, the CCSH model was also tested against [M+Na]+ data. However, high deviations were observed (± 4.77 % average, ± 10.86 % for the 95% of the cases, Table 2) which could be expected due to the likely higher impact of the volume of the sodium atom in the overall CCS of the molecule. In light of this data, [M+Na]+ data required a separate model for CCS prediction that was different to the one initially developed. The procedure for CCSNa model development was equivalent to the process described above (section 2.3) but using as input a dataset of 249 CCS values for [M+Na]+ ions. The accuracy of the model was evaluated by also comparing predicted and empirical data (Table 2). Prediction deviations were ± 2.08 % on average (± 5.25 % for the 95% of the cases) showing a great improvement compared with predicted data using the CCSH model. Figure 3B depicts the predicted vs. empirical CCS values comparing the 95% prediction intervals (blue coloured area) for the univariate MARS analysis on CCSNa model. The fact that different predicted values can be obtained for both protonated molecules and sodium adducts is of great help for empirical observations of both species for a suspect substance. Hence, increased confidence on the tentative identification can be garnered by matching both of the CCS values observed with predicted data.
The CCSNa model herein presented also improves the prediction accuracy of previously developed model by the authors [22]. In that work, we evaluated the performance of the ANN predictive model for sodium adducts finding that deviations between predicted and empirical data were below 8.7% for the 95% of the cases. However, the development of an exclusive model for the sodium adducts by MARS improves the prediction accuracy.
3.1.3. Blind testing of the models
Several reference standards were purchased from different research projects during the development of the predictors based on MARS. Hence, they were not included in the training and validation datasets used. These compounds were used to verify the utility of our prediction models for chemicals not previously considered in the training steps. Thus, model applicability can be extrapolated for upcoming RT and CCS predictions of real suspect compounds. Therefore, we calculated deviations between predicted and empirical data for this dataset, and compared the observed deviations with previously calculated accuracies at different percentiles (shown in Table 2). Table 3 depicts the empirical and predicted values of RT and CCS for the different adducts observed for the additional set of 25 reference standards. Moreover, the deviation between empirical and predicted is shown and as it can be observed the RT predictions are generally in agreement with the empirical data with the 95th percentile of the observed deviations (± 4.15 min) being in the same range than that observed during validation. Furthermore, the vast majority of CCS values for [M+H]+ are in agreement with the values calculated using the CCSH model. For these compounds, 95% of the cases showed deviations below ± 3.71 %, yielding even better results than the initial database during model validation. Only 3,4-dichloroaniline shows a deviation greater than 4%, which could be explained by the small CCS value calculated. When evaluating CCSNa, higher deviations are observed concretely for the case of di(2-ethylhexyl) terephthalate and vildagliptin (-8.61% and 8.74%, respectively). These deviations could be explained because of particular chemical structures of the molecule such as the presence of an adamantyl group in vildagliptin, which has a large and rigid structure, or the high rotatability of alkyl chains in the di(2-ethylhexyl) terephthalate. However, if these adducts would be treated as outliers, 95% of the CCSNa values show deviations of ± 3.15 %, which is in great accordance with the data obtained during method validation. Finally, for [M-H]-, a small set of molecules was gathered, and all of them fit well within the ± 5.8 % deviation.
3.2. Open access prediction platform
To aid future researchers working with UHPLC-IMS-HRMS, a free online webpage incorporating these models has been released. The models are available for the scientific community through https://datascience-adelaideuniversity.shinyapps.io/Predicting_RT_and_CCS/ . Figure 4 illustrates the layout of the online platform for the prediction of RT and CCS for both (de)protonated molecules or sodium adducts.
The operational of the platform is user-friendly and easy-to-follow. As an example, the step-by-step method to obtain prediction for omeprazole is shown. First, selection of which parameter is going to be predicted need to be done (Figure 4A). In this case, CCS for protonated molecule is selected by indicating ‘Select Response: Collision Cross Section’ and ‘Sodiated: No’. After downloading the appropriate descriptors for the molecule of interest using Dragon v5.4 integrated within OChem (www.ochem.eu) [37], those can be added in the corresponding editable fields (Figure 4B). The CCS value can, then, be predicted and the output is shown together with their corresponding prediction intervals (Figure 4C). In this case, the CCS predicted value for the protonated molecule of omeprazole is 181.51 Å2 with a prediction interval of 171.93 – 190.08 Å2. The empirical value for [M+H]+ for omeprazole is 180.58 Å2, denoting that the prediction only deviated 0.52% from the empirical value.
The ease of prediction as well as the open access for this online platform is of great help for those researchers working on UHPLC-IMS-HRMS instruments who do not have an in-house developed prediction model.