The aim of this study was to develop ML models for the prediction of individual time-trial performance and changes therein following training in recreational cyclists, and to obtain the most important predictors. Because of the small sample size, we applied techniques to reduce overfitting. Time-trial performance could be predicted very accurately using the ML models with high internal validity, illustrating that predictions of time-trial performance are highly generalizable to unseen data, both at baseline (R2 = 0.923, MAE = 0.183 W/kg) and after training (R2 = 0.758, MAE = 0.338 W/kg). However, changes in time-trial performance were more difficult to predict, also for unseen data (R2 = 0.483, MAE = 0.191 W/kg). The most important predictors were identified based on feature importance, demonstrating contributions from physiological profiles, individual training load and well-being during the intervention.
4.1 Predictions of time-trial performance
Machine learning models in the present study allowed for very accurate predictions of time-trial performance at baseline and after training. Differences in individual 4-km time-trial performance at baseline were largely explained by the best ML model (R2 = 0.99) in the training set used to create the ML model. Similarly, previous cross-sectional studies have explained cycling time-trial performance using multiple linear regression models [11, 57–59]. In those prior studies, differences in 15-km to 1-h time-trial performance were also largely explained by physiological determinants in trained, well-trained and elite athletes (R2 ≥ 0.78). The main difference with these previous studies is that those models were explanatory and have not been tested for their predictive performance, either using internal validation (testing the model on a subset of the collected data not used to build the model) or external validation (testing the model on newly gathered data from another cohort). In this study, we assessed predictive performance using internal validation, and also demonstrated high generalizability and good predictive performance of our ML model for cycling performance at baseline (R2 = 0.92, MAE = 0.18 W/kg).
To the best of our knowledge, this is the first study to predicted time-trial performance after a prescribed training intervention using predictors from physiological profiles at baseline and individually monitored training load and well-being during the intervention. Our findings show that 4-km time-trial performance after a 12-week training intervention could be predicted with high accuracy and high predictive performance on unseen data (R2 = 0.76, MAE = 0.34 W/kg).
Individual changes in time-trial performance, however, were more difficult to predict. When quantifying inter-individual differences in the response to training, researchers typically visualize the inter-individual responses in a bar plot ranking responses from low to high (similar to Fig. 2). However, one should be careful not to consider that all variability reflects true inter-individual differences in trainability [60]. In fact, part of the inter-individual differences are due to measurement error, (random) within-subject variability and regression to the mean [60–64]. Regression to the mean is the phenomenon whereby participants with relatively high or low values at baseline are naturally expected to regress towards the mean at follow-up [60]. As for within-subject variability, this could be partly explained by behavioural or environmental factors, such as sleep, stress, nutrition and circadian rhythm [64]. Not recognizing these three factors confounds interpretation of the differences in trainability as well as of responders and non-responders to a training intervention [62].
In this study, we demonstrated how ML analyses can help to gain more insights into these inter-individual differences in responses to training. Approximately 50% of the changes in time-trial performance in the test set could be explained by our ML model using predictors from physiological profiling and monitored training load and well-being. We are aware of only one other study that used ML analyses to predict individual changes in time-trial performance in elite short track speed skaters following 3 months of training [22]. Here, the best model explained ± 20% of differences in time-trial performance. The higher explained variance in the present study could be related to the fact that we also included predictors on comprehensive physiological profiles and on wellbeing, such as sleep and stress, that accounted for (part of the) within-subject variability in our ML model [61, 64]. In summary, using a ML approach to predict performance after training provides novel insights into performance optimization and inter-individual differences in the response to training.
4.2 Most important predictors of time-trial performance
It is well known that endurance performance can be largely explained by V̇O2max, V̇O2 at the lactate threshold, performance V̇O2 and efficiency [65–67] as well as by underlying skeletal muscle determinants [65]. Accordingly, previous studies have identified key predictors of time-trial performance across different (longer) distances: ventilatory or lactate thresholds for 40-km performance [68]; performance V̇O2 and gross efficiency for 1-h performance [58]; V̇O2max, first ventilatory threshold, cycling economy and NIRS-derived recovery kinetics for 25-km performance [57]; performance V̇O2, blood oxygen carrying capacity and leg oxygenation for 15-km performance [11] and mitochondrial oxidative capacity, submaximal blood lactate and leg oxygenation for 26-km performance [59]. Contrary to these longer time trials, 4-km time-trial performance has a greater anaerobic energy contribution [69], which may translate into other involved key predictors.
To the best of our knowledge, this is the first study to reveal predictors of time-trial performance before and after a 12-week training intervention based on ML modelling. Based on the feature importance of our ML models, we found predictors similar to those in literature on endurance performance, such as performance V̇O2, ventilatory thresholds, power at V̇O2max, gross efficiency and leg oxygenation. However, we also observed other important predictors including lean body mass, body fat percentage, skinfolds, leg circumferences as well as predictors related to Wingate power output or jump height, reflecting higher importance of sprint-related capacities. Additionally, we demonstrated important predictors derived from training impulse scores, resting heart rate, and sleep duration during the intervention. Interestingly, the ML model for changes in time-trial performance also indicated other predictors related to training and well-being, illustrating how sickness and psychological well-being may impact training adaptations, as well as compliance during the intervention. It should be noted that these predictors should be interpreted with caution, as the prior feature selection ensured that not all predictors were included in the best model. Still, these findings support the notion that key predictors of endurance performance may differ with the distance/duration of the time trial and that within-subject variability related to sleep, stress and health also contribute to the prediction of training adaptations [61,64]. In summary, ML modelling can provide new insights into key predictors of endurance performance.
4.3 Machine learning using small sample size: how to avoid overfitting?
ML is typically performed with large datasets and can be more challenging with small sample sizes, as there is a risk of overfitting. Robust evaluation of ML models is imperative for assessing generalizability of models, and is even more important when training and test samples are small [70, 71]. To obtain unbiased performance estimates on unseen data, regardless of sample size, researchers are advised to perform nested cross-validation and train-test split validation [72]. Cross-validation has also been recommended with hyperparameter tuning to avoid overfitting and improve generalizability [73]. It is important to note that k-fold cross-validation techniques may lead to biased results and should be avoided with small sample size [74]. Correspondingly, we completely separated training and test data sets using train-test splits for evaluation of model performance and applied nested leave-one-out cross-validation to obtain the best hyperparameter settings for each model. Another important factor to consider is the feature-to-sample ratio [72]. To reduce this ratio, one could perform feature selection prior to modelling or apply ML algorithms that reduce the number of features (e.g. by regularization with glm and principal components with pcr). In this study, we did both.
With overfitting, ML models are developed with a high complexity and as a result models fit random noise in the training data, which make their predictions on unseen test data less accurate. Overfitting can be reflected by a large increase in error from the training to the test set together with poor model performance on the test set. We did not observe this for modelling time-trial performance at baseline or after training. However, we did observe overfitting for the changes in cycling performance, especially for the glm algorithm. For changes in time-trial performance, best out-of-sample performance was obtained by the rf algorithm using hyperparameter tuning and feature selection, reducing the feature-to-sample ratio by ~ 50%. Another way to look at overfitting is to compare the prediction errors observed for the test set with the typical error that can be expected for the target. In our case, this typical error can be calculated from the coefficient of variation for time-trial performances (3.3% [75]) and the standard deviation in baseline performance of our participants [64], which was 0.135 W/kg. Interestingly, the MAE of our best ML models at baseline (0.183), after training (0.338) and for changes in time-trial performance (0.191) were only slightly higher than this typical error. This supports the idea that unbiased performance estimates can be obtained with ML regardless of sample size, when applying techniques to reduce overfitting.
4.4 Perspectives and limitations
On average, changes in time-trial performance were small and not different between the four types of training interventions. This may be explained by the relatively low sample size combined with the large inter-individual variability in responses to training. As mentioned, variability could be partly due to measurement error, within subject variability and regression to the mean [60, 62–64]. Moreover, training supervision, compliance and prior training volume could have influenced inter-individual variability in training responses. Many training sessions were supervised, but moderate-intensity endurance sessions were not (although heart rate was monitored throughout all cycling sessions). Compliance was used as a predictor in the ML modelling rather than as a cut-off to exclude participants that completed fewer training sessions. Finally, the training volume that cyclists were accustomed to prior to the study could have influenced their subsequent responses to training. Therefore, selecting a homogenous group of athletes in terms of training history and strict training supervision are advised for future studies.
Absence of group differences for changes in time-trial performance could also be due to the chosen outcome measure, the 4-km time-trial performance. Previous studies have reported 8% improvements in 40-km cycling time trial [43] and 10% improvements in 10-km running performance [44] based on a similar training impulse following 6 and 10 weeks of polarised endurance training, respectively. Whereas these studies focus solely on endurance training, we investigated effects of concurrent strength and endurance training and for this reason assessed a 4-km time trial, which requires a significant anaerobic energy contribution compared to longer time trials [69]. Concurrent plyometric and polarised training resulted in + 1.8% increase in 5-km running time-trial performance in well-trained runners after 8 weeks of training [12], which is comparable to the + 2.3% increase in our concurrent eccentric /plyometric and polarised training group. Notably, the magnitude of adaptations in these shorter time trials is much closer to the random error for measuring time-trial performance (i.e. typically 2–3%, although lower in athletes [75, 76]), making it more difficult to detect group differences in the changes in performance.
Performing ML with small sample size, such as with training studies, can benefit from techniques to increase sample size. Rather than upsampling records by duplication, one could generate synthetic datasets that contain more observations but with similar variable distributions and correlation structure, which improves existing AI algorithms through data augmentation [77, 78].
Predictive performance of all ML models was reported for both training and test sets. Typically, the best model is selected based on the training set (or validation dataset for model development). For predictions at baseline and after training, the best models based on the training set also performed best on the test set. However, this was not the case for changes in time-trial performance, as models were overfitting. Therefore, it may be better to investigate model performance on the test set as well, taking into account the internal validity of the model, especially when sample size is too limited to create a sufficiently large validation set during model development (this was only one sample using the leave-one-out cross validation in our case). Subsequently, these best models can be externally validated as additional proof of their generalizability.
4.5 Practical applications
This study shows that predictive modelling using machine learning may aid coaches, athletes and sport scientist in their pursuit of performance optimization. New insights can be obtained when athletes assess their comprehensive physiological profiles before and after specific training blocks, as well as when they monitoring their training load, and well-being (including sleep and stress) during these blocks. Not only can ML models provide individual predictions of performance after training, but also give new insights into the predictors that play an important role in establishing these predictions. Ultimately, such a ML approach could help coaches to determine the most promising training strategies for their individual athletes, which guides coaches in giving personalized advice for talent development and performance optimization.