Training load responses modelling in elite sports: how to deal with generalisation?

This study aims to provide a transferable methodology in the context of sport performance modelling, with a special focus to the generalisation of models. Data were collected from seven elite Short track speed skaters over a three months training period. In order to account for training load accumulation over sessions, cumulative responses to training were modelled by impulse, serial and bi-exponential responses functions. The variable dose-response (DR) model was compared to elastic net (ENET), principal component regression (PCR) and random forest (RF) models, while using cross-validation within a time-series framework. ENET, PCR and RF models were ﬁtted either individually ( M I ) or on the whole group of athletes ( M G ). Root mean square error criterion was used to assess performances of models. ENET and PCR models provided a signiﬁcant greater generalisation ability than the DR model ( p = 0 . 012 , p < 0 . 001 , p = 0 . 005 and p < 0 . 001 for ENET I , ENET G , PCR I and PCR G , respectively). Only ENET I , ENET G and RF I were signiﬁcantly more accurate in prediction than DR ( p = 0 . 020 , p < 0 . 001 and p = 0 . 043 , respectively). In conclusion, ENET achieved greater generalisation and predictive accuracy performances. Thus, building and evaluating models within a generalisation enhancing procedure is a prerequisite for any predictive modelling.


Introduction
The relationship between training load and performance in sports has been studied since decades. A key point of the performance optimisation is the training prescription delivered by coaches, physical trainers or the athlete himself. Such a programming involves both various modalities of exercise (i.e. the type of training regarding to the physical quality required to perform) and adjusted training load. Training load is usually dissociated into i) an external load defined by the work completed by the athlete, independently of his internal characteristics 1 and ii) an internal load corresponding to the psycho-physiological stresses imposed on the athlete in response to the external load 2 .
Models of training load responses emerged with the impulse response model promoted by Banister et al. 3 in order to describe human adaptations to training loads. Afterwards, a simplified version of the original model built on a two-way antagonistic first order transfer function (fitness and fatigue components, so called Fitness -Fatigue model) has showed a large interest to describe the training process [4][5][6][7][8] . However, several limitations regarding to the model stability, parameter interpretability, ill-conditioning and predictive accuracy were reported 9,10 . Such models may be considered as non-linear models according to their component structure 11 and therefore, require a sufficient number of observations (i.e. performances) to correctly estimate relationships between training load and performance 9,12 . To overcome some of the limits, refinements of the former impulse response model were proposed by using a recursive algorithm in order to estimate parameters according to each model input (i.e. the training load) 11 and by introducing variations in the fatigue response to a single training bout 13 . Further adaptations to the Fitness -Fatigue model were also developed with the aim of improving both goodness-of-fit and prediction accuracy 14,15 . Nonetheless, because impulse-response models aiming to predict training effects in both endurance (running, cycling, skiing and swimming) 6,11,[16][17][18][19][20][21] and more complex (hammer throw, gymnastic and judo) 8,22,23 activities sought to mitigate the underpinning physiological processes involved by exercise into a small number of entities, accuracy of predictions might greatly suffer from 24 . Moreover, these models assume that the training effect is maximal by the end of the training session. This assumption is reasonable only for the negative component of the model (i.e. "Fatigue"), where its maximal value is taken immediately after the session. Regarding to the positive effects induced by training (i.e. "Fitness"), such a response is quite questionable since physiological adaptations are continuing from the end of the exercise session. For instance, skeletal muscle adaptations to training described by increases in muscle mass, fiber shortening velocity and myosin ATPase activity modifications are known to be progressive (i.e. short to long term after-effects) rather than instantaneous [25][26][27] . Consequently, serial and bi-exponential functions were proposed to counteract these limitations and better describe training adaptations through exponential growth and decay functions, according to physiological responses in rats 28 .
A more statistical approach was used to investigate the effects of training load on performance by using principal component analysis and linear mixed models on different time frames 12 . Such models infer parameters from all available data (i.e. combining subjects instead of by-subject model) but allow parameters to vary in respect of heterogeneity between athletes. The model being multivariate, the multi-faceted nature of the performance could be conserved by including psychological, nutritional and technical information as predictors 12,16,18 . However, authors did not consider the cumulative facet of daily training loads, where exponential and decay cumulative functions such as proposed by Candau et al. 17 may be suitable for performance modelling.
Alternatives from computer sciences field were also used to clarify the training load -performance relationship in a predictive aim. Most notably, machine learning approaches are usually focused on the generalisation of models (i.e. a model ability to make accurate predictions on unknown data). Various approaches tend to maximise such a criterion. For instance, one can perform cross-validation (CV) procedures, where data are separated into training sets for parameters estimation and testing sets for prediction 29 . Such a procedure fosters the determination of optimal models, relatively to the family of models considered and regarding to their ability for generalisation. In the same time, CV procedures allow to diagnose under-and over-fitting of the model. Underfitting commonly describes an inflexible model unable of capturing noteworthy regularities in a set of exemplary observations 30 . In contrast, overfitting represents an over-trained model, which tends to memorise each particular observation thus leading to high error rates when predicting on unknown data 31 . While aforementioned studies aimed to describe the training load -performance relationships by estimating model parameters and by testing the model on a single data set, generalisation of models cannot be ensured. This challenges their usefulness in a predictive application. On the other hand, modelling methodologies using CV procedures are the standard in a predictive aim rather than only being descriptive. To our knowledge, only a few recent studies modelled performances with Fitness-Fatigue models using a CV procedure 10,32,33 and one separated data into two equals training and testing data sets respectively 34 . Two of them reported an overfitting of the model at the expense of accurate predictions 10,33 . Consequently, interpretations drawn from predictions as well as model parameters may be incorrect.
The physiological adaptations involved by exercise being complex, some authors investigated the relationship between training and performance by using Artificial Neural Networks (ANN), non-linear machine learning models 35,36 . Despite low prediction errors reported (e.g. 0.05 seconds error over a 200m swimming performance 35 ), the methodological consideration in their study mostly influenced by a small sample size and the "black-box" nature of ANN question their use in sport performance modelling 9,37 . Computer sciences offer plenty of machine learning models although being often resumed to ANN for physical performance prediction. Some algorithms might be preferred for sport performance modelling by preventing high dimensionality and multicollinearity issues. To cite a few, non-linear approaches such as Random Forest (RF) models 38 and regularised linear regressions 39,40 proved their efficiency and could be profitable in a sport performance context.
To date, not any consensus is claimed about which model family (i.e. impulse response and physiological based, statistical and machine learning models) should be preferred for physical performance prediction based on a data set, mainly due to a lack of evidence and confidence in training effect modelling and performance prediction accuracy. In addition, because generalisation ability is not systemically appraised, practical and physiological interpretations drawn from models may be incorrect and at least should be taken with caution.
Beyond the model employed to elucidate the relationships between training load and performance in a predictive application, we hypothesised that modelling approaches including CV procedures, regularisation and dimension reduction methods would lead to a greater generalisation than former impulse response models.
In order to prescribe an optimal training programming, sport practitioners need to understand the physiological effects involved by each training session and its after-effects on physical performance. Hence, this study aimed to provide a reliable and transferable methodology relying on model generalisation in a context of sport performance modelling. To do so, the variable dose-response (DR) model 13 was considered as a baseline framework and compared to two regularisation methods and one machine learning regression model: a principal component regression (PCR), an Elastic net (ENET) regression and a RF regression model, respectively.

Participants
Seven national elite Short-track speed skaters (mean age 22.7 ± 3.4 years old; 3 males, body mass of 71.4 ± 9.4 kg, and 4 females, body mass of 55.9 ± 3.9 kg) voluntary participated to the study. Each athletes experienced the 2018 Olympic Winter Games in PyeongChang, South Korea (n = 2) or were preparing the Olympics Games of Pekin, China (n = 7). The whole team was trained by the same coach, responsible for training programming and data collection. Mean weekly volume of training was 1/14 16.6 ± 2.5 hours. Data were collected over a three months training period, beginning one month after resuming training for a new season. Participants were fully informed about data collection and written consent was obtained from them. The study was performed in agreement with the standards set by the declaration of Helsinki (2013) involving human subjects. The protocol was reviewed and approved by the local research Ethics Committee (EuroMov, University of Montpellier, France). The present retrospective study relied on the collected data without causing any changes in the training programming of athletes.

Dependent variable: Performance
Participants performed each week standing start time trials (distance = 166.68 meters equal 1.5 lap) after a standardised warm-up. At the finish line, photocell timing system (Brower timing system, USA) recorded individual time trial performance. A total of n = 248 performances were recorded for an average of 35.4 ± 2.23 individual performances. The performance test being a gold standard for the assessment of acceleration ability 41 , athletes were all familiar with it prior to the study. In the sequel, let Y ⇢ R be the domain of definition of such a performance and Y 2 Y the random variable. In this context, each observation y j 2 Y can be referenced by both its athlete i and its day of realisation t as y i,t .

Independent variables
Independent variables stem from various sources, which are summarised in Table 1. In the sequel, let X ⇢ R d with d 2 N be the domain of definition of the random variable X = [X 1 ,. .., X d ] 2 X . The variable X is thus defined as a vector composed of the independent variables detailed hereafter. First, {X 1 } refers to the raw training loads (TL, Figure 1c), calculated from on-ice and off-ice training sessions (see details on Supplementary material Appendix). Then, {X 2 , X 3 } represent two aggregations of daily TL. Those aggregations come from the daily training loads w(t) -also known here as X 1 -convoluted to two transfer functions adapted from Philippe et al. 28 , which are denoted g imp (t) and g ser (t).
The associated impulse response G imp (t) reflects the acute response to exercise (e.g. fatigue). It is defined as The time delay for the decay phase to begin only after the growth phase is given by T D. Here, T D = 4t G . Both t G and t D are the time constants of respectively the growth phase and the decline phase with t G = 1 day and t D = 7 days ( Figure 1b). Note that the time constants t I , t G , t D were averaged and based on empirical knowledge and previous findings 13 . Hence, for a given athlete, Note that the symbol ⇤ denotes the convolution product. Similarly, some characteristics components of each session were aggregated. This encompasses Rate of Perceived Exertion (RPE) {X 4 , X 5 }, averaged power {X 6 , X 7 }, maximal power output {X 8 , X 9 }, relative intensity {X 10 , X 11 }, session duration {X 12 , X 13 } and session density {X 14 , X 15 }. Also, for each session ice quality {X 16 } and rest between two consecutive sessions {X 17 } were considered. Since some models may benefit from time through autocorrelated performances y i,t , the preceding performance y i,t k with k = 1 was included as predictor, denoted {X 18 }. Finally, athlete {X 19 } was considered excepted for individually built models.
Applied to the observed data of this study a data set of n = 248 observations of performances associated with 19 independent variables was obtained (see Table 1). To formalise, allowing that X ⇥Y ⇠ f with f a function of density, the built data set is a sample S = {(x j , y j )} jn ⇠ f n .

Modelling Methodology
Formally, the goal is to find a function h : X ! Y which minimises the generalisation error In practice the minimisation of R is unreachable. Instead, we get a sample set S = (x i , y i ) in 2 X ⇥Y and note the empirical error as The objective becomes to find the best estimateĥ = argmin h2H Re(h) with H the class of function that we accept to consider. Here, four family of models are evaluated in this context. With the exception of the DR, all models were individually and collectively computed (h I and h G , respectively).

Reference: Variable dose-response
The non-linear mathematical model developed by Busso 13 was considered as the model of reference. Formally and according to the previously introduced notation, this model is a function h (busso) : X 1 ! Y . It describes the training effects on performance over time, y(t), from the raw training loads X 1 . TL are convoluted to a set of transfer functions g apt (t) and g fat (t), relating respectively to the aptitude and to the fatigue impulse responses as The five parameters of the model (i.e. k 1 , k 3 , t 1 , t 2 and t 3 ) are fitted by minimizing the residual sum of squares (RSS) between modelled and observed performances, such as where t 2 T being the day in which the performance is measured. A non-linear minimisation was employed according to a Newton-type algorithm 42 .
Unlike this model of reference, the next presented models take benefit from the augmented data space X ⇤ = X \ X 1 .

Regularisation procedures
Elastic net. In highly dimensional contexts, multivariate linear regressions may lead to unsteady models by being excessively sensitive to the expanded space of solutions. To tackle this issue, cost functions penalising some parameters on account of correlated variables exist. On one side, Ridge penalisation reduces the space of possible functions by assigning a constraint to the parameters, thus minimising their amplitude to almost null values. On the other side, Least Absolute Shrinkage and Selection Operator (LASSO) penalisation has the capacity to fix parameters coefficient to null. The ENET regularisation combines both Ridge and LASSO penalisation 39 . Hence, the multivariate linear model h (enet) : X ⇤ ! Y is with x 2 X ⇤ the predictors, b 2 R d the parameters of the model and e t the error term. The regularisation stems from the optimisation of the objective where a 2 [0, 1] denotes the mixing parameter which defines the balance between the Ridge regularisation and the LASSO regularisation. l denotes the impact of the penalty with l ! •. For a = 0 and a = 1, the model will use a ridge and a lasso penalisation, respectively. Thus, for a ! 1 and a fixed value of l , the number of removed variables (null coefficients) increases with monotony from 0 to the LASSO most reduced model. The model was adjusted by hyper-parameters a and l during the model selection, being part of the CV process (as described below).
Principal component regression. In this multivariate context with potential multicollinearity issues, principal component analysis aims to project the original data set from X ⇤ into a new spaceX ⇤ of orthogonal dimensions called principal components. These dimensions are built from linear combinations of the initial variables. One may use the principal components to regress the dependent variable: also known as Principal Components Regression (PCR). The regularisation is performed by using as regressors only the first principal components which retain the maximum of variance of the original data, by construction. In our study and according to the Kaiser's rule 43 , p principal components with an eigenvalue higher than 1 were retained and further used in linear regression. Such a model, h (pcr) :X ⇤ ! Y , can be defined as a linear multivariate regression over principal components as with x 2X ⇤ \{X ⇤ p+1 ,. ..,X ⇤ d } the predictors, b 2 R p the parameters of the model and e t the error term. In addition to being a regularisation technique by using a subset of principal components only, PCR also exerts a discrete shrinkage effect on the low variance components (the lower eigenvalue components), nullifying their contribution in the original regression model.

Random Forest
Random Forest model consists of a large number of regression trees that operate as an ensemble. RF is random in two ways, (i) each tree is based on a random subset of observations and (ii) each split within each tree is created based on a random subset of candidate variables. The overall performance of the forest is defined by the average of predictions from the individual trees 44 . In this study, random subset of variables and number of trees were the two hyper-parameters for adjusting the model within the model selection. The model is a function h (rf) : X ⇤ ! Y .

5/14
Time series cross-validation and prediction Data were separated -respectively to the time index-into one training data set for time series CV (80 % of the total data) and the remaining data for an unbiased model evaluation (evaluation data set). In this procedure, a model selection occurs first with the search of hyper-parameters values that minimise the predictive model error over validation subsets. The procedure is detailed in Algorithm 1.

Require:
A data set of n time ordered elements, S = {(x j , y j )} jm The number of partitions to split training data to, K < m s min s val + 1 2 N The minimum size of training set within a partition, s min 1 The size of validation set within a partition, s val 1 A class of functions, H Ensure:

Algorithm 1 iteratively evaluates a class of functions H , in which each function h (i) differs from its hyper-parameters values.
A time ordered data set S is partitioned into training and validation subsets (S train and S valid , respectively). For each partition k with k 2 {1,...,K}, h (i) functions are fitted on the incremental S train and evaluated on the fixed S valid subset that occurs after the last element of S train . Once h (i) functions are evaluated on K partitions of S, a function h (i ⇤ ) that provides the lowest and averaged root mean square error (RMSE) among validation subsets defines an optimal model denoted h ⇤ . Afterwards and for each partition of S, h ⇤ is adjusted on new time ordered training subsets S 0 train which combines both S train and S valid . Hence, the generalisation capability of h ⇤ is evaluated on fixed length subsets of evaluation data, saved for that purpose.

Statistical analysis
For any model, the goodness of fit according to linear relationships and to performance were described by the coefficient of determination (R 2 ) and the RMSE criterion respectively. Their generalisation ability is described by the difference between RMSE computed on each training and testing data. The prediction error was reported through the Mean Absolute Error (MAE) between observations and predictions. After checking normality and variance homogeneity of the dependant variable by a Shapiro-wilk and a Levene test respectively, linear models were performed to assess differences in generalisation of models and differences of predictive performances for each group computed models compared to the DR model. For individual models comparisons, linear mixed models were used to account for subject variability in performances modelling by including athletes as random effect. Significance threshold was fixed to p < 0.05. Unstandardised regression coefficients b are reported along with 95% confidence interval (CI) as a measure of effect size. Models computation and statistical analysis were conducted with R statistical software (version 4.0.2). The DR model was computed with personal custom-built R package (version 1.0) 45 .

Results
Through the times series CV, models provided heterogeneous generalisation and performance prediction. Distributions of RMSE per model are illustrated in Figure 2.

Models generalisation
Comparisons to the DR I models showed significant differences and a greater generalisation capability for both ENET .014] 95% CI, for RF I and RF G respectively). Therefore, ENET and PCR models computed on overall data were the most generalisable models, followed by individual ENET and PCR models. A summary of model pairwise comparisons is provided in Table 2.

Prediction performances
For any model, the observed RMSE on the evaluation data set showed significant differences to the reference DR I , with the exception of PCR I and RF G (p = 0.072 and p = 0.088 respectively, see Table 2). The ENET G model provided the greatest performances (RMSE = 0.176 ± 0.012, MAE = 0.15 ± 0.01 and R 2 = 0.179 ± 0.063), closely followed by RF G .
Only PCR G showed lower performances in prediction than DR I . Distributions of RMSE on data used for evaluation have shown heterogeneous variance between models. The greatest standard deviations were found for DR I and PCR G with s = 0.053 and s = 0.062 respectively. The ENET, PCR I and RF models provided more consistent performances with lower standard deviations comprised within  Table 3.
Predictions made from the two most generalisable models -ENET G and PCR G -and the reference DR I illustrate the sensitivity of models for a representative athlete (Figure 3). Performances modelled from DR I model were relatively steady and less sensitive to real performance variations. Standard deviation calculated on data used for model evaluation supported such a smooth prediction with s = 0.015, s = 0.071 and s = 0.062 for DR i , PCR G and ENET G , respectively.  Hyper-parameters of the PCR and ENET models were n comp = 3 and a = 0.28, l = 0.02.

Discussion
In the present study, we provided a modelling methodology that encompasses data aggregation relying on physiological assumptions and model validation for future predictions. Data were obtained from elite athletes, able of improving their performance by training being very sensitive to physical, psychological and emotional states. The variable dose-response model 13 was fitted on individual data. It was compared to statistical and machine-learning models fitted on individual and on overall data: ENET, PCR and RF models. Cross validation outcomes revealed significant heterogeneity in performances of models. The main criterion of interest, generalisation, was significantly greater for both ENET and PCR models than DR I model. One can explain this result by the capabilities of the statistical models to better catch the underlying skating performance process using up to 19 independent variables when associated with regularisation methods. Conversely, the DR I model relies on two antagonistic components strictly based on the training load dynamics. It does not deal with any other factors that may greatly impact the performance (e.g. psychological, nutritional, environmental, training-specific factors) 12,18,46 . Thus, such a conceptual limit can be overtaken by employing multivariate modelling that may result in a greater comprehension of the training load -performance relationship, for the purpose of future predictions 9,12 .
Distributions of RMSE from training and evaluation data sets allow us to establish a generalisation model ranking ( Table  2). Linear models computed on overall data offer a better generalisation. This finding is essential because by handling the bias-variance trade-off, models are more suited for capturing a proper underlying function that maps inputs to the target even on unknown data. Hence, it allows further physiological and practical interpretations from the models. Sample size might be one source of explanation for such behaviour. It is well known that statistical inference on small samples leads to bad estimates and consequently to bad performances in prediction 47,48 . A greater sample size obtained by combining individual data led to more accurate parameter estimates, being more suitable for sport performance modelling 12 . This is particularly important to consider when we aim to predict a very few discipline specific performances throughout a season. However, predicting non-invasive physical quality assessments which can be daily performed (e.g. squat jumps and its variations for an indirect assessment of neuromuscular readiness 49 , short sprints) may be an alternative for small sample size issues. Also, regularisation tends to stabilise parameters estimators and favour generalisation of the models. For instance, multicollinearity may occur in high-dimensional problems. Stochastic models generally suffer from such a conditioning. One would note that the ENET and PCR models attempt to overcome these issues in their own way by (i) penalising or removing features -or both -that are mostly linearly correlated and (ii) by projecting the initial data space onto a reduced space, which is optimised to keep the maximum of variance of the data from linear combinations of the initial features. Both approaches limit the number of unnecessary -or 9/14 noisy -dimensions. In contrast, in this study non-linear machine learning models expressed a lower generalisation capability than linear models even when models combine data from several athletes. We believe that such models may be powerful in multidimensional modelling but require an adequate data set with, in particular, ones with a sufficient sample size. Otherwise, model overfitting may occur at the expense of inaccurate predictions on unknown data.
As reported previously and with the exception of PCR G , models were more accurate in prediction than DR I ( Table 3). The large averaged RMSE as well as large standard deviations provided by the DR I among performance criteria tend to agree with the literature, since the model is prone to suffer from a weak stability and ill-conditioning raised by noisy data that impact its predictive accuracy 9,10 . This evokes that linear relationships between the two components "Aptitude" -"Fatigue" and the performance are not clear. However, because of a lack of cross-validation procedures on impulse models and particularly the DR employed in our study, our results cannot be validly compared with the literature. Despite lower standard deviations of R 2 reported by ENET and PCR models, the weak averaged R 2 values suggest that linear models can only explain a few part of the total variance. Therefore and if the data allow it (i.e. a sufficient sample size and robustly collected data), non-linear models may still be effective and should be considered during the modelling process.
The sensitivity of models according to gains and losses of performances differed between the two most generalisable models -ENET G and PCR G -and the reference DR I . Such differences can be explained by the influence of variables that may affect performance, other than training loads dynamic (e.g. ice quality the day of performance, cumulative training loads following a serial and bi-exponential function, the last known performance) or a DR I model failure in parameter estimates regarding to the variability of the data. However, this applied example does not inform us about neither the generalisation ability of models nor accuracy of predictions because it concerns only a particular set of data, where the selected models are trained on the first 80 % of data and evaluated on the 20 % remaining data.
This study presents some limits. The first one concerns the data we used and particularly the criterion of performance: standing start time trials few times a week during an approximately 3-months period. Even though being a very discipline specific test in which athletes are familiar and being conducted in standardised conditions, each test requires high levels of arousal, buy-in and motivation. Therefore, psychological states and cognitive functions monitoring such as motivation and attentional focus 50,51 should have been done prior performing each trial. Secondly, the time series cross-validation presented here has a certain cost, most notably when only few data are available (e.g. when models are individually computed). The rolling origin re-calibration evaluation performed as described by Beirgmer et al. 52 implies a model training only on a incremental sub-sequence of training data. Hence, the downsized sample size of the first training sub-sequences may cause model failure in parameter estimates and consequently, an increase of prediction errors. Then, training and evaluation data sets present some dependencies. In order to evaluate models on fully independent data, some modifications of the current CV framework exist at the expense of withdrawing even more data in the learning procedure. According to Racine 53 , the so-called hv -block cross-validation is one of the least costly alternative to the CV used in our study, requiring a certain gap between each training and validation subsets. However, due to a limited sample size, we voluntary chose to not adapt the original CV framework described in Algorithm 1. Nonetheless, we recommend researchers and practitioners to consider such alternatives in case of significant dependencies and when sample size is sufficient.
Finally, backtesting was performed in order to evaluate model performances on historical data. From a practical point of view, models are able to predict the coming performance following a given feature of data known until day t. However, the contribution of training load responses modelling also concerns training after-effects simulations over a longer time frame. Having identified a suitable model, simulations of independent variables within their own distributions would allow practitioners and coaches to simulate changes in performance following objective and subjective measures of training loads, and any performance factors that are monitored. Conditional simulations that consider known relationships between independent variables (e.g. relationships between training load parameters) 54, 55 may improve the credibility of simulations.

Conclusion
In this study, we provided a transferable modelling methodology which relies on the evaluation of models generalisation ability in a context of sport performance modelling. The mathematical variable dose-response model along Elastic net, principal component regression and random forest models were cross-validated within a time series framework. Generalisation of the DR model was outperformed by ENET and PCR models. The ENET model provided the greatest performances both in terms of generalisation and accuracy in prediction. Increasing sample size by computing models on the whole group of athletes led to more performing models than the individually computed ones. The methodology highlighted in our study can be reemployed whatever the data, with the aim of optimising elite sport performance through training protocols simulations. Moreover, we believe that model validation is a requisite for any physiological interpretation for the purpose of making future predictions. Further researches that involve training session simulations and model evaluations in forecasting would highlight the relevance of some model families for training programming optimisation.