Mental fatigue (henceforward, fatigue) is a psychobiological state that is characterized by a decline in performance and a decreased motivation for effort and is usually caused by engaging in cognitively demanding activities for extended period1,2. Fatigue is a complex state that involves changes in brain activity (e.g. norepinephrine and dopaminergic systems)3–5, subjective feelings (e.g., lack of energy) and cognitive performance6–8. Because fatigue tends to influence cognitive performance, it is known to be a highly relevant topic regarding human safety. For example, the risk of work-related and road accidents is substantially higher when people are fatigued9,10. Accordingly, the prevention as well as the detection of fatigue is imperative11. In line with this, research on the biomarkers of fatigue has become important and suggests that fatigue can be estimated based on markers that reflect the activity of the cerebral cortex (e.g. theta activity obtained by neuroimaging techniques) or the autonomous nervous system (e.g. Heart Rate Variability) in order to prevent its negative consequences12.
Machine learning is a relatively novel way of utilizing biomarkers to detect fatigue and this approach has captured the attention of science as well as practice alike13–15. A few studies used biological signals obtained by electroencephalography (EEG) to train machine learning models that are capable of effectively detecting mental fatigue (i.e.: classification models that successfully distinguish between fatigued and non-fatigued states; for review, see 16). Even though the high accuracy (> 80–90%) of these models is impressive, EEG has several limitations, such as the difficult and time-consuming procedure of setting up the electrodes and its sensitivity to external electromagnetic fields17. Due to these limitations and the fact that fatigue has also been associated with changes in the autonomic nervous system18, other studies have investigated whether fatigue detection is possible based on biological signals obtained by peripheral measures, for example, electrooculography19 or electrocardiography (ECG)20. Most of the fatigue studies utilizing ECG calculated the variation of consecutive R-wave peaks denoted in the literature as Heart Rate Variability (HRV)21. HRV reflects the activity of the autonomic nervous system and is indeed a potential reliable biomarker of fatigue since many previous studies have confirmed the association between HRV and fatigue18,22−24.
Machine learning studies have shown that models trained exclusively on HRV are capable of effectively detecting fatigue with the best accuracy scores ranging between 75% and 91%20,25−27. Laurent et al. (2013), for example, showed that the support vector machine (SVM) algorithm trained on HRV data recorded during the performance of a fatiguing switching task (i.e. algorithm trained on task-related HRV data) was able to detect fatigue with an accuracy up to approx. 80%. In contrast to this study using task-related data, another tested the algorithms based on resting HRV data26. In their study, the resting period prior to prolonged task performance was labelled as the non-fatigue state, while the resting period after task performance was labelled as the fatigue state. They found that the k-nearest neighbors (KNN) algorithm was capable of detecting fatigue with an accuracy of approx. 75% based on four HRV indices.
The differences in predictive accuracy reported in these studies is probably caused by various factors. The most important factors might be the methodological differences across the studies such as differences in the cognitive tasks used for fatigue induction, the time window used for HRV calculation, the sample size differences (ranging between 13 and 45) or even in the difference being whether the ECG was recorded during rest or active task performance. In contrast to time windows and sample size factors, the effects of which have been extensively studied in the machine learning literature28,29, no studies have used multiple cognitive tasks for fatigue induction or have directly compared the predictive performance of models trained on resting and task-related HRV data even though this would have significant implications for both practice and research. Therefore, in the present study, we extend the literature in this field by analyzing a multi-task dataset and comparing models trained on task-related as well as resting HRV data.
Beside the predictive performance, the models are also supposed to demonstrate comprehensive generalizability, that is, to make accurate predictions on previously unseen samples30. To increase the generalizability of machine learning models, it is recommended to use a reasonably large dataset, avoid information leakage (e.g. by performing feature selection only on the training dataset), conduct cross-validation and test the models on previously unseen data31,32. In addition, one might argue that using more than one cognitive task to induce fatigue might strengthen the generalizability because the different tasks affect different cognitive and affective systems33 and the models trained on such heterogeneous data might be less sensitive to noise and task-specific characteristics34. With other words, models that accurately detect fatigue irrespective of the type of task at hand are supposed to be of greater value because they could be effectively utilized in a variety of situations, which reflects both higher reliability and usefulness in practice.
In this study, we aimed to train machine learning models to achieve comprehensive generalizability and to become robust to task-specific characteristics. To achieve this goal, we first combined the datasets of three fatigue-related experiments that applied different cognitive tasks requiring different cognitive operations. Thus, the fatigue induction was variable. Second, due to the same reason, we were able to conduct analyses on a relatively larger dataset (n = 85) than previous studies (highest n = 45)25. Third, to avoid information leakage, data was preprocessed after the separation of training and test data. And fourth, the models were trained using cross-validation on the training dataset but the final evaluation was based on previously unseen dataset to see how well the models generalize to new data. This analytic approach differs from the previous studies20,25−27, because they used cross-validation only and did not test the models on unseen holdout data.
Beside fatigue detection, the prediction of the level of fatigue caused by prolonged cognitive performance is another important question that could be addressed by machine learning. In line with this, a few attempts have been made to train machine learning models on biomarkers and other variables like demographics or psychometric data to predict the consequences of performing a fatiguing task35–37. Highly relevant to our study, Mun and Geng (2019) utilized machine learning to predict the level of post-experiment subjective fatigue based on various types of self-reported and biological data including resting HRV. The most predictive features were self-reported measures (e.g. pre-experiment fatigue, anxiety) but other indices reflecting cardiac activity such as blood pressure and the low frequency HRV component also contributed to the prediction of post-experiment fatigue. Similarly to studies using machine learning to detect fatigue, fatigue was induced by a single task. Consequently, the question remains unanswered whether post-experiment fatigue induced by different cognitive tasks could be predicted by models trained on pre-experiment variables.
In sum, the present study had three main goals. First, we trained classification algorithms to detect fatigue and regression algorithms to predict the severity of post-experiment subjective fatigue induced by prolonged cognitive performance based on a heterogeneous dataset in terms of fatigue induction. Second, we compared the predictive performance of classification models trained on resting and task-related HRV data. To our knowledge, no previous studies have made such comparison. Third and final, we explored the effects of time-window length to find the shortest time windows that still result in accurate predictions because the use of shorter ECG recordings would be beneficial for research as well as practice.