Reliability of the Heartbeat Tracking Task to Assess Interoception

Interoception refers to the competence in perceiving and interpreting internal sensations emerging from the body. The most common approach to assess interoception is through cardiac interoceptive tests like the heartbeat tracking task (HTT), which measures the accuracy on perceive and counting heartbeats during a period. However, the literature is scarce in providing adequate reliability evidence for this measure so that the interoception assessment may be threaten. In addition to HTT accuracy, it is possible to determine sensibility (self-reported confidence) and interoceptive awareness (correspondence between accuracy and sensibility). Thus, we measured the test–retest reliability of HTT and also investigated the behavior of HTT outcomes along the task. Therefore, 31 healthy adults (16 males) with 27.8 (9.4) years old performed two consecutive HTT interspersed by one day. Intraclass correlation coefficient (ICC), standard error of measurement (SEM) and minimal detectable difference (MD) analyzes showed 'Good' relative reliability for interoceptive accuracy (ICC = 0.880; SEM = 0.263; MD = 0.728; p < 0.001) and 'Moderate' for sensibility (ICC = 0.617; SEM = 0.648; MD = 1.797; p < 0.001) and awareness (ICC = 0.593; SEM = 0.227; MD = 0.628; p < 0.001). The absolute reliability shows low threshold values for observing true effects in HTT outcomes. The results also showed that reducing the number of HTT blocks did not impact the outcomes. The HTT showed to be reliable in determine the interoceptive competences in healthy adults.


Introduction
Interoception can be defined as the competence perceiving and interpreting body-derived sensations such as heartbeats, pain and temperature (Quadt et al., 2018). Usually, these sensations are related to the activity of body signals arising from the heart, lungs, thermoregulation, pain sensation and physical effort (Craig, 2002;Marcora, 2009). According to a conceptual framework developed by Garfinkel et al. (2015), interoception consists of three different dimensions e.g., accuracy, sensibility and awareness. In this regard, interoceptive accuracy (IAcc) is the objective representation of perceived internal sensations, while interoceptive sensibility (ISen) is the individual's subjective evaluation regarding confidence in one's own perception of bodily signals (Garfinkel et al., 2015). The metacognitive correspondence between IAcc and ISen is interoceptive awareness (IAwa), thus referring to how adequate conscious comprehension of what has been happening internally in the body is (Garfinkel et al., 2015).
The literature has shown that detecting and interpreting internal bodily signals may play a fundamental role in physical and mental health (Khalsa et al., 2018;Machado et al., 2019;Paolucci et al., 2017). In fact, variations in health conditions have been suggested to be involved in different profiles of perceiving and reporting signs and symptoms 1 3 (Murphy et al., 2017;Quadt et al., 2018) as well as behaviors (Critchley & Garfinkel, 2017). It is well established that different psychiatric disorders affect individuals interoceptive skills in specific ways (Khalsa et al., 2018). Therefore, psychological and/or emotional impaired individuals are affected by interoceptive mechanisms that can act minimizing or attenuating symptoms of disorders such as depression, anxiety, alexithymia, autism and panic, among others (Brewer et al., 2016;Dunn et al., 2010;Ehlers & Breuer, 1992;Pollatos et al., 2007aPollatos et al., , 2007bSchauder et al., 2015). In addition, interoceptive processing may be linked to fatigue mechanisms that arise from signals provided by physical exercise (McMorris, 2020). Given this, it seems relevant to investigate whether the measures used to assess interoceptive competences are in fact reliable in terms of maintaining stability and consistency over multiple trials.
Approaches to assess interoception have used cardiac responses (Brener & Ring, 2016). For example, as originally suggested by Schandry (1981), the Heartbeat Tracking Task (HTT) indicates interoceptive competences through the mental counting of heartbeats during a given period of time.
Despite slight changes such as number and duration of the blocks (Brener & Ring, 2016;Garfinkel et al., 2015), HTT have been used as originally based on the proximity between measured and perceived heartbeats (Schandry, 1981). For example, individuals accurately counting the heartbeats are considered as having higher IAcc. In order to minimize the individuals' attempts to predict the counted heartbeats based on time, self-report measures were used for assessing counting confidence. This was incorporated into performance of the HTT through the use of scales or questionnaires to evaluate the individuals' confidence in relation to counting (Garfinkel et al., 2015;Porges, 1993). This confidence variable allows the determining both ISen and IAwa, providing a more robust interpretation of interoceptive competences.
The literature slightly criticizes HTT for its ability to reflect interoception as an overarching concept (Zamariola et al., 2018) and whether heartbeat counting is fully related to cardiac detection (Ring & Brener, 2018). Despite the conceptual limitations of the HTT, it has been suggested that this test is capable of adequately determining interoceptive competences based on cardiac signals. Some studies have found the test to be reliable (Ehlers & Breuer, 1992;Ferentzi et al., 2018;Herbert et al., 2011;Larsson et al., 2021;Pollatos et al., 2007aPollatos et al., , 2007bRing & Brener, 2018;Stevens et al., 2011;Van der Does et al., 1997;Wittkamp et al., 2018). However, some of those previous studies had design flaws (e.g. inappropriate statistical and reliability procedures) that could threaten their conclusions.
Reliability refers to reproducibility of a test or a measure performed several times (Hopkins, 2000) between (e.g. interrater) and within (e.g. intra-rater) (e.g. test-retest) examiners (Koo & Li, 2016). In this regard, reliability analysis is important to provide information about the error of the measurement and analysis power, thereby being essential to estimate sample size in scientific research studies, as well as parameters for interpreting measurements in clinical settings. Therefore, the aim of the current study was to determine the test-retest reliability of HTT. In addition, we observed the behavior of IAcc, ISen and IAwa along the blocks, given that different HTT variations (i.e. number and duration of the blocks) have been used in previous studies.

Participants
Thirty-one healthy adult males (n = 16) and females (n = 15) ranging from 20 to 53 yrs old voluntarily agreed to participate in this study after been invited through posts on social media (i.e. Instagram and Whatsapp). They were included if: (1) attended age range limit between 18 and 59 yrs; and (2) have no clinical diagnosis of neurological disorders that could impact the performance in HTT. Importantly, all participants completed research protocols so that no dropout was registered. This research was approved by the institutional Ethics Committee in Research with human beings under registration number 4.606.418 (National Health Council number 466/12), being conducted according to Helsinki declaration.

Study Design
This is a test-retest reliability study performed in two visits interspersed by 24 h between them. In the first visit, participants signed an informed consent form before assessment of body mass, height, level of physical activity, profession, illness history and heartbeat perception. Afterward, they completed the multidimensional assessment of interoceptive awareness (MAIA) (Mehling et al., 2012). In the first visit, participants were familiarized with HTT (3 to 5 blocks) before performing the full task with 60 blocks 5 min later. In the second visit, they repeated HTT as in prior visit. The characteristics of the study were reported following the recommendations of guidelines for reporting reliability and agreement studies (GRRAS) (Kottner et al., 2011).
The sample size (n = 29) was estimated through the web calculator (https:// wnari fin. github. io/ ssc_ web. html) provided by Arifin (2021). The sample size was estimated considering a design having two replicate observations, 80% statistical power and 95% statistical significance. The minimum acceptable reliability (ρ0) was set at 0.76 while the expected reliability (ρ1) at 0.91, as according to ratings by Koo and Li (2016) to ensure that the intraclass correlation coefficient (ICC) was rated as 'Good' and 'Excellent'.
It is important to emphasize that 'Moderate' ICC values (0.50-0.75) would not represent insufficient reliability. However, the range between 'Good' and 'Excellent' was used to ensure that even in the worst-case scenario, the desired ICC remains around the 'Good' rating.

Heartbeat Tracking Task (HTT)
Briefly, HTT assesses the individual's accuracy through measuring the difference between perceived and actual heartbeat (Garfinkel et al., 2015). The closer the count is to actual heartbeats, greater is the accuracy. Participants were seated in a comfortable position having the arms extended over a comfortable holding so that the oximeter (Nonim OEM, Xpod 3012LP, Minnesota, US) was placed on the right-hand middle finger. The participants performed the HTT as recommended by Larsson et al. (2021). The following instructions were given based on the aforementioned authors: "You should keep your eyes closed and try to perceive and silently count your heartbeat without manually or electronically checking your pulse. A software will give a 'Start' and 'Stop' signal so that you initiate and end your count. At the 'Stop' signal, you will be asked to report how many heartbeats you counted during that period, as well as self-report your confidence in the count".
To assess IAcc, the task was performed with 60 blocks of different durations. The blocks lasting 18 s (n = 6), 20 s (n = 48) and 22 s (n = 6), and were randomly presented to participants, as suggested by Larsson et al. (2021). Participants were oriented to mentally estimate the heartbeat to assess the ISen, so that at the end of each block they reported their level of confidence in estimating the heartbeat (i.e. perceived heartbeat). They rated their ISen in a scale ranging from 1 to 4, having items corresponding to: (1) I did not sense my heartbeats; I am completely guessing about the number of beats; (2) I sensed something about my heart, but I had no idea what I was counting, and I have no confidence at all in my counting; (3) I sporadically or faintly picked up on my heartbeat; my counting is based on something, but it may be off by a small margin; and (4) I clearly sensed my heartbeat, and have full confidence in my count (Larsson et al., 2021). The IAwa was assessed through the correspondence between IAcc and ISen.

Statistical Analysis
A two-way mixed effects ICC analysis (ICC 3,1 ) assessed the relative reliability between of interoceptive measures such as accuracy, sensibility and awareness, being interpreted as 'Poor' (< 0.5), 'Moderate' (> 0.5 and < 0.75), 'Good' (> 0.75 and < 0.9), and 'Excellent' (> 0.9) Koo and Li (2016). The absolute reliability was calculated using the standard error of the measurement (SEM) and the minimal difference (MD) (Brietzke et al., 2021;Weir, 2005). Importantly, the SEM represents the minimal difference necessary to identify a true effect so that pre-to-post differences higher than the MD are interpreted as a real and clinical relevant change (Weir (2005).
The IAcc was calculated (Eq. 1) as proposed by Schandry (1981), while ISen was assessed as the participant's averaged in the confidence scale. The IAwa was assessed through a Pearson correlation coefficient between IAcc and ISen scores as suggested elsewhere (Garfinkel et al. (2015). A one-way repeated measures ANOVA verified whether the number of blocks of HTT interfered in the test performance, having HTT blocks grouped in three different moments such as 1-20, 21-40 and 41-60. The Bonferroni's correction was used in multiple comparisons if F-values were significant. Moreover, a paired t-test was performed to compare 1-30 versus 31-60 HTT blocks. The Bland-Altman analysis was performed to determine the data dispersion.
where: IAcc-Interoceptive Accuracy, actual heartbeats-Real number of heartbeats, counted heartbeats-Reported number of heartbeats.
Results were expressed as mean, standard deviation (SD) and 95% of confidence interval (CI 95% ). All analyzes were performed using the SPSS software (v.23, IBM, New York, US) and the graphs were produced on GraphPad Prism (v.6, GraphPad Software, San Diego, US). Table 1 presents the participants' characteristics such as body mass, height, and MAIA scores. For example, 83.9% of them had already noticed changes in HR at some moment in their lives. The mean (± SD) HR during the task was 74.5 (10.1) bpm for trial 1 and 77.0 (8.7) bpm for trial 2. In addition, the actual and counted heartbeats were respectively 24.8 (3.4) and 16.6 (5.3) beats for trial 1, and 25.7 (2.9) and 18.9 (6.2) for trial 2. Importantly, there were no significant differences when HTT was divided into three parts with 20 blocks in each (F (1.820; 111) = 0.005; p = 0.991) or into two parts with 30 blocks (t (61) = 0.436; p = 0.664), thus indicating that HTT did not promote changes in the behavior of IAcc, ISen and IAwa variables throughout the task.
Absolute reliability analysis indicated that IAcc, ISen and IAwa variables showed a SEM of 0.263, 0.648 and 0.227, respectively, thereby indicating the threshold values for observing true effects in HTT outcomes. Furthermore, MD results indicated that true effects in pre-to-post HTT outcomes may be detected with changes from 0.728, 1.797 and 0.628 in IAcc, ISen and IAwa variables, respectively. Table 2 presents relative and absolute reliability values.
Bland-Altman analysis suggested no bias in IAcc, ISen and IAwa variables, given the roughly random data dispersion. Figure 1 depicts Bland-Altman results for IAcc, ISen and IAwa respectively.

Discussion
The aim of the current study was to determine the reliability and influence of the total number of blocks on HTT outcomes. Together with non-significant differences in HTT outcomes between trials, we observed moderate-to-good relative reliability in interoceptive competences such as IAcc, ISen and IAwa, thereby suggesting that HTT can yield stable and consistent responses. Furthermore, we showed that reducing the number of blocks in the task did not exert any impact on HTT outcomes.
A relative reliability index expressed as ICC may be considered adequate to reflect stability and consistency of repeated measures (Koo & Li, 2016). In fact, alternative relative reliability methods such as Pearson's correlation and split tests have concerns as errors due to small samplederived bias and inappropriate divisions of single trials may impact interpretation of the results (Hopkins, 2000;Weir, 2005). In this regard, caution is necessary when interpreting results of previous studies assessing relative reliability in HTT outcomes through approaches such as Pearson's or Spearman's correlation coefficients (Ferentzi et al., 2018;Herbert et al., 2011;Larsson et al., 2021;Pollatos et al., 2007aPollatos et al., , 2007bRing & Brener, 2018;Stevens et al., 2011;Van der Does et al., 1997), given that high and significant correlation coefficients do not necessarily indicate strong reliability but, rather, certain proportionality between trials.
According to Weir (2005), Pearson's correlation coefficient is not adequate for test-retest reliability as it cannot handle systematic errors derived from instrumentation and procedures. To the best of our knowledge, the study by Wittkamp et al. (2018) was the first to use ICC measurements to determine reliability of the HTT. The authors found lower reliability (ICC = 0.42; CI 95% = 0.27, 0.58) when compared to the current study (ICC = 0.880; CI 95% = 0.684, 0.949) for IAcc. The difference between the ICC values herein found in relation to the study by Wittkamp et al. (2018) can be explained by the fact that our study presents a design exclusively devoted to testing reliability of HTT. In addition, we highlight the use of a more robust protocol (e.g. with 60 blocks) and, in an original way, reliability measurement for sensibility and awareness measures. Using ICC results, we observed that interoception assessed through HTT accuracy showed good reliability, although sensibility and awareness were moderate. It is worth highlighting that absolute reliability may act as a complement to relative reliability, as absolute measures can be more applicable than ICC because they reflect the measurement error while maintaining the assessed variables as originally acquired (Weir, 2005). In our results, all participants presented SEM and MD for IAcc below than 0.263 and 0.728, respectively, thus representing an untrue change in HTT performance. As there were no values higher than the SEM and MD scores, a relevant absolute reliability between trials can be suggested. Moreover, the ISen results showed that 7 individuals presented SEM and MD scores above than 0.648 and 1.797 respectively, thus suggesting true changes on absolute reliability for this variable. Considering the IAwa element corresponding to IAcc and ISen changes, its values were impacted by the moderate stability and consistency shown by ISen. Therefore, HTT may be recommend to assess IAcc with certain level of confidence, although the ISen and IAwa variables may be less consistent. Future studies are required to investigate if SEM and MD thresholds can be used as cutoff points to separate individuals with low and high interoception, and if they remain similar in different populations (i.e. athletes, aged, sedentary, emotional impaired, psychiatric impaired).
Establishing cutoff points for the interoception level is a very common strategy, although it still lacks more substantial evidence. The way individuals are stratified, according to their level of perception of internal signals (e.g. good or bad perceivers), does not seem to have sufficient theoretical robustness, as different cutoff points were used in the literature (i.e. 0.4; 0.6; 0.65; 0.66; 0.69; 0.7; 0.8; 0.85; 0.86) to divide the sample into groups of low and high interoception (Filippetti & Tsakiris, 2017;Garfinkel et al., 2015;Herbert et al., 2012;Hill et al., 2017;Machado et al., 2019;Zamariola et al., 2018). Another concern is regarding the equations used to determine the IAcc levels where, according to our observation, it was possible to identify four different formulas used in the studies (Chick et al., 2020;Garfinkel et al., 2015;Machado et al., 2019;Schandry, 1981). In the study by Rae et al. (2020) it was highlighted that using an alternative equation to the one proposed by Schandry (1981) seems to be more appropriated in cases where individuals overestimate the heartbeat count. However, no participant in the current study showed this pattern. Although these equations produce similar results, it is necessary to establish better defined criteria to determine IAcc, especially that consider the measurement error and not only the mean difference between the counted and actual heartbeats.
A relevant point to be highlighted from the results of the current study is that it does not seem necessary to carry out very extensive HTT protocols. Originally popularized with three blocks and durations of 25 s, 35 s and 45 s (Schandry, 1981), HTT has undergone changes over the years either with different total number of blocks (i.e. 4, 5, 6 and 60) or durations (i.e. 25 s; 30 s; 35 s; 40 s; 45 s; 50 s; 55 s) (Garfinkel et al., 2015;Herbert et al., 2012;Larsson et al., 2021;Ueno et al., 2020). Our results suggest that performing the test with 20 or 30 blocks does not result in Fig. 1 Bland-Altman analysis of HTT outcomes. IAcc interoceptive accuracy, ISen interoceptive sensibility, IAwa interoceptive awareness, a.u. arbitrary units, SD standard deviation reduced performance. In the current study, some participants reported tiredness, somnolence, loss of concentration, loss of attention and mental fatigue during the HTT with 60 blocks. It is then suggested to perform HTT with fewer blocks to minimize possible effects of cognition loss due to sustained attention for an extended period of time.
From the applicability viewpoint, some aspects and limitations need to be considered. In the first place, it is questionable whether HTT measures reflect the concept of interoception in its entirety. It is possible that objective tests to determine IAcc, such as HTT, measure different aspects than those offered by subjective self-report measures (i.e. MAIA, Likert scales) or other types of tests (i.e. discrimination tasks, water load test, breathing learning task). This idea is supported by the study by Murphy et al. (2020), which shows that there is a distinction between estimates of interoceptive accuracy and attention. The authors also recommend that the relationship between objective and subjective interoception measures be established if the interoceptive aspects are the same.
The simple representation through the perception of cardiac signals can generate limited interpretations, as interoception is a capability that includes other aspects such as visceral, breathing, pain, thirst, hunger and several other sensations (Craig, 2002). The study by Zamariola et al. (2018) criticized the theoretical concept behind cardiac signal perception tests, suggesting that they are not capable of providing sensitive data regarding global interoception. However, a few studies (Herbert et al., 2012;Ketai et al., 2021;van Dyck et al., 2016) have investigated the relationship between the perception of cardiac signals and other internal signals (i.e. gastric and urinary). Other studies have also investigated the relevance of interoceptive signals on physical exercise (Hill et al., 2017;Kósa et al., 2021;Machado et al., 2019).
Secondly, as highlighted in previous studies (Brener & Ring, 2016;Larsson et al., 2021), it is possible that factors such as the individual's overall intelligence, time perception capacity and prior knowledge of their heart rate may exert impacts on the HTT results. However, a well-organized protocol reduces the possibility of bias during performance of HTT, including the confidence scale used (Larsson et al., 2021), which predicts possible situations that may occur during the perception and counting of heartbeats. In the third place, the Bland-Altman analysis offers an observation of the data dispersion that corroborates the results of relative and absolute reliability, showing that the individuals remained in a concentration area with low inter-individual variation. Finally, it is possible that certain occurrences in the participants' lives (i.e. emotional stress, anxiety, poor sleep quality) as well as external environmental factors (i.e. noise, cold, heat) may impact performance during the HTT test. However, all the procedures used and the guidelines given to the participants were intended to minimize these effects.

Conclusion
HTT outcomes were reliable in determining the interoceptive competences in healthy adults. While IAcc results showed stable (i.e. not changing within days) and consistent (i.e. not changing within hours) results in relative and absolute reliability analyses, ISen and IAwa showed lower reliability levels, thus requiring caution in their application.