Intra-rater Reproducibility in Physical Function Testing Among Haemodialysis Patients: The Impact of Measurement Timing

Background. Accurate evaluation of physical function in patients undergoing haemodialysis is crucial in the analysis of the impact of exercise programs in this population. Objective. To evaluate the reproducibility of several physical functional tests, depending on the timing of their implementation (before the HD session versus non-HD days). Design. Prospective, non-experimental, descriptive study. Methods. Thirty patients in haemodialysis were evaluated twice, one week apart. The test session was performed before the haemodialysis session started and a retest was performed in non-dialysis day. The testing battery included the Short Physical Performance Battery, sit-to-stand-to-sit tests, six-minute walking test, one-leg stand test, timed up and go, and handgrip strength with and without forearm support. The intra- rater reproducibility was determined by the intraclass correlation coecients and the agreement was assessed by Bland–Altman analysis. Results. The interclass correlation coecients values ranged from 0.86 to 0.96, so that all tests showed good to very good relative reliability. The mean differences between trials of sit to stand to sit 10 and 60, Time up and go and all the handgrip tests were close to zero, indicating no systematic differences between trials. Large range of values between trials was observed for the Six minutes walking test, gait speed, One-leg stand test and Short physical performance battery, indicating a systematic bias for these four tests. Conclusion. The Sit to stand to sit 10 and 60, time up and go and handgrip tests had good to excellent test-retest reliability in measuring physical function in different dialysis days of patients undertaking haemodialysis. The Minimal detectable change values are provided for this population. Bias were found for the Six minutes walking test, gait speed, Short physical performance battery or one-leg stand test when the testing day changed.


Introduction
Chronic kidney disease (CKD) is an important global health problem because of its high incidence, prevalence, morbidity and mortality rates, and socioeconomic cost 1,2 . Globally, an estimated 850 million people suffer from kidney disease, amounting to more than 10% of the adult population and accounting for at least 2.4 million deaths annually 3 .
Haemodialysis (HD) is the most common form the renal replacement therapy (RRT). Patients on longterm HD experience physical function problems, as well as an impaired health-related quality of life 4 .
Physical activity (PA) is any body movement produced by muscles that results in increased energy expenditure. Exercise is a subset of PA that is planned, structured, repetitive and purposeful 5 . PA level is lower in CKD patients at any stage compared to healthy counterparts 6,7 . It seems that PA level has also an impact on physical function since patients in HD or peritoneal dialysis with impaired PA had worst physical function compared with more active patients 8,9 . Several studies have recently been published which report the bene cial effects of exercise on the physical function of patients receiving HD 4,10-12 .
Physical function tests are commonly used to assess the effectiveness of exercise programs and may be challenging for patients and their assessors to complete due to time constrains before the start of the HD session.
Previous studies have investigated the relative and absolute reliability of several physical functional tests, many of which have demonstrated excellent test-retest intra-rater reliability when tests are undertaken before the start of the HD session [13][14][15][16][17] . Reproducibility refers to the variation in measurements made on a subject under changing conditions 18 . It remains unknown if physical functional tests are reproducible when the same rater measures in non-HD days.
Therefore, the aim of this study was to evaluate the reproducibility of several physical functional tests, depending on the timing of their implementation (before the HD session versus non-HD days).

Design
This was a reproducibility and method comparison study, changing day of testing, HD versus non-HD day.

Setting and participants
The participants were recruited from the HD unit in the Hospital de Terrassa in August 2019 and signed a written informed consent. This study was approved by the Ethics Committee at Hospital de Terrassa and was carried out in accordance with the standards set out in the Declaration of Helsinki, it was registered at ClinicalTrials.gov (NCT04049708), Patients were included in the study if they had been receiving maintenance HD for at least 3 months and did not have any acute or chronic medical conditions that would preclude collection of the test data. Individuals were excluded if they had recently had a myocardial infarction (in the 6 weeks prior), or had unstable angina, malignant arrhythmias, or any disorder that would be exacerbated by activity. Demographic and clinical data from the patients' medical histories were registered.

Procedure
The study consisted of repeating the same tests in two different occasions, trials 1 and 2, to evaluate the reproducibility. It was always performed by the same experienced nurse. The test session was performed before HD treatment, as described elsewhere in the literature 17,19,20 . Before the rst HD session of the week, the participants underwent the Short physical performance battery (SPPB), One legged stance test (OLST), and Time up and go (TUG) tests. Before the second HD session in the same week, the patients performed the Sit to stand to sit 10 (STS-10) and sit to stand to sit 60 (STS-60) tests. Finally, the participants undertook the Six minutes walking test (6MWT) before their third HD session in the week.
The retest session was performed on a non-HD day by the same nurse. Participants completed the same battery of tests in a single test session.
De nition the tests: Short physical performance battery (SPPB) Objectively measures lower extremity function and includes several tests, balance, gait speed, and sit to stand to sit 5 repetitions (STS-5). This is a commonly used test in patients undertaking HD 17,21 . One-legged standing test(OLST) It consists of maintaining a one-legged stance for as long as possible, with a maximum of 45 seconds per leg in three trials 19 .
Timed up-and-go test (TUG) the participants were given verbal instructions to stand up from a standard armchair (using their arms if necessary), walk three meters as quickly and safely as possible, turn back at a cone set out by the researchers, walk back, and sit down again in the chair. The patients could wear their regular footwear and to use a walking aid if needed. A stopwatch was started on the word "go" and stopped when the patient was fully seated again with their back against the backrest. The time taken to complete the test was recorded in three consecutive trials, using the rst one to familiarise the patients with the test. The best time from the three trials was analysed 22 . Sit-to-stand-to-sit tests (STS) The STS10 consisted of performing 10 complete movements of sitting down and standing as fast as possible, with the arm held tightly against the chest. STS10 elapsed time was recorded. In the STS-60 test, the number of repetitions performed for 60 seconds was recorded 17,20,23 .
Handgrip (HG) with or without arm support. Two different procedures were compared, with and without arm support. In the HG test without support, the participant was seated in a chair. Participants performed three consecutive three-second repetitions using an approved Jamar hand dynamometer, with 15 second rest periods between repetitions. The same test was then performed with the arm supported by the surface of a table providing support 17,26 .
The six-minute walking test(6MWT) It consisted of assessing the maximum distance walked during a 6min period 26 .

Statistics
The normality of the data distribution was assessed using the Kolmogorov-Smirnov test. Normally distributed descriptive data were reported as the mean plus the standard deviations (SDs) and nonparametric data were reported as the median plus the range. We also performed paired comparisons with paired t-tests or Wilcoxon signed rank tests to assess any systematic bias between the trials.
Bland-Altman plots were used to visually assess the disagreement between the measurements in two different measurement days. A plot of each participant's mean score plotted against the patient score difference (test on non-dialysis day minus retest before HD treatment) was constructed to check for possible systematic bias. The Bland-Altman plots displayed the 95% limits of agreement (95%LOA) which give a range within which it is expected the 95% of future differences in measurements between measurement days to lie. The 95%LOA was calculated as the difference in the mean scores of the test ± the score difference SD × 1.96.
The intraclass correlation coe cient (ICC; model alpha) and a two-way random-effects model were used to assess relative intra-rater reliability which was rated 'excellent' (ICC ≥ 0.900), 'good' (≥ 0.750) or 'fair' (0.600 to 0.749) 27 . We assumed that there was no systematic bias between measurements within subjects and that the within-subject SDs were equal for all measurements since the same rater measured participants one-week apart.
We calculated the absolute reliability, standard error of measurement (SEM), and minimal detectable change (MDC) 90% con dence interval (MDC 90 ) thresholds for these tests. The SEM and the MDC 90 were calculated using the following formulas 17,23 .
The SEM measures absolute reliability and represents the extent to which a variable can uctuate during the measurement process 28 .To be 90% con dent about the range for a measurement, the calculation 1.68 × SEM was used 15,16 . The MDC is de ned as the amount of change in a measurement required to conclude that the difference is not attributable to error and is the smallest change that falls outside the expected range of error 16,29,30 .We set the level of signi cance required to a probability of p ≤ 0.05 for all our statistical analyses and the data were managed and analysed using the Statistical Package for the Social Sciences (SPSS) version 20.0 for windows (IBM Corp., Armonk, NY)

Results
Thirty participants with a mean age of 66.4 years (SD = 16.30), mean time on HD of 34.4 months (SD = 51.4), and mean Charlson comorbidity index of 8.5 (SD = 2.5) completed this study. The demographic and clinical data statistics for all the participants are shown in Table 1. No adverse events occurred during the testing.
Descriptive statistics of trial 1 (before the HD session) and trial 2 (non-dialysis day) as well as differences, are shown in Table 2.     Given the value of the MDC calculated in the present study, a change in the individual performance of less than one third of the mean cannot be detected beyond measurement error for the STS-10.
Intraclass correlation coe cients values ranged from 0.86 to 0.96, so that all tests showed good to very good relative reliability (Table 3). Con dence intervals were narrow, except for the relatively large con dence interval obtained for gait speed test and the STS-10.
Bland-Altman scatterplots were created to estimate disagreement between the two trials. The mean differences of STS-10, STS-60, TUG and all the handgrip tests were close to zero, indicating no systematic differences between trials. All, except for the handgrip tests, presented better values non HD day. Large range of values between trials was observed for the 6MWT, gait speed, OLST and SPPB (Table 3). Thus, Bland-Altman plots indicated a systematic bias for these four tests. The mean difference scores between the different days for the same rater differed signi cantly from exact agreement (p < 0.001). Figures 1 to 3 show the agreement between STS-10 ( Fig. 1), STS-60 (Fig. 2), and TUG (Fig. 3) before the HD session and on a non-dialysis day. For the STS-10 there was a mean difference of 0.9 seconds between the days (LOA − 11.6 and 9.9 seconds). For the STS-60 there was a mean difference of − 0.5 repetitions LOA − 6.6 and 5.6 repetitions). For the TUG there was a mean difference of 0.2 seconds (LOA − 2.3 and 2.8 seconds). All gures show that there is not much change in the differences as the mean increased while the variation of data was constant.

Discussion
The study attempted to clarify if physical function tests measured in patients undertaking HD are reproducible when changing the testing day (before the HD session vs. non-dialysis day). The sample size reached the recommended number of 30 31 .
Although high ICC coe cients were obtained, ICC is a ratio index of within and between subjects' variability, therefore agreement between groups of subjects does not provide information about the individual change or error in scores. Additionally, ICC is dependent of the sample variability, and thus ICC should not be employed isolated 32 . The Bland-Altman plots were useful in exposing the relationship between the trials, so that there was a tendency to have better scores when the physical function test was performed before the HD session (except for the HG tests).
The present study shows a high degree of agreement between measurements on different days (HD day before the session vs. non-HD days) and good or excellent ICC results (above 0.86) only for some tests (STS-10, STS-60, TUG and HG tests) demonstrating lack of systematic bias when the measurement day changed. Thus, our results support the use of these tests when there is a change in the timing for assessment.
The scores from our participants were the similar to those reported by previous research of our group, Handgrip left: 20.5-20.5 kg vs 23.8-23.4 kg ) 19,23 . Our sample was around 5 years older than the previous samples studied. Compared to other studies, with HD patients around 62 and 57 years old, results are also similar, for the STS-60, with 26-28 repetitions 23 , and 20.5-19.8 repetitions 33 , this last article differing from the rest, probably due to the small sample of only 10 patients. For the TUG, it is reported 8.9-8.1 seconds 33 .
Our results suggest that without arm support HG test is also reliable and has even lower values of MDC, what would made it easier to nd true changes out of the variability of the measurement.
The present ICC results concur with those from our previous studies, in similar samples (39 participants for the STS-10, STS-60, HG) 17  Our results show that there was no systematic bias for the STS-10, STS-60, TUG, or HG tests and so, these tests can be measured on different days. Nevertheless, this study shows a systematic bias for the SPPB, gait speed, and 6MWT when the timing (before the HD session vs. non-dialysis day) changes. Systematic bias have been explained by the learning effect once the participant repeats the test and improves results during the re-test, albeit to a non-signi cant degree 34 . A previous intra-rater study also showed a non-learning effect 19 . Our results do not show this learning effect, since gait speed and 6MWT performance was better before the HD session on trial 1 compared to the retest session on non-HD days ( Table 2). Some authors suggest that the testing before the HD session may have reduced the effects of fatigue from the previous HD session 33 . Additionally, it is well-known the high variability of functional results in this cohort 17,20 , so it seems very important to keep the same testing circumstances when testing this cohort.
Hence, the use of Bland-Altman method evidenced that 6MWT, gait speed, OLST and SPPB showed substantial bias and large disproportion of the LOA. This case, large ICC values but lack of agreement with Bland-Altman method, was also found when establishing reliability of some motor tests 32 . Gait speed, and 6MWT achieved higher results when testing before the HD session, while balance achieved higher results on non-HD days. Fatigue, as a result of administering all the tests in a row on a non-HD day could explain why some tests obtained poorer results on non-HD days, which should not affect balance.
Previous research has tested a battery of three test on non-HD days 33 . Clinical feasibility does not allow us to test patients on several non-HD days because these participants already spend many hours in a clinical setting for their treatments and so it would be di cult to convince them to spend extra time in for physical function testing alone. Finally, our results may help to clarify which tests could be measured before the HD session by the same rater, because there is no consensus on this regard and clinical applicability should be considered to extend testing into routine treatment.
The main strength of this study was that, to the best of our knowledge, this was the rst time that the reproducibility of physical function tests in patients undergoing HD has been tested with different test administration timings. Assessment at the nephrology units could be di cult to implement because of a lack of human resources and logistics in many clinical settings. Thus, it is important to be exible regarding the test timing in this cohort, but it is also important to note that these changes impact the reproducibility of several commonly used physical function tests. The main weaknesses of this work were that the sample size was relatively small. Another limitation is that we did not make two measurements with each timing. Since there was only one-week difference between measurements, we believe we may assume that there were no systematic biases between measurements within subjects and that the within-subject SDs were similar for all measurements.
Our results have important implications in the implementation of physical function testing in HD units and indicate that the same assessors should test patients. Future work should be multicentric and include higher sample sizes to con rm it and should also aim to clarify the ideal battery for clinical assessments in this population by assessing other tests, such as lower-muscle strength tests.

Conclusion
The STS-10, STS-60, TUG and handgrip tests had good to excellent test-retest reliability in measuring physical function in different dialysis days of patients undertaking HD. The MDC values are provided for this population. Bias were found for the 6MWT, gait speed, SPPB, or OLST when the testing day changed. Future studies should be conducted to clarify the ideal battery for routine clinical assessments in this population, including lower-limbs muscle strength tests.

CONTRIBUTORS.
All authors designed the study; or collected, analysed, or interpreted the data; and drafted or critically revised the article and approved the final draft. Bland-Altman plots showing agreement for the time required to perform the sit-to-stand-to-sit 10 test, obtained before the haemodialysis session and on a non-dialysis day by the same rater. Y axis difference between (non-dialysis -before the haemodialysis session) in seconds. X axis average (non-dialysis + before the haemodialysis session)/2 seconds.

Figure 2
Bland-Altman plots showing agreement for the time required to perform the sit-to-stand-to-sit 60 test, obtained before the haemodialysis session and on a non-dialysis day by the same rater. Y axis difference between (non-dialysis -before the haemodialysis session) in seconds. X axis average (non-dialysis + before the haemodialysis session)/2 seconds.

Figure 3
Bland-Altman plots showing agreement for the time required to perform the timed up-and-go test, obtained before the haemodialysis session and on a non-dialysis day by the same rater. Y axis difference between (non-dialysis -before the haemodialysis session) seconds. X axis average (non-dialysis + before the haemodialysis session)/2 seconds Figure 4 4A and 4B. Bland-Altman plots showing agreement for the kilograms achieved with the handgrip strength test, right and left with forearm supported, obtained before the haemodialysis session and on a non-dialysis day by the same rater. Y axis difference between (non-dialysis -before the haemodialysis session) Kilograms X axis average (non-dialysis + before the haemodialysis session)/2 kilograms Figure 5 5A and 5B. Bland-Altman plots showing agreement for the kilograms achieved with the handgrip strength test, right and left without support, obtained before the haemodialysis session and on a nondialysis day by the same rater. Y axis difference between (non-dialysis -before the haemodialysis session) Kilograms X axis average (non-dialysis + before the haemodialysis session)/2 kilograms