Inter-Rater Agreement and Test-Retest Reliability of the Performance and Fitness (PERF-FIT) Test Battery for Children: A Test for Motor Skill Related Fitness.

Background: The Performance and Fitness (PERF-FIT) test battery for children is a recently developed, valid assessment tool for measuring motor skill-related physical tness in 5 to 12-year-old children living in low-income settings. The aim of this study was to determine: (1) inter-rater agreement and (2) test-retest reliability of the PERF-FIT in children from 3 different countries (Ghana, South Africa and the Netherlands). Method: For inter-rater reliability 29 children, (16 boys and 13 girls, 6-10 years) were scored by 2 raters simultaneously. For test–retest reliability 72 children, (33 boys and 39 girls, 5-12 years) performed the test twice, minimally one week and maximally two weeks apart. Relative and absolute reliability indices were calculated. ANOVA was used to examine differences between the three assessor teams in the three countries. Results: The PERF-FIT demonstrated excellent inter-rater reliability (ICC, 0.99) and good test-retest reliability (ICC, ≥ 0.80) for 11 of the 12 tasks. a poor ICC was found for the Jumping item only, due to low spread in values. Overall, measurement error, Limits of Agreement and Coecient of Variation had acceptable levels to support clinical use. No systematic differences were found between rst and second measurement between the three countries but for one item (Overhead throw). Conclusions: The PERF-FIT can reliably measure motor skill related tness in 5 to 12-year-old children in different settings and help clinicians monitor levels of power and agility, and fundamental motor skills (throwing, bouncing, catching, jumping, hopping and balance).

In total, 101 children between 5-12 years of age were recruited in two elementary schools in low income areas in Cape Town, South Africa (SA), two elementary schools in low income areas in Accra, Ghana (GH) and in two elementary schools in middle income areas in Tilburg, the Netherlands (NL) (See Fig. 1). The sample was randomly chosen by the teachers from the class list.
First, we examined the inter-rater agreement of the PERF-FIT test battery. Twenty-nine 6-10-year-old South African children were included in this part of the study. Next, we examined test-retest reliability to evaluate possible variance in performance between two test moments in the children and if the variance was stable under the different testing circumstances. In total, 72 children between 5 and 12 years of age completed the test-retest part of the study; South African children (24), Ghanaian children (23) and Dutch children (25). The  Permission to approach the head teachers were obtained from the school districts. Verbal and/or written explanations of study purpose, test procedures, bene ts and risks were provided to parents. Children were included after parents or caretakers signed the consent forms and children gave assent to participate. Children included were a random sample of school children aged 5-12 years and with understanding of the local language. Children with health-related conditions were excluded based on the Physical Activity Readiness Questionnaire (PAR-Q) (Warburton, et al., 2011). In addition to PERF-FIT scores, data sought included age, height, weight and gender. No other information was available to the raters about the children.
Assessments were performed under the circumstances that the test was developed to be used in; on the school's premises outside (GH and SA), in the gym or hall (NL and SA) and a physiotherapy practice (NL). Participants completed standardized warm-up procedures prior to testing as prescribed in the manual. They were allowed practice trials for each test item before the scored trial as indicated in the manual. Children who did not have proper shoes, performed the test barefoot on both occasions. All of the Ghanaian children were tested barefoot; part of the South African children wore uniform shoes and part was barefoot. All the Dutch children wore sneakers.
The lead author trained at least one rater per country but was not present at any of the test sessions. The trained raters instructed the other raters in SA, GH and NL during a half-day training, where they practiced in small groups to obtain a solid routine. Raters were selected as being representative of the future users; pediatric physiotherapists, physiotherapists and occupational therapist, teaching assistants and a nurse. The raters conducted all the testing during school hours except in the Netherlands where part of the testing was done on a day-off.

Inter-rater Agreement Study
The consistency of two different clinicians rating the PERF-FIT was tested. When establishing inter-rater agreement with two observers, one tests if the instructions for scoring were unambiguous and if this led to similar results. Overall results are excellent (mean ICC 0.99), indicating that the two raters did get the same results for the same subjects. Since the children were selected randomly by the teachers, the results can be generalized for the child population within this age range (D'Olhaberriague et al., 1996).

Test-retest Reliability Study
Test-retest reliability concerns the reproducibility of the observed value when the measurement is repeated in a stable population. Studying reliability may seem straightforward, as it is just a matter of repeating the measurement on a reasonable number of individuals. However, interpreting the ndings is less simple and a combination of approaches is more likely to give a true picture of reliability (Bruton, et al., 2000). The type of data (continuous) of the PERF-FIT requires standard error of measurement (SEM) (De Vet et al., 2006;Stratford and Goldsmith, 1997) and proportions of agreement within speci ed limits to provide useful information concerning reliability (De Vet et al., 2006). Given the ICC's found in this study, one can assume that the PERF-FIT is a reliable tool. ICC's for 4 items are 80 or higher and 7 items have an ICC of 90 or higher. The relative nature of the ICC is re ected in the fact that the magnitude of an ICC depends on the between-subjects variability. That is, if subjects differ little from each other (homogeneous sample), ICC values can be low even if trial-to-trial variability is small as shown in the Jumping item. This item, which is easy in this population, showed low ICC but good agreement (85%). It would also be of interest to test the reliability of this item in young children and with DCD. Importantly, if we were to include participants with neurodevelopmental delays the between-subjects variability will change as well as the ICC (Strainer, Norman, & Cairney, 2014).
ICC is not sensitive to disagreement due to systematic bias as was shown by the comparison between test 1 and test 2, half the items showed a very small but signi cant improvement but have high ICC's. The need to perform the test twice will cause performance variability, due to changes in motivation and familiarization with the tasks. Detailed analysis of the Side jump data, with good ICC (0.90), showed that ve children "improved" ten jumps or more (max 13). However, this was not due to instruction or circumstances since the ve children came from three different countries. Still these differences cannot be attributed to improved anaerobic tness, or improved motivation since these children showed no improvement on the other items. Hence this nding points more towards a short-term learning effect or getting the clue of the agility required in this task for some children. We therefore added the recommendation in the manual to offer one extra practice opportunity if a child is still struggling with the weight shifts of the Side jump.
Throwing and catching series also showed small improvements. Some of the African children were less used to this task, which may have increased the learning effect. Consequently, we will emphasize to consistently use the two practice trials per level of di culty, to reduce the learning effect.
Test location. The subject population of interest for the PERF-FIT is the group of children in elementary school age living in low socioeconomic circumstances. Children with different lifestyles (level of daily physical activity, participation in structured physical education and sports) and testing in different contexts may respond differently to re-testing of some tasks. Therefore, we gathered data in three countries with many raters (n = 16), to analyze the reliability across these different populations and environments making the results clinically more widely applicable (Bruton, Conway, & Holgate, 2000). Although the testing was done in a standardized way, raters, sites and children were very different. Still, no country-related bias was found except for the Overhead throw, where the difference in scores between the two test occasions was larger in the Ghanaian children. Scoring this item requires the tester to focus on the landing spot, preferably on sand, dirt oor or grass so the sandbag leaves a landing imprint. The practice trial given in this task is done with submaximal force, which may have decreased the familiarization in the rst testing.
Despite the noise and distraction, inherent to testing at the school premises in open space, the test results were considerably stable, which implies that the children were able to attend to the instructions under these circumstances. These ndings point to the fact this test is enjoyable and engaging for the children. It is to be expected that if children are tested in a more clinical one-on-one situation, the variability between test and retest will be even less.
In this study we choose for a wide variety of outcomes because they all have advantages and disadvantages. Both the SEM and LoA were calculated because they differ in the type of measurement error that they describe and in the coverage probability of the reference interval (0.68 versus 0.95%). If the variability in test-retest outcomes depends on the magnitude of the mean values, the use of a ratio statistic is useful to the researchers. The advantage of CV being unitless is that it can be used to compare different instruments, but this makes it harder to translate results into clinical practice.
Outcome Measure: Perf-t The PERF-FIT measures motor skill related physical tness in children aged between 5-12 years old. The test has two subscales: a Performance part and a Fitness part. See Table 1. The PERF-FIT test battery is easy to administer, low-cost and developed for measuring performance-related physical tness in school-aged children living in low-income settings and has excellent content validity and good structural validity (Smits-Engelsman et al., 2020a, 2020b). A full description of the PERF-FIT test battery is available through the rst author (Smits-Engelsman 2018). Children bounce tennis ball to the oor and catch. This series involves ve bouncing and catching items of increasing skill di culty. All children start at the easiest level. This series is discontinued if the child scores less than 6 out of 10 catches.

Throwing and Catching
Children throw tennis ball in the air to at least eye level height and catch. This series involves ve throwing and catching items of increasing skill di culty. All children start at the easiest level of this series. The series is discontinued if the child scores less than 6 out of 10 catches.

Jump
Children are asked to jump inside an agility ladder. This series involves four jumping items of increasing di culty. Two test trials are allowed if maximum score is not obtained.

Hop
Children are asked to hop inside an agility ladder. This series involves four hopping items of increasing di culty for each leg. Two test trials are allowed if maximum score is not obtained.

Balance
Children are asked to perform two (2) static balance tasks for each leg and three (3) dynamic balance tasks. Tasks involve knee hugging, grasping the foot and picking cans from the oor at close and far distance.

Agility and Power items
Running Children are asked to run (one foot per square) in 3.5 m agility ladder and run around a bottle placed at a distance of 50 cm from the starting line and return the same way as fast as possible. Two test trials are given for each child. The time taken (in seconds) to complete this task and number of mistakes made are recorded. Stepping Children are made to step with two feet in each square of a 3.5 m agility ladder and run around a bottle placed at a distance of 50 cm from the starting line and return the same place as fast as possible. Two test trials are given for each child. The time taken (in seconds) to complete this task and number of mistakes made are recorded.

Side Jump
Children are required to jump sideways on their feet. One foot per square, in the same three squares of the agility ladder. The total number of correct landings in 15 s is recorded for each of the two test trials.

Long Jump
Children are asked to jump forward as far as possible and land on their feet in a controlled manner (i.e. balanced landing). The distance between the starting line and the heel of the foot that landed closest to the starting line is measured in centimeters. Two test trials are given.

Overhead Throw
Children kneel just behind a starting line and throw a sandbag (2 kg) forward as far as possible. The bag is held over the head and thrown from a starting position behind the head. The distance between the starting line and the part of the sandbag closest to the starting line is measured in centimeters. Two test trials are performed.

Agility and power subscale
This subscale contains ve items: Running, Stepping, Side Jump, Long Jump, and Overhead Throw. For the Agility and power subscale children perform two trials for each item and get 15 seconds rest in between.

Motor Skill Performance subscale
This subscale contains ve Skill Item Series (SIS) of increasing di culty; Bouncing and Catching, Throwing and Catching, Jumping, Hopping (left and right), and Balance. All children start at the easiest level and a series is terminated when they do not reach the criterion number of points for the item after two trials. If a child obtains the maximum number of points after the rst trial no second trial is given and the child proceeds to the next level of di culty.
After the rst round of collecting validity data in Brazil (Smits-Engelsman et al., 2020a), it was found that most children obtained a maximum score on the static balance series and it was decided to increase the total number of seconds of the static balance series from 40 to 60 seconds for future studies. At this moment the data collection for SA had already started with the 40 seconds protocol. Therefore, the South African data on one item, Static balance, was discarded in the current paper. This was the only adaptation in the protocol, which then was used for data collection in GH and NL.

Data analysis
Descriptive data were calculated in terms of mean value and standard deviation (Mean ± SD). Relative reliability, which is the degree to which individuals maintain their position in a sample over repeated scoring or testing, was determined by calculating the two-way random intraclass correlation coe cient (ICC 2,1a) for absolute agreement of single measures. The 95% con dence interval (CI) was calculated for each ICC (Shrout, & Fleiss, 1979 A paired t-test was used to compare the means of test (T1) and retest (T2) to evaluate whether there was any statistically signi cant bias between the test results.
Next, indicators of absolute reliability were calculated to determine the degree to which repeated measurements vary for individuals, expressed in the actual units of measurement, or as a proportion of the measured values. The Standard Error of Measurement (SEM), Bland and Altman's 95% Limits of Agreement (LoA) (Bland and Altman 1986) and coe cient of variation (CV) are all measures of absolute reliability that were used in this study.
The calculation of SEM and LoA do not depend on sample size, but the precision of their estimate for the population parameter does. Bland and Altman recommended sample sizes of at least 50 individuals in a study to consider the sample LoA to be a precise estimate of the population LoA (Bland, & Altman, 1986). Since we were also interested in a group comparison we oversampled, and we aimed at 25 subjects per country.
The SEM, as measure of precision of the assessment, was determined using the ICC through the formula SEM agreement = SD*√(1-ICC agreement ) in which SD is the sample SD of the grand mean and ICC is the calculated intraclass correlation coe cient (Weir, 2005 Absolute reliability statistics were also calculated using the standard deviation of test-retest differences (SD differences) and its derivatives.
SD differences is the SD of the differences between values on T1 and T2.
The Coe cient of Variation (CV) or relative standard deviation is the individual SD expressed as a percent of the mean of T1 and T2 using the formula (SD/Mean) *100. The higher the SD, the greater the percentage of the mean and the higher the %CV. A %CV of < 10% is considered excellent, 10-20% medium, implying good precision, 20-30% high, meaning low precision and > 30% is considered very high, indicating very low precision (Atkinson, & Nevill, 1998;Lee, et al., 2013).
To test for possible dissimilarities in the degree of the error between T1 and T2 in the participating countries an ANOVA was run on the difference score (T1-T2) for all items with country (3) as between group factor and post hoc Bonferroni tests.
Statistical data analyses were carried out using SPSS version 25.0. A value of p < .05 was considered statistically signi cant in all analyses.

Inter-rater agreement
Very high ICC's were found ≥ 0.98 for all items. The results of the inter-rater agreement (n = 29) of the two raters are shown in Table 2.

Test-retest Reliability
Test-retest reliability results (n = 72) of the sixteen raters in the three countries are depicted in Table 3. Overall test-retest reliability was good to excellent on 11 of the 12 items; all ICC's were .80 or higher (Table 3). Only the item Jumping showed a low ICC due to lack of spread in the data. This was the easiest item and many children had a maximum score (63% and 78% in T1 and T2, respectively). Percentage agreement plus or minus 1 point was 84.7%.
Comparison between rst and second test occasion showed that there was a statistically signi cant difference on half of the items. However, as shown in Table 3 (Column Mean Difference 1-2) this systematic bias was small, except for Bouncing and catch and Side jump (p < 0.001). The SD of the differences in scores between the two test occasions and LoA for each variable with its 95% con dence interval are also shown in Table 3. The mean %CV is 9.6%, which indicates excellent stability and the highest %CV (21% for Hopping on left foot) was still considered acceptable.

Comparison Per Country
The repeated measure ANOVA showed that the differences between T1 and T2 were not signi cantly different between countries for 11 of the 12 scores (Table 3). Only the Overhead throw the difference was larger in the Ghanaian children. Post hoc test showed that they were different from the Dutch children, who were slightly worse on the second test while the Ghanaian children in general performed better the second time on this item (see Fig. 2).
Insert Fig. 2 About here Discussion A new tool, the PERF-FIT was developed because none of the currently available norm referenced motor performance tests for children of elementary school age combined fundamental skills and muscular skeletal tness. This study aimed to evaluate whether the PERF-FIT is a reliable tool and whether the measurement error is acceptable for practical use. Because widely accepted criteria or guidelines for reliability and agreement reporting in the health care and medical elds are lacking (Kottner et al., 2011), we chose for a wide variety of outcomes to evaluate the reliability of the PERF-FIT. Inter-rater agreement depends primarily on good training of the raters, and on good standardization and description of the tasks (Smits-Engelsman, 2018). Data in this study indicate that this was the case for the PERF-FIT. Test-retest reliability is highly dependent on the situation or on the state and stability of the participants, and is therefore characterized by larger variability, which was con rmed by our results although the agreement between the rst and second test occasion was good. A small learning or familiarization effect was seen in 6 of the 12 items. No systematic differences between test-retest differences were found between the testing sites in the three countries in randomly selected children between 5-12 years old, except for 1 item. An average CV of 10% -obtained in the current study-means that, assuming the data are normally distributed, 68% of the differences between tests lie within 10% of the mean of the data (Atkinson & Neville, 1998).

Inter-rater agreement study
The consistency of two different clinicians rating the PERF-FIT was tested. When establishing inter-rater agreement with two observers, one tests if the instructions for scoring were unambiguous and if this led to similar results. Overall results are excellent (mean ICC 0.99), indicating that the two raters did get the same results for the same subjects. Since the children were selected randomly by the teachers, the results can be generalized for the child population within this age range (D'Olhaberriague et al., 1996).

Test-retest reliability study
Test-retest reliability concerns the reproducibility of the observed value when the measurement is repeated in a stable population. Studying reliability may seem straightforward, as it is just a matter of repeating the measurement on a reasonable number of individuals. However, interpreting the ndings is less simple and a combination of approaches is more likely to give a true picture of reliability (Bruton, et al., 2000).  et al., 2006). Given the ICC's found in this study, one can assume that the PERF-FIT is a reliable tool. ICC's for 4 items are 80 or higher and 7 items have an ICC of 90 or higher. The relative nature of the ICC is re ected in the fact that the magnitude of an ICC depends on the between-subjects variability. That is, if subjects differ little from each other (homogeneous sample), ICC values can be low even if trial-to-trial variability is small as shown in the Jumping item. This item, which is easy in this population, showed low ICC but good agreement (85%). It would also be of interest to test the reliability of this item in young children and with DCD. Importantly, if we were to include participants with neurodevelopmental delays the between-subjects variability will change as well as the ICC (Strainer, Norman, & Cairney, 2014).
ICC is not sensitive to disagreement due to systematic bias as was shown by the comparison between test 1 and test 2, half the items showed a very small but signi cant improvement but have high ICC's. The need to perform the test twice will cause performance variability, due to changes in motivation and familiarization with the tasks. Detailed analysis of the Side jump data, with good ICC (0.90), showed that ve children "improved" ten jumps or more (max 13). However, this was not due to instruction or circumstances since the ve children came from three different countries. Still these differences cannot be attributed to improved anaerobic tness, or improved motivation since these children showed no improvement on the other items. Hence this nding points more towards a short-term learning effect or getting the clue of the agility required in this task for some children. We therefore added the recommendation in the manual to offer one extra practice opportunity if a child is still struggling with the weight shifts of the Side jump.
Throwing and catching series also showed small improvements. Some of the African children were less used to this taskwhich may have increased the learning effect. Consequently, we will emphasize to consistently use the two practice trials per level of di culty, to reduce the learning effect.
Test location. The subject population of interest for the PERF-FIT is the group of children in elementary school age living in low socioeconomic circumstances. Children with different lifestyles (level of daily physical activity, participation in structured physical education and sports) and testing in different contexts may respond differently to re-testing of some tasks. Therefore, we gathered data in three countries with many raters (n=16), to analyze the reliability across these different populations and environments making the results clinically more widely applicable (Bruton, Conway, & Holgate, 2000). Although the testing was done in a standardized way, raters, sites and children were very different. Still, no country-related bias was found except for the Overhead throw, where the difference in scores between the two test occasions was larger in the Ghanaian children. Scoring this item requires the tester to focus on the landing spot, preferably on sand, dirt oor or grass so the sandbag leaves a landing imprint. The practice trial given in this task is done with submaximal force, which may have decreased the familiarization in the rst testing.
Despite the noise and distraction, inherent to testing at the school premises in open space, the test results were considerably stable, which implies that the children were able to attend to the instructions under these circumstances. These ndings point to the fact this test is enjoyable and engaging for the children. It is to be expected that if children are tested in a more clinical one-on-one situation, the variability between test and retest will be even less.
In this study we choose for a wide variety of outcomes because they all have advantages and disadvantages. Both the SEM and LoA were calculated because they differ in the type of measurement error that they describe and in the coverage probability of the reference interval (0.68 versus 0.95%). If the variability in test-retest outcomes depends on the magnitude of the mean values, the use of a ratio statistic is useful to the researchers. The advantage of CV being unitless is that it can be used to compare different instruments, but this makes it harder to translate results into clinical practice.

Limitations And Future Research
Given the way the inter-rater reliability was examined, variability as a result of instruction was not tested. During eld-based testing, all sources of variability cannot be controlled, therefore the design chosen for this study is close to the context this test was developed for. Results of agreement and reliability studies are intended to provide information about the amount of error inherent to a measurement tool in a speci c population and context. High ICC's re ect adequate relative reliability for use of the PERF-FIT in the population that has been investigated. However, measures of reliability are generated by distribution-based methods and are dependent on the mean and variance in the group. The Minimal Detectable Change is very susceptible to increased variance given its formula. Reliability studies should be repeated in the population the instrument will be applied in, since variability may be different in groups on children with known poor motor performance, low levels of tness, or learning disabilities. Also, the impact of BMI on the scores and the reliability should be investigated in different weight categories. Additionally, studies are needed to evaluate the responsiveness of the PERF-FIT or ability of the test to measure changes after intervention.

Conclusion
The present study examined inter-rater agreement and test-retest reliability of the PERF-FIT in a manner that replicates how the test is typically used in the actual everyday context. Inter-rater agreement and test-retest reliability were adequate to support clinical use. Hence, the PERF-FIT was relatively stable over time based on the small differences between the repeated measurements and based on the calculated SEM's. The Coe cient of Variation on average was 10%, indicating good stability. Hardly any systematic differences were found between the testing sites in the three countries, which supports the use of the PERF-FIT by trained raters from a variety of backgrounds in different contexts. Declarations Contributors All individuals listed as authors meet the appropriate authorship criteria and have approved the acknowledgement of their contributions. The primary author, BCM, was responsible setting up the project, development of research design, for the drafting of the paper and liaising with the coauthors on ndings and conclusions. DJ contributed to the paper through interpretation of data, completing methodological assessments and revising manuscript content throughout its development. JF was responsible for the logistics of the whole project and rater supervision. WA supervised data collection in the Netherlands, RD and SL were responsible for the project in Ghana, JF, ES and DJ for the project in SA. All contributed to the paper through assisting with the interpretation of data and revising manuscript content through its development.

Funding No funding
Competing interests The authors declare that they have no competing interests.
Consent for publication Not applicable.
Patient consent Ethical approval was obtained from the University of Cape Town, Ghana Health Service and University of Groningen gave their approval for the study (UCT HREC Ref 598/2019: HREC139/2019; GHS-ERC 084/04/19; PSY-1920-S-0107). Written informed consent to participate was obtained from the parents/guardians of the minors included in this study and assent was signed by the children. Permission was also obtained from the head teachers of the schools.
Data sharing statement The datasets used and analyzed during the current study are available from the corresponding author on reasonable request. The PERF-FIT manual and instruction videos can be accessed free of charge for the intended users after registration via the rst author for use in low resource communities.

Figure 2
Scatterplot of the mean values (cm) obtained by the children for the Overhead throw at Time 1 (test) and Time 2 (retest) in the three countries.