Standardized Measure for Performance Assessment of Athletes in The CrossFit Open: Theoretical Structuring and Item Response Theory

Rafael da Silva Fernandes (  rafasfer2@ufra.edu.br ) Federal Rural University of the Amazon https://orcid.org/0000-0002-3035-8025 Bruna Gabriele Biffe Centro Universitário Católico Salesiano Auxilium https://orcid.org/0000-0001-9650-5713 Mário Jefferson Quirino Louzada Centro Universitário Católico Salesiano Auxilium https://orcid.org/0000-0002-5744-2235 Antônio Cézar Bornia Universidade Federal de Santa Catarina https://orcid.org/0000-0003-3468-7536 Dalton Francisco de Andrade Universidade Federal de Santa Catarina https://orcid.org/0000-0002-4403-980X


Introduction
CrossFit ® , sport discipline that has been growing worldwide, has become a popular sport with more than 15.000 members all over the world. Such an increase may be highlighted when compared to the number of athletes subscribed in its annual competition, called "The CrossFit Open", in which approximately 26.000 athletes participated in 2011, reaching 572.653 subscribed athletes in 2019 (CrossFit; Glassman, 2004).
In its official website, CrossFit ® , is defined as: "CrossFit is a lifestyle characterized by safe, effective exercise and sound nutrition. CrossFit can be used to accomplish any goal, from improved health to weight loss to better performance. The program works for everyone -people who are just starting out and people who have trained for years." Usually, every sport discipline, specially, CrossFit ® -which presents multiple physical requirements -needs to identify effective techniques to analyze the performance through a smaller number of influential variables, thus facilitating the analysis and the development of training programs to enhance relevant physical skills. Due to its practical nature, this performance enhancement usually happens in an evolutive and adaptative manner (Gómez-Landero & Frías-Menacho, 2020).
Most of the studies dedicated to CrossFit ® have been directed towards understanding physiological and nutritional factors, training strategies, physical and psychological recovery and other aspects that may directly influence the performance of the athletes (Claudino et al., 2018;Mangine, Stratton, et al., 2020;Mangine, Tankersley, et al., 2020;Schlegel, 2020).
Typically, such studies vary depend on whether one is analyzing beginners, athletes with longer sport experience, athletes who are focused on maintaining their health, high-level competitors and several classes relating to age group and sex.
Regarding functional limitations, sporting performance is regulated via different factors that go through the ability of efficiently repeating the contractile motor activity, however it is limited by the progression of the fatigue -characterized for a decrease in the strength or production of musculoskeletal energy causing the reduction in the capacity of keeping the intensity of the exercise, so that greater fatigue leads to better performance (García-Pinillos et al., 2019;Hargreaves & Spriet, 2020;Khassetarash et al., 2021;Potvin & Fuglevand, 2017;Taylor et al., 2016;Wan et al., 2017).
is the reduced variability in the performance of athletes when they are position in a specific class. Thus, after considering decisive performance aspects, strategies design a better adaptive response or responses to induce the best gain to the athlete's output (Hanin & Hanina, 2009;Silva-Grigoletto et al., 2013).
Currently, the athlete's performance evaluation criteria is provided by the score obtained in the execution of a workout. So, performance can be conceptualized, in the CrossFit ® context, as an output or score, presented in time, number of repetitions or pounds, originated from the execution of a workout by an athlete, being possible to distinguish between the efficiency and efficacy of the execution.
Based on this concept, three points can be highlighted: i) Performance, based on the output, it is a tool to measure physical conditioning.
ii) Performance ascertains the athlete's physical conditioning, in other words, the efficiency and efficacy of the workout performed by the athlete.
iii) Performance allows the differentiation or distinction between the conditioning of two or more athletes.
It is important to clarify that the definition of conditioning is not directly identifiable and observable. What is directly observable and identifiable are the outputs obtained by the athletes in the various workouts performed. In other words, conditioning is perceived through the performance of the athletes in the workouts proposed in a competition, therefore, it is a resultant of the interpretation and scope of the workouts in measuring this execution.
For this purpose, evaluating or determining the athlete with the best conditioning based on a single workout is flawed, since it is insufficient to encompass the wide variety of exercises proposed by CrossFit ® itself.
Despite the validation of the "CrossFit Games" as a measurement tool of the athlete's conditioning, principles of the Classical Test Theory (CTT) are applied to rank athletes on each workout. Thus, CTT determines the "final score" as a simple rating score, which in CrossFit ® is the sum of the "ranks" obtained (Nunnally, 1975).
Due to the large diversity and number of athletes that can sign up for "CrossFit Games", it is expected that when applying CTT, subgroups of athletes are placed on the same rank, thus, information regarding performance on the different workouts is lost. Hence, Item respond theory (IRT) has been employed to measure latent traits and characteristics of the measurement Fernandes, Luz, Reis, Luz, & Guimarães, 2022;Fernandes, Luz, Reis, Luz, Guimarães, et al., 2022).
IRT application contributes to provide information regarding the performance of each athlete in different workouts, moreover, it becomes possible to obtain a scale for measuring and interpretating the scores in the CrossFit ® setting (Bonifay, 2019;Henninger & Meiser, 2020a. In this scope, the first objective of this study is methodological: it is proposed the application of a probabilistic model of the Item respond theory (IRT), called Graded-Response Model proposed by (Bock & Zimowski, 1997) to describe the probability of an athlete executing a workout obtaining a certain output, given its physical conditioning, the latent trait being measured. Thus, we have "CrossFit Games" as the measurement tool to assess performance (output) of the athletes in different workouts (items) and determine their latent traits (physical conditioning).
The second objective of this work is practical: the application of the model encompasses the performance analysis, mechanisms to provide additional information that identify execution characteristics per workout and data regarding the quality of the measurement tool as a criterion for performance evaluation. In particular, when analyzing the athlete's performance in various workouts it is possible to distinguish or differentiate the physical conditioning of two of more athletes.

Measuring Instrument
"The CrossFit Open" is a qualifying event that, since 2012, is composed by 5 workouts that are completed by the athletes and mobilizes thousands of athletes around the world to compete in the biggest participative CrossFit® event, "CrossFit Games" (CrossFit; G. CrossFit).
For this purpose, we can describe workout as a group or repetitive series of exercises that require some combination of strength, cardiopulmonary ability and/or gymnastic, to be performed within a specific time frame (Time Cap).
In most cases, there are two ways of determining the stop criterion: first, after a specified number of completed repetitions during a predetermined time frame or time cap, which is called truncated by repetitions, the score is given based on the execution time. In the second case, after a specified time, the athletes complete the maximum number of repetitions, this is called truncated by time, and the score in given by the number of repetitions, workouts that do not utilize the metrics of the number of repetitions and/or time may eventually appear, the most common amongst them being the one set by strength movement, in which the athlete is assessed (score) through the maximum load executed within a specified time frame. It is worth highlighting the cases in which there is a repetition sectioning, a case where the athlete could not complete the execution before the time cap the score is given by the number of completed repetitions. Therefore, it is possible to consider that a workout has its complexity defined by the number of repetitions to be executed, the time frame defined for execution, the number of types of exercises, and the complexity of their execution, the complexity being able to differentiate the various workouts.
Thus, we have the following specifications that specify its complexity: • Number of Repetitions: referring to the total amount of repetitions to be completed (for repetition sectioning) or number of completed repetitions (for time sectioning), usually, it is divided in amount or repetitions per exercises or number of rounds.
• Execution time or Time Cap: referring to the time-limit available for the athlete to execute all or the maximum number of constant repetitions in a workout.
• Number of types of exercises: it is the number of different exercises proposed in a workout. The difference among the exercises can be determined by the increase in complexity and/or the increase of the imposed load.
• Complexity of execution on each exercise: referring to the categorization of the exercises as to type, exercises that include gymnastic elements, Olympic weightlifting, or aerobic conditioning.
In general, workouts are defined to consider the diversity of athletes and could be classified in two classes: types or division.
The first class aims on differentiating beginner athletes to the ones with larger experience in sports practice and are known as: Rx'd and Scaled. It is worth highlighting the assumption that athletes who have an extended time of practice tend to be able to execute more complex workouts or have better skills, whereas beginners need workouts with adapted movements and reduced load.
The second class is consisted of factors such as sex and age range and takes into consideration physical and biological characteristics. The workout is specified with varying complexity, according to these characteristics. Figure 1 there is a distribution of the athletes in the various divisions and types for 2019 and in Figure 2 for 2020.

Data Set
For this study, data obtained at CrossFit Games (G. CrossFit) website referring the years 2019 and 2020 of "The CrossFit Open" were used. Data were subdivided, according to Table 1, by year, category and division. Data showed the relevance relative to the variety of the participants that corelates age range and sex. A greater interest of the athletes is observed in competing in the same "Rx'd" type, the male sex in larger number compared to the female sex, regarding the 18 to 34 age range. In the "Rx'd" type, athletes with extended practice time are expected to have longer practice time.
Besides, the "Scaled" class is an indication of a smaller amount of practice and/or noncompetitive objectives, in other words, beginner athletes may or may not seek the practice for health purposes and face competition only as personal challenge.

Item Response Theory
In the CrossFit® context, the outputs obtained in a set of workouts have been traditionally used as an assessment and selection process to find the most conditioned athlete. However, due to the complexity of each workout, those may benefit athletes with abilities on certain movements. In order to avoid this situation, and even as a premise of physical conditioning, the set of workouts need to have a wide variety of requirements.
Particularly in "The CrossFit Open", the score obtained in a specific workout serve as criterion to rank the athletes, and after the 5 workouts, the general ranking is calculated through the sum of the rankings in each workout. Consequently, an athlete who is placed in a lower ranking position in each workout indicates a greater contribution in the final sum, this athlete with lower score being the most conditioned one.
This method of evaluation relies on the specific set of workouts that composes the competition; thus, analysis and interpretation are always associated to the competition as a whole, which is the main characteristic of the Classical Test Theory. Therefore, it is made unfeasible the comparison between people that were not subjected to the same competition, or at least to what are called parallel methods of evaluation (Andrade et al., 2000;Mangine, Tankersley, et al., 2020).
Contextualizing, the Item Response Theory -IRT refers to the set of probabilistic models that intend to represent the probability of an athlete obtaining a specific score in a workout as a function of the characteristic parameters of the workout and the athlete's physical conditioning. This relation is always expressed in a way that the better the physical conditioning higher is the probability of obtaining a greater score in a workout (Andrade et al., 2000).
From the concept of Performance, the main characteristics of a workout are its complexity and ability of distinguishing two or more athletes. As a result, from the IRT point of view, the twoparameters logistic model is an adequate model to this context (Chalmers, 2012;Hori et al., 2020aHori et al., , 2020b. In a context of expansion of the dichotomous model, the polytomous models can handle items with three or more sorted or unsorted classes. Particularly, (Bock & Zimowski, 1997)proposed the Graded-Response Model -GRM as an extension of the Two-parameter model. Thus, in the context of the CrossFit, it is intended to describe the probability of an athlete fitting certain group, based on his/her physical conditioning, hence, it is expected that a better conditioned athlete will have an increased probability of obtaining improved performances in a set of workouts, thereby obtaining better outputs. It means that the sectioning of the athletes may be done gradually and orderly. Athletes with better performance are assigned to the primary groups and, as the score decreases, they are assigned to the last groups.
However, an issue appears when establishing criteria to define the sectioning of the athletes, mostly due to the continuous property of the scores, regarding the time unit, or the discrete property, relating to the number of valid repetitions, which results in estimative precision biases. Particularly, it is possible to make an empiric comparison between the IRT models that encompasses characteristic parameters of the evaluator when presenting a notation that refers to the data from the performance evaluation as well as a discussion regarding the common characteristics amongst evaluators (Ueno & Okamoto, 2008;Uto & Ueno, 2016. In this project, our aim is to discuss the rater biases on types. Usual rater characteristics on which the accuracy depends are as follow: • Severity: the tendency of ranking with lower positions that what is justifiable by the results.
• Consistency: the point to which the evaluator classifies similarly the results from similar quality. • Range restriction: the tendency to overuse some classes from restricted sections.
In practice, the consistency bias is disregarded, once the evaluator does not attribute a result to the athlete, thus, it does not represent a bias.
It will be implied, for the purpose of this project, that only an evaluator is going to determine the age restriction, also being described as a specialist or professional in the field.
Thereunder, the sectioning of the athletes may be completed in one of two ways: based on the score obtained, called grouping by score or based in the raking of the athlete, called grouping by rank.
Thus, given that the athletes performed a specific workout and that the number of groups and age restriction are pre-established, we can define it as: • Grouping by score refers to the distribution of the athletes through their obtained score. This grouping has a discriminatory nature and aims to compare athletes in classes, consequently, results in an inference about athlete's common characteristics and predictor factors and, due to subjectivity, it is reasonable considering it as an intuitive process that must be done by specialists or professionals from this field.
• Grouping by rank refers to the distribution of athletes through their obtained classification.
This grouping has a qualifying nature.
Whether by score or rank, the grouping criteria also relies on the type of truncated and the tiebreaker criterions. Besides, it is still necessary to introduce the premises of growing grouping per range and that represents the classification of the athlete according to his/her competitive objectives, presuming that better conditioned athletes will tend to perform a bigger number of repetitions and be placed in primary groups.
To exemplify the process of grouping by score, let's first look at Figure 3, in which the construction of the frequency histogram of work frequency is carried out as a function of the number of repetitions of the work 19.1. Note that this training is characterized by a truncation by time and it is reasonable to think that the grouping should be done according to the athlete's performance.
Furthermore, it is possible to observe in Figure 3 a figure of normal curvature and in the Figure 4 inclusion of the assumption of increasing clustering.
The premises is that the grouping per specialist will tend to be rising, given that it is expected that as the person becomes more competitive, less people would be interested in dedicating time and effort, and consequently, fitting the primary groups.
Also, it is possible to group by score, for workouts truncated by repetitions, as is the case of Workout 20.1 in which it has the characteristic of being truncated by repetitions, that is, it means that after a Time Cap, the athlete interrupts the execution and the score is given by the number of repetitions, not by the shortest time. Thus, as a graphic example of Figure 5, after the time of 900 seconds, the time is adjusted as follows: Regarding the section scores, workout have as a characteristic the repetition grouping, in other words, after a Time Cap the athlete interrupts the execution and the score is given based on the execution time, and not for the maximum number of repetitions. Therefore, time may be adjusted as follows: in which, represents the number of the repetitions to be done within the and represents the number of repetitions the athlete was able to perform.
Thus, if the athlete has finished all the repetitions within the time cap, their time will be kept. It is worth highlighting that in practice, in the case of time cap being a really big number, the athletes would tend to take longer in the completion of the remaining repetitions, given that they are spending more body energetic resources.
In this way, Figure 5 demonstrates the grouping by score performed by worktout truncated by time, and thus, facilitates the grouping by the specialist, according to Figure 6, adapting to the Gradual Response Model, with ordered categories, in a single dimension and taking into account consider the assumption of increasing clustering.
Finally, to exemplify the grouping by rank, it is simply a matter of grouping according to the frequencies or number of athletes of interest. As can be seen in Figure 7 the value described in each column of the figure needs to be determined according to interest. This grouping is important, especially when it is necessary to define the first places in a competition, as is the case of "CrossFit Games -The Open Stage".

Graded-Response Model
The Graded-Response Model by (Samejima, 1968(Samejima, , 1969 assumes that the classification of the response to an item may be sorted with each other. This model obtains more information from people's answers than simply if they have given yes or no answers (Andrade et al., 2000;Bonifay, 2019;Uto & Ueno, 2016).
In the CrossFit context, we assume the grouping (classes) by score, representing the output of the athlete's performance in a workout (item), may be sorted amongst each other, thus the Graded-Response Model may be applied. Furthermore, the GRM is useful and allows an estimative of the probability of an athlete obtaining a score in a workout given his/her physical conditioning. In other words, it means that the athletes can be classified gradually and in an orderly manner, with the best performing athletes in the primary groups and, as their score worsens, they are placed in the final groups.
For instance, assuming that the scores of the workout classes are arranged in order, from lowest to highest, and denoted by = 0,1, … , , where ( + 1) it is the same number of classes of the i-th workout. The probability of and athlete j being placed in a certain group, or a higher one of the i workout is given by the extension of the Two-parameter logistic model , with = 1,2, … , , = 1,2, … , and = 0,1, … , , where , is the parameter of difficulty of the kth class of the i workout and represents the physical conditioning (latent trait) of the -th athlete.
Regarding the models for dichotomous items, the slope parameters ai is the item discrimination.
However, regarding models for non-dichotomous items, the discrimination of a specific class depends on the slope parameter, common to all the item classes, as well as the distance from adjacent difficulty classes.
Thus, the probability of a person j receiving a score k in the i item is given by the expression: Notice that if we have a test with i items, each one with ( + 1) output classes, then we shall ] parameters to be estimated.

Preliminary Analysis of the Data Set
In Figure 4, we can see the grouping by score performed in Workout 19.1. Thus, as presented in [3], the description of this worktout is described in Figure 8: We initiate our analysis focusing on the workout 19.1, because it is a time grouping, the score is given by the number of executed repetitions until = 900 . Therefore, it is necessary to establish new values to represent the output of the athletes, thus, if the athlete fits a specific group, for example, Group 1, it means that he or she obtained a score 5, Group 2 with a score of four and so on.
Thus, characterizing and sorting the workout 19.1, the data is summarized in Table 2. 19.1 n/a 0 n/a 19193 n/a n/a n/a = not assessed by the measurement tool.
The parameters to each of the workouts are estimated assuming that the distribution of follows a normal distribution with = 0 = 1. Values for < 1 indicate the item has little discrimination capacity. Values for ≥ 1 mean the item discriminates well. It is possible to observe on Table 3 that all the workouts present values for ≥ 1. Furthermore, as shown in Figure 9. Workout 19.1 characteristic curve., it is observed that the peak of the curves referring to each group is greater than a value of 30% and this is positive evidence in relation to the information generated by the Workout under analysis.
Finally, we may still evaluate the Test Information Function (TIF) and the Standard Error of Measurement (SEM) presented on Figure 10. So, we can verify the degree of precision of the workout set to several scale ranges (0,1), and as can be seen, SEM presents lower values, better precision, in the interval [0,4].

General Analysis of the Data Set
Preliminary analysis is important to describe the process of analyzing and evaluating a set of workouts. However, it is important to understand all the information generated by the competition with their respective numbers of registered athletes. In this sense, Table 1 presents a set of 2 years of competition, 2 types of categories and 14 types of division (gender and age group), thus making it necessary to analyze 56 subsets of data and/or scenarios.
The general analysis, then, consists of an analysis of the main indicators that refer to a good quality of the measurement instrument, namely: the analysis of the parameter a when providing information regarding the discrimination power of the workouts; the frequency of respondents, which in this context, refers to the number of athletes included in the groups; the analysis of the worktous characteristic curves, which leads to the idea that flat curves or low probability peaks generate little information and, finally; analysis of the FRT and EPM curves.
Initially, all estimates of parameters a were analyzed and presented in Figure 11. It is a multidimensional graph, varying the value of the estimation of parameter a (y axis), with the 14 divisions (gender and age group), the two categories and the two years under analysis represented by colors and the sizes of the points representing the number of athletes in each scenario. Finally, a dotted line was drawn informing the value of interest, typically being ≥ 1. Therefore, it is noted that in all scenarios the value of interest was reached.
In a second stage, Figure 12 evaluates the occurrence of athletes in each situation, in which a point in every grid represents a specific workout in analysis. Hence, we are interested in verifying, at first, the smallest points, and later in which analyses scenario it fits. In order to do that, frequencies greater than 200 were transformed in 200, in the intent of better presenting (visually) the situations with low frequency.
To guide the analysis process, for example, we can fixate vertically the section "Men (18-34)" and observe a lower frequency of athletes that fit Group 01, workout 03, "Rx'd" type in the year 2020.
A strategy to avert this situation is to unite groups 1 and 2 and estimate the interest parameters.
In particular, the estimation of parameters after this regrouping did not present a significative difference.

Discussion
Typically, in world level sports the quality of a competition in determining the best athletes or teams is given by their acceptance in recognizing and validating this competition. Another way of recognizing a competition is based on the rules and norms that a bigger authority states and in turn, exerts validation of the competitions around the world. As an example, the soccer regulation by FIFA.
Regarding CrossFit, the validation of the competitions is exerted by the entity itself, which sanctions world events and function as qualifying stages for the final competition called CrossFit Games. However, due to the proportions reached by the sport and its growing pace, several competitions, classified as amateurs, aim at determining the fittest athletes, those being validated by their own competitors.
In this section, we present an analysis of the events and an assessment of the quality of the "The CrossFit Open" as a mechanism for measuring physical conditioning.
Due to the size of the data set, a preliminary analysis was done regarding the "Rx'd" type, section "Men (18-34)", 2019.

Conclusions
We focused our work in presenting a conceptual arrangement of the main definitions, terms and expressions applied in the CrossFit context and that were pointed towards Samejima's Graded-Response Model of the Item Response Theory (Samejima, 1969). In this sense, we were able to accomplish the primary objective of the production by applying GRM to the respective context, and as a result, describing the probability of an athlete performing a Workout and obtaining a score, given his/her physical conditioning.
On the other hand, given the adjustment of the IRT to the context presented, it was possible to achieve the second objective and incorporate to the athletes' performance analyses useful data that seek to identify performance characteristics and positively evaluate the quality of the CrossFit Open.
Commonly, the evaluation process of a measurement tool requires an interactive process of regrouping the results, that in a sense, intend to improve the quality and validation of the measurement instrument. In the Item Response Theory this process means that the classes of responses are to assessing well enough the analyzed item, and it is necessary a grouping, in other words, in the context of the CrossFit and this work, it means that the grouping of the athletes, in face of their results, cannot contain all the data that the Model would be able to collect, and with the regrouping, it could offer more data about the measurement tool.
Considering this and the 56 studied scenarios, individually evaluation each presented situation could be a costly work. Besides the regrouping analysis, it is still necessary to answer the following questions: Did the score grouping, designed by a specialist, allocated an adequate number of athletes or did it in the best way? Is the number of specified groups sufficiently adequate? Given that the grouping elaborated by the specialist and done per Workout is based on a single scenario, could the remaining scenarios be better regrouped?
Those are the complex answers that, besides considering the premises and characteristics of the referred context, require a greater discussion on how the optimization process must be completed.
However, this work did not focus on this optimization process.
In addition, the Item Response Theory provides us other tools that make possible to infer and measure, qualitatively, predictor factors of performance, for example, a subject of great relevance in the field of Sports Science and Exercise Physiology.
Studies that merge the Item Response Theory and the Sports Science and Exercise Physiology context in the presented manner were not found in literature. There are, on the other hand, studies that merge Sports Science and Psychology and encompasses psychometric assessments, among those, ones that employ the Item Response Theory.
In conclusion, this work accomplished both objectives proposed, methodological as well as practical, and recognizes the limitations derived from the reduced amount of qualitative data on the topic and the little use of applied probability models.
Future research is needed to build a standardized performance measurement scale that allows for a contextual and practical interpretation of the performance metrics obtained.   Histogram as a function of the number of repetitions performed. Workout 19.1.

Figure 4
Grouping by Workout 19.1 score, truncated by time and performed by a specialist.

Figure 5
Grouping by Workout 20.1 score, truncated by repetitions and performed by a specialist. Grouping by Workout 20.1 score, truncated by time and performed by a specialist.

Figure 7
Grouping by Workout 20.1 rank, truncated by time and based on frequency.  Workout 19.1 characteristic curve.

Figure 10
Test Information Function (TIF -continuous blue line) and Standard Error of Measurement (SEMdotted red line).

Figure 11
Estimates of the discrimination parameters.

Figure 12
Multidimensional analysis of the scenarios, groups, worktous and frequency of athletas in each situation.
"NA" representes the group of athletes that did not compute their scores.