The methodology to generate a common scale that links data from different samples across different multi-item instruments starts with collecting item-level data from multiple studies on the same construct. We then need to synchronize variable names and scoring formats across studies. After harmonization, we locate connections between study samples and measurement instruments. Two samples connect when they use the same tool. Samples could also connect indirectly via a third sample. Connections via the same instruments are natural links because identical items occur in both samples. In the absence of natural links, subject-matter experts identify potential bridges across studies by placing items into groups based on similarity. We use the term “equate group” to refer to a group of items that measure the same feature in (perhaps slightly) different ways.
Figure 1 displays items from three different instruments that measure child development. The tools contain several common items that are measured in multiple instruments but also unique items. Common items are part of an equate group, as displayed by the arrows between them. In the example, there is one (common) item that is equivalent in all three instruments (i.e. walks alone). The item "sits without support" occurs in both the first (i.e. blue) and the second (i.e. green) instrument, and the item "claps hands together" appears in the second (i.e. green) and third (i.e. orange instrument). When we administer each instrument in a different study, we can link these by placing common items in the same column and estimate the model with concurrent calibration. However, linking through data reorganization is not always possible. For example, in Fig. 1, the first and second instrument is administered in the same study. We cannot place the two responses on the early two instruments into the same column. In situations like these, the equate group method provides a more flexible way to link items used in the model. Eekhout & van Buuren use this generic strategy to connect 16 studies measuring child development [15].
Both statistical information and subject matter experts’ input provide the basis of the assignment of items to eligible equate groups. An equate group can be either passive or active. An active equate group links items across instruments by restricting item difficulty estimates to be identical. A passive equate group does not enforce this restriction. Equate group status is initially unknown and should be determined as part of the modelling effort. This flexibility in testing and modifying the equate groups is an excellent strategy for improving comparability. In general, high quality equate groups contain items that function similarly in different tests. We evaluate similarity based on face validity and statistical measures, like infit and outfit. In this way, we have the flexibility to assess the goodness of an equate group for items that are not quite identical, such as "says three words" and "says five words" in Fig. 1. Since we store those items in separate columns, it is possible to test whether they can be equated or not without re-organizing the data.
The Rasch model models the probability of passing an item as a logistic function of the difference between each person's ability and the difficulty of the item as follows:
$${P_{ni}}=\frac{{{\text{exp}}\left( {{\beta _n} - {\delta _i}} \right)}}{{1+{\text{exp}}\left( {{\beta _n} - {\delta _i}} \right)}}$$
where \({P_{ni}}\) is the probability of passing item i , \({\delta _i}\) is the item calibration dependent on the attributes of item , i.e. the difficulty of item\(~i\) and \({\beta _n}\) depends on the attribute of person n, i.e. the ability of the person.[16] The log-odds that a person with ability \({\beta _n}\) answers an item with difficulty \({\delta _i}\) correctly is the difference between the person's ability and the item's difficulty \(\left( {{\beta _n} - {\delta _i}} \right)\).
We extended the Rasch model to facilitate the use of equate groups. This extension requires that item difficulties of similar items should be identical. Wright and Stone present a simple method to equivalate the difficulty between two test forms with common item links [16]. In their case, this is done by estimating the shift in difficulty as the weighted average of difficulty differences of the linking items and using this weighted average to align the difficulties of the test forms. Wright and Stone align the scales after fitting separate Rasch models to each instrument, similar to the transformation methods described previously. We adopt their method for concurrent estimation, leading to a constrained Rasch model. In particular, suppose that \({\delta _i}\) is the difficulty of each item in the equate group, l is the number of items in the equate group and \({w_i}\) is the number of respondents with an observed score on item i. We apply the constraint \({\delta _q}\)= \({\delta _i}\)for all i during concurrent model estimating, where
$${\delta _q}=\frac{{\mathop \sum \nolimits_{i}^{l} {\delta _i}{w_i}}}{{\mathop \sum \nolimits_{i}^{l} {w_i}}}$$
is the combined difficulty of the items in the equate group.
The modelling task consists of selecting the items that best fit the Rasch model and activating equate groups that bridge all instruments. We recommend to exclude items with less than ten observations in the smallest category. As a modelling strategy, we advise an iterative procedure that balances between preserving the best items, using active equate groups that perform well and bridge all instruments and reasonable distributions of the latent scores in the study population. We diagnose the fit of the model by the quality of equate groups through fit measures and visualizations. We also evaluate the plausibility of the latent variable's distribution and the infit of the items. An important assumption underlying equate groups is that the items in the group work in the same way across the different study samples, i.e., there is no differential item functioning. This assumption is critical for active equate groups because when it is not met, restricting the difficulty parameters to be equal across studies may introduce unwanted bias in comparisons between study samples [17, 18].
To estimate the constrained Rasch model, we developed new software in an R package called dmetric [19]. This package contains various tools to work with equate groups (see Appendix A for code). The rasch() function in dmetric package extends the rasch.pairwise.itemcluster() function from the sirt package [20, 21]. The dmetric package also includes functions that calculate fit measures for items and equate groups and that visualise item response curves. At the time of writing, dmetric is not yet available on CRAN. For the time being, please contact the authors for access to the package.
Simulation
Objective
Previous simulation studies have investigated the performance of concurrent calibration methods compared to transformation methods, for example, or the performance of concurrent calibration across different ability distributions [12, 22–24]. Here, we conducted a simulation study to investigate the quality of the constrained solution using equate groups as a function of the measurement range of the instruments, the number of equate groups, the location of equates along the scale and the difference in abilities between samples under various amounts of misspecification.
Simulation design
Data were simulated for two or three instruments, and each instrument consisted of 10 unique items plus additional items in equate groups. Table 1 presents a summary of the simulation design. The item difficulties (\({\delta _{i..l}}\)) varied in three simulation conditions. Difficulties of items could overlap, i.e. be in the same range for both instruments, or could not overlap, where only the items in the equate groups connected the instruments. Where the difficulties did not overlap, the difficulties of the items could be close to one another or not. Accordingly, the range of item difficulties of the instruments (1) did not overlap and were not close: [-5,-3] and [3, 5]; (2) did not overlap but were close: [-3,-0.1] and [0.1,-3]; or (3) overlapped: [-2,1] and [-1,2]. The number of equate groups was 1, 2 or 5. Depending on the number of equate groups, we added items to the second instrument with item difficulties that were present in the first instrument. The sensitivity to the specification of "wrong" equate groups was investigated by gradually increasing the difference between difficulties of items in one of the equate groups, starting at 0 (no deviation) to 2 logits, with steps of 0.1 logits.
We varied the locations of the equate groups. We suspected that the best locations for equates would be relatively far apart and would cover a wide range of the scale. Equate groups were placed in the centre of the instruments, in the range of one instrument but not in the other, spread equally over both instruments, or at the end of one instrument. Furthermore, we varied two conditions for the person abilities. Both samples had the same ability in the first condition: normally distributed with N(0 ~ 1). In the second condition both samples had different abilities: normally distributed with a mean difference of 2, so N(-1 ~ 1) and N(1 ~ 1). In the condition with three instruments, we simulated samples that were sensitive at distinct ranges, at N(-1.5 ~ 1), N(0.5 ~ 1) and N(2.5 ~ 1).
Given the set of items and difficulty parameters, we simulated a data set with 1000 rows per instrument and one column for each item in the instruments. The person ability settings and the item difficulty settings were used in the sim.raschtype function of the sirt package to generate the data (see R code in Appendix A) [20]. A Rasch model was fitted on the full data to obtain the true difficulty parameters for the reference situation where all items were administered to all respondents. Subsequently, the data were split such that 1000 persons had data for the first instrument, another 1000 for the second and, if the condition required, another 1000 for the third instrument. Two additional Rasch models were fitted to these data: one model where the equate group items had the same difficulty, and another model where all item parameters were estimated freely (i.e. no equate groups were indicated). The estimated difficulties from these two Rasch models were compared to the true reference item difficulties from the full data.
Table 1
Summary of the conditions in the simulation design
Parameter
|
Variation
|
Number of variations
|
Number of instrumentsa
|
2 or 3
|
2
|
Difficulty ranges for the items in the instruments (\({\delta _{i..l}})\)
|
- No overlap: [-5,-3] and [3, 5]
- Close: [-3,-0.1] and [0.1,-3]
- Overlap: [-2,1] and [-1,2]
|
3
|
Number equates
|
1, 2, or 5
|
3
|
Location equates
|
- In the center of the instruments
- In range of one instrument (not the other)
- Evenly spread over both instruments
- At the extreme end of the instruments
|
4
|
Equate misspecification
|
Difficulty deviation of 0 to 2 logits with steps of 0.1
|
21
|
Abilities (\({\beta _n}\))b
|
- Equal: N(0 ~ 1)
- Different: N(-1 ~ 1) and N(1 ~ 1) (2 instruments) or N(-1.5 ~ 1),N(0,5 ~ 1), and N(2,5 ~ 1) (3 instruments)
|
2
|
Note: a Each instrument contained 10 items, with additional equate items. b Data were generated for 1000 persons per instrument. |
Comparing the model performance
The quality of the scaling of the difficulties for the two (or three) instruments was evaluated using two statistical measures: the correlation and the misalignment. The correlation (ρ) was calculated between the true difficulty parameters and the estimated difficulty parameters obtained from the models with and without equate groups. Higher correlation corresponds with modelled estimates closer to the scale of the true difficulties.
The misalignment (\(\gamma\)) was measured by estimating the vertical distance between two lines (Fig. 2). One line presents the regression of modelled difficulty estimates for instrument A with the true difficulty parameters. The second line is the regression of estimates for instrument B. The coefficient for misalignment captures whether the estimated difficulty parameters for the instruments are aligned to the same scale and we estimate it by
\(\delta =c+~\beta \hat {\delta }+~\gamma z~\)
where \(\delta\) is the true difficulty parameters of the items, c is the constant, \(\beta\) is the coefficient for \(\hat {\delta }\) which are the estimated difficulties, and \(\gamma\) is the mis-alignment for the instruments z. A larger coefficient indicates more misalignment of the difficulty estimates between the instruments.