In addition to factor analytical techniques from classical test theory (CTT), approaches from item response theory (IRT) were considered for the psychometric evaluation of the candidate items. For the IRT family of methods, the many-facet Rasch measurement model (MFRM) is a unidimensional IRT model which allows for inclusion of context or situation factors—so-called facets—in addition to the two facets of item difficulty and subject ability considered in the standard Rasch model[40, 43, 44]. The selected number of facets comprise the model to be tested and are considered in the calculation of IRT indices (e.g., person and item estimates). As shown in Figure 3, the defined MFRM model was comprised of five facets (F1, F4, F6, and F8) and three dummy facets (DF2, DF3, and DF7). The following descriptions are of the facets:
F1 Person: Two persons listening to the exact same audio can differ in their immersive audio experience. For performance assessment applications of IRT, this difference would be attributed to the ability of the individuals. Here, the responsible person trait could be described as receptivity or propensity for immersion. Each participant is considered as an element of this facet.
F4 Piece/Song: Different songs or pieces might contribute differently to the immersive audio experience. This characteristic could be termed potential for immersion. The four pieces of music used constitute the four elements of this facet.
F5 Version: Analogously, different versions, namely, audio formats, could have different potential for immersion. Mono, stereo, and 3D-audio are the three elements of this facet.
F6 Liking: Different degrees of liking a piece of music might influence the immersive audio experience. The four response categories of item C1 are the elements of this facet.
F8 Items: For the same immersive audio experience, a person might respond differently to several items. This is because some items require more of the latent construct than others to achieve the same (high) response category. This characteristic is represented by the item difficulty. The candidate items constitute the elements of this facet.
We introduced dummy facets (DF) into the model to test for interactions between facets and potentially influencing variables that are not considered as facets in their own right. The following serves as a description of the dummy facets:
DF2 Expertise: Differences in expertise related to music and audio production might influence the immersive audio experience. Participants were assigned to one of three levels of expertise based on their indication of music-related profession. The three levels are the elements that constitute this dummy facet.
DF3 Recruitment: Participants were acquired from a panel provider and via mailing lists. This might have influenced the response behavior. Therefore, the two sources for participants represent the elements of the dummy facet.
DF7 3D Impression: The four response categories of item C2 (from 1 = “strongly disagree” to 4 = “strongly agree”) are the elements of this dummy facet.
The log odds form of the model without dummy facets is given by

where \({p}_{nimjlk}\) is the probability that a person \(n\) responded with category \(k\in \{\text{2,3},4\}\) to item \(i\) when they listened to the piece of music \(m\) in the format \(j\) with a liking of \(l\); \({p}_{nimjlk-1}\) is the probability that a person \(n\) responded with category \(k-1\) to item \(i\) when they listened to the piece of music \(m\) in the format \(j\) with a liking of\(l\); \({\tau }_{k}\) is the difficulty of responding with category \(k\) relative to \(k-1\). The difficulty \({\delta }_{i}\) of item \(i\) is the point on the latent variable in which Category 1 and 4 were equally probable. \({\omega }_{m}\) and \({\xi }_{j}\) are the potential for immersion of piece/song \(m\) and version \(j\), respectively. \({\lambda }_{l}\) represents the influence of response category \(l\) of item C1. The unit of all parameters and, therefore, of the latent dimension is logits, that is, log odds units[40].
To test for the assumed unidimensional structure and the role of other influential variables, we applied a principal component analysis of standardized residuals (PCAR) [40, 45]. Data were preprocessed using Excel and several R packages in RStudio (Version 1.3.959, https://www.rstudio.com; R, Version 4.0.2, https://www.cran.r-project.org; car, Version 3.0-11, https://www.cran.r-project.org/package=car; dplyr, Version 1.0.5, https://www.cran.r-project.org/package=dplyr). Exploratory and confirmatory factor analyses were conducted by means of Jamovi (Version 1.6.23, https://www.jamovi.org/). For the main MFRM analysis, the Facets software (Version 3.83.6, https://www.winsteps.com/facets.htm) was used, and the PCAR was calculated by the software Winsteps (Version 5.0.0, https://www.winsteps.com/winsteps.htm).
Factor analysis of the initial item set
As a test of statistical preconditions, Bartlett’s test of sphericity (χ2 = 76209, df = 300, p < .001) and the Kaiser-Meyer-Olkin measure of sampling adequacy (overall MSA = .984, MSA > .970 for all candidate items) indicated that the data set was suitable for an exploratory factor analysis (EFA)[46]. An EFA with varimax rotation and maximum likelihood extraction was performed and revealed that only the first factor (eigenvalue 16.78) showed an eigenvalue greater than 1. Thus, according to the Kaiser-Guttman criterion, only this first factor should be extracted[47]. The scree plot was also in favor of just one extracted factor. Although the parallel analysis suggested five factors (model fit: RMSEA = 0.0428, TLI = 0.981, BIC = -372; model test: χ2= 1087, df = 185, p < .001, total variance explained by the first 5 factors = 76.1%; see Supplementary Figure S2 for simulated eigenvalues), one should bear in mind that this finding might be the result of psychometrically unsuitable items that disturbed the results (see Supplementary Table S3 for the factor loadings). In general, a comparison of the dimensionality of other immersion-related inventories and the IAQI inventory could not yet be recommended. The dimensionality of an overall immersion as a multisensory phenomenon had not yet been conclusively clarified. Existing hierarchical models assert a cause-and-effect relationship for which no data-based evidence was available. Items from other inventories had to be significantly adjusted in order to meet the needs of the IAQI inventory. When comparing inventories that are evidently different, equality of dimensionality cannot be expected.
Item identification by Many-Facet Rasch Measurement analyses
As the EFA confirmed unidimensionality, in the next step, MFRM analyses were performed. It was assumed that the structure of the 4-point response scale on the latent dimension “immersion” would be the same for all candidate items. Therefore, the rating scale model (RSM) was selected for further analyses rather than the partial credit model (PCM), in which the scale structure would be considered as item-dependent[40,48]. The 5 facets participant, item, piece, version (audio format), and liking and the 3 dummy facets expertise, recruitment, and 3D impression were specified (see Figure 3 for the MFRM model). This model was used for an iterative process to determine the final item set. Based on the two criteria of outfit mean-square statistics and point-measure correlations, outlier participants and items were successively identified and removed from the data set. As a rule of thumb, we decided that no more than 15% of the participants should be excluded as outliers during this process. Generally, mean-square fit statistics indicate the randomness within the probabilistic model and have an expected value of 1.0[49]. Values smaller than the expected value (model overfit) indicate observations that are too predictable, while values larger than 1.0 indicate too little predictability (model underfit); outfit statistics are outlier-sensitive. Mean-square values were used, rather than standardized fit statistics, because with the latter even small deviations from model expectations become significant in larger samples[50]. The point-measure correlation provides information on the correspondence between the observed scores and the model expectation[40]. Therefore, a negative point-measure correlation indicates poor coincidence of model expectations and observations.
The first step of analysis included the complete data set containing all participants (N = 222) and all candidate items (N = 25). The item outfit mean-square values ranged from 0.76 to 3.65, and the Rasch measures explained 55.08% of the variance. To identify potentially disturbing outlier participants, we chose the criterion of an outfit mean-square value >1.75, which is below the rule of thumb of 2.00[51] and close to the recommended sample size-based threshold for dichotomous models of 1.82[52]. As a consequence, 20 participants showed an outfit value of >1.75 and were removed from the data set for the second step of analysis.
In this second step, the analysis of the data set with 91.0% of the participants (n = 202) and all candidate items resulted in 55.18% of explained variance, characterized by an outfit range from 0.78 to 3.05 for the items. To detect outlier participants in this step, we used the point-measure correlation. As a consequence, one participant was excluded from further analyses due to a negative correlation.
After removing a first set of outlier participants, we performed the third step of analysis to identify items with poor model fit. Item 20 showed an outfit of 3.08 while all other items had values ranging from 0.78 to 1.54. Thus, Item 20 was removed from the data.
The subsequent fourth step of analysis resulted in 56.84% of explained variance with item outfit values ranging from 0.81 to 1.63. Again, exclusion of outlier participants was in line with the criteria of an outfit of >1.75 (n = 9) and negative point-measure correlation (n = 2) for the fifth step, with 85.6% of the sample remaining (n = 190). After these steps, data trimming based on person misfit was discontinued.
The sixth step of analysis started from this data set adjusted for outlier (n = 190 participants) and included 24 of the candidate items. In this iteration, the Rasch measures explained 58.17% of the variance, and item outfit values ranged from 0.81 to 1.55. According to the recommended sample size-based threshold for dichotomous MFRM models, item outfit values should be in the range of 0.94 to 1.06[52]. However, these thresholds may be too strict in view of the fact that it was not the first step of analysis[40] and that the data were polytomous rather than dichotomous. Thus, the more lenient criterion of an outfit value of >1.2 was applied to exclude items. According to this criterion, Items 23, 22, 18, 25, 13, 24, and 14 were removed from the item set across seven iterations. In the 13th iteration, the remaining 17 items showed an outfit value between 0.90 and 1.16 and were, thus, considered as psychometrically adequate (see Supplementary Table S4 for details).
Final item set
To compile a short final item set, we considered the content of the items as well as their position on the latent dimension immersion, that is, the item difficulty. The main aim of this last step of analysis was to cover a preferably wide range on the latent continuum based on 10 items but without large accumulations in immediate vicinity. Therefore, quintiles (20% percentiles) of the item difficulty distribution were used. The authors discussed items within each quintile, and two items out of each quintile were selected for the final set.
Analysis of the final item set
The final item set showed an excellent internal consistency (Cronbach’s α = 0.967, SD = 0.903) for the adjusted data set. The quality of this index of internal consistency was comparable to the quality criteria of an intelligence test[53]. A confirmatory factor analysis (CFA) of the adjusted data with the 10 final items as indicators on just one factor resulted in fit measures indicating good or at least adequate fit (model fit: CFI = 0.978, TLI = 0.972, SRMR = 0.0163; see Supplementary Table S5 and Table S6 for details)[54].
To check whether the outlier adjusted data adequately fit the specified Rasch model, we considered the standardized residuals[55]. A reasonable fit is indicated when the mean of the standardized residuals is close to 0[33,55] and their standard deviation near 1[33], which was the case (M = –8.15 × 10–4, SD = 1.01). Furthermore, about 5% or less of the absolute standardized residuals should exceed values ≥ 2, and about 1% or less should have values ≥3[33,40], which was also the case with 4.5% being ≥ 2 and 0.9% being ≥ 3.
The characteristics of the model as an outcome of iterative MFRM analysis can be summarized in five steps as follows: First, the Rasch measures explained 62.69% of the variance. Second, as shown in Table 3 and Table 4, the outfit values of the items ranged from 0.91 to 1.17 and were, thus, in the targeted range. The position of the items, that is, item difficulty, was almost identical to that from the previous analysis (see Table 3 and Supplementary Table S4) so that a range from –0.81 to 0.40 was covered. Third, Figure 4 shows the resulting Wright map[56] with the facets of participant, piece, version, item, and the (4-point) response scale. As expected, the 3D format was localized slightly higher (0.26 logits) on the latent continuum of immersion than the stereo format (0.14 logits), which was more distinct from the mono format (–0.40 logits; see Supplementary Table S7 for the detailed measurement report of this facet). This means that a 3D audio version was more likely to be rated higher on the immersion scale than the same sound example in stereo or mono format. This finding supports the assumption that 3D audio formats are more likely to actually trigger an increased immersion experience. Fourth, on the piece/song level, comparison of ratings showed only small differences with regard to their localization on the latent dimension (see Supplementary Table S8 for the detailed measurement report of this facet). Therefore, it could be concluded that the experience of immersion was independent of song genre. Fifth, as shown in the category probability curves (Supplementary Figure S3), the response categories of the rating scale (from 1 to 4) were in the correct order. The Rasch-Andrich thresholds, which represent the transition points where adjacent response categories are equally likely to be observed, were separated each by 2 logits from each other so that no collapsing of categories was necessary[40] (see Supplementary Table S9 for details on the response scale category statistics).
Table 3 Measurement report for the final 10 items set trimmed for outliers
Item
|
Total Score
|
Observed Average
|
Fair (M) Average
|
Measure
(logits)
|
Model SE
|
Outfit
|
Correlation
|
MnSq
|
ZStd
|
PtMea
|
PtExp
|
15
|
5146
|
2.26
|
2.08
|
0.40
|
0.03
|
0.91
|
-2.4
|
.79
|
.77
|
8
|
5173
|
2.27
|
2.10
|
0.37
|
0.03
|
0.95
|
-1.4
|
.78
|
.77
|
12
|
5326
|
2.34
|
2.19
|
0.21
|
0.03
|
1.00
|
0.1
|
.77
|
.77
|
21
|
5344
|
2.34
|
2.20
|
0.19
|
0.03
|
0.97
|
-0.7
|
.79
|
.77
|
19
|
5435
|
2.38
|
2.25
|
0.09
|
0.03
|
1.08
|
2.2
|
.76
|
.78
|
6
|
5452
|
2.39
|
2.26
|
0.07
|
0.03
|
1.17
|
4.5
|
.75
|
.78
|
11
|
5542
|
2.43
|
2.31
|
-0.03
|
0.03
|
1.16
|
4.3
|
.75
|
.78
|
5
|
5588
|
2.45
|
2.34
|
-0.08
|
0.03
|
0.91
|
-2.5
|
.80
|
.78
|
7
|
5879
|
2.58
|
2.51
|
-0.40
|
0.03
|
1.00
|
0.0
|
.79
|
.78
|
4
|
6256
|
2.74
|
2.73
|
-0.81
|
0.03
|
0.97
|
-0.8
|
.78
|
.78
|
Mean
|
5514.1
|
2.42
|
2.30
|
0.00
|
0.03
|
1.01
|
0.3
|
.78
|
|
SD
|
336.0
|
0.15
|
0.19
|
0.37
|
0.00
|
0.09
|
2.6
|
.02
|
|
Note. N = 190. Total Score = observed raw score; Observed Average = observed raw score divided by the number of observations (2,280); Fair (M) Average = Rasch measure to raw score conversion, producing an average rating for the item that was standardized so that it was fair; Measure = item difficulty in logits; Model SE = model standard error; MnSq = mean-square; ZStd = Z-standardized t-statistic; PtMea = point-measure correlation (correlation between the item's observations and the measures modelled to generate them); PtExp = expected value of the point-measure correlation; SD = standard deviation of the sample (excerpt from Facets Output).
To control for unidimensionality of the 10-item set, we used a principal component analysis of standardized residuals (PCAR) based on the outlier-adjusted data. This revealed contrasts—the principal components—with very similar eigenvalues smaller than 1.6, such that each component had a strength of less than two items (see Supplementary Table S10)[57]. Moreover, the Rasch measures of the items and persons each explained more than two and a half times as much variance as one of the contrasts. Another indicator for the unidimensionality was the high correlation of person measures obtained from clusters of items formed according to their loadings on the components of the PCAR (see Supplementary Table S11 and Table S12).
Table 4 Measurement report and bilingual version of the final 10-item set of the Immersive Audio Quality Inventory.
#
|
Item
|
Measure
(logits)
|
Outfit (Mean-Square)
|
Source
|
15
|
Das Musikhören war mein einziger Wunsch. My only wish was to listen to the music.
|
0.40
|
0.91
|
[19]
|
8
|
Ich war oft aufgeregt, weil mich die Musik unmittelbar erreichte.
I was excited because I felt a direct connection with the music.
|
0.37
|
0.95
|
[19]
|
12
|
Beim Zuhören verlor ich mein Zeitgefühl. While listening, I lost all sense of time.
|
0.21
|
1.00
|
[19]
|
21
|
Das Hörerlebnis hat mich stark berührt. The listening experience moved me.
|
0.19
|
0.97
|
[25]
|
19
|
Das Hörerlebnis war überwältigend. My listening experience was overwhelming.
|
0.09
|
1.08
|
[25]
|
6
|
Ich empfand das Zuhören oft als aufregend. I often found it exciting to listen to the music.
|
0.07
|
1.17
|
[19]
|
11
|
Beim Zuhören konnte mich kaum etwas ablenken. While I was listening, hardly anything could distract me.
|
–0.03
|
1.16
|
[19]
|
5
|
Das Hörerlebnis fesselte mich. The listening experience captivated me.
|
–0.08
|
0.91
|
[19]
|
7
|
Ich war neugierig auf den weiteren Verlauf des Hörerlebnisses. While I was listening, I was curious as to how it would continue.
|
–0.40
|
1.00
|
[19]
|
4
|
Ich mochte das Zuhören. I enjoyed listening.
|
–0.81
|
0.97
|
[19]
|
Note. For application purposes when using the IAQI, a 4-point rating scale with labeled extremes (1 = “strongly disagree” [„Trifft ganz und gar nicht zu”], 2 = “strongly agree” [„Trifft voll und ganz zu”]) must be used. For additional statistical details of the items see Table 3.
Application of the IAQI
For the useful application of the IAQI inventory, scoring is necessary to express the individual answers to the items in one overall value. The scale of the inventory allows response values from 1 to 4. By taking the mean of the answers of all 10 items, a possible overall score from 1 to 4 will result in steps of 0.1. To check the admissibility of this scoring procedure in our study, a one-tailed Pearson correlation between the averaged IAQI sum score across all stimuli and the person characteristics (logits) was calculated. A high correlation between the two features of r(190) = .878, 95% CI [.847,1.0] was observed. The scatterplot shows a slightly s-shaped arrangement of the data points for items obtained with IRT methods (for details see Supplementary Figure S4). A simple score calculation by averaging the individual response values of the 10 items without complex individual weighting of items was, therefore, considered permissible.