Wrapped Into Sound: Development of the Immersive Audio Quality Inventory (IAQI)

Although virtual reality, video entertainment, and computer games are dependent on the three-dimensional reproduction of sound (including front, rear, and height channels), it remains unclear whether 3D-audio formats actually intensify the emotional listening experience. There is currently no valid inventory for the objective measurement of immersive listening experiences resulting from audio playback formats with decreasing degrees of immersion (from mono to stereo, 5.1, and 3D). The development of the Immersive Audio Quality Inventory (IAQI) could close this gap. An initial item list (N = 25) was derived from studies in virtual reality and spatial audio, supplemented by researcher-developed items and items extracted from historical descriptions. Psychometric evaluation was conducted by an online study (N = 222 valid cases). Based on controlled headphone playback, participants listened to four songs/pieces, each in the three formats of mono, stereo, and binaural 3D audio. The latent construct “immersive listening experience” was determined by probabilistic test theory (item response theory, IRT) and by means of the many-facet Rasch measurement (MFRM). As a result, the specied MFRM model showed good model t (62.69% of explained variance). The nal one-dimensional inventory consists of 10 items and will be made available in English and German.


Introduction
In the early days of sound transmission and reproduction, one of the main technological aims was the rendition of spatial concert atmospheres over loudspeakers [1]. In the 1950s, when stereo media hit the market [2], a 2-channel recording and reproduction system seemed to be the landmark for high-delity playback of music. However, as early as 1940, and in cooperation with the conductor Leopold Stokowski, the entertainment industry initiated the application of rear and elevated speakers for the Walt Disney lm Fantasia (1940) and initiated the new spatial audio format Fantasound [1]. This even had a "voice of God" loudspeaker mounted on the ceiling [3]. In the following decades, a variety of technological developments was necessary to accomplish the evolution from monophonic to 3D sound reproduction with the main aim of creating a spatial illusion. Most of the technical approaches, however, were limited to the listening experience of surround sound in the horizontal plane [4]. In the 1970s, Granville Cooper and Michael Gerzon played a key role in the further development of 3D audio formats. Based on a recording and playback system with four stereo channels, Cooper sought to recreate a concert performance in the home environment [5]. This system was called tetrahedral ambiophony and could ful ll the basic psychoacoustic requirements for a three-dimensional sound eld construction with a limited number of four loudspeakers in front, rear, and elevated positions [5][6][7][8]. Since then, additional audio formats such as Auro 3D, Dolby Atmos and DTS:X have been developed. All of the aforementioned playback technologies can be summed up under the term 3D audio or immersive audio.
However, the question remains whether there is a relationship between the increasing spatiality of sounds and the listener's emotional response. By using the Geneva Emotional Music Scale (GEMS-25) [9], Hahn [10] conducted a rst approach to measuring emotions evoked by 3D audio, surround sound, and stereo. However, the latent construct of immersion could not be investigated by the GEMS inventory. According to Görne[11], the goal of a stereo recording is to place the listener in a virtual acoustic environment. One characteristic of a successful recording is the impression of a virtual space. Following this line of reasoning, a comparison of audio playback formats should be based on the extent to which a listener feels immersed in a virtual acoustic environment (e.g., in stereo, surround sound, and 3D audio). This presupposes an objective tool for measurement that is currently unavailable. The only two inventories that come closest to our research question focus either on the perceptual evaluation of spatial audio technologies [12] or on the development of a consensus vocabulary (and its application) for the perceptual space of venues for music and speech performance [13].
In this context, the key term immersion, is an important concept from virtual reality research, which can be "characterized by diminishing critical distance to what is shown and increasing emotional involvement in what is happening" [14]. Other related terms to the conceptual eld of immersion are absorption and presence, for which a variety of partially overlapping de nitions exist. For example, absorption is de ned as "an extreme involvement or preoccupation with one object, idea, or pursuit, with inattention to other aspects of the environment.
[…]" [15], and presence is understood as "the subjective experience of being in one place or environment, even when one is physically situated in another" [16]. For the immersion-related term of presence, the most concise de nition is the experience of "being there" [17]. Some studies also assume the existence of social presence. For example, Shin et al. found evidence that 3D sound can play a key role in triggering social presence, thereby positively in uencing enjoyment [18]. However, due to the lack of clear de nitions and comprehensive concepts, a plain distinction between the various types of presence is di cult. We de ne immersion as a "psychological state characterized by perceiving oneself to be enveloped by, included in, and interacting with an environment that provides a continuous stream of stimuli and experiences" [16]. In this context, being immersed means being involved in a given context, not only physically but also mentally and emotionally [19]. For our study, we further assume as a working de nition that immersion is a continuous latent trait. Its manifestation may be dependent on innate and learned hearing mechanisms. We presume that psychoacoustic and electrophysiological correlates exist.
Although inventories for the operationalization of these terms already exist, they are predominantly related to the visual domain. These selected existing inventories will serve as a starting point for the development of an audio-speci c inventory (see Table 1).

Study aims
The main aim of the study was the development of an inventory for the measurement of subjectively perceived degrees of auditory immersion. This would allow for later comparison of immersive experiences resulting from different audio playback formats. For this purpose, a multi-stage model of test development was applied [20]. As the development of an inventory requires a large number of participants (in our case, N > 200), a laboratory study seemed to be unrealistic. For this reason, we decided to use a web-based approach. Because most participants would not meet the technical requirements for the standards of 3D audio playback via loudspeakers (e.g., elevated or up ring speakers), binaural 3D versions of musical stimuli had to be created so that a 3D effect could be generated by means of headphones.
As the perceived 3D effect in binaural productions is in uenced by many factors, for example, the individual Head-Related Transfer Function (HRTF) [21], the selection of the stimuli remained a particular challenge. To generate a su cient amount of response variance, we had to con rm that the binaural 3D audio material had the potential to elicit a convincing 3D effect among the participants. This was to be guaranteed by extensive pre-testing and additional evaluation of the auditory stimuli through experienced sound engineers. Following data collection, advanced psychometric routines such as con rmatory factor analysis and item response theory (IRT) were applied so that we could decide on the dimensionality of the latent construct immersion and the validity and reliability of items [22]. In the end, a short inventory (with a length of about 10 items) was to be made available to the research community for future evaluation of listening situations in which spatial audio and immersive audio experiences are of interest.

Formulation and selection of items
A mixed strategy of item identi cation and item generation was applied: In a rst step, a literature review in the data bases PsycINFO and ProQuest was conducted on the topics of virtual reality, gaming, and spatial audio focusing on inventories that address the notion of the key terms: immersion, absorption, involvement, or presence. As the majority of inventories came from the domain of augmented or virtual reality research, the wording of selected items had to be refocused to listening. The original items were mainly used as a source of inspiration and had to be adapted signi cantly. For example, an item such as "I liked the type of the activity" [19] was reformulated to "I enjoyed listening," and an item such as "I felt detached from the outside world" [23] was adapted to "While listening, I felt as if I were detached from the rest of the world." Additionally, items extracted from historical descriptions of spatial audio effects (Items 22 and 23) and researcher-developed items were added (Items 25, C1, and C2). On this basis, an initial item set of 25 candidate items and two control items was compiled (see Table 1). For the original wording of items and their adaption, see Supplementary Table S1. I had a three-dimensional listening experience while listening to this piece/song.

RD
Note. RD = researcher-developed item.
The wording of the items was meant to capture the personal listening experience (emotions felt) and not offer a description of the technical properties of the sound or music or what it conveys (emotions perceived). Therefore, items were predominantly formulated as rst-person statements. In addition, items related to hypothetical situations, performance or learning tasks, control of the situation, or visual aspects were disregarded. In case of items with similar content from different inventories, the item that could be adapted best to music reception was selected. Identical items from different sources were only considered once.
After the selection and adaptation process, a German and an English version of the initial item set was created according to the standards of cross-cultural research methods and test adaptation (e.g., translation, evaluation, and retranslation) [28][29][30]. Table 1 contains all items from the initial list and two additional items to control for the liking of the piece/song and for the impression of three-dimensionality. A 4-point rating scale with labeled extremes (1 = "strongly disagree" ["Trifft ganz und gar nicht zu"], 4 = "strongly agree" ["Trifft voll und ganz zu"]) was used for item responses.

Online Study
An online study was conducted for the psychometric evaluation of the German version of the item set from 28 January to 2 March 2021 using the platform SoSci Survey (www.soscisurvey.de). All standards for the implementation of an internet listening experiment, such as high hurdle techniques or a check of participants' audio equipment, were considered [31]. In terms of sample size, according to classical recommendations on sample size for explorative factor analysis (EFA), a sample-to-subject ratio of about 10 : 1 can be regarded as a reasonable starting point [32]. This results in a sample size of about 250 valid cases for the EFA. Due to the expected high demands on participants' endurance and audio equipment, this seemed to be a realistic target sample size. Finally, for the scheduled MFRM, a minimum of 30 observations per element (e.g., a participant or an item) and at least 10 observations per response scale category (4point) were necessary for stable estimates of the respective parameters [33], achieved with the sample size required for factor analysis.

Stimuli
Potentially suitable audio material was gathered from a variety of sources. Due to the general methodological approach, mono, stereo, and 3D versions of all pieces were required. As an online study was to be conducted, all 3D audio samples had to be available as binaural versions for headphone usage. In general, three different approaches were used to create Through an extensive iterative process of external and internal evaluation of the stimuli regarding their degree of immersion, four suitable pieces/songs were selected (for the nal stimulus list see Supplementary Table S2). Based on these four preselected 3D stimuli, three audio engineering experts identi ed the respective section of every piece/song with the strongest 3D effect. Stereo and mono versions were added to the stimulus selection as additional formats with predictable lower degrees of immersion. For the online study, all stimuli were normalized to -20 LUFS (integrated). The length of each section was about 60 seconds and was kept constant across all versions of a piece/song. This length is considered to be su cient as the mean initial emotional response time to audio stimuli is around 8.31 s. Our sample duration exceeded this minimum requirement [34]. All sound examples were presented in wav format. Details of the complete stimulus selection process are described in Supplementary Figure S1.
Procedure Figure 1 depicts the entire procedure of the online study. On the welcome page of the survey, participants were informed that the study was about music perception and that participation would take about 45 minutes. Information on technical requirements was given (e.g., audio playback equipment and deactivation of sound processing enhancements of the operating system). All attendees were informed that various tests on attentive participation would be embedded and that response time would be recorded. The informed consent of the participants was then requested.
To check for participants' attention, we administered a short calculation task (4 + 5 = ?). Additionally, the input eld was not limited to the character length of the solution. This task was to exclude those participants using auto ll scripts for the completion of questionnaires. The same lter criterion was applied to the input eld in which participants were asked to state their age. Next, participants indicated their gender, educational level according to the ISCED [35], and whether they were in a music-related profession.
The Quick Hearing Check (QHC)[36] is a 15-item self-report on hearing loss. According to the QHC instructions, sum scores of 32 or higher generally indicate a severe hearing loss; this functioned as an exclusion criterion in our study. An instructed response item was embedded in the list of original items of the QHC to detect participants who produced meaningless data by nonattentive response behavior [37].
Next, participants had to indicate the kind of playback device they used in this study from a list of playback devices (i.e., headphones, built-in laptop, smartphone, or tablet speaker(s), speakers in a monitor/TV, or freestanding speakers). Selfreported non-headphone users were informed that the use of headphones was mandatory for this experiment, and usage would be controlled by listening tasks. In a next step, the type (circumaural, supra-aural, intra-aural), the manufacturer, and the model of the headphones used had to be provided. Next it was checked whether autoplay and Java Script were enabled in the browser. For several browser types, brief instructions on how to set up requirements were given. Windows users were instructed to deactivate all sound processing enhancements.
After the technical requirements were established, participants completed the Headphone and Loudspeaker Test (HALT) [38,39]. HALT is comprised of tasks for calibrating the playback level, checking the correct assignment of stereo channels, estimating the lower cutoff frequency, and screening for headphone usage. In the original HALT laboratory experiment with various playback devices, participants set an average level of 67.77 dB(A) (test-retest reliability r tt = .899) with a relatively low heterogeneity (SD = 4.29) by using a counting task. The subjectively adjusted sound pressure level was measured with a short section from a pop song (long term LUFS = -8.4). As headphones of different quality were used in the HALT study, we expect a similar setting and reliability of the volume standardization in the inventory development for the Immersive Audio Quality Inventory (IAQI, say "Yuackee").
Spatial Hearing Test. As a manipulation check, (perception of differences between the audio formats), a comparison task (2-AFC design) was used: Participants listened to three pairs of sound samples and decided which sound sample of a pair showed higher spatiality. Pairs and pair positions were presented in random order and based on the same 20second excerpts used for the main study (rendered either in mono or in 3D audio). One pair served as a retest item.
After the participants completed the initial tests, the main part of the study started. A complete (fully crossed) design was used [40]. Because there are no missing values, this design leads to the highest precision of model parameter estimates. In our study, all items were presented in random order. To reduce cognitive load, we de ned and kept constant a random order of the candidate items and control items for each participant throughout the entire procedure. Instructed response items were embedded between the original candidate items for each stimulus, which enabled us to check for attentive participation. The stimuli were randomized in two steps for each participant: First, the order of versions (mono, stereo, 3D) was randomized for each piece/song. Second, the pieces/songs were placed in random order. Each stimulus was, rst, automatically played on a blank questionnaire page. After the stimulus had been played once completely, the candidate items and control items were displayed with their 4-point rating scale along with control buttons for replaying and pausing the stimulus.
Additional criteria for data trimming were prede ned to ensure data quality: In case of two incorrectly answered instructed response items, the participant was excluded from the survey. The cases in which participants took longer than 5 minutes to answer the items for one stimulus were agged. If a processing duration of 5 minutes was exceeded a second time for the same case, the agged participants were excluded from the survey.

Participants
Participants were acquired from a commercial sample provider (mo'web GmbH, Germany, https://www.mowebresearch.com/) and through target-group speci c mailing lists. Multiple criteria for the ltering of meaningless data were applied during data collection. As shown in Figure 2, of N = 2,277 commenced questionnaires, only 255 were completed; 2,022 were excluded due to the incorrect answering of the instructed response items, high QHC scores, or dropout. Five participants had to be excluded manually due to repeated timeout. To exclude participants who did not use headphones, we applied the results of the HALT screening procedure. To maximize the percentage of correct classi cations, HALT can determine the optimal scoring for the screening procedure for a given prevalence, that is, the proportion of headphone users in the relevant population. Therefore, the assumption was made that 75% of the participants who reported using loudspeakers switched to headphone use after receiving instructions to use headphones. However, 28 of these participants were classi ed as loudspeaker users according to the optimal screening method identi ed for a headphone prevalence of 75% and were therefore excluded. The remaining 222 participants comprised the nal sample and were the basis for the next steps of data analysis. Table 2 shows socio-demographic data for this sample and the subsamples grouped by type of acquisition.

Ethical Approval Statement
The study was performed in accordance with relevant institutional and national guidelines [41,42] and with the principles expressed in the Declaration of Helsinki. Formal approval of the study by the Ethics Committee of the Hanover University of Music, Drama and Media was not mandatory as the study adhered to all required regulations. Anonymity of participants and con dentiality of their data were ensured. They were informed about the objectives and the procedure of the survey as well as the option to withdraw from the study at any time without providing reasons or having any repercussions. All participants gave their informed consent online in accordance with the guidelines of the Hanover University of Music, Drama, and Media, by ticking a checkbox.

Results
In addition to factor analytical techniques from classical test theory (CTT), approaches from item response theory (IRT) were considered for the psychometric evaluation of the candidate items. For the IRT family of methods, the many-facet  F8 Items: For the same immersive audio experience, a person might respond differently to several items. This is because some items require more of the latent construct than others to achieve the same (high) response category. This characteristic is represented by the item di culty. The candidate items constitute the elements of this facet.

We introduced dummy facets (DF) into the model to test for interactions between facets and potentially in uencing
variables that are not considered as facets in their own right. The following serves as a description of the dummy facets: DF2 Expertise: Differences in expertise related to music and audio production might in uence the immersive audio experience. Participants were assigned to one of three levels of expertise based on their indication of music-related profession. The three levels are the elements that constitute this dummy facet.
DF3 Recruitment: Participants were acquired from a panel provider and via mailing lists. This might have in uenced the response behavior. Therefore, the two sources for participants represent the elements of the dummy facet.
DF7 3D Impression: The four response categories of item C2 (from 1 = "strongly disagree" to 4 = "strongly agree") are the elements of this dummy facet.
The log odds form of the model without dummy facets is given by where p nimjlk is the probability that a person n responded with category k ∈ {2,3, 4} to item i when they listened to the piece of music m in the format j with a liking of l; p nimjlk − 1 is the probability that a person n responded with category k − 1 to item i when they listened to the piece of music m in the format j with a liking ofl; τ k is the di culty of responding with category k relative to k − 1. The di culty δ i of item i is the point on the latent variable in which Category 1 and 4 were equally probable. ω m and ξ j are the potential for immersion of piece/song m and version j, respectively. λ l represents the in uence of response category l of item C1. The unit of all parameters and, therefore, of the latent dimension is logits, that is, log odds units [40].
To test for the assumed unidimensional structure and the role of other in uential variables, we applied a principal component analysis of standardized residuals (PCAR) [40,45] Figure S2 for simulated eigenvalues), one should bear in mind that this nding might be the result of psychometrically unsuitable items that disturbed the results (see Supplementary Table S3 for the factor loadings). In general, a comparison of the dimensionality of other immersion-related inventories and the IAQI inventory could not yet be recommended. The dimensionality of an overall immersion as a multisensory phenomenon had not yet been conclusively clari ed. Existing hierarchical models assert a cause-and-effect relationship for which no data-based evidence was available. Items from other inventories had to be signi cantly adjusted in order to meet the needs of the IAQI inventory. When comparing inventories that are evidently different, equality of dimensionality cannot be expected.

Item identi cation by Many-Facet Rasch Measurement analyses
As the EFA con rmed unidimensionality, in the next step, MFRM analyses were performed. It was assumed that the structure of the 4-point response scale on the latent dimension "immersion" would be the same for all candidate items.
Therefore, the rating scale model (RSM) was selected for further analyses rather than the partial credit model (PCM), in which the scale structure would be considered as item-dependent [40,48]. The 5 facets participant, item, piece, version (audio format), and liking and the 3 dummy facets expertise, recruitment, and 3D impression were speci ed (see Figure  3 for the MFRM model). This model was used for an iterative process to determine the nal item set. Based on the two criteria of out t mean-square statistics and point-measure correlations, outlier participants and items were successively identi ed and removed from the data set. As a rule of thumb, we decided that no more than 15% of the participants should be excluded as outliers during this process. Generally, mean-square t statistics indicate the randomness within the probabilistic model and have an expected value of 1.0 [49]. Values smaller than the expected value (model over t) indicate observations that are too predictable, while values larger than 1.0 indicate too little predictability (model under t); out t statistics are outlier-sensitive. Mean-square values were used, rather than standardized t statistics, because with the latter even small deviations from model expectations become signi cant in larger samples [50]. The point-measure correlation provides information on the correspondence between the observed scores and the model expectation [40]. Therefore, a negative point-measure correlation indicates poor coincidence of model expectations and observations.
The rst step of analysis included the complete data set containing all participants (N = 222) and all candidate items (N = 25). The item out t mean-square values ranged from 0.76 to 3.65, and the Rasch measures explained 55.08% of the variance. To identify potentially disturbing outlier participants, we chose the criterion of an out t mean-square value >1.75, which is below the rule of thumb of 2.00 [51] and close to the recommended sample size-based threshold for dichotomous models of 1.82 [52]. As a consequence, 20 participants showed an out t value of >1.75 and were removed from the data set for the second step of analysis.
In this second step, the analysis of the data set with 91.0% of the participants (n = 202) and all candidate items resulted in 55.18% of explained variance, characterized by an out t range from 0.78 to 3.05 for the items. To detect outlier participants in this step, we used the point-measure correlation. As a consequence, one participant was excluded from further analyses due to a negative correlation.
After removing a rst set of outlier participants, we performed the third step of analysis to identify items with poor model t. Item 20 showed an out t of 3.08 while all other items had values ranging from 0.78 to 1.54. Thus, Item 20 was removed from the data.
The subsequent fourth step of analysis resulted in 56.84% of explained variance with item out t values ranging from 0.81 to 1.63. Again, exclusion of outlier participants was in line with the criteria of an out t of >1.75 (n = 9) and negative point-measure correlation (n = 2) for the fth step, with 85.6% of the sample remaining (n = 190). After these steps, data trimming based on person mis t was discontinued.
The sixth step of analysis started from this data set adjusted for outlier (n = 190 participants) and included 24 of the candidate items. In this iteration, the Rasch measures explained 58.17% of the variance, and item out t values ranged from 0.81 to 1.55. According to the recommended sample size-based threshold for dichotomous MFRM models, item out t values should be in the range of 0.94 to 1.06 [52]. However, these thresholds may be too strict in view of the fact that it was not the rst step of analysis [40] and that the data were polytomous rather than dichotomous. Thus, the more lenient criterion of an out t value of >1.2 was applied to exclude items. According to this criterion, Items 23,22,18,25,13,24, and 14 were removed from the item set across seven iterations. In the 13 th iteration, the remaining 17 items showed an out t value between 0.90 and 1.16 and were, thus, considered as psychometrically adequate (see Supplementary Table S4 for details).

Final item set
To compile a short nal item set, we considered the content of the items as well as their position on the latent dimension immersion, that is, the item di culty. The main aim of this last step of analysis was to cover a preferably wide range on the latent continuum based on 10 items but without large accumulations in immediate vicinity. Therefore, quintiles (20% percentiles) of the item di culty distribution were used. The authors discussed items within each quintile, and two items out of each quintile were selected for the nal set.

Analysis of the nal item set
The nal item set showed an excellent internal consistency (Cronbach's α = 0.967, SD = 0.903) for the adjusted data set. The quality of this index of internal consistency was comparable to the quality criteria of an intelligence test [53]. A con rmatory factor analysis (CFA) of the adjusted data with the 10 nal items as indicators on just one factor resulted in t measures indicating good or at least adequate t (model t: CFI = 0.978, TLI = 0.972, SRMR = 0.0163; see Supplementary Table S5 and Table S6 for details) [54].
To check whether the outlier adjusted data adequately t the speci ed Rasch model, we considered the standardized residuals [55]. A reasonable t is indicated when the mean of the standardized residuals is close to 0 [33,55] and their standard deviation near 1 [33], which was the case (M = -8.15 × 10 -4 , SD = 1.01). Furthermore, about 5% or less of the absolute standardized residuals should exceed values ≥ 2, and about 1% or less should have values ≥3 [33,40], which was also the case with 4.5% being ≥ 2 and 0.9% being ≥ 3.
The characteristics of the model as an outcome of iterative MFRM analysis can be summarized in ve steps as follows: First, the Rasch measures explained 62.69% of the variance. Second, as shown in Table 3 and Table 4, the out t values of the items ranged from 0.91 to 1.17 and were, thus, in the targeted range. The position of the items, that is, item di culty, was almost identical to that from the previous analysis (see Table 3 and Supplementary Table S4) so that a range from -0.81 to 0.40 was covered. Third, Figure 4 shows the resulting Wright map[56] with the facets of participant, piece, version, item, and the (4-point) response scale. As expected, the 3D format was localized slightly higher (0.26 logits) on the latent continuum of immersion than the stereo format (0.14 logits), which was more distinct from the mono format (-0.40 logits; see Supplementary Table S7 for the detailed measurement report of this facet). This means that a 3D audio version was more likely to be rated higher on the immersion scale than the same sound example in stereo or mono format. This nding supports the assumption that 3D audio formats are more likely to actually trigger an increased immersion experience. Fourth, on the piece/song level, comparison of ratings showed only small differences with regard to their localization on the latent dimension (see Supplementary Table S8 for the detailed measurement report of this facet). Therefore, it could be concluded that the experience of immersion was independent of song genre. Fifth, as shown in the category probability curves (Supplementary Figure S3), the response categories of the rating scale (from 1 to 4) were in the correct order. The Rasch-Andrich thresholds, which represent the transition points where adjacent response categories are equally likely to be observed, were separated each by 2 logits from each other so that no collapsing of categories was necessary [40] (see Supplementary Table S9 for details on the response scale category statistics). Note. N = 190. Total Score = observed raw score; Observed Average = observed raw score divided by the number of observations (2,280); Fair (M) Average = Rasch measure to raw score conversion, producing an average rating for the item that was standardized so that it was fair; Measure = item di culty in logits; Model SE = model standard error; MnSq = mean-square; ZStd = Z-standardized t-statistic; PtMea = point-measure correlation (correlation between the item's observations and the measures modelled to generate them); PtExp = expected value of the point-measure correlation; SD = standard deviation of the sample (excerpt from Facets Output).
To control for unidimensionality of the 10-item set, we used a principal component analysis of standardized residuals (PCAR) based on the outlier-adjusted data. This revealed contrasts-the principal components-with very similar eigenvalues smaller than 1.6, such that each component had a strength of less than two items (see Supplementary   Table S10) [57]. Moreover, the Rasch measures of the items and persons each explained more than two and a half times as much variance as one of the contrasts. Another indicator for the unidimensionality was the high correlation of person measures obtained from clusters of items formed according to their loadings on the components of the PCAR (see Supplementary Table S11 and Table S12). Note. For application purposes when using the IAQI, a 4-point rating scale with labeled extremes (1 = "strongly disagree" ["Trifft ganz und gar nicht zu"], 2 = "strongly agree" ["Trifft voll und ganz zu"]) must be used. For additional statistical details of the items see Table 3.

Application of the IAQI
For the useful application of the IAQI inventory, scoring is necessary to express the individual answers to the items in one overall value. The scale of the inventory allows response values from 1 to 4. By taking the mean of the answers of all 10 items, a possible overall score from 1 to 4 will result in steps of 0.1. To check the admissibility of this scoring procedure in our study, a one-tailed Pearson correlation between the averaged IAQI sum score across all stimuli and the person characteristics (logits) was calculated. A high correlation between the two features of r(190) = .878, 95% CI [.847,1.0] was observed. The scatterplot shows a slightly s-shaped arrangement of the data points for items obtained with IRT methods (for details see Supplementary Figure S4). A simple score calculation by averaging the individual response values of the 10 items without complex individual weighting of items was, therefore, considered permissible.

Discussion
We successfully developed the Immersive Audio Quality Inventory (IAQI) for the measurement of immersive music experience with high psychometric quality. The manageable number of ten items allows for an e cient application in multiple research elds in which audio content plays an important role, such as research in the entertainment industry or virtual reality. Possible limitations of our ndings should be considered and might have resulted from the use of binaural headphone mixes as 3D stimuli (instead of loudspeaker playback). In the current state of our research, we cannot rule out that the presentation by headphones might underestimate the "true" impact of 3D audio on immersion.
However, the question of the magnitude of the effect size will be subject to forthcoming research. Another possible in uencing factor on the strength of the 3D effect could result from the mismatch of head-related transfer functions (HRTFs). The HRTFs used in the stimuli are based on average HRTFs of a large sample of listeners but do not match the individual HRTF of a participant. This can result in a suboptimal localization of phantom sound sources. A poor localization could result in an attenuation of the experience of immersion. Furthermore, even a matching HRTF cannot preclude an inappropriate headphone-to-ear transfer function (HpTF), which also negatively affects localization. The HpTF is de ned as the electroacoustic transfer function of a headphone, measured in the eardrum[58]. Differences occur due to interindividual differences in the physiognomy of the pinna. Another uncertainty in the measurement of immersion experiences may result from the differences in bass perception: While a strong bass perception can be a strong bodily sensation in loudspeaker reproduction, this effect is largely absent when the listener uses headphones.
However, the binaural approach was pragmatic as the required high number of participants was unrealistic for a laboratory study. In a future laboratory study, the authors will further evaluate the Immersive Audio Quality Inventory by using anchor stimuli from the online study in a loudspeaker setup. This will allow direct comparison between binaural 3D audio for headphones and for loudspeakers with the same audio material.
We are also aware that the binaural 3D realizations we used are not the only possible ones: Current state of the art production tools for 3D audio (e.g., dearVR MUSIC, Dolby Atmos Renderer) allow for a number of degrees of freedom in the adjustment of output parameters such as HRTF types and spatial settings. Based on multiple evaluations of the output, we tried to identify the best possible examples of the binaural approach. Although these sources of variation should be considered as sources of uncertainty in measurement, it seems unlikely that such intervening variables will in uence the main effect of differences in immersion experience between the three audio formats mono, stereo, and 3D audio.
Finally, the psychometric quality of the identi ed unidimensional IAQI scale should be considered. Concerning the question of validity, we rst refer to content validity: As the majority of items were derived from previous research (see Table 1), items used for the construction of the initial IAQI item list were the result of multiple selection and evaluation processes by previous research in the eld of virtual reality. Thus, it seems reasonable to assume that item content re ects the de nition of the target construct immersion and is the result of careful selection by expert judges [59][60][61].
Additionally, the Rasch model itself provides evidence about construct validity: The two major threats of construct validity are construct-irrelevance and construct underrepresentation [59,60] which are indicated by mis tting items and large gaps in the coverage of the latent dimension by the items, respectively [62]. Within the iterative MFRM analysis process, we discarded mis tting items and selected ten items from the remaining ones that were located optimally on the latent dimension to cover a wide range, which further supports construct validity. The high but not perfect correlation of the IAQI score with the 3D impression measured by item C2 (Spearman's ρ (2,280) = .718, p < .001; see Supplementary Figure S5) could be regarded as rst evidence for convergent validity. However, this nding should be interpreted with care as both variables were measured using the same method [59]. Future research will have to consider additional forms of convergent and discriminant validity.
The last criterion of psychometric quality is the reliability of the scale. First, we can refer to the aspect of internal   Flowchart of the data ltering process for the online study Facet model for the MFRM analysis Note. F1 to F8 represent the 5 Facets of the model and DF2 to DF 7 the 3 dummy facets (only considered for interaction effects but not for main effects).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.