Twenty paid volunteers participated in the experiment (17 men, 3 women, all aged 19–24 years). Sample size was determined by our previous experiences before conducting experiments. This sample size corresponds to an effect size f of 0.195, alpha = 0.05, power = 0.8 using the G*Power 3.1 [29,30]. Participants were undergraduate and graduate students of Toyohashi University of Technology. All had normal or corrected-to-normal vision and were naïve to the purpose of the study.
In all of the studies, all participants provided written informed consent before the experiment. All of the experiments were approved by the Ethical Committee for Human-Subject Research of Toyohashi University of Technology, and all experiments were performed in accordance with the committee’s guidelines and regulations.
The visual stimuli were generated and controlled by a computer using Unity Pro and presented on a head-mounted display (HTC Vive Pro: 1,440 × 1,600 pixels, 90 Hz refresh). The participants responded in the task by moving a joystick.
Stimuli and conditions
In the virtual space, a table was in the center of the room, and either an empty chair or an avatar sitting in a chair was presented in one of three positions, that is, on the left, right, or other side of the table, based on the participant’s perspective (Fig. 1A and 1B). Then, a broken circle was presented on the table. The gap in the broken circle was angled in one of eight directions (0°, 45°, 90°, 135°, 180°, 225°, 270°, and 315°), like a Landolt ring. We created two conditions for the interval between the presentation of the chair or avatar and the presentation of the circle (short: 200 ms; long: 1,000 ms). There were 96 combinations of trials (2 with/without the avatar, 2 short/long intervals, 3 positions of the avatar and chair, and 8 directions of the gap in the circle). The directions of the gap in the circle were merged in the analysis.
Each trial began with a blank black screen for 1,000 ms, which was followed by a red fixation dot. Then, 1,000 ms later, the fixation dot disappeared and the room with the table, chair, and/or avatar appeared. After 200 ms (short interval) or 1,000 ms (long interval), the broken circle was presented on the table. The participants were asked to judge the direction of the gap in the circle from the avatar’s perspective or the empty chair’s position and to respond with the joystick as accurately and quickly as possible (Fig. 2). If the gap direction was 135° counterclockwise from the participants' point of view (when the 6 o'clock position is defined as 0°), as in Figure 2, the participant’s task was to adjust the direction from the perspective of the avatar, that is, the joystick should have been moved to the 45° counterclockwise position. Participants received no feedback. The next trial began immediately after the joystick response.
Before the practice trials, a direction judgment task from the participant's perspective was conducted first. Then, 12 practice trials (2 with/without the avatar, 2 short/long intervals, and 3 positions of the avatar and chair) were presented, and the participants judged the direction of the gap in the circle from the avatar's perspective or the empty chair’s position.
In each test session, all 96 combinations of the conditions were repeated twice in a random order, for a total of 192 trials. Each participant completed four test sessions, for a total of 768 test trials. It took approximately 90 minutes for each participant to finish this experiment, including the time required to provide the experimental instructions, to conduct the practice trials and the test sessions, and to take breaks between sessions.
Individual mean RTs and error rates were calculated for each of the twelve conditions (i.e., with/without the avatar, short/long, and left/front/right position of the avatar/chair). For the analysis, we treated the trials in which the participant moved the joystick within the range of ± 22.5° from the correct angle, as the correct response. RTs were determined as the time from the onset of the broken circle to the time when the joystick reached the end position (i.e., 2.5 cm from the center of the joy stick). Trials for which the RT was shorter than 150 ms (0 %) and trials for which the RT was longer than three standard deviations from the mean RT of each condition for each participant (1.5 %) were excluded as outliers from the analysis. Trials in which the participants made an error were also excluded from the RT analysis (approximately 6.8% of the trials). The RTs and error rates were submitted to a 2 × 2 × 3 repeated-measures analysis of variance (ANOVA) with the avatar existence, interval, and position of the avatar and chair as the within-subject factors. If there was a lack of sphericity, the reported values were adjusted using the Greenhouse-Geisser correction . When performing the multiple comparisons after the ANOVAs, we reported the p-values that were corrected using Shaffer's modified sequentially rejective Bonferroni procedure .
The ANOVA of RTs showed significant main effects of the existence of the avatar, F(1, 19) = 19.570, p < 0.001, ηp2 = 0.507, the length of the interval, F(1, 19) = 73.376, p < 0.001, ηp2 = 0.794, and the position of the avatar and chair, F(1.170, 22.222) = 7.864, p = 0.001, ηp2 = 0.293. There was also a significant interaction between the avatar existence and interval, F(2, 38) = 7.355, p = 0.014, ηp2 = 0.279, and between position and interval, F(2, 38) = 12.068, p < 0.001, ηp2 = 0.388. The avatar existence × position interaction, F(2, 38) = 3.138, p = 0.055, ηp2 = 0.142, and avatar existence × interval × position interaction, F(1.351, 25.670) = 0.383, p = 0.604, ηp2 = 0.020, were not significant.
Participants’ RTs were significantly faster for the “with avatar” condition than the “without avatar” condition, only in the short interval condition (p < 0.001) (Fig. 3A). In the short interval condition, the RTs were slower in the front position than in the other two positions (ps < 0.01). In the long interval condition, the RTs were faster in the left position than in the other two positions (ps < 0.05). In all conditions, the long interval condition had faster RTs than the short interval condition.
The ANOVA of error rates revealed a significant main effect of the existence of the avatar, F(1, 19) = 7.335, p = 0.014, ηp2 = 0.279 (Fig. 3B). The participants were more accurate when the avatar was presented than when it was not. The main effect of position was also significant, F(2, 38) = 11.696, p < 0.001, ηp2 = 0.381. Participants responded more accurately when the avatar was in the front and left positions than in the right position. No other main effect or interactions were found to be significant.
The humanoid avatar was associated with improved performance of identifying the orientation of a visual stimulus from an imagined position only in the short interval condition (200 ms). When the avatar was present, the participant’s viewpoint moved quickly to the position of the avatar, but this process either did not occur or took a long time in the chair condition. In the long interval condition, participants may have enough time to transform the viewpoint to an arbitrary position regardless of the presence of the avatar.
This facilitation effect of the humanoid avatar was basically consistent with the findings of Ward et al.  and Michelon and Zacks . They also showed that the presence of humanoid avatar makes it faster to determine how the visual stimulus looks from that position than it does with an inanimate object [7, 15]. However, one may argue that the results were obtained because humanoid avatars capture attention more easily than an empty chair. To examine this, in Experiment 2, we added a condition in which the humanoid avatar was sitting backwards. We hypothesized that if there is no facilitation effect on the RT in the backward avatar condition, then the embodiment to the avatar is important for efficient perspective taking.