Preliminary Experiments
Participant
Eight participants (six men and two women; college students aged 19–21 years) enrolled in the study. The participants were segregated into four groups, with each pair referred to as a collaborator.
Participation-Agreement Procedures
Informed consent
was obtained at the laboratory on the day of the experiment. The participants were handed a paper that outlined the experiment and data-handling procedures, which were explained by the experimenter. All eight participants agreed to participate in the study. The experimental data were obtained using anonymized ID numbers. This ensured that the data were not linked to the participants’ names.
Devices
The VR systems were established in two separate rooms. Each system comprised a VIVE Pro Eye HMD), two controllers (VIVE Controller 2018), two base stations (SteamVR Base Station 2.0), and a computer. The VR environment was created using Unity (2021.3 .1f1) in a server-client network using “Netcode for Game Objects.” In this environment, paired participants entered the same virtual space and interacted via physical actions. No audio communication was available, and the VR environment featured two avatars, buttons, a display, and a mirror (Fig. 1). The avatars were able to move based on six-coordinate data (three positions and three rotations) obtained from the HMD and two controllers. These avatars were boxy and lacked personality traits, and their movements were executed using the “Final IK” asset. Red- and green-labeled reaction buttons were placed in front of the avatar in the VR space. The RTs were acquired via collision detection when the avatar touched a button.
The task comprised four phases: Phases 1–4 (time count, fixation cross-presentation, target presentation, and blanks, respectively). Phases 1 and 3 are the countdown and motion phases, respectively.
Phase 1: A 3-s countdown display.
Phase 2: Presentation of a black fixation cross, “+”, at the center of the display for 1 s.
Phase 3: Presentation of targets (red or green) on either the left or right side of the display until a response is obtained.
Phase 4: A blank interval of 0.5 s before the next countdown begins.
The participants were instructed to touch a button corresponding to the target color, regardless of its location. Each session comprised 16 or 32 consecutive trials, with the target color and position randomized between the trials.
Procedure
The participants were allowed to select either the client or host experimental booths. The terms “host” and “client” were designated because paired data were transmitted as streamed data from the client to the host PC. Following the instructions of the experimenter stationed at each booth, the participants were instructed to wear the HMDs and operate the controllers with both hands. Before each session, the participants were briefed on the colors of the stimuli for which they were responsible. Before commencing the joint task, the participants were instructed to view their collaborators directly and then confirm their avatars in the mirror set in the VR space. The host stood on the right, whereas the client stood on the left. The host operated the buttons in the VR space using the left hand, whereas the client used the right hand. Thus, the right hand was not used on the host side, and the left hand was not utilized on the client side. The task involved pressing a button labeled with the corresponding color name when the assigned color appeared. The participants were instructed to halt if they felt uncomfortable, lift their HMDs at the end of each session, and assume breaks as required.
Sessions
The participants entered the space individually and completed eight practice trials for the Go/No-Go task. During the practice session, the correct answer was indicated when the correct button was touched. An incorrect answer was revealed when another button was touched or when a certain amount of time elapsed without a touch being detected. If a participant failed in all eight trials, then the practice session was repeated. After the practice session was completed, the following sessions were conducted.
Session 1: Go/No-Go task—individual sessions for the assigned target in 32 trials.
Session 2: Joint Simon task (host: green; client: red)—connect the VR space with a human collaborator and touch each color in 32 trials.
Session 3: Joint Simon task (host: red; client: green) in 16 trials.
Session 4: Conformity task—simultaneous touching with both targets appearing in the same breath in 16 trials.
Session 5: Competition task—the target that appeared touched the target before the opponent in 16 trials.
Session 2 involved the procedure shown in the upper section of Fig. 1. The target colors in Session 3 were swapped to minimize the learning effects. Sessions 4 and 5 involved the procedure shown in the lower section of Fig. 1.
Sensor data from Sessions 4 (conformity) and 5 (competition) were used for machine learning and testing, respectively. In the conformity-task session, the participants were instructed that, “whichever target appears, touch the correct button in the same breath as your companion.” During the competition-task session, the participants were instructed that, “whichever target appears, touch the correct button before your companion.”
Data Processing
In the preliminary experiment, we investigated the features necessary for distinguishing between interpersonal behaviors in VR environments. To identify subtle differences in subconscious movements, we used the sensor data during Phase 1, i.e., the time at which the participants were staring at the countdown, as shown in Fig. 1.
The sensing data comprised the gaze angle, eye position, pupil size, head position, head rotation, left- and right-controller positions, left- and right-controller rotations, gaze angle, XYZ three-axis data for each, and left- and right-pupil sizes. The transmission latency from the client to the host was approximately .01 s. Signals were sampled at a variable rate (80 Hz average), and after missing values were removed, the client and host data were linked at intervals of approximately .02 s and then used for machine learning. The features used for machine learning were distance, which was obtained as the root sum of squares of the XYZ (Euler angle) of the position and gyrosensor at each sampling point; the velocity from the time difference; and the acceleration obtained from the time difference in velocity, which was used as the analysis data. After deleting samples with missing values, we used the Python sklearn Random Forest Classifier (n_estimators = 250, random_state = 42) as the random-forest model. A set of decision trees was constructed for a subset of randomly sampled training data, and predictions based on a subset of these features were aggregated to obtain the final prediction. Owing to the low accuracy of the classification model using velocity and acceleration data in the machine-learning process and the relatively high accuracy of the model when using distance features, we decided to use only distance features, except for the triaxial gaze data, as they are considered essential for synchronization. Data from three among the eight participants were used as training data to measure the accuracy during cross-validation, as the data were moving for each pair.
Main Experiment
Participant
The participants of this experiment were recruited through a university website, and 20 were assigned to each experimental day. However, owing to the absence of one participant, the final number of participants was 18, which comprised seven men, nine women, and two of other genders. The average age of the participants was 19.83 years, and informed consent was obtained in advance via a web-based questionnaire. This procedure, which was different from the preliminary experiment, was designed to avoid coercion for consent due to face-to-face situations in the laboratory. After the experiment, the participants were instructed to complete a questionnaire survey and interview, for which they received an honorarium of approximately $10 (1,500 yen) after completion.
The device and experimental procedures were identical to those used in the preliminary experiments. Two types of avatars, i.e., a box and a human, were designed for other research purposes22.
Sessions
Session 1: Go/No-go task—individual sessions for the assigned target; 32 trials with box avatars.
Session 2: Joint Simon task (host: green; client; red)—connect the VR space with a human collaborator and touch each relevant color for 32 trials in a box avatar.
Session 3: Cooperation task—32 trials of the joint Simon task (host: red; client: green) in a human avatar.
Session 4: Conformity task—16 trials in human avatars.
Session 5: Competition task—16 trials in human avatar.
Session 6: Bot-condition joint Simon task (host: green; client: red) —32 trials in a human avatar.
Session 7: Bot-condition joint simulation task (host: red; client: green) —16 trials in a box avatar.
Session 345 was a training session for determining cooperation, conformity, and competition activities in machine learning and was conducted using human avatars, which were responsible for the same color targets throughout the three sessions.
Sessions 2 and 6 were test sessions for comparing activities under the condition where the paired partner was a human or bot. In the test sessions, the participants were responsible for color targets different from those in the training sessions.
Sessions 2 and 6 differed from the box or human avatar in terms of appearance.
A previous study confirmed that an avatar’s appearance does not affect the JSE or bot cognition22. To confirm that the appearance of the bot avatar imposed no effect, we compared the avatar differences under the avatar condition (Sessions 6 and 7) and based on the classification probability as repeated factors. The results of ANOVA indicate that the main effect of the avatar condition and the interaction effect between the avatar condition and classification were insignificant. Moreover, no significant interaction effects involving the bot–avatar appearance were indicated.
Dependent variable
Correct response rate. The correct responses were counted by touching the correct color target in the trials to which the participants should respond and by not touching the trials to which they should not respond. Subsequently, the correct response rate was divided by the number of trials.
JSE. The mean RT delay (RTs for incompatible targets – RTs for compatible targets) during the joint Simon task (Sessions 2, 3, 6, and 7) minus that of the Go/No-Go task (Session 1) was calculated. The RTs for correct responses with more than two standard deviations from the mean RT were excluded as outliers. Additionally, pairwise data from participants whose RT could not be measured because of equipment failure were excluded.
Bot cognition. After the experiment, a questionnaire was administered to determine whether the participants were aware of the bot, followed by face-to-face interviews with the experimenter to confirm whether they were aware of it. The participant who perceived the human collaborator as a bot was assumed to be unaware of the discrimination between humans and bots. In the data analysis, binary values of 1 and 0 were used to indicate the awareness and unawareness of bots, respectively.
Sensor data. By performing the procedures of the preliminary experiment, the distance was obtained as the root sum of squares of the XYZ (Euler angle) of the position and gyrosensor at each sampling point, the velocity as the time difference between the two, and the acceleration as the time difference between the two. Because the accuracy of the classification model using velocity and acceleration data was low during the machine-learning process, we adopted a model that used only distance features. The resulting features were the HMD position, HMD rotation, and 12 variables of position information for the left- and right-controller positions and rotations. The eye-gaze and pupil-size features used in the preliminary experiments were not used because no sensing data corresponded to the bot. Under the bot condition, only trace data from Phase 3 were used; thus, data from Phase 3 were used to formulate the classification model. Samples with missing values on the host or client side were deleted. After removing missing values, the number of observations obtained from Sessions 3, 4, and 5 was 16759 for training and testing the machine learning. Sensor data from Sessions 2, 6, and 7 were prepared as files on the host and client sides to compare the human and bot conditions. The total number of observations was 124193.
We used the MATLAB Classification Layer application for the machine-learning model to compare the decision trees, random forests, vector machines, and neural nets. The results showed that even a single decision tree provided a correct answer rate exceeding 90%, which is comparable to the performances of other methods. Thus, we adopted a decision-tree model to identify the most important features. The Gini diversity index was used as the splitting criterion.
Pair Activity Probability. Using MATLAB’s trained Model.predict function, we applied the classification model to Sessions 2, 6, and 7 as test sessions using the 12 feature variables. The classification probability results for each observation were the activity indices of cooperation, conformity, and competition.
These probability values were angularly transformed using ARSIN(SQRT(probability)) × 180/pi. We adjusted for a probability of 0 by setting ARSIN(SQRT(.0833)) × 180/pi and a probability of 1 by setting ARSIN(SQRT(1-.0833)) × 180/pi. These corrections were performed based on the usual adjustment (1/4N) for angular transformations. Thus, the minimum and maximum possible values were 16.54 and 78.69, respectively. These values were averaged for each trial and used as cooperation, conformity, and competition indices for the pair activity. Because the human condition comprised data that switched from the client to the host for comparison with the bot condition, we analyzed the condition effects via multilevel analysis. For the linear mixed model, paired groups were specified as random-effect factors after centralization was performed, in which the mean value of each paired group was subtracted from each indicator.
Synchrony Index. As in the preliminary experiment, the most important feature for classifying paired activities was the rotation of the unused hand (host-side right-hand rotation). Therefore, the XCORR of the sensor data for the host-side right-hand rotation and client-side left-hand rotation was calculated for each trial and then used as the interpair synchrony index using MATLAB’s XCORR function. After normalizing the sensor data for each trial, the maximum value obtained at lag0 was used as the XCORR index.