Study1 (Pilot)
Ethics approval, data, and materials: https://osf.io/z26ar/.
Method
Undergraduates judged handwritten truthful and deceptive alibi statements either on deception (using any cue they like; control condition) or only on verifiability (heuristic condition).
Participants
51 undergraduates of the University of Amsterdam Psychology Department took part in Study1. We excluded seven participants who failed the attention check (see below) and five participants were not Dutch native speakers. Of the 39 remaining participants (31 female, 8 male; M age = 19.38 years, SD = 1.68), n = 19 judged deception (any cue possible) and n = 20 judged only the single cue verifiability. Participants received course credits for partaking in the study and the most accurate participant received a 20,- euro bonus.
Procedure
After providing online informed consent, participants were randomly assigned to the deception judgement or the verifiability judgement condition through Qualtrics. In both conditions, the participants were asked to evaluate 16 alibi statements, presented one by one, in a random order. In the verifiability judgement condition there was no mentioning of deception or lie detection.
Judging deception, participants were asked to evaluate “How truthful is this statement?”, on a scale from “totally deceitful” (-100) to “totally truthful” (+100), with the definition of truthfulness provided as “a truthful statement is a statement that is true, honest and adheres to the fact of the situation”. Judging verifiability, participants were asked to evaluate “How verifiable is this statement?”, on a scale from “totally unverifiable” (-100) to “totally verifiable” (+100), with the definition that “verifiable activities are activities that are recorded (e.g. a security camera), documented (e.g. payment with a debit card or using a smartphone) or an activity with an identifiable witness present”. This definition arose from the Verifiability Approach (10), but we simplified it to its essence.
An attention check was embedded among the statements. It looked like another alibi statement but instructed participants to ignore the provided statement and instead answer -47 on the scale. After rating the (real and bogus) statements, a manipulation check asked participants about the basis for the judgments (indicate up to 3 out of the 11 cues from the list they used most as the basis of their judgement; with 3 cues referring to verifiability), a single item asked about motivation to accurately judge the statements (from -100 to +100), and finally participants were asked to provide age, gender, and mother tongue.
Materials
We used 64 alibi statements (32 truthful, 32 deceptive). To avoid item-effects, we created 4 sets of 16 statements (each containing 8 truthful and 8 deceptive statements), and participants were randomly assigned to receive one of the sets. The statements were selected from 72 statements obtained in a previous mock crime study where participants provided a handwritten statement on their whereabouts on campus in the last 15 minutes (1). Participants either truthfully described their activities, or they lied. The lying participants had just enacted the mock theft of an exam but pretended to have been on campus as a regular student. These statements were manually pseudonymized (i.e., all identifiable information including names of persons were changed to plausible alternatives). Content-coding by trained coders (11) showed that the 32 truthful statements (M = 8.28; SD = 8.67) contained more verifiable details than the 32 deceptive statements (M = 3.47; SD = 4.65), d = 0.69 (95% CI: 0.18; 1.19) (see Supplementary Table 3: https://osf.io/v3kdw/). Below is the English translation of one example statement (all original Dutch statements can be found on https://osf.io/z26ar/):
‘I quietly walked down until the entrance of G/lab. I was in doubt about what to do (stood still for a moment). Then I walked into the corridor of G, saw a cleaner/guy with a cart and read something about using lockers at the UvA. Then I walked to the outside entrance and walked around (back of G) and looked at the kind of butterflies that are now there for the light festival. So then walked further around G. Went back inside (second floor lab) and looked for a moment at university pabo, there is a poster next to the door about participating in brain research for money. When I had read that I walked quietly to this research room.’
Main Analyses (Not Preregistered)
The 2 (Judgement Method: Deception vs Verifiability, between-subjects) x 2 (Veracity: Truthful vs Deceptive, within-subjects) mixed ANOVA on the participant’s judgment showed the predicted interaction effect between Judgment Method and Veracity, F(1, 37) = 8.43, p = .006, η²p = .19 (Figure 1). To follow-up on the interaction, we conducted a paired sample t test contrasting judgements for truthful and deceptive statements within each judgement method. Lie-truth differences when judging deception were small and non-significant, t(18) = 0.45, p = 0.660, d = 0.10 (95% CI: -0.35; 0.55)[1], BF01 = 3.85[2]. In contrast, lie-truth differences when judging verifiability were significant and large, t(19) = 4.00, p < .001, d = 0.89, (95% CI: 0.36; 1.41), BF10 = 45.65.
Additional analyses
Participants were motivated to provide an accurate judgement (Judging deception: M = 66.32; SD = 29.48; Judging verifiability: M = 66.90; SD = 32.93).
The manipulation check showed that participants made judgements in line with the instructions they received in their respective condition: Participants instructed to judge verifiability more often listed cues related to verifiability as the basis of their judgement (M = 1.75; SD = 0.64) than participants judging deception (M = 0.68; SD = 0.75), t(37) = 4.78, p < .001, d = 1.53, (95% CI: 0.81; 2.24), BF10 = 669.50.
Study2-3
Ethics approval, data, and materials: https://osf.io/z26ar/. Moving from the exploratory stages of research to confirmation, we preregistered our hypotheses, analysis plan, and predictions: https://osf.io/z26ar/.
Method
The procedure of Study2-3 followed that of Study1 with 3 main differences. First, we preregistered the hypotheses and statistical analyses. Second, we moved from locally recruited undergraduates to online crowdsourcing (Prolific.co participants with Dutch as first language; Paid 2.50£ for participation, with the 5 best performing participants receiving an additional 5.00£.). Third, we added a heuristic condition that based their judgements on detailedness, using the following definition: ‘Degree to which the message includes details such as descriptions of people, places, actions, objects, events, and the timing of events; the degree to which the message seemed complete, concrete, striking, or rich in details’ (2).
There were a few other, minor changes to the Study1 procedure. We provided participants during the initial instructions with a map of the campus. We also changed the manipulation check to an open box, asking participants to describe the cue they relied most on. We added a single item about experienced difficulty of the judgements (from -100 to +100), and – as an additional attention check- after completing all judgments, a surprise multiple-choice question asked about the core of the last statement (e.g., finding a book). Demographics were obtained from Prolific.
Participants
Study2. 142 participants took part in Study2. We excluded 34 participants who failed either of the two attention checks. The 108 remaining participants (39 female, 69 male; M age = 29.69 years, SD = 10.34; n = 30 judging deception, n = 39 judging verifiability, and n = 39 judging detailedness) mostly had the Dutch nationality (73%; Belgian: 24%; Other: 3%).
Study3. Participants from Study2 could not partake in Study3. 303 participants took part in Study3. We excluded 73 participants who failed either of the two attention checks. Of the 230 remaining participants (107 female, 119 male, 4 missing; M age = 30.32 years, SD = 10.59; n = 77 judging deception, n = 89 judging verifiability, and n = 64 judging detailedness) 72% had the Dutch nationality (Belgian: 26%; Other: 2%).
Main analyses (preregistered)
Study2. Lie-truth differences when judging deception were small, d = 0.39 (95% CI: 0.02; 0.76), and lacked evidential value, BF10 = 1.40, but were just statistically significant at a significance threshold of 0.05, t(29) = 2.14, p = 0.041 (see Supplementary Table 3: https://osf.io/v3kdw/). In contrast, lie-truth differences when judging verifiability were again significant and large, t(38) = 5.06, p < .001, d = 0.81, (95% CI: 0.44; 1.17), BF10 = 1854. This was also true when judging detailedness, t(38) = 6.61, p < .001, d = 1.06, (95% CI: 0.66; 1.45), BF10 = 173709. However, the 3 (Judgement Method: Deception vs Verifiability vs Detailedness) x 2 (Veracity: Truthful vs Deceptive) mixed ANOVA only showed a main effect of Statement Veracity, F(1, 105) = 53.03, p < .001, η²p = .34, and not the predicted Judgement Method by Statement Veracity interaction, F(2, 105) = 1.23, p = .297, η²p = .02. The predicted interaction did not reach significance, but we may have used an underpowered design to uncover the interaction effect. Of note, the lie-truth difference in the control condition happened to be larger than anticipated (d = 0.39). We think this is due to sampling error related to the modest sample size (3). We thus ran the study again with more statistical power.
Study3. The 3 (Judgement Method: Deception vs Verifiability vs Detailedness) x 2 (Veracity: Truthful vs Deceptive) mixed ANOVA showed the predicted interaction effect, F(2, 227) = 31.84, p < .001, η²p = .22. Lie-truth differences when judging deception were small and non-significant, t(76) = 0.86, p = 0.394, d = 0.10 (95% CI: -0.13; 0.32), BF01 = 5.60 (see Supplementary Table 3: https://osf.io/v3kdw/). In contrast, lie-truth differences when judging verifiability were again significant and large, t(88) = 11.60, p < .001, d = 1.23, (95% CI: 0.95; 1.50), BF10 = 2.56 x 1016. This was also true when judging detailedness, t(63) = 9.19, p < .001, d = 1.15, (95% CI: 0.83; 1.46), BF10 = 2.64 x 1010.
Additional analyses
Because these analyses are highly similar for Study 2 and 3 (that used the same procedure), we aggregated the data of Study 2 and 3 (n = 338; results for each study separately can be found on https://osf.io/z26ar/).
Preregistered additional analyses.
As a manipulation check of the judgment instructions, we coded the open box responses describing the cue that the participant relied most on. A condition-blind coder scored the responses as referring to verifiability, detailedness or other cues. Agreement with a second condition-blind rater was moderate to high (Study2, all statements double coded: 84% agreement; Study3, one third of statements double coded: 69% agreement). A Chi Square Test on the association between Judgement Method (Judge Deception vs Judge Verifiability vs Judge Detailedness) and Reported Cue Use (Deception vs Verifiability vs Detailedness) indicated that participants reported using the cue they were instructed to use, χ2(4) = 129.08, p < .001, Cramer V = 0.44. Participants judging verifiability most often mentioned a cue related to verifiability (89%; Detailedness: 6%; Other: 5%). Participants judging detailedness most often mentioned a cue related to detailedness (44%; Verifiability: 28%; Other: 28%). Participants judging deception most often mentioned other cues (38%; Verifiability: 24%; Detailedness: 37%).
Non-Preregistered additional analyses.
Participants were motivated to provide an accurate judgement (M = 76.99, SD = 23.89; Judging verifiability: M = 81.47; SD = 22.07; Judging deception: M = 75.49; SD = 22.19; Judging detailedness: M = 72.99; SD = 26.90). Participants experienced the task to be moderately difficult (M = 48.29, SD = 38.52; Judging deception: M = 58.75; SD = 36.43; Judging verifiability: M = 43.59; SD = 39.72; Judging detailedness: M = 43.27; SD = 37.27).
Robustness Analyses
Exclusion criteria were preregistered and served to assure that participants paid attention to each statement. But our findings do not hinge on the exclusion criteria. When not excluding any participant, the critical Judgment Method by Statement Veracity interaction was significant for the combined Study 2 and 3 data, F(2, 442) = 32.77, p < .001, η²p = .13. With large lie-truth differences when judging verifiability: d = 0.95 or detailedness: d = 1.16), but not deception, d = 0.06.
Each participant judged one of 4 series of 16 alibi statements. Splitting the data per stimulus set (bottom rows Supplementary Table 3: https://osf.io/v3kdw/) shows that the benefits of single-cue judgements do not hinge on a specific set of stimuli.
ROC Analyses
For each judgement method, we used the average statement judgement to predict statement veracity. The ROC analysis plots sensitivity against specificity and provides a measure of diagnostic value across all possible cut-off points. The area under the ROC curve varies from 0 to 1 (=perfect classification), with .50 denoting the chance level. As shown in Table 2, classification accuracy was above chance for judgments relying on a single cue (either verifiability or richness in detail), but at chance level for deception judgements.
Using Youden’s J (4), we also identify the optimal cut-off point when equally balancing specificity and sensitivity. We used independent validation, the strictest method to avoid data overfitting. Hence, we used the data of Study2 to evaluate classification accuracy based on the optimal cut-off derived in Study3 (and vice versa). Accuracy was poor for veracity judgements, and moderate for the single cue judgements, see Table 2.
Study4
Ethics approval, data, and materials of Study4: https://osf.io/z26ar/. Hypotheses, analysis plan, and predictions were preregistered before the start of data collection: https://osf.io/z26ar/.
Method
Fluent-German crowdsourced-participants judged interview transcripts (see Materials) either on deception or on richness in detail.
Participants
251 participants took part in Study4. We excluded 59 participants who failed both attention checks. The 192 remaining participants (92 females; M age = 27.59 years, SD = 9.11; n = 104 judging deception, n = 88 judging detailedness) were Polish (26%), German (13%) or had one of 27 other nationalities. Demographics were obtained from Prolific.
Procedure
After providing informed consent, participants were randomly assigned to the deception judgement or richness in detail judgement condition through Qualtrics. In both conditions, participants were asked to evaluate 13 transcripts (including one bogus transcript used as an attention check, but excluded from main analyses), presented one by one, in a random order. The instructions for the judgements were the same as for Studies 1 to 3.
The first attention check concerned the bogus transcript, which looked like just another transcript, but with the instruction to ignore the transcript and instead answer -47 on the scale. The second attention check was a surprise recall test after the last transcript, asking to select a unique utterance (e.g., ‘forgot the name of the girl I was looking for’) in the last transcript among six options. Thereafter, participants indicated their motivation and experienced difficulty, and were asked to list, one-by-one, the cues they had relied on.
Materials
We used 72 transcripts (half truthful, half deceptive). To avoid item-effects, we created 6 sets of 12 transcripts[3], and participants were randomly assigned to receive one of the 6 sets. The statements were selected from (5). Participants in that study were native-German speaking undergraduates who were interviewed about the two tasks they claimed to have been doing in the past half hour. Statements were later transcribed verbatim. We selected the transcripts from participants who had been instructed to consistently lie or tell the truth (i.e., the lie-lie and truth-truth conditions), and used only the ‘Find Michelle at the bus stop’ task. This task entailed leaving the lab, crossing the campus to the bus stop, trying to find a girl named Michelle (of whom they received a photo), making notes of arriving and leaving buses and returning to the lab within 35 minutes. From the structured interview, we selected only the first response to the interviewer’s instruction to describe the task as accurately and in as much detail as possible. Truth tellers described the task they had enacted (trying to find Michelle at the bus stop). Liars also provided a statement about their search to find Michelle, but had not actually enacted that task. We edited the transcripts to correct for spelling errors, but we retained all utterances and filler words (e.g., ‘Ehm’). Content-coding of the entire transcripts by trained coders (13) showed that the 36 truthful transcripts (M = 38.06; SD = 15.36) contained more details than the 36 deceptive transcripts (M = 25.72; SD = 9.66), d = 0.96 (95% CI: 0.47; 1.45)[4]. Below is the English translation of one example statement (all original German transcripts can be found on https://osf.io/z26ar/):
‘Okay, so after I finished the task with the café I went to the stop at the hospital. I didn't know exactly where it was, so I first meandered through here a bit, asked 'uuh I'm doing a task, can you tell me where the stop is?' And I already thought that it was this one and then I went there. Yes, and then I was supposed to look for Michelle. There were two or three people sitting there, three people sitting there, and then I asked them in Dutch if their name was Michelle. Yes, there was no Michelle there, then I sat there for five minutes, looked to see if maybe some bus was coming by where a Michelle got off, but no bus came by at all. And then I came back here and, yes, I didn't complete the task because I didn't find Michelle’.
Main analyses (preregistered)
The 2 (Judgement Method: Deception vs Richness in detail) x 2 (Veracity: Truthful vs Deceptive) mixed ANOVA showed the predicted interaction effect, F(1, 190) = 20.09, p < .001, η²p = .096. Lie-truth differences when judging deception were small and non-significant, t(103) = 0.09, p = 0.929, d = 0.01 (95% CI: -0.18; 0.20), BF01 = 9.17. In contrast, lie-truth differences when judging richness in detail were significant and moderate to large, t(87) = 7.06, p < .001, d = 0.75 (95% CI: 0.51; 0.99), BF10 = 2.53 x 107.
Additional analyses
Participants were motivated to provide an accurate judgement, M = 52.92, SD = 39.05 (Judging deception: M = 51.78; SD = 41.30; Judging richness in detail: M = 54.28; SD = 36.39) and rated the task moderately difficult, M = 42.90, SD = 42.94 (Judging deception: M = 48.35; SD = 44.80; Judging richness in detail: M = 36.45; SD = 39.95).
Tracking of the time spent per page[5] showed that the average time to read and evaluate a transcript was about a minute, M = 57.44 sec., SD = 30.81 (Judging deception: M = 58.80 seconds; SD = 33.82; Judging richness in detail: M = 55.82 seconds; SD = 26.92).
Study5
Ethics approval, data, and materials of Study5: https://osf.io/z26ar/. Hypotheses, analysis plan, and predictions were preregistered before the start of data collection: https://osf.io/z26ar/.
Method
Participants judged statements on richness in detail, either being explicitly told or not that their judgements served to tell lie from truth. Participants in the explicit condition were told that some statements were deceptive, and that their goal was to detect the deceptive statements. Participants in the non-explicit condition were not given any information about deception or lie detection and merely asked to evaluate the statements.
Participants
166 fluent Dutch-speaking participants (who had not performed in Studies 2 to 4) took part in Study5 on Prolific. We excluded 16 participants who failed both attention checks. The 150 remaining participants (83 females; M age = 26.80 years, SD = 8.48; n = 76 in the explicit condition and n = 74 in the non-explicit condition) were Dutch (55.33%), Belgian (31.33%) or had another nationality (13.33%). About half of them (52%) were students. Demographics were obtained from Prolific.
Procedure
After providing informed consent, participants were randomly assigned to the explicit versus non-explicit condition through Qualtrics. In both conditions, participants were asked to evaluate 16 statements (and an additional bogus statement used as an attention check, but excluded from main analyses), presented one by one, in a random order. Instructions for the non-explicit condition were similar to those used in Studies 1 to 4 (judge detailedness), but in the explicit condition participants were informed (i) that some statements were deceptive and (ii) that their goal was to detect those lies.
The first attention check concerned the bogus statement, which looked like just another transcript, but with the instruction to ignore the transcript and instead answer -47 on the scale. The second attention check was a surprise multiple-choice question after judging the last statement, asking to indicate the core of the last statement (e.g., ‘Search for a book’) from six options. Thereafter, participants rated motivation and difficulty. Finally, there were two (open box) manipulation checks, asking about the goal of the study and what cues they had relied on.
Materials
The statements were selected from (13) and are the same as those used in Study 1 to 4. We used 64 statements (half truthful, half deceptive). To avoid item-effects, we created 4 sets of 16 statements, and participants were randomly assigned to receive one of the 4 sets.
Main analyses (preregistered)
Using JASP 0.16 and its default settings, the 2 (Goal of lie detection: Explicit vs Non-explicit) x 2 (Veracity: Truthful vs Deceptive) mixed Bayesian ANOVA showed that the data were 2.78 times less likely (BF01) under the model including the interaction than under the model with only the two main effects. Lie-truth differences when judging richness in detail were significant and large when the goal of lie detection was not explicit (as it was in the when participants relied on heuristics in Studies 1 to 4), d = 1.02 (95% CI: 0.78; ∞), BF10 = 1.39 x 1010 (Mtruthful = 40.56, SD = 23.40; Mdeceptive = 22.69, SD = 25.91), but also when the goal of lie detection was made explicit, d = 0.97 (95% CI: 0.74; ∞), BF10 = 5.51 x 109 (Mtruthful = 39.56, SD = 24.61; Mdeceptive = 17.33, SD = 28.19).
Additional analyses
A condition-blind researcher (MW) coded whether participants mentioned deception or lie detection in the open box answer about the study goal. 56 out of 76 (or 74%) of the participants in the explicit condition mentioned deception or lie detection versus only 2 out of 74 (or 3%) participants in the non-explicit condition, χ2 (1) = 79.66, p < .001, Cramer´s V = 0.73.
A condition-blind researcher (MW) also coded whether participants mentioned richness in detail in the open box answer about the cues they had relied on. The vast majority of the participants mentioned richness of detail (137 out of 150 or 91.33%).
Participants were motivated to provide an accurate judgement, M = 58.39, SD = 32.81 (Explicit condition: M = 56.07; SD = 35.72; Non-explicit condition: M = 60.78; SD = 29.58) and rated the task as moderately difficult, M = 35.39, SD = 43.42 (Explicit condition: M = 41.07; SD = 42.13; Non-explicit condition: M = 29.57; SD = 44.23).
Robustness check
The exclusion criteria were preregistered and served to assure that participants paid attention to each statement. But our findings do not hinge on the exclusion criteria. When not excluding any participant, the Bayesian ANOVA showed that the data were 2.34 times less likely (BF01) under the model including the interaction than under the model with only the two main effects. And the lie-truth difference was large for the non-explicit condition, d = 1.04 (95%: 0.78; 1.31), as well as the explicit condition, d = 0.97 (95%: 0.71; 1.23).
Study6
Study6 was exploratory and therefore not preregistered. Ethics approval, data, and materials of Study6: https://osf.io/z26ar/.
Method
Undergraduate participants either lied or told the truth in a videotaped interviewed about their whereabouts at the university campus. Immediately after the interview, the interviewers judged the statement on detailedness (using a 0 to 10 scale), with the statement deemed credible for scores of 6 and above. We examined the accuracy of these simple, real-time judgements.
Participants
47 undergraduate participants from the University of Amsterdam took part in return for course credits. Three participants were excluded, two because they did not complete their mission, and one because of suspected intoxication. Of the remaining 44 participants, n = 23 where in the truthful condition (Mage = 19.87, SD = 2.70; 43.5% native English speakers; 78% female, 22% male), and n = 21 where in the deceptive condition (Mage = 19.48, SD = 1.12; 47.6% native English speakers; 52% female, 43% male, 5% non-binary).
Procedure
We recruited participants who were comfortable to provide a video statement. Through a brief screening via e-mail, we tried to balance our sample and get about half native English and half non-native English speakers. The entire procedure was conducted in English. Upon arrival to the lab, participants were welcomed by a first experimenter[6]. Participants provided written informed consent.
Participants were randomly assigned to the deceptive versus truthful condition and received written instructions for the theft or study location mission, respectively. Participants were asked to paraphrase their mission to the first experimenter to assure it was well-understood. In the deceptive condition, participants first went to a building to find a key, then to a another building to open up a mail box with that key and steal an exam, and finally to a third building to drop the stolen exam. In the truthful condition, participants searched for an appropriate study location in several buildings of the campus, taking flyers with them to proof they visited the designated areas. Participants were asked to return between 25 and 30 minutes.
Upon return to the laboratory, participants were informed that they were suspected of the theft, and were briefly informed about the innocent mission (allowing those in the deceptive condition to create a realistic lie). They were informed that they will be interviewed about their whereabouts in the last half an hour by a second experimenter, and that their statement would be checked on verifiability (see the information protocol of Nahari et al., 2014). A reward in course credits was promised for providing a credible statement, and they were given 10 minutes to prepare the statement.
The participants were then guided to another room, where the second (condition-blind) experimenter would conduct the video interview. After a brief explanation and short small talk (aimed to build rapport), the interviewer asked to describe their whereabouts of the last half an hour in as much detail as possible. To try and get a rich statement, the interviewers encouraged them to fill 10 minutes. The experimenter had been instructed not to interrupt the interviewee during the interviewer, and to only encourage the interviewee to speak (by nodding, ‘OK’, etc.). When such prompt did not lead the interviewee to say more, a follow-up question was asked (i.e., ‘What proves me you are telling the truth?’). Directly after the interview, the same experimenter scored the interview on detailedness from 0 = not detailed at all to 10 = very detailed using the DePaulo et al. (2003) definition (‘Degree to which the message includes details such as descriptions of people, places, actions, objects, events, and the timing of events; the degree to which the message seemed complete, concrete, striking, or rich in details’).
After the interview, participants were guided back to the first experimenter, asked to take an English language proficiency test, and to honestly answer a few brief questions about the interview experience (single-scale measures of cognitive demand, emotional arousal, motivation, fatigue, and perceived likelihood that statement would be verified; all from -100 not at all to +100)[7]. Participants were thanked and received their credits (with a detailedness score of ≥ 6 by the second experimenter leading to the bonus pay).
Main Analyses (Not Preregistered)
Detailedness judgements for deceptive statements (M = 7.17, SD = 1.34) were considerably higher than for truthful statements (M = 4.48, SD = 1.57), t(42) = 6.16, p < .001, d = 1.86 (95%: 1.14; 2.56). The diagnosticity of the detailedness judgements to classify lies from truths was high, ROC = .91 (95% CI: .82; .99). Using the pre-determined cut-off (i.e., 6), 21 out of 23 (91%) truthful statements and 14 out of 21 deceptive statements (67%) were correctly classified. Overall accuracy was 79%. There was a significant association between statement veracity (truthful versus deceptive statement) and heuristic judgement of veracity (judged truthful versus judged deceptive), χ2 (1) = 15.94, Cramer V = .60.
The six interviewers were trained in coding detailedness using the Verifiability Approach (VA): they got acquainted with the relevant literature, learned the coding scheme https://osf.io/k9e8f/, and practiced the coding in a workshop. Each statement was coded independently by two interviewers who then discussed their coding and came to a consensual scoring. Using this consensus score, the 23 truthful statements (M = 15.13; SD = 9.67) were found to contain more verifiable details than the 21 deceptive statements (M = 6.24; SD = 6.09), t(42) = 3.61, d = 1.09 (95%: 0.45; 1.72).
Study7
Ethics approval, data, and materials of Study7: https://osf.io/z26ar/. Hypotheses, analysis plan, and predictions were preregistered before the start of data collection: https://osf.io/z26ar/.
Method
Participants judged truthful and deceptive videotaped interviews (see Materials) either on a richness in detail or on eye gaze aversion.
Participants
205 participants took part in Study7. We excluded 34 participants who failed both attention checks. The 171 remaining participants (123 female, 44 male, 4 other) had a mean age of 22.07 years (SD = 5.03). Eighty-six participants judged detailedness, and 85 judged eye gaze aversion. They had Dutch (31%), English (17.5%) or another (51.5%) language as mother tongue. Their country of origin was The Netherlands (30%), Germany (8%) or one of 40 other nationalities. Participants were rewarded with course credits or 7.50 euro, and the 3 best performing participants received a bonus of 0.50 credit or 5 euro.
Procedure
Participants were recruited via the recruitment portal of the University of Amsterdam. The vast majority of this pool consists of undergraduate students, with the remainder consisting of community members. Participants first provided informed consent, which included the explicit agreement not to download, store or share the video statements. Participants were randomly assigned to judge detailedness or eye gaze behavior through Qualtrics. In both conditions, participants were asked to evaluate 12 videos, presented one by one, in a random order. Instructions for the detailedness judgements were the same as for Studies 1 to 6. For eye gaze aversion, we instructed people to judge ‘Looking away’, explained that this ‘… means the person in the video does not maintain eye contact with the interviewer/camera or looking to the side during the interview’.
One attention check asked about the demeanor of the interviewee (i.e., Please answer the following question on the content of the last statement. Which is true? The interviewee was scratching his hair several times, The interviewee was coughing several times, The interviewee was having hiccups several times, The interviewee was holding his nose several times [correct answer], The interviewee was laughing several times), and one attention check asked about the content of the statement (i.e., ‘Which of the following persons did in the interviewee mention? Boris Johnson, Joe Biden, Angela Merkel [correct answer], Olaf Scholz, Pope Francis). Thereafter, participants rated motivation and difficulty, and were asked to name the cues they had relied on. Finally, we asked age, native language, gender, country or origin, country of residence, contact detail in order to provide the bonus pay, and whether they opted for money or credits. Finally, participants were debriefed and explained that the interviewees had been instructed to lie vs tell the truth.
Materials
We used 12 video statements obtained in Study6[8]. From the pool of 44 videos, we only used those for which the participants had provided consent to use their video statements in new research, that were below 4 minutes in length after cutting (see below), and where the interviewee was not wearing a face mask. Finally, we selected the videos so that the truthful and deceptive conditions were balanced in native tongue (English vs other). All participants watched the same set of 12 videos (6 truthful, 6 deceptive). From the interview, we cut the initial rapport building phase, and the follow-up question at the end. So we selected the response to the interviewer’s instruction to describe in as much detail as possible and trying to fill up the 10 minutes, what the interviewee had done in the last half an hour.
Using the detail count by the trained coders of Study6, the selected 6 truthful videos were found to contain more details (M = 6.17, SD = 1.17) than the selected 6 deceptive videos (M = 4.17, SD = 1.17), d = 1.71 (95% CI: 0.30; 3.03). Using a stopwatch, one team member (OKA) had measured the time that the interviewee looked away from the interviewer/camera. The coding of a random subset (20%) of the statements by second team member (AL) spoke to the reliability of this eye gaze aversion measurement (ICC = .93). That time was converted to the percentage of the entire interview’s duration. Eye gaze aversion in the 6 truthful videos (M = 59.83%, SD = 13.70) did not differ from that of the 6 deceptive videos (M = 61.50%, SD = 9.94), d = 0.14 (95% CI: -1.00; 1.27). Thus, the coding confirmed that detailedness, but not eye gaze aversion, is a diagnostic cue to deception.
Main Analyses (preregistered)
Using JASP version 0.16.2, and its default settings, the 2 (Cue: Richness in detail vs Eye Gaze Aversion) x 2 (Veracity: Truthful vs Deceptive) mixed Bayesian ANOVA showed that the data were much more likely (BF10 = 5.73 x 1023) under the model that included the interaction as compared to the model that only included the two main effects. As is clear from inspecting Figure 2, cue diagnosticity matters. Lie-truth differences when judging eye gaze aversion were faint, t(85) = 1.98, p = .05, d = 0.21 (95% CI: -0.01; 0.47), BF01 = 1.29. In contrast, lie-truth differences when judging richness in detail were significant and large, t(84) = 15.94, p < .001, d = 1.73, (95% CI: 1.44; +∞), BF10 = 1.79 x 1024.
Additional analyses
Participants were motivated to provide an accurate judgement. On a scale from 0 to 10 they rated their motivation M = 6.14, SD = 1.95 (Judging Eye gaze aversion: M = 5.97, SD = 2.12; Judging richness in detail: M = 6.32, SD = 1.75). Also using a 0 to 10 scale, they rated the task to be moderately difficult, M = 5.06, SD = 2.32 (Judging Eye gaze aversion: M = 5.22; SD = 2.39; Judging richness in detail: M = 4.91; SD = 2.24).
Eighty-five percent of the participants of the detailedness condition reported that they relied on detailedness, and 92% of the participants of the eye gaze aversion condition reported that they relied on gaze aversion in their judgment. That participants indeed relied on the instructed cue and can accurately judge cue presence is also apparent from the correlations between the participants judgements and the researcher coding of cue presence. Detailedness judged by the participants correlated strongly with detailedness as assessed by the trained coders, r = .74, p < .01 (but not with eye gaze aversion as measured with the stopwatch, r = -.32, p = .32). Eye gaze aversion judged by the participants correlated strongly with eye gaze aversion as measured with the stopwatch, r = .94, p < .001 (but not with detailedness as assessed by the trained coders, r = .24, p = .45).
References
1. B. Verschuere, M. Schutte, S. van Opzeeland, I. Kool, The verifiability approach to deception detection: A preregistered direct replication of the information protocol condition of Nahari, Vrij, and Fisher (2014b). Appl. Cogn. Psychol. 35, 308–316 (2021).
2. B. M. DePaulo, et al., Cues to deception. Psychol. Bull. (2003) https:/doi.org/10.1037/0033-2909.129.1.74.
3. T. R. Levine, Y. Daiku, J. Masip, The Number of Senders and Total Judgments Matter More Than Sample Size in Deception-Detection Experiments. Perspect. Psychol. Sci. (2021) https:/doi.org/10.1177/1745691621990369.
4. W. J. Youden, Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
5. B. L. Verigin, E. H. Meijer, A. Vrij, L. Zauzig, The interaction of truthful and deceptive information. Psychol. Crime Law 26, 367–383 (2020).
[1] Cohen´s d is the standardized mean lie-truth difference.
[2] The Bayes Factor BF01 expresses how much more likely the data are under the null hypothesis of no lie-truth difference than under the alternative hypothesis of a lie-truth difference. BF10 is the inverse of BF01. We report BF10 when the data were more likely under the alternative hypothesis than under the null hypothesis.
[3] Due to a programming error 1 of the 6 sets missed 1 (truthful) statement.
[4] Coders counted the number of perceptual, temporal, and spatial details, and we summed these to provide an index of richness in detail. Coding was based on the entire interview.
[5] This Qualtrics feature was only implemented for Study4.
[6] There were 4 experimenters (undergraduates students) who took the role as Experimenter 2, and each interviewed 3 to 14 participants.
[8] The videos are available from the first author after signing a non-disclosure agreement that stipulates the confidential nature of the videos and that they can only be used for research purposes.