We examined the contribution of informed and non-informed advice-taking to participants’ choice behavior in a sequential reinforcement learning task (Fig. 1). In this section, we start by establishing a ‘reveal effect’ describing the causal influence of revealing the teacher’s advice to the participant on choice behavior (Fig. 2). This effect uniquely disentangles the influence of advice-giving from trial-by-trial independent learning, which was explored both at the group level and at an individual level (i.e., between sessions test-retest estimates). We then continue to describe the mathematical processes underlying individuals’ choice behavior. Specifically, we used computational modeling to estimate the contribution of three internal latent processes to participants’ choice behavior: (1) individual learning (i.e., learning from experience with the cards, independent of any advice), (2) informed advice-taking (i.e., trial-by-trial learning of the value of following the teacher’s advice), and (3) non-informed advice-taking (i.e., a fixed internal bias to follow advice regardless of choice and outcome history). Finally, we will show further support for the necessity of the two advice-taking processes. We used participants’ estimated computational parameters to simulate data. We independently lesioned participants’ informed or non-informed advice-taking tendencies and demonstrated unique regression signatures for these processes. Overall, we found strong and compelling evidence suggesting a causal effect of advice on participants’ choice behavior, both via informed and non-informed advice-taking processes.
Theory-independent analysis examining the influence of receiving advice on compliance. We first assessed whether participants demonstrated a bias towards advice-taking across trials and sessions. For this aim, we first calculated a coherence rate as the dependent variable to reflect whether participants chose the same card as the teacher (coded 0/1 when teacher and student selected two different/same cards, respectively), both when the teacher’s choice was revealed and when it was concealed. The coherence rate in the concealed condition served as a baseline, as it reflects the contribution of an individual learning process to the likelihood of agreement with the teacher’s choice, i.e., the likelihood of the participant reaching the same choice as the teacher independently. This is plausible when the participant has learned to identify the more valuable offer, which was often aligned with the choice of the teacher (concealed or revealed). This approach enabled us to examine the unique contribution of the teacher’s advice by comparing coherence rates across the revealed advice and concealed advice conditions, which we termed the reveal effect.
We therefore performed a hierarchical Bayesian logistic regression analysis, predicting participants’ coherence rates as a function of advice presentation (concealed/revealed). We found that coherence rates in the revealed advice condition were higher than the baseline rates in the concealed advice condition, suggesting a causal influence of advice on compliance. Specifically, participants were more likely to choose the card advised by the teacher when the teacher’s advice was revealed (66% coherence rate) compared to when the advice was concealed (48% coherence rate; posterior median estimation for the difference in the population was 0.29, HDI89% = 0.25 to 0.33 probability of direction (pd) ~ 100% suggesting all of the distribution was positive; Fig. 2A/B].
To further substantiate the estimation of the causal influence of advice on participants’ choice behavior, we examined the test-retest reliability of the tendency to follow advice. Participants performed the task in two separate experimental sessions across two adjacent days (visual stimuli of cards and artificial teachers were changed across blocks and sessions). We performed a hierarchical Bayesian regression analysis and found good test-retest reliability (posterior median = 0.73, HDI89% = 0.64 to 0.81; Fig. 2C/D). Overall, these results demonstrate a strong and reliable causal influence of receiving advice on individuals’ choice behavior in a reinforcement learning task during which advice was given repeatedly. However, these analyses did not allow us to disentangle the effects of informed from non-informed advice-taking. For this purpose, we turned to computational analyses that can explicitly model the contribution of different learning and decision-making processes.
Computational Modeling. We hypothesized that participants’ choice behavior integrates three sources of information: individual learning, informed advice-taking and non-informed advice-taking. We formulated a saturated model that included all three types of information and tested it against three nested models. Specifically, in these models we predicted the agents’ choices based on reward history and experimental conditions. For each trial, we updated the subjective values of the cards (Q-values) based on the prediction error signal23. For clarity, we will describe the models we computed. The first model is a baseline model which only had an individual learning component (Model 1). We then added non-informed advice-taking (Models 2 & 3) and informed advice-taking (Model 4) to the models, and lastly, we computed a full model with all three components (Model 5).
Model 1 (null model)
This model assumes that participants learned the cards’ values from their own experience without considering the presented advice
(1) δchosen_card = (reward – Qchosen_card)
(2) Qchosen_card = Qchosen_card + α * δchosen_card
where α is a learning-rate (free-parameter) and δchosen_card represents the prediction error for the card selected by the agent (this equation is used in all models). Thus, this model ignores any advice revealed to the participant. In order to choose between the cards, we used a softmax policy:
(3) p(choice) =\(\frac{exp(\beta \cdot {Q}_{chosen\_card} )}{\varSigma exp(\beta \cdot Qi)}\)
where β is an inverse noise parameter (free parameter), and Qi denotes the Q-values of each card offered in a current trial. Thus, this model had two population-level free parameters (αchosen_card, β).
Model 2 (fixed non-informed advice-taking)
This model is similar to the baseline model only with an additional fixed bias in favor of teacher advice, when it was presented. When the teacher’s advice was presented, action values were calculated according to
(4) Qnetadvised_card = Qadvised_card + φ
(5) Qnetunadvised_card = Qunadvised_card
where φ is a free parameter (unrestricted and could be positive or negative) describing the individual tendency to follow advice regardless of any choice-outcome history during the task.
These Qnet values were then entered into the softmax:
(6) p(choice) =\(\frac{exp(\beta \cdot {Qnet}_{chosen-card} )}{\varSigma exp(\beta \cdot Qne{t}_{i})}\)
Overall, this model has three population-level free parameters (αchosen_card, β, φ).
Model 3 (dynamic non-informed advice-taking)
Here, we augmented Model 2 so that on trials in which the participants’ decision was more difficult (the offered cards’ internal values were similar), they could increase non-informed advice-taking tendencies to further rely on external information. Specifically, this model assumes that the bias in Eq. (4) changes as a function of the difference between the Q-values of the two cards that are currently offered to the player.
(7) φ = φintercept + φslope* |ΔQcards|
φintercept serves as a general bias factor, and φslope serves as an additional slope parameter that is multiplied by the absolute difference between Qadvised_card and Qunadvised_card. This model includes four free parameters: α, β, and two bias parameters (φintercept and φslope).
Model 4 (Informed advice-taking): This model assumes that instead of having a general preference for the advised card, participants evaluated the advice during the experimental block via Q-learning. The two options, to follow or not to follow the advice, were updated via Q-learning using a prediction error and learning rate (the same free parameter α from the other models), as shown below (Eq. (8) and Eq. (9)):
(8) δfollow_advice = (reward – Qfollow_advice)
(9) Qfollow_advice = Qfollow_advice + α *δfollow_advice
Next, we calculated Qnet, which incorporated the Q-values of the cards and of following (or not following) advice, and weighted them using ω:
(10) Qnetadvised_card = ω*Qadvised_card + (1-ω)*Qfollow_advice
We note that the Q-values of the cards were updated using Eq. (1) and Eq. (2), and that the softmax decision function in this model is the same as in Model 2 (Eq. (6)). This model involves 3 free parameters: learning rate for the cards and for following advice (α), inverse noise parameter (β), and weighing between the Q-values of the cards and of following advice (ω).
Model 5 (informed and non-informed advice-taking): This model combines Models 2 and 4 and assumes both a general preference for the advised card, and an evaluation of the teacher through the reward history (Eq. (11)).
(11) Qnetadvised_card = ω*Qadvised_card + (1-ω)*Qfollow_advice + φ
Model fitting. We performed a model comparison using a leave-one-block-out approach, as described in the Method section. We calculated a difference distribution for each paired model comparison and estimated the elpd difference and the standard error of the difference distribution (using ‘loo’ package; with an elpd difference of 2 times the standard-error considered substantial24). We found Model 5 to be the winning model (see Table 1). Parameter recovery is depicted in SI.
Table 1
Model comparison results – winning model compared to other models.
Model | Expected log probability difference compared to the winning model (Model 5 – informed and non-informed advice-taking) |
Null model | -3938 (89.4) |
Model 2 (fixed bias) | -189.5 (22.9) |
Model 3 (differential bias) | -213.2 (22.7) |
Model 4 (teacher evaluation) | -916.7 (59.8) |
Note. Elpd (expected log probability density) was calculated using a leave-one-(block)-out cross-validation approach. An elpd difference that is larger than 4 and at least twice the standard error is considered to be significant evidence for the winning model 24. Elpd difference standard errors are noted in brackets. |
Estimated parameters. The estimated parameters for the winning model highlight the contribution of individual learning, informed advice-taking and non-informed advice-taking (Fig. 3). The ω parameter was estimated to be ~ 0.75 (Fig. 3C), indicating a reliance on individual learning, but also on informed advice-taking, which tracked the accuracy of the teacher. This value indicates that a highly accurate teacher could influence the decisions made by participants beyond individual learning, a finding that is in line with the reveal effect discussed above. Parameter φ was estimated to be ~ 0.2 (Fig. 3D), indicating a tendency to follow advice in a manner that is unrelated to the teacher’s accuracy. This fixed contribution is the basis of non-informed compliance and contributed to the reveal effect as well. This tendency increased the likelihood of following advice even when it went against one’s individual experience and/or when the teacher’s accuracy was low.
Associations between the winning model and empirical data. To demonstrate the association between the winning model parameters and the model-agnostic results, we estimated the individual bias to follow advice for each participant and for each simulated agent. To do so, we simulated artificial data with the same number of trials as the empirical data using individuals’ parameter estimation (we used the mean posterior for each individual and parameter). We then calculated and plotted the 'reveal effect’ from empirical data and artificial data for each subject based on the model and examined the correlation between them. We found a strong positive correlation (Pearson r = 0.89, pd ~ 100%; CI89% = .85 to .91; Fig. 3E), showing that the model successfully replicated the behavioral results.
Computational lesion analyses. Thus far, our modeling results (Table 1) show clear evidence suggesting that individuals are using both informed and non-informed advice-taking mechanisms. Specifically, expected log probability density (elpd) suggested that using both informed and non-informed processes substantially increased our ability to predict left-out blocks. To illustrate the existence of both types of advice-taking in the empirical data, we additionally performed a lesion analysis. Here, we first simulated two data sets for each individual, based on the individual empirical parameters’ estimation. In the first set (non-informed advice-taking), we lesioned the ω parameter (i.e., ω was set to 1). In the second set, (informed advice-taking) we lesioned the φ parameters (i.e., φ was set to 0). This allowed us to examine the reveal effect in datasets that were artificially constructed using only one of the advice-taking processes at a time.
(1) Signature for non-informed advice-taking: We calculated the ‘reveal effect’ using only two trials for each participant, per block – the first time a conceal trial was presented and the first time a reveal trial was presented. We reasoned that during such an early stage in the task, informed advice-taking would not be able to produce a reveal effect. Note that each block included a novel teacher, with whom the participant did not have prior experience. For the empirical data, we found a substantial reveal effect (median = 0.62, CI89% between 0.56 to 0.71, pd ~ 100%; see Fig. 4A). We found a similar effect in the non-informed artificial dataset (median = 0.30, CI89% between 0.22 to 0.38, pd ~ 100%; see Fig. 4C). Importantly, we found evidence in favor of no reveal effect for the informed advice-taking artificial data set (median = -0.01, CI89% between − 0.09 to 0.07, pd = 59.19%; see Fig. 4B).
(2) Signature for informed advice-taking: To capture a unique signature for informed advice-taking, we examined the influence of outcome on choice behavior. Specifically, we calculated a coherence repeat rate dependent variable, which reflected whether participants exhibited the same behavior (following the teacher’s advice or not) across two consecutive trials, n and n + 1 (coded 0/1 for different/same behavior, respectively). We then used Bayesian logistic regression analyses to predict coherence repeat as a function of previous outcome (rewarded vs. unrewarded) and advice presentation (concealed vs. revealed on both n and n + 1 trials), as well as their paired interaction. Note that we did not include trials in which advice presentation was different across n and n + 1 trials to allow for a more precise comparison. We reasoned that the interaction of previous outcome X advice presentation would be present only for informed advice-taking. Specifically, when advice is presented, a reward on trial n should increase coherence on trial n + 1 compared to unrewarded n trials (as can be seen in the updating of the Qfollow_advice value in the winning Model 5). However, in concealed advice trials, the value for taking/rejecting advice should not be updated, and so the reward on trial n should not affect coherence rates. Note that we also added teacher accuracy in the n trial as a fixed effect only with no Interaction, allowing us to control for overall coherence baseline rates which might change with teacher accuracy.
For the empirical data, we found evidence in favor of a previous outcome X advice presentation paired interaction (median = 0.06, CI89% = 0.04 to 0.07, pd ~ 100%; Fig. 4D). This suggests, as we predicted, higher coherence repeat rates after a rewarded vs. unrewarded n trial, but only when advice was revealed. When the advice was concealed, the outcome of trial n had no influence on coherence repeat rates. A similar (but smaller) positive interaction was found for the informed artificial data (in which non-informed advice-taking was lesioned; median = 0.02, CI89% = 0.01 to 0.04, pd = 99.10%; Fig. 4E). Importantly, the non-informed artificial data set (in which informed advice-taking was lesioned) did not show evidence in favor of a previous outcome X advice presentation paired interaction (median = 0, CI89% = -0.1 to 0.02, pd = 85.35%; Fig. 4F).
Overall, our lesion analyses were able to show two unique regression signatures: one that is predicted only by informed, but not non-informed advice-taking and one that is predicted only by non-informed but not informed advice-taking. Importantly, both effects were observed in the empirical data set, providing further evidence for our model comparison conclusion which suggests that both types of advice-taking are required to explain the data.