**An overview of the behavioural data.**

In Experiment 1 tested the efficacy of the proposed method, including a comparison with other methods. Therefore, we set three conditions: Other’s perspective, Dialectical, and Repeated conditions. All participants produced two estimates for each question. In the Other’s perspective condition, participants used our method. In the Dialectical condition, they did dialectical bootstrapping15. In the Repeated condition, they produced two estimates for a question without instructions (Table 1). Participants were randomly assigned to one of the three conditions. The stimulus consisted of the questions which asked general knowledge (for example, ‘What percent of the world's airports are in the United States?’; Table 2).

In Experiment 2, we also set the three conditions (that is, the Other’s perspective, Dialectical, and Repeated conditions). To conduct further analysis, we performed Experiment 2 with the following modifications from Experiment 1. We recorded response times, asked participants for a third (i.e., final) estimate, and to rate their level of confidence (See more details in ‘Methods’).

In Experiment 3, we tested the method on an additional framework to examine whether its efficacy increased when the number of estimates increased. In this experiment, we set a single condition: All participants made five estimates for a question: one participant’s own estimate and four estimated public opinions.

## Efficacy of our method

Based on the behavioural data from the Other’s perspective condition in Experiments 1 and 2, we analysed the efficacy of our method. First, we compared accuracy among the three estimates, i.e., the first, second, and their averaged estimates, only in the Other’s perspective condition. As an index of the accuracy of the estimates, we used the mean squared error (MSE), where a lower MSE indicates higher accuracy. We calculated the MSEs of the three estimates for each participant.

Figure 2 shows the results of the analysis. Notably, the averaged estimates had a lower MSE than the participants’ own estimates (Estimate 1) across the two experiments (Experiment 1: *t*(149) = 4.39, *p* < .01, *Cohen’s d* = 0.15; Experiment 2: *t*(29) = 4.23, *p* < .01, *Cohen’s d* = 0.50). Thus, we confirmed that the new method elicited the wisdom of the inner crowd.

Note that we also compared the averaged estimates with two people’s own estimates (see Section S1 of the Supplementary Information) and found they did not exceed the two people’s estimates. However, we also found that the averaged estimates could be more accurate than 1.5 people’s estimates (1.59 people in Study 2; 1.26 people in Study 1). To the best of our knowledge, this is the best efficacy for an approach to harvesting the wisdom of the inner crowd14,17,20−22.

As previous studies14,15 showed, the second estimate is not necessarily accurate. Accordingly, in our study, Estimate 2 had a higher MSE than the average across the two experiments (Experiment 1: *t*(149) = 6.50, *p* < .01, *Cohen's d* = 0.32; Experiment 2: *t*(29) = 3.49, *p* < .01, *Cohen's d* = 0.32). Along with this, we did not find a significant effect of MSE between Estimate 1 and Estimate 2 (*ps* > .1).

-----Figure 2 about here-----

Table 2

Questions and correct answers used in the experiments (in %). We checked all the answers on 2022/08/05. We used the answers in The world factbook (52), as with the previous studies (14,15). Experiment 1 used Questions 1–8 (14), and Experiments 2 and 3 used all the questions (15). Note that since we could not confirm the answer on Q10, we used the data on the World Bank Data (53). In addition, as for Q5, we used the latest data on the World Population Review (54) because a fertility rate changes frequently. All questions were translated into Japanese.

Number | Question | Answer |

1 | The area of the USA is what percent of the area of the Pacific Ocean? | 6.32 |

2 | What percent of the world’s population lives in either China, India, or the European Union? | 41.29 |

3 | What percent of the world’s airports are in the United States? | 32.31 |

4 | What percent of the world’s roads are in India? | 7.30 |

5 | What percent of the world’s countries have a higher fertility rate than the United States? | 69.84 |

6 | What percent of the world’s telephone lines are in China, USA, or the European Union? | 52.09 |

7 | Saudi Arabia consumes what percentage of the oil it produces? | 26.62 |

8 | What percentage of the world’s countries have a higher life expectancy than the United States? | 22.56 |

9 | What percent of the earth’s surface is covered by water? | 70.90 |

10 | What percent of the worldwide land mass is not used for agriculture? | 63.10 |

11 | What percent of the world’s population is between 15 and 64 years old? | 65.18 |

12 | What percent of the world’s population is Christian? | 31.40 |

13 | What percent of the world’s population speaks Mandarin Chinese as a first language? | 12.30 |

14 | What percent of the world’s population aged 15 years or older can read and write? | 86.30 |

15 | What percent of the worldwide gross domestic product (GDP) comes from the service sector? | 63.00 |

16 | What percent of the worldwide labor force works in the agricultural sector? | 31.00 |

17 | What percent of the worldwide income does the richest 10% of households earn? | 30.20 |

18 | What percent of the worldwide gross domestic product (GDP) is re-invested (‘gross fixed investment’)? | 25.70 |

19 | What percent of the goods exported worldwide are mineral fuels (including oil, coal, gas, and refined products)? | 14.40 |

20 | What percent of the worldwide gross domestic product (GDP) is used for the military (military expenditure)? | 2.140 |

## Comparison of the methods

We compared the efficacy among conditions, based on the data from Experiments 1 and 2. The reduction of MSE was calculated as shown in Eq. 1.

\(Reduction of MSE = {MSE}_{first estimates} - {MSE}_{averaged two estimates}\) Eq. 1

Subsequently, a higher the reduction of MSE indicates higher effectiveness of a method. As Fig. 3 shows, the results confirm the advantage of our method. In Experiment 1, the Other’s perspective condition had a larger reduction of MSE than the Repeated condition *(pairwise Wilcoxon rank-sum* test: *p* < .05, *Cliff’s delta* = 0.16; Dialectical and Repeated conditions did not follow the normal distribution, Kolmogorov–Smirnov test: *ps* < .05; note that all pairwise tests were performed using Bonferroni correction). In Experiment 2, the Other’s perspective condition had a larger reduction of MSE than the Dialectical condition (*pairwise Wilcoxon rank-sum* test: *p* < .05, *Cliff’s delta* = 0.40; Other’s perspective and Repeated conditions did not follow the normal distribution, *Kolmogorov–Smirnov* test: *ps* < .05).

No such effects were found between the Other’s perspective condition and Dialectical condition in Experiment 1 (*p* = 1.00) or between the Other’s perspective condition and Repeated condition in Experiment 2 (*p* = 0.22). However, the results showed the benefit of our method: Other’s perspective had the largest values on reduction of MSE among all conditions across the two Experiments (Experiment 1: Other’s perspective = 51.80; Dialectical = 42.01; Repeated = 15.69; Experiment 2: Other’s perspective = 85.56; Dialectical = 17.38; Repeated = 38.46). In summary, we can assume that the new method is superior to the previous ones in terms of efficacy (for raw data, see S5 of the Supplementary Information).

-----Figure 3 about here-----

## Analysis of cognitive load

Methods for collecting the wisdom of the inner crowd can be used on a daily basis. From this perspective, it is important that the method is convenient to use. Therefore, we compared the cognitive load among all conditions.

As an index of the cognitive load, we utilised response time: in Experiment 2, the laboratory computer recorded the response time. Particularly, we examined the response time for the second estimates because, for this estimate, the participants were instructed differently depending on their assigned condition (Table 1). It should be noted that for the first estimates, we did not find any significant effects among the three conditions (*pairwise t*-test: *ps* > .1).

Figure 4 shows the results of the analysis. Most importantly, the Other’s perspective condition had a significantly shorter response time than the Dialectical condition (*pairwise t*-test: *p* < .01, *Cohen’s d* = 0.96). The results indicate that our method is relatively convenient to use. It should be added that the Repeated condition had a shorter response time than the Other’s perspective and Dialectical conditions since the Repeated condition had no specific instructions (*pairwise t*-test: the Other’s perspective condition, *p* < .05, *Cohen’s d* = 0.84; the Dialectical condition, *p* < .01, *Cohen’s d = 1.72*, respectively).

When considered along with the results presented in the last section, our method could be superior to other methods in terms of efficacy and cognitive load.

-----Figure 4 about here-----

## When the proposed method worked better (or worse)

For further analysis, we investigated the conditions under which the methods worked better or worse. In Experiment 2, all participants reported their level of confidence in their first estimates. Subsequently, we analysed the influence of confidence on the efficacy of each method. We conducted mixed-effects analyses41 for each condition with the reduction of the MSE as a dependent variable and confidence as an independent variable, as well as the participants and questions as random variables.

The results showed that in the Other’s perspective condition, higher confidence corresponded to a greater reduction of MSE (*F*(1, 531.62) = 10.30, *p* < .01; see also Fig. 5). In other words, the proposed method worked better when participants were confident in their own estimates. Accordingly, it would be better for people to use the method when their confidence is high. For other conditions, we did not find such effects (Dialectical condition: *p* = .96; Repeated condition: *p* = .73). Thus, hereafter, we shall discuss the Other’s perspective condition.

How did these results emerge? In Experiment 2, confidence did not correlate with accuracy in the first estimate (*p* > .1), meaning there is room for improving estimates even when the participants feel confident. Importantly, confidence in the first estimate correlated with accuracy in the second estimate. We conducted an additional mixed-effects analysis that included the MSE in the second estimate as a dependent variable, with confidence as an independent variable. The results showed that, although marginally, the higher the confidence was, the lower the MSE was in the second estimate (*F*(1, 544.9) = 2.80, *p* = 0.095). Subsequently, the average could be close to the true value, resulting in the consequences as described above. We shall remark on this issue in the ‘Discussion’ section.

-----Figure 5 about here-----

## Overconfidence in the final estimate

Thus far, we have shed light on the positive aspects of the wisdom of the inner crowd. Here, we point out its negative aspects and limitations.

Previous studies18, 19 have pointed out the possibility that people cannot ‘utilise’ the wisdom of the inner crowd. As we have discussed, the average of the two estimates was accurate in the proposed method for harvesting the wisdom of the inner crowd. However, people might not naturally utilise averages as their final estimates. For example, some people might adopt their first estimate as their final one. We address this problem based on the results of Experiment 2 since all the participants produced final answers based on their own thinking.

Figure 6A shows the results of the analysis. The analysis compared the MSE of the first and final estimates. As Fig. 6A shows, only in the Repeated condition, the final estimate was more accurate than the first estimate (*t*(30) = 2.20, *p* < .05). In the Other’s perspective and Dialectical conditions, we did not find such effects (*ps* > .1). Thus, as previous studies pointed out, people do not always utilise the wisdom of the inner crowd naturally (as for how the results emerged, see Section S2 of the Supplementary Information).

Moreover, in Experiment 2, the participants also responded with confidence in their final estimates. Comparing this with their confidence in their first estimates, we found that the participants became more confident in the final estimates than in the first estimates across all conditions (Fig. 6B, Other’s perspective: *t*(27) = 3.29, *p* < .01; Dialectical: *t*(26) = 3.15, *p* < .01; Repeated: *t*(29) = 4.70, *p* < .01). Together with the above results, the methods of the wisdom of the inner crowd could lead to participants having ‘overconfidence’42–45 as a whole. We shall remark on this issue in the Discussion.

-----Figure 6 about here-----

## When the number of estimates increased

Thus far, we had asked the participants to provide a single public opinion in the Other’s perspective condition. Subsequently, can the efficacy of our method increase if the number of estimated public opinions increases? Previous research on the wisdom of the inner crowd17,22 has often discussed the effect of the number of estimates. For instance, one study22 examined a case where participants gave five estimates in response to a single question (note that this study provided no specific instructions). They compared this case with when participants gave two estimates for a question, and the results revealed that increasing the number of estimates did not enhance the wisdom of the inner crowd effect. Thus, to determine the potential of our method, it is important to address whether the number of estimates can enhance the wisdom of the inner crowd effect.

In Experiment 3, all the participants gave five estimates in response to each question: the participants answered their own estimate once and estimated public opinions four times (see more details in the ‘Methods’ section). In the analysis, we calculated the reduction of MSE. In this context, we computed how much the error in the first estimate decreased by averaging all five estimates. Thus, the reduction of MSE was calculated as shown in Eq. 2.

\(Reduction of MSE = {MSE}_{first estimates} - {MSE}_{averaged five estimates}\) Eq. 2

Subsequently, we compared the results of this analysis with those of the Other’s perspective condition in Experiment 2. As Fig. 7A shows, the reduction of MSE in Experiment 3 recorded a positive value (95% CI = [47.20, 126.91]; we conducted the bootstrapping based on 10,000 sampling with replacement). In other words, the error in the first estimate decreased to some degree when the participants gave five estimates.

However, most importantly, we could not find any significant effects between them (*Welch t*-test: *t*(59) = 0.34, *p* = .73). This indicates that increasing the number of public opinion estimates did not necessarily enhance the efficacy of our method (see the ‘Discussion’ for speculation on how to overcome this limitation).

How did the results emerge? As mentioned in the ‘Introduction’, the wisdom of the inner crowd paradigm aims to make participants produce different opinions from their own. Subsequently, we calculated ‘distance’ in both experiments. Here distance means the absolute distance between participants’ own and guessed public opinions. As for Experiment 3, we first averaged four public opinions and computed the distance. The results show that Experiment 3 had a smaller distance than Experiment 2 (see Fig. 7b; *t*(59) = 5.08, *p* < .01). That is, Experiment 3 failed to make participants produce different opinions compared to Experiment 2.

Table 3 shows more detailed results in Experiment 3. In this table, we categorised participants’ own estimates. Specifically, we categorised them according to the relative size compared to four public opinions: the smallest, second smallest, medium, second largest, and the largest. We first counted the number of estimates falling into each category for each participant and then added up the number for all participants. As a result, the frequencies of appearance of ‘medium’ and ‘second smallest’ categories were larger than those of other categories (95% CI). Especially in the medium category, we can assume that participants assigned two of four public opinions to larger values than their own estimates and did the other two to smaller values than their own estimates. In other words, we can consider that the participants’ first own estimates worked like anchoring46–48. We could speculate that as a result, the averaged four public opinion was not largely different from participants’ own estimates.

-----Figure 7 about here-----

Table 3

Analysis of the first estimates. Estimates were categorised by size in comparison to the size of all public opinions. We first counted the number of estimates falling into each category for each participant and then added up the number for all participants.

Category | Frequency in the 20 questions (95% CI) |

Smallest | [2.10, 3.93] |

Second smallest | [5.68, 7.29] |

Medium | [5.74, 7.71] |

Second largest | [2.19, 3.51] |

Largest | [0.61, 1.35] |