Does performance on a vocabulary test give any meaningful indication on how someone will perform on a math test? Or how quickly they react to a change in an array of stimuli? Or how well they can mentally rotate an image? Surprisingly, the answer is yes. Performance across all these tasks shows a positive correlation; if a person performs well on one task, they are likely to perform well in another. This effect has been replicated many times, but the most compelling results are from full scale intelligence quotient (FSIQ) tests because of the number and variety of tasks used. The exact number and type included vary from test to test, but they typically include 11-17 measures that assess memory, basic math, spatial reasoning, and analogical reasoning (Carroll, 1993; Johnson et al., 2004). Despite differences in the test batteries and variety of tasks used, a positive correlation matrix is found (Carroll, 1993; Johnson et al., 2004). When variable-reducing techniques, like principal component analysis (PCA) or factor analysis, are applied to this positive correlation matrix, one factor is consistently extracted that can account for approximately half of the variance (Carroll, 1993; Deary, 2000). All cognitive tasks positively load onto this factor, meaning the factor can account for variance in performance in the task (Carroll, 1993; Deary, 2000). Because this factor is seemingly related to all cognitive abilities, it is referred to as g for “general” (Spearman, 1904).
g has been extracted with a variety of test batteries in a variety of samples, making it one of the most well-replicated results in psychology (Carroll, 1993; Deary, 2000; Johnson et al., 2004; Warne & Burningham, 2019). Despite the ubiquity of g, there are still important parameters to consider when creating a test or test battery to extract this factor. As mentioned earlier, almost all cognitive tasks load onto g, but some tasks have a higher loading than others. The highest loadings on the g factor will be found when tasks are complex, novel, and require reasoning, irrespective of the task content or method of delivery (Jensen, 1992; Quiroga et al., 2019). Even though we know what kinds of tasks load highly onto g, no task is a perfect or pure measure of any specific cognitive construct. All cognitive tasks assess g to some degree, but they also assess more specific abilities (Gignac, 2015). Using a large and diverse battery of tests will attenuate task-specific variance, resulting in a more accurate g factor (Major et al., 2011). Another issue with all measures is random error, variables unrelated to the construct of interest that impact performance on the task. Random error can cause performance to be different when participants complete the measure at different time points or respond differently to theoretically similar items. If a measure produces similar results, despite random error, it is referred to as a reliable measure (John & Benet-Martinez, 2000). Reliable measures are crucial because more of the variance in performance across individuals is due to differences in the actual cognitive ability the task is measuring instead of differences caused by random error (Bray et al., 1998). Reliability also impacts the correlation matrix. Less reliable measures will artificially lower the correlations, which impacts the extraction of the g factor (John & Benet-Martinez, 2000). While it is important that measures are reliable, they also need to be sensitive enough to detect individual differences across people. The g factor accounts for variance in performance across people, therefore the tasks used should show variability based on true differences in cognitive ability (Hedge et al., 2018). Extracting a robust g factor depends on using appropriate tasks and on the sample that is being assessed. Highly homogeneous samples of human participants may not have true differences in the construct of interest which reduces variance and attenuates the subsequent g factor (Sackett & Yang, 2000). It is also best to test many participants due to how correlational and factor analyses are conducted. While a sufficient sample size to detect a reliable correlation depends on a variety of parameters, the sample size required can be in the hundreds (Bonett & Wright, 2000), and for factor analysis a sufficient sample size ranges from as few as 75 to as many as 1,200 participants (Mundfrom et al., 2005, but see de Winter et al., 2009). To summarize, to extract the strongest g factor, a large, heterogeneous sample of people should be given a large variety of cognitive tasks, that are reliable and sensitive to individual differences.
Even though g is consistently replicated, it is still not clear what exactly g is. It is tempting to use intelligence and g interchangeably, since g can be extracted using FSIQ tests and is related to a wide variety of cognitive tasks. Yet g only accounts for half of the variance in performance on FSIQ, which means other factors besides g are related to performance. In addition, the amount of variance g can account for relies on the strength of the correlation matrix (Jensen, 1998). Individuals who perform better on intelligence tests tend to have weaker correlations between tasks, meaning that higher FSIQ scores come with increased differentiation of abilities (Blum & Holling, 2017). For higher performing individuals, g explains less of the variance in performance (Jensen, 1998). Therefore, it would be incorrect to say that more intelligent people have more g (Detterman, 1991). These results indicate that when we refer to intelligence in a colloquial sense, we are referring to more than g, even though the two concepts are closely related (Jung & Haier, 2007; Stankov, 2017).
With that distinction stated, understanding g is still important given the consistent pattern of correlations across tasks, even among high ability individuals (Blum & Holling, 2017). Even though g is a single factor or component, it does not mean it is a single causal entity. Instead, g is commonly theorized to be composed of more specific cognitive processes like working memory (WM), short term memory (STM), processing speed, attention, and associative learning (Conway et al., 2002; Deary, 2000; Jensen, 1998; Kaufman et al., 2009; Sheppard & Vernon, 2008). It is likely that the tasks included in FSIQ tests, particularly complex tasks that load highly onto g, require support from multiple cognitive domains (Chuderski, 2013). Therefore, g could reflect individual differences in how many processes are required to solve a task (Chuderski, 2013). Another theory suggests that differences in one of these abilities could act as a bottleneck, constraining and weakening the ability of all other cognitive domains to function (Kovacs & Conway, 2019). With this theory, g primarily reflects differences in one cognitive ability, but it is unclear which cognitive ability. These theories are helpful for understanding the more specific cognitive processes that are involved with intelligence tests and how those processes are used across a large number of tasks (Conway et al., 2002; Deary, 2000; Jensen, 1998; Kaufman et al., 2009; Sheppard & Vernon, 2008). Future research is still needed, however, to fully understand if there is a relationship between these cognitive processes that could impact the positive correlation matrix (Frischkorn et al., 2019).
At the psychological construct level, g is related to a variety of cognitive processes. Similarly, g and intelligence are correlated with a variety of neurobiological mechanisms, processes, and features, (Deary et al., 2010). Thus far there have been two major lines of research. One focuses on what makes individuals different, for example comparing people who have high IQ scores to people who do not. The most robust result from this line of research is the modest positive correlation between brain size and measures of intelligence (Pietschnig et al., 2015). The other line of research focuses on why performance is positively correlated across tasks, irrespective of individual performance. A variety of methods, including functional magnetic resonance imaging (fMRI), positron emission tomography (PET), and lesions due to accident or stroke, have indicated the importance of the frontal cortex in a wide variety of tasks (Jung & Haier, 2007). The dorsolateral prefrontal cortex in particular is active during a variety of WM and reasoning tasks, though similar patterns of activations in different areas of the frontal cortex for other types of tasks have also been identified (Colom et al., 2013). Yet, brain areas do not function in isolation; rather, different areas are connected, forming functional networks (van den Heuvel & Sporns, 2017). Similar performance across tasks may be partially due to how whole networks are activated by tasks instead of discrete regions. A network connecting the frontal and parietal cortex is implicated (Fraenz et al., 2021; Jung & Haier, 2007; Zanto & Gazzaley, 2013). Thus, most research has indicated that consistent performance could be due to activation of the frontal cortex over a wide variety of tasks (Colom et al., 2013; Jung & Haier, 2007)
Important neurobiological correlates of g and intelligence have been identified in humans, but the techniques used thus far fail to support causal interpretations. Nonhuman animal models (hereafter animals) would be ideal to explore causal manipulations, but it first needs to be established that animals have a g factor similar to what is seen in humans (Matzel et al., 2013). Investigations over the past 20 years have generated promising results that are described in more detail elsewhere (Burkart et al., 2017; Flaim & Blaisdell, 2020; Shaw & Schmelz, 2017), but some key results from mice and avian species will be briefly reviewed here. For mice, Matzel and colleagues in particular have been consistently exploring a general factor using a cognitive test battery that targets different domains of learning (Matzel et al., 2003; but also see Locurto & Scanlon, 1998; Locurto et al., 2003, 2006). Briefly, the test battery includes five tasks and measures non-spatial navigation (Lashley III maze), spatial navigation (Morris water maze), suppression of exploratory behavior to avoid an aversive audiovisual stimulus (passive avoidance), using odor to guide a response (odor discrimination), and using an auditory cue to predict an aversive shock (associative fear learning). Multiple experiments found that performance was positively correlated across all tasks, and the first factor extracted could account for 38-43% of the variance in performance (Kolata et al., 2005, 2007; Matzel et al., 2003; 2006). For these individual experiments however, the number of subjects ranged from 21-56, which is smaller than what is typically used or recommended in human studies (Mundfrom et al., 2005, but see de Winter et al., 2009). When the results from multiple experiments were combined to have a total of 241 subjects, the result was replicated, providing robust evidence for a g like factor in mice (Kolata et al., 2008). Subsequent experiments have shown that performance on this cognitive test battery is positively correlated with measures of WM, similar to what is seen in humans (Kolata et al., 2005). Investigations with avian species have also yielded interesting results. Cognitive test batteries that typically include motor learning, color discrimination, reversal learning, spatial memory, and inhibitory control, have been administered to robins (Shaw et al., 2015), spotted bowerbirds (Isden et al., 2013), magpies (Ashton et al., 2018), southern pied babblers (Soravia et al., 2022), and song sparrows (Anderson et al., 2017; Boogert et al., 2011). For robins and spotted bowerbirds, performance across the tasks was mostly positively, though not significantly, correlated, and a factor that could explain 34% and 44% of the variance in performance, respectively, was found (Isden et al., 2013; Shaw et al., 2015). This result should be treated with some caution since a small number of subjects, 16 robins and 14 bowerbirds, were assessed. More robust results have been obtained with magpies, which assessed 56 subjects (Ashton et al., 2018) and southern pied babblers, which assessed 38 subjects (Soravia et al., 2022). The subsequent correlation matrices were uniformly positive and the subsequent factor extracted accounted for 64% and 60% of the variance respectively (Ashton et al., 2018; Soravia et al., 2022). Yet, similar results were not found in song sparrows, even though 52 (Boogert et al., 2011) and 41 (Anderson et al., 2017) birds were assessed using the same test battery. Across both experiments, two factors were extracted, and not all tasks loaded onto the first factor extracted. This may be due to the low reliability in performance across years on cognitive tasks in song sparrows (Soha et al., 2019). While the results from animals thus far are interesting and promising, there are some difficulties in comparing g across species.
Research with many species thus far indicates that g can be found beyond humans, but it is not clear exactly how similar g is across species. This ambiguity is partially due to the differences in test batteries across species. In humans, g has been heavily investigated in relationship to processing speed, where more intelligent individuals are consistently faster on simple tasks (Sheppard & Vernon, 2008), yet this has not been investigated or replicated with animal test batteries (see discussion by Flaim & Blaisdell, 2020). In contrast, the relationship between response inhibition and g has rarely been investigated in humans, but response inhibition tasks are almost always included in avian cognitive test batteries (Flaim & Blaisdell, 2020). Even when the cognitive domain does overlap, there are differences in the procedures used for humans versus nonhumans that can impede comparisons. Taking associative learning as an example, in humans an initial investigation using a simple associative learning task, where children had to learn which picture was associated with a reward, was not related to IQ scores (Plenderleith, 1956). More recent investigations have used the word-pairs task, where participants learn up to ten arbitrary pairs of words, like cat-pie, or the three-term contingency task where one word serves as a cue and the participant must learn three response words (Kaufman et al., 2009). These associative learning tasks show a positive relationship to g that scales with complexity, where the more complex three-term contingency task has a stronger relationship with g (Tamez et al., 2008; Williams & Pearlberg, 2006; but see Kaufman et al., 2009). In contrast, for mice and birds, a simple associative learning task, such as learning how to discriminate one cue from another to obtain a food reward, is related to the g like factor extracted in these species (Flaim & Blaisdell, 2020). The finding that associative learning is related to g across species, but different levels of difficulty are needed to reveal such a relationship, may be related to the experience of the subject. g is related to complexity, but it is also related to novelty, where novel tasks tend to have a high g loading (Carroll, 1993; Sternberg & Gastel, 1989). If animal subjects are naïve to any highly artificial experimental stimuli and procedures, the task may be sufficiently novel to explain why performance is related to g, despite the apparent simplicity. In contrast, when many humans are assessed, they have had years of experience in an educational setting with similar materials and task demands as the word-pairs and three-term contingency tasks. Therefore, for humans, task difficulty may be a more important factor for investigating associative learning and g. These results could indicate that task loading onto g is related to novelty, complexity, and associative learning across species, but further research is necessary to determine if there is a similar relationship between complexity and g in animals.
While animals may be relatively naïve to cognitive assessments compared to most human samples, there are other issues when comparing across species. In nonhuman research on g, the sample of animals tested is often homogeneous in some way (Shaw & Schmelz, 2017). In mice, only male subjects have been used and a a similar ‘home’ environment is used across studies (Kolata et al., 2008; but see Sauce et al., 2018), while for the wild bird subjects, like robins, collection is biased towards males and bold individuals (Shaw & Schmelz, 2017). If g is a robust phenomenon in animals, then it should replicate across members of the species that differ from the samples assessed thus far, for example in sex, personality, or environmental conditions. In addition, most experiments assess a small number of subjects. This can be overcome by using a consistent test battery, which makes it possible to pool results from multiple experiments, as demonstrated by Kolata et al. (2008). Utilizing a species that is more commonly investigated, either in the lab or across field sites, could also increase the number of subjects if multiple labs are willing to work together (Shaw & Schmelz, 2017). Thus, there could be improvements in both the test battery and sample characteristics, particularly for avian species. Tasks that assess clear cognitive domains, facilitate cross species comparisons, and have identified neural correlates should be favored. Species for which it is possible to obtain a large and diverse sample should also be favored, at least in these preliminary investigations of g in animals.
Given these arguments, it is surprising that pigeons have not been given a comprehensive test battery, given their long history as an animal model in psychology. Pigeons have excellent visual acuity and readily learn to peck visual stimuli in a touchscreen operant chamber, similar to procedures used to assess human and nonhuman primates (Wright et al., 2018; Zentall, 2021). Investigations of matching, interval timing, reaction time, memory, and many other cognitive domains show there are similarities in performance across pigeons and primates that indicate similar underlying mechanism at the psychological and neurobiological level (Colombo & Scarf, 2020; Güntürkün, 2005; Vickrey & Neuringer, 2000; Zentall, 2021). Methods for investigating memory, associative learning, and cognitive flexibility in particular have been well established, and the neural mechanisms supporting performance have been identified on some level. Similar to humans, performance on many cognitive tasks seems to depend on nidopallium caudolaterale (NCL) which is the avian equivalent to the mammalian prefrontal cortex (Güntürkün, 2005). For example, when assessing STM in the pigeon by requiring pigeons to remember a stimulus over a short delay to guide choice behavior, there is sustained neural activity in the NCL that relies on the neurotransmitter dopamine, similar to results found in nonhuman primates (Johnston et al., 2017; Karakuyu et al., 2007). Given this rich history, there are many tasks that could be included in a cognitive test battery for pigeons, but a few were selected as ideal for our purposes.
The tasks in the battery developed here were selected according to how well they assessed a specific cognitive domain, if the task facilitates cross-species comparisons, and if the neural substrates of performance had been identified (Diekamp et al., 2000; Flaim & Blaisdell, 2020; Izquierdo et al., 2017; Johnston et al., 2017; Karakuyu et al., 2007; Lissek et al., 2002; Vickrey & Neuringer, 2000). Ultimately, the pigeon cognitive test battery was designed to assess associative learning, cognitive flexibility, memory, and processing speed. Specifically, there were four tasks, symbolic match to sample (SMTS), serial reversal learning, delayed match to sample (DMTS), and a reaction time (RT) task. All the tasks were sufficiently sensitive to detect individual differences in performance, and all subjects completed at least two tasks in the battery (Table 1). Surprisingly, the correlation matrix from the test battery was not uniformly positive, and PCAs did not consistently yield a component similar to g. Potential procedural issues, the influence of age and experience, and the possibility that these results reflect a genuine difference between pigeons and other species are discussed.
Table 1. All subjects in the test battery and which tasks they completed, where 1 signifies they completed the task and 0 signifies they did not. Experience refers to the number of cognitive tasks completed before or between tasks in the cognitive test battery.
Test Battery
|
|
Tasks
|
|
|
|
|
Age
|
Experience
|
Name
|
Symbolic Match to Sample
|
Serial Reversal Learning
|
Delayed Match to Sample
|
Reaction Time
|
Total
|
17
|
11
|
Vonnegut
|
1
|
1
|
1
|
1
|
4
|
17
|
10
|
Dickinson
|
1
|
1
|
1
|
1
|
4
|
3
|
3
|
Bowser
|
1
|
1
|
1
|
1
|
4
|
3
|
2
|
Peach
|
1
|
1
|
1
|
1
|
4
|
3
|
2
|
Waluigi
|
1
|
1
|
1
|
1
|
4
|
3
|
1
|
Luigi
|
1
|
1
|
1
|
1
|
4
|
3
|
1
|
Mario
|
1
|
1
|
1
|
1
|
4
|
3
|
1
|
Shy Guy
|
1
|
1
|
1
|
1
|
4
|
3
|
1
|
Wario
|
1
|
1
|
1
|
1
|
4
|
17
|
6
|
Estelle
|
1
|
1
|
1
|
1
|
4
|
16
|
9
|
Jubilee
|
1
|
1
|
1
|
1
|
4
|
11
|
7
|
Herriot
|
1
|
1
|
1
|
1
|
4
|
17
|
9
|
Hawthorne
|
1
|
1
|
0
|
1
|
4
|
11
|
6
|
Goodall
|
1
|
1
|
1
|
1
|
4
|
0.5
|
0
|
Athena
|
1
|
1
|
1
|
1
|
4
|
0.5
|
0
|
Wenchang
|
1
|
1
|
1
|
1
|
4
|
16
|
9
|
Gambit
|
1
|
1
|
1
|
0
|
3
|
12
|
11
|
Darwin
|
0
|
1
|
1
|
1
|
3
|
12
|
6
|
Durrell
|
1
|
1
|
0
|
1
|
3
|
12
|
5
|
Cousteau
|
0
|
1
|
1
|
1
|
3
|
1
|
2
|
Itzamná
|
0
|
1
|
0
|
1
|
2
|
1
|
1
|
Odin
|
0
|
1
|
0
|
1
|
2
|
3
|
2
|
Yoshi
|
0
|
1
|
0
|
0
|
1
|
|
|
Total
|
18
|
23
|
18
|
21
|
|