Subjects and housing
We captured great tits (N = 66) in three urban areas (a city park in Malmö, 55.6001° 12.9899°, population density in 2020: 4 150 people//km2; and two sites in Lund, 55.7144° 13.2069° and 55.6976° 13.2472°, population density in 2020: 3 535 people/km2, source: https://www.citypopulation.de/en/sweden/cities/) and eight rural sites (seven within 10 km from the town of Höör, 55.9346° 13.5278°, and one at Stensoffa in the Svedala region, 55.6947° 13.4494°; population density: <5 people/km2) in Scania, Southernmost Sweden (Table S1), using mist nets set up next to bird feeders that we previously set up. The Malmö and Lund sites consisted of an urban matrix of large buildings surrounded by major roads, pedestrian walkways, and lawns interspersed with a mix of native and non-native tree species (for details on species composition in Malmö, see Jensen et al. 2022). The rural sites were in forested areas, with no active farms or inhabited houses near the capture locations. The most common trees in these forest habitats were common oak (Quercus robur), lime (Tilia cordata), elm (Ulmus glabra), birch (Betula sp.), Norwegian spruce (Picea abies) and hazel (Corylus avellana).
We captured and tested 20 birds in 2015 (September to December) and 10 birds in 2016-2017 (December to February). As the results were inconclusive, we resumed the experiment by capturing and testing 36 additional birds in 2021-2022 (September to February). We did not capture birds from March to late August when great tits are breeding and moulting. After capture, we marked each bird with one unique numbered metal ring as well as one or two plastic colour rings for visual identification in the lab. We used plumage characteristics to age and sex them. We then transported the birds in individual cotton bags to an indoor animal facility at the Department of Biology, Lund University. The transport took a maximum of 30 minutes.
We housed the birds in individual 55 × 56 × 36 cm cages that we had positioned on shelves in an enclosed compartment along a wall in the room. The cages were placed two by two so that each bird had visual contact with one neighbour. The room had lighting with an outdoor light spectrum and computer-controlled light and temperature regimes. In mornings and evenings, an automatic one-hour dimming function simulated dawn and dusk, following outdoor day length patterns. We kept the temperature constant at 14°C, which is a temperature that works well in this type of experiment (Brodin and Urhan 2014; Brodin and Urhan 2015; Isaksson et al. 2018). Before we started any training or experimental sessions, we allowed the birds to get accustomed to the environment in the lab for at least two days (i.e. started the tests no earlier than the morning of their third day in captivity), which is sufficient according to our experiences from previous studies.
The birds had ad libitum access to a food mixture of seeds and nuts, a suet cake and water that was changed daily. The water was enriched with a commercial vitamin supplement for birds. We cleaned the cages every day. Before each testing session, we visually inspected the birds and made sure that they were in good condition. We avoided handling the birds during the experimental sessions to minimise stress. When we had finished all experimental sessions on a bird, we released it at the same location as it was originally captured, 10 to 26 (mean ± SD = 17.8 ± 4.6) days after capture. Before we released a bird, we checked whether it was in adequate body condition (i.e. no injuries or feather damage and sufficient fat reserve). The study complies with Swedish and EU animal welfare legislations and regulations.
Neophobia test
Animals may frequently be wary of new and unknown objects, a phenomenon known as neophobia, causing them to avoid novel objects. Hence, there is a risk that an animal’s inability to solve a task may depend on neophobia towards experimental objects rather than an inability to pass the test (Greenberg 2003; Audet et al. 2016). To control for this, we performed a neophobia test, two to five days after capture in 2015 and 2021-2022, and following another experiment in 2016-2017 (Isaksson et al. 2018). The test was performed on the birds in their home cages, visually separated from their neighbours. We started the test with a control stage in which we presented a mealworm (Tenebrio molitor) on a ceramic dish (diameter: 10 cm) that the birds had been familiarized with before the test. We repeated this procedure five times with each bird to control for within-individual variation in feeding latencies. If the bird refused to take the mealworm for over 30 minutes in this control stage, we terminated the experiment and repeated it the next day. All but one bird (an urban adult female from 2021) took the five mealworms in the control stage in either the first or the second neophobia test. As we could not calculate a neophobia score for this one bird, it is excluded from models in which neophobia is included as a covariate (see Statistical analyses).
After the fifth session, we presented the mealworm on an unfamiliar plastic plate (diameter: 35 mm in 2015-2017, 85 mm in 2021-2022) that was painted with broad red and green stripes, placed on top of the ceramic plate. Such a plate with a striking novel colouration should be a good reason for neophobic behaviour to manifest. The most common neophobic action is that it takes a longer time to approach the novel plate than the familiar one. We observed the bird until the bird consumed the worm from the coloured, novel, plate. There was one bird (an urban adult male, also from 2021) that consumed all five mealworms in the control stage but did not consume the worm in the neophobia stage for 30 minutes; we terminated the test for this bird and considered its latency to be the maximum allowed time, 1800 seconds. We calculated neophobia in the same way as Audet et al. (2016), as the difference in seconds between when the bird took the mealworm from the new brightly coloured plate and the old, non-painted plate. For the latter, we used the mean of the five sessions in the control stage. We then log-transformed the neophobia score (adding 400 seconds to all data points so that we get positive values for all of them) to get the variable closer to a normal distribution.
General experimental protocol
Following the neophobia test, each bird participated in four problem-solving experimental sessions that were performed at least 24 but no more than 48 hours from one another. The problem-solving sessions consisted of two 20-minute tests, the string-pulling test and the plug-opening test. For all birds in 2015 and half of the birds in 2021-2022, the string-pulling test always preceded the plug-opening; for the other half of the birds in 2021-2022, the order was reversed, i.e. the plug-opening always preceded the string-pulling. In 2016-2017, the plug-opening and string-pulling tests were performed separately instead of directly after one another; multiple tests of either or both types were performed on the same day; and there could be several days long gaps (up to 12 days) between two tests of the same kind. In spite of these irregularities in the experimental regime, the 10 birds in this group showed similar learning patterns to the 56 birds with stricter regime (see Results), so we opted to include them in our models.
At the start of the experimental sessions, the focal bird in its home cage was visually (but not acoustically) isolated from the other birds by moving the cage to a separate shelf (2015-2017) or a desk in the same room (2021-2022) and closing off the housing compartment. After turning off all lights, the observer removed all food from the cage and set up the experimental device for the first test. The lights were turned off so that the bird could not see the device getting set up. In 2015 and 2022, the perches, except for the one next to the test device, were also removed from the cage; in 2016-2017 and 2021, all perches were kept in the cage. Following this, the observer moved to a booth covered by dark, one-way glass to get out of the bird’s sight and turned on the lights for the bird. After the bird had solved the task or succeeded to eat the worm by other means (see below) or lost the worm by dropping it where it could not reach it, or, after the maximum time of 20 minutes, the observer turned off the lights again, replaced the device for the first test with the device for the second test, and repeated the above protocol. All sessions were video recorded. Regardless of whether or not a bird solved a problem in the first session, we performed four experimental sessions on all birds to test whether their solving performance improved with each repeat, indicating learning. All neophobia and problem-solving sessions were recorded on camera (type: Toshiba Camileo S20); however out of the 516 problem-solving sessions, 66 recordings are not available due to technical malfunctions during recording or file saving. See Online Resource 1 for a sample of these videos.
String-pulling
Our test device consisted of a small (35 mm diameter) petri dish (with a bottleneck-like plastic rim attached to it to reduce the risk of the reward accidentally falling out) attached like a hanging bucket to a 17 cm string, hanging inside a vertically positioned transparent plastic tube with the opening facing upwards (Figure 1a). In the dish, we had placed the food reward (two mealworms in 2015, reduced to one mealworm after it seemed sufficient from 2016 onward) that was visible but not directly accessible to the bird until it pulled up the string. We discarded the tests from three birds (two rural males and one rural female) in 2015 because they were presented with a test prototype where they had no plastic tube around the string. The remaining 27 birds from 2015 to 2017 had the string hanging into a thin-walled plastic tube crafted from plastic cups and stabilised with a wooden frame. The 36 birds in 2021-2022 had a sturdier plastic tube (150 mm tall and 70 mm wide, with a 3 mm thick wall), mounted on an upside-down ceramic dish, around the string; in 2022, a thinner rim was added to this sturdy tube.
We considered a session as solved when a bird pulled up the string and took out the worm from the dish. Out of the 252 trials of 63 birds included in our analyses, 16 had to be terminated early because the mealworm fell out of the dish before the bird could pull up the string (in 12 cases because the bird was shaking the string, and in four cases because the worm crawled past the rim of the dish). These were counted as unsuccessful tests, and in the analyses these birds were given maximal latencies. In eight trials, the birds successfully pulled up the string but lost the worm, dropping it back into the tube. These were counted as successful despite the fact that they could not get the prey, because the birds still went through the right set of motions to get the prey. In five trials the birds pulled up the string, dropped it outside the tube, and took the worm from the hanging dish; in one trial the bird stretched downward to reach all the way down to the rim of the dish and pulled it up before taking the worm out. Although these were both unconventional solutions, we still counted them as successful because the bird pulled up the dish in some innovative way. However, in four trials, the bird dived into the tube and ate the worm while in there, then attempted to get out. These trials were counted as unsuccessful despite the bird getting the worm, because this “solution” did not require innovation; three out of four of these birds managed to solve the problem in the conventional way afterwards.
Plug-opening
In this test, we placed a mealworm inside a transparent tube that was closed by a cotton plug at its bottom end (Figure 1b). In 2015-2017 and 2022, we used a 75 mm long and 11 mm wide glass tube; in 2021 it was a slightly larger, 100 mm long and 15 mm wide plastic tube. At the start of each session, we introduced this test device to a bird’s home cage attached to the cage wall next to a perch. In Groups 2015 and 2022, the tube’s bottom was 41 cm above the cage’s floor, whereas in 2016-2017 and 2021 it was only 26 cm above the cage floor. If a bird removed the plug, the mealworm would fall to the bottom of the cage and become accessible to the bird.
We considered a test successful when the bird removed the plug so that the worm fell out. Out of the 264 trials of 66 birds, there were six trials where the bird pulled out the cotton plug but lost the worm before eating it: in five trials it fell outside the cage and in one the bird could not find it in the cotton. In a seventh trial, the bird pulled out the cotton, but the worm got stuck in the tube. We counted these trials as successful despite the birds not getting the food reward. In 6 trials, the bird, instead of pulling the cotton with its beak, grabbed it with its foot and pulled it out. These solutions were also counted as successful. However, in one trial, the bird pulled out the cotton with its foot clearly by accident, as it did not pay attention to the tube and did not eat the worm afterwards; this trial was counted as unsuccessful. In two trials, the cotton fell out of the tube without the bird touching it, and in two other trials, the worm escaped from the tube, squeezing by the cotton plug, before the bird could solve the task. These trials were also counted as unsuccessful.
Statistical analyses
For each task, we quantified problem-solving latency as the time (in seconds) from the start of the test until the bird solved the problem (took out the worm from the dish in the string-pulling test, pulled out the plug so that the worm fell out in the plug-opening test). We decided to use the start of the test rather than the first interaction with the test device because the bird was in a small enclosed space and could inspect the feeder already before interacting with it. For the unsuccessful sessions, we assigned a maximal latency value of 1201 seconds, even if they had to be terminated early due to the bird losing the worm. We assigned a separate latency value for each of the four tests of the same type therefore, each bird was in the model with four trials.
Base models
We run all our statistical analyses in R (version 3.6.1). We analysed problem-solving latency with Cox mixed-effects proportional hazard models (separate models for string-pulling and plug-opening) using the “coxme” R package (Therneau 2012). Survival models like the Cox proportional hazard model simultaneously handle variation in the probability of an event (such as solving success) and variation in latencies, making them well-suited for analysing behavioural latency data when there are individuals who do not show the focal behaviour (e.g. solve the task), as long as the proportional hazard requirement is met (Jahn-Eimermacher et al. 2011; Andersen et al. 2021). Therefore, they are often used in problem-solving studies (e.g. Cook et al. 2017; Preiszner et al. 2017; Prasher et al. 2019). In these models, we used solving latency as the response variable, treating tests with maximal latencies (i.e. tests where the bird did not solve the test) as censored data. We included the following explanatory variables in our model: sessions number (1 to 4 for the four consecutive test sessions of the same type on the same individual) as a covariate, and habitat type (urban vs rural), sex (male vs female), age (first-year vs older) and year (four levels: 2015, 2016-2017, 2021 and 2022) as factors. The variable “year” also controls for the identity of the experimenter, as it was always the same person within a year but a different person each year except in 2021 and 2022. We treated 2021 and 2022 as separate years because we implemented changes in the methods between December 2021 and January 2022 (see above), whereas 2016-2017 was treated as a single year because there were no such changes in the protocol. We also included bird ID nested within capture site as random factors to control for autocorrelation within individual and within population, respectively. As stepwise model selection based on p-values, despite being frequently used, is also often criticized (Garamszegi et al. 2009), we opted to present the estimates both from the full models and from reduced models where explanatory variables with P-values over 0.1 were eliminated. We refer to explanatory variables with P-values below 0.05 as “statistically significant” and those with P-values between 0.05 and 0.1 as “tendencies” or “trends”. For pairwise comparisons between the four years, we extracted parameter estimates by using the ‘emmeans’ function of the ‘emmeans’ R package (Lenth et al. 2019); we opted to not use the P-value corrections built into the package, as these methods reduce the statistical power of the models (Nakagawa 2004).
As a sensitivity analysis, we also tested the effect of the above variables on solving success and solving latency in separate models, using the glmmPQL function of the MASS R package (Venables and Ripley 2002). For solving success, we built mixed-effects generalized linear models with binomial error distribution. The response variable in these models was a binary factor in which successful and unsuccessful sessions were included with a value of 1 and 0 respectively. For solving latency, we built mixed-effects linear models with latency as response variable, which we log-transformed to bring closer to a Gaussian distribution. In this model we excluded unsuccessful sessions and included only the subset of sessions where the bird solved the task (therefore, unsuccessful birds in all sessions were excluded, reducing the sample size). In both models, the fixed and random effect structure was identical to the above Cox models.
Neophobia
As we could not quantify neophobia for one individual (the one that did not take the food item in the control phase of the neophobia test, see methods), and excluding this individual would have led to data loss, we opted not to include neophobia in the above models. Instead, we tested for an effect of neophobia on problem-solving success by building separate, extended Cox mixed-effects proportional hazard models, including all the fixed and random variables from the above full models, plus log-transformed values of neophobia as a covariate. As the string-pulling model (but not the plug-opening model) was sensitive to variable order due to the relatively large number of censored data, we kept the variable order the same as in the base model and added neophobia as the last fixed term. As the other variables yielded estimates qualitatively similar to the base model, we do not report the full model output, only the neophobia estimate. To avoid multicollinearity, we also tested whether neophobia was affected by any of our tested factors in a single linear model with habitat type, sex, age and group as covariates; none of these variables had a significant effect on neophobia (Table S2).
Habitat, sex and age differences in learning speed
We tested whether learning speed (i.e. the change of latencies over the four sessions, included in the model as the variable “session number”) differed between habitat types, sexes and age groups by adding interaction terms between session number × habitat type, session number × sex or session number × age to our Cox models. We tested each of these interactions in separate models rather than all three in the same model to avoid over-parametrization. Like with the neophobia models, we only report the interaction estimates; the variables not included in the interaction yielded qualitatively similar results to the base model.
Relationship between performance in the two test types
We tested whether problem-solving performances in the string-pulling and the plug-opening tests are related to each other by adding solving latency from the test with the other test device (log-transformed for better model fit) as a covariate to our Cox models, in which non-solvers were given maximal latency values of log(1201) = 7.091. In this test, the sessions were paired by session number, i.e. the first string-pulling test with the first plug-opening test, the second string-pulling with the second plug-opening, and so forth. Like in our other extended models, we only report the estimates belonging to the effect of one solving latency on another.
We also investigated the relationship between the success rates of the plug-opening and string-pulling tests with Pearson’s chi-squared tests, one comparing overall solving success between the two problem-solving tests (each bird that solved a test at least once was counted as “successful” and only those that failed 4 out of 4 times were counted as “unsuccessful”) and one comparing the solving success in the first session, but only birds that solved in the first session were treated as successful.