The lifespan of pet or companion dogs has been increasing over the years [1] and, with this, behavioural and physical deficits more common in old age have also become more frequent. In the last decades, research on canine ageing has grown exponentially because both researchers and the public recognised pet dogs’ emotional, economical, and scientific values, in the latter case as an animal model species [2–5]. According to these studies, owners often report a decline in the visual and auditory function, social behaviour [6–12], and the sleep/wake cycles [13, 14] of ageing pet dogs.
To better understand these phenomena, researchers have developed various behavioural tests measuring the behavioural differences linked to old age in companion dogs. For instance, an open field test showed that the chronological age of the dogs is linked to their neophilic behaviour: specifically, young dogs (1-4 years) sniffed and played for a longer time with novel objects compared to older dogs (>9 years) [15]. In a similar study, authors have also observed that, when a stranger entered the experimental area, younger dogs (1-4 years) had more physical interactions with the person, compared to older dogs (> 9 years) [16]. Discrimination and reversal learning tasks were developed for measuring memory, learning and cognitive flexibility, showing that younger dogs (1.5-6.5 years) were able to learn faster than older dogs (8.0-14.5 years) [4]. These results have been further validated using EEG, demonstrating a correlation between sleep spindle (non-REM bursts of activity in the sigma range) intrinsic frequency and the number of reversal learning training trials [17]. Sleep spindles predict learning in dogs, and vary with age [18, 19]. A spatial memory task where dogs had to rely on short-term memory to find food, was also developed, showing that younger dogs (3-6 years) were more efficient than older dogs (9-11 years), committing fewer errors and finding the food more often at their first attempt. External validation was obtained examining the gut microbiome of the dogs, demonstrating that worse memory performance (more errors) was associated with a higher proportion of Actinobacteria in their faeces [20] in agreement with the high abundance of some Actinobacteria in the gastrointestinal tract of patients with Alzheimer ’s disease [21].
Recently a battery of standardised outdoor behavioural tests (Mini Mental Test, MMT) was developed for the rapid assessment of age-related behavioural differences in family dogs [2]. Older dogs displayed less social interest, poorer spatial memory, and seemed both less interested in and less fearful of a novel, moving object [2]. However, neither test-retest nor interobserver reliability was reported for this test battery which is inevitable before applying the tests in clinical settings.
The quality of assessments should be measured through five key measures: purpose, standardisation, reliability, validity, and practicability [22]. A biological measurement is the cumulative result of several factors: the true value of the phenomenon that we intend to measure, biological variation, tool sensitivity, skill and expectation of the observer and the experimenter, subject-related factors (e.g., hunger, fear), as well as external factors, such as environmental temperature or visual, olfactory, and auditory stimuli [23]. Therefore, standardisation is adopted to reduce the amount of random error [24], so that the effect of the independent variable is better recovered from noise sources. For example, detailed protocols and training of the experimenters and coders are methods implemented for standardisation [25]. In a standardised test, two parameters are assessed to measure if the test can be considered relevant and accurate, reliability and validity [26, 27].
A measure is considered reliable when it is consistent and stable over multiple measurements [27]. There are three criteria for reliability: 1) intra- and interobserver reliability or agreement, which is the level of consistency within and between observers/coders, assessing the effect of the subjective bias on the coding/scoring system [27]; 2) internal consistency, indicating coherence among components of a scale aimed to measure the same phenomenon [28]; and 3) test-retest reliability, indicating that the test yields the same results when repeated on the same subjects under identical conditions [22].
Conversely, validity indicates that the method measures what it is meant to measure, both internally and externally [22, 28]. Therefore the conclusions that may be drawn from the results of the measurement are relevant, coherent, and useful [24]. Internal validity relates to the value of the measure itself, and it is assessed through three categories [29]. 1) Content validity or the scientific relevance of a test indicates that the method contains only measures that are relevant to its aims. 2) Construct validity shows whether the hypothesized cause explains the test scores. This can be measured through the correlation of the measurement with other measures or methods that are theoretically related (convergent validity) while ensuring it is independent of others to which it is not related (discriminant validity). 3) Criterion validity (predictability) indicates the predictive ability of the measurement, in comparison with a previously validated instrument (a “Gold Standard”). External validity is the degree to which results can be generalized across studies [29].
Behavioural tests are frequently used in various contexts, for example, to assess temperament or personality in pet, working, and shelter dogs [22, 29, 30], which may be assessed in person or remotely [28]. Concerning ageing research, behavioural testing has been used to investigate the role of age on various cognitive skills, such as discrimination and reversal learning, memory, and executive function. Several authors observed a significant difference between young and old individuals during discrimination and reversal learning tasks [4, 31, 32].
Despite the widespread use and the importance of behavioural testing for ageing research, some shortcomings have been identified. Some of these tests require long training and cannot be repeated over a short time (e.g., [4]), which makes it impossible to use them for monitoring age-related behaviour changes in a longitudinal design [33]. Others rely on social interaction [16, 34], which may be perceived differently by different dogs depending on the partner. For example, test accuracy may be undermined by the different responsiveness of dogs towards male and female experimenters. Previous research indicates that shelter dogs show a stronger decrease in defensively-aggressive behaviours (tendency to look, barking) towards women [35], lower levels of plasma cortisol and more relaxed posture when petted by women [36], as well as more stress-related behaviours (tendency to look, shorter tail-high periods, lip-licking) when walked on a leash by men [37]. The influence of human gender on behaviour has been understudied in companion dogs [38, 39] and, so far, has never been accounted as a potential confounding factor in field tests aimed to assess age-related behavioural differences. Finally, cognitive tests designed to measure positive affective states have replicability issues and may not be reliable in ageing dogs, due to the learning required: for example, studies based on the cognitive bias test, a test for mood based on discrimination choices, showed that older dogs may struggle to learn the discrimination and therefore it may not be possible to test them [4, 40]. Clinicians still need standardised testing for positive emotions in senior animals.
This study aimed to investigate 1) the reliability (interobserver, inter-experimenter, test-retest) of the MMT [2], and 2) to adapt it to indoor settings to have a controlled environment with limited distractions. The MMT demonstrated content and construct validity (internal validity) and a good degree of external validity [2, 20]. However, inter-observer and intra-experimenter agreement as well as test-retest reliability are yet to be studied. Therefore, we modified the protocol to include two experimenters (a woman and a man) and compared the dogs’ behaviour in the two situations with the different experimenters. They and an independent observer also coded the dogs’ behaviour for calculation inter-observer reliability. We added a new test to the battery for assessing spatial memory and neophilia, the Novel object recognition test (NOR) [41–43]. The test is widely used with murine models, but to our knowledge, has not been applied to dogs yet [41]. Finally, we tested both old and young dogs and compared the performance of the dogs on the first occasion (T0) and after one to two weeks (T1) to measure test-retest reliability.