4.1. General comments
Since 2004, many proofs of concept have been published about the ability of dogs or rats to detect diseases. However, there are often great discrepancies among results. While some publications report low sensitivities of 17% (Gordon et al., 2008) and specificities around 29% (Amundsen et al., 2014), others achieve rates of 100% sensitivity (Horvath et al., 2008; Cornu, et al., 2011; Sonoda, et al., 2011) and specificity (Sonoda, et al., 2011; Yamamoto, et al., 2020).
Only a few studies reported testing performed in screening conditions (DB2, unforced choice), and those usually enrolled small numbers of animals. This could be explained by the fact that screening conditions in double-blind testing combined with unforced choices are more challenging for the animals, the handlers, and the operators, limiting the amount of data required to validate such a method. These results are discussed in the following paragraphs.
4.2. Considerations about patients’ selection and samples
4.2.1. Patient and control selection: reference test and populations matching
First, careful diagnosis of patients and controls is critical to avoid bias. Making sure that patients have the disease of interest is usually confirmed with the gold standard. The accuracy of such a test must be high to avoid false-positive inclusions. Also, confirmed negative samples are critical, and all controls should be tested in an ideal situation. However, very few studies report having rigorously tested controls. This can be explained by the fact that asking volunteers to perform non-required detection tests is costly, time-consuming, tedious and invasive, poses ethics issues, and could lead to volunteer disengagement. However, from a scientific point of view, non-tested volunteers could be a source of false-negative samples. Such samples would be detrimental for animal training and testing. Indeed, the animal must be educated with samples with known status. An inaccurate reference test might lead to sample status errors and mislead the detector. For instance, Thuleau et al. (2018) reported they educated dogs to detect breast cancer from patients with cancer confirmed by histology and from volunteers with a recent (< 12 months) negative mammography. Even if mammography is reliable, false negatives can occur, or cancer can appear within a few months following the screening. Dogs are trained to ignore such samples, which can lead to other mistakes.
If the reference test has poor accuracy, then animal training can be impacted. For instance, the results reported by Cornu et al 2011 show that training a dog with potential “rogue” controls affected final performances (66). Selected controls were patients aged > 50 with elevated Prostate-Specific Antigen (PSA, comparable with cancer patients regarding these characteristics). Control patients had mean PSA value of 8.3 +/- 4.1 [range: 2–16.8]. Given these values, it can be considered that 20–30% of these control patients with negative prostate biopsies had prostate cancer.
Similarly, Willis et al 2004 reported they were concerned that “rogue” control specimens from people with undiagnosed cancer elsewhere in the body might be inadvertently added to pooled samples. They did have an occasion during training in which all dogs unequivocally indicated as positive a sample from a participant recruited as a control based on negative cystoscopy and ultrasonography. After further tests, a transitional cell carcinoma was discovered. As such detection method with animals is not yet validated, not all false positives indicated by animals can be double-checked. More recently, Grandjean et al. 2020 had a similar issue, with two of their supposed SARS-CoV-2 negative controls turned out to be positive.
Second, the importance of matching the characteristics of patients and controls groups to make sure that animals detect the disease itself and not a confounding factor is known (89, 92). Matching has been reported with age, sex, skin colour, other diseases, comorbidities, symptoms, smoker status, diet.
For instance, Bomers et al 2012 worked on C. difficile detection with dogs at a hospital. They reported that on the day of the detection round all cases had diarrhoea compared with 6% of the controls. In such a situation, we can wonder if the dog successfully indicated the targeted disease (C. difficile), or just the presence of diarrhoea.
To prevent such bias, Willis et al 2004 exposed the dogs to urine from patients presenting with a broad range of transitional cell carcinomas, in terms of grade and stage, to increase their likelihood of recognizing the common factor or factors. They took particular care to train the dogs with control samples containing elements likely to be present in urine from patients with bladder cancer and commonly occurring in other non-malignant pathologies. This way, they could teach the dogs to ignore non-cancer specific odours. This led to the inclusion of urine samples from a variety of patients, such as people with diabetes to control for glucose, those with chronic cystitis to deal with the influence of leucocytes and protein, and healthy menstruating women to control for blood.
Several years later, the same team (Willis et al 2011) assumed that body fluids, tissues and emissions from young, healthy individuals differ in composition from those of older cancer patients to a greater extent than do samples from age-matched individuals with the non-cancerous disease of the same organ. They performed an electronic nose study in which the classification accuracy dropped once more diseased individuals were added to the healthy control group (93). This shows that the choice of controls can markedly affect the level of specificity achieved.
4.2.2. Disease-specific odour and types of body fluids chosen
Research teams made the hypothesis that a specific odour was present in the samples they chose. However, so far and to our knowledge, in the case of cancer, it is not known whether a specific cancer has a specific chemical signature or not, and, if so, what is the source of such signature. Indeed, the odour of cancer could come from the tumour itself, or the modified environment surrounding the tumour, or both. Moreover, it is still not known yet whether all cancers have shared odours or not. For instance, McCulloch group reported good dogs' performances trained to alert to two cancers rather than for single cancer discrimination. This could mean that there is a general biochemical marker common to all cancers, with individual-specific cancers having additional markers (53).
There are different interpretations considering the localization of disease odour within the body: is it localized, organ-specific or spread? For instance, Horvath et al. (2008) report that one important observation during the training period was that use of fat from the same individuals from whom the carcinomas were removed did not increase the number of failures. The absence of reaction by the dog suggests that a general body odour including all organs did not exist. However, two years later, the same team (Horvath et al 2010) reported that for the same cancer (ovarian), dogs trained with tumours could discriminate blood samples and vice versa. Their study strongly suggests that the characteristic odour emitted by ovarian cancer samples is also present in the blood (plasma). Similarly, after observing that canine scent judgement can be used on both breath samples and watery stool samples, Sonoda et al 2011 concluded that chemical compounds may be circulating throughout the body for colorectal cancer.
Murarka et al. (2019) comment that Yoel et al. (2015) found that after being trained on the breast cancer cell line, the dogs were able to detect both skin cancer and lung cancer cell lines, suggesting the possible presence of a general cancer olfactory cue within cancer cell lines. However, this study did not explore whether these dogs could also then detect cancer in patient-derived samples. In this case, there is also the possibility that the dog learnt to disregard control samples (which were probably similar) instead of recognizing malignant cell cultures. This seems in according to observations from Murarka et al 2019, whose research suggests that after training on cell lines to prepare the dogs, there was no spontaneous switch to blood plasma.
From these observations, four situations can be considered depending on odour specificity and localization, which are presented in Table 8.
Disease odours localization and specificity hypothesis
Disease odour localized
Disease odour widespread
Disease odour: specific
Sample choice critical
High test specificity
Sample choice is less critical
High test specificity
Disease odour: common to several diseases
Sample choice critical
Localization can give alert to a shortlist of diseases
Sample choice is less critical
Low test specificity
Table 8 shows that body fluid choice is critical. This also affects control choice. From this table, we see that an odour widespread throughout the body and non-specific to a disease will lead to low specificity tests. In such situations, indications of a sample by a trained animal will not give much information on what disease to look for, and therefore will have low added value.
Body fluids used in the reviewed articles are dominated by breath and urine (Fig. 4). These have the advantage of being easy to sample (liquid, air, noninvasive), easy to split into several samples, and therefore allow several trainings and tests per sample without encountering pollution or odour decrease. Liquids like urine are also easy to dilute, for instance, to increase detection difficulty by reducing the amount of VOCs per sample. These dilutions also allowed to study animal detection thresholds (88), and comparisons with GC-MS and e-noses. However, we regret that the reasons that lead to the choices of body fluids were not or poorly documented.
4.2.3. Sampling protocols
After body fluid and sampling localization choice, sampling protocols and materials are key to have high-quality samples. Most of the studies report the importance of applying the same sampling procedures both for patients and controls to eliminate potential bias and confounders. For instance, Ehmann et al 2012 showed that, at first, trained dogs were not discriminating disease state, but sampling location which was different for patients (at the hospital) vs healthy volunteers (at home).
If the sampling protocol is made at home with no supervision, risk of error can occur, leading to poor samples quality. Thuleau et al. (2018) report that to sample skin secretion they asked patients and volunteers to shower with an odourless soap, before sleeping with a cotton pad on the breast overnight. In this case, researchers cannot be sure that the person has followed each step correctly or that no incident occurred. In this example, the pad could have felt during the night, resulting in pollution and a limited contact time of the pad with the skin, and therefore in a limited amount of VOCs. As well, other odours could have been impregnated on the sample, such as bedsheets’ odours, partners’ odour, pets’ odours. Such unsupervised sampling protocols add difficulties and should be controlled as much as possible.
A non-exhaustive list of parameters that can induce bias are smokers status, sex, age, ethnicity, diet, different sampling locations, different sampling protocols for patients and controls, treatments. For instance, to limit diet bias, Hackner et al. (2016) report that for homogeneous sampling, the tested persons were constrained not to drink, eat and smoke within 90 minutes before breath sample collection.
4.2.4. Odour sampling materials
All types of body fluids do not necessarily require odour sampling materials. For instance, urine, faeces, and blood can be sampled and presented untransformed to animal detectors. However, breath and skin secretions need optimized materials to capture VOCs without releasing other odours that could disturb detection. Some sampling materials have been presented in Sect. 3.5.2 and Table 3. For instance, Willis et al 2016 report that their choice of material comprising their patches came from studies on canine scenting in forensic science. In terms of the greatest variety and quantity of skin surface VOCs collected and readily released, the optimum fibre appeared at the outset of their study to be 100% cotton, so they employed a widely available, sterile, pure cotton gauze throughout. For the chosen sampling time of 15 minutes, they were again guided by the forensic science literature.
However, such description is an exception, and as for the choice of body fluids, we can regret that the choice of materials is little documented. The vast discrepancies among material types strongly suggest this part of research is still empirical and needs better understanding, characterization, and standardization. In the future, this field of research would benefit from a better description of material parameters, as it is often done in publications reporting VOCs detection by GC-MS.
4.2.5. Sample conservation
In chemistry, it is known that temperature variations, light and hygrometry can modify VOC profiles. Such parameters are crucial but not well described and yet not consensual.
Most of the reviewed studies stored samples at low temperatures (< 0°C), and only a few stored them at room temperature (see Table 3 and Fig. 5). This choice is usually not motivated, except in a few studies. Willis et al. (2011) report that samples were stored primarily at − 80◦C, which has been the most desirable for retaining chemical species (86). Mahoney et al 2012 report that their samples were frozen at − 20◦C until the evaluation day (up to seven days). Though there is some controversy surrounding the cellular impact of freezing and thawing sputum, past research suggests that samples may be kept frozen without significant alteration of cell quality or cell counts (94). Not much information is given about light. However, most studies report storing samples in a fridge or in a freezer, where an absence of light is evident. No information has been found about hygrometry or pressure. Conservation time and the number of sample openings lack description. The heterogeneity about VOCs conservation procedures shows this part is still empirical and needs better understanding and evaluation. Guidelines about minimal, maximal, and optimal conservation conditions would undoubtedly be helpful for standardization.
4.2.6. Considerations about odour threshold
Selected animals have a superior sense of smell compared to humans (15). For instance, Horvath et al. (2010) observed that trained dogs could detect a quantity of 20 ovarian carcinoma cells on the abdominal fat. However, the sense of smell is not unlimited, and it loses efficiency below a certain VOCs threshold. This threshold effect has been studied in Sato et al 2017 (article excluded from this systematic review). Willis et al 2004 also report that they had to consider the physical state of the urine when presented to the dog. They opted to train one cohort of dogs on wet samples and another on dry samples. When tested, the dogs trained on liquid urine performed significantly better, suggesting that the more volatile molecules are important in the cancer odour signature.
Odour threshold also plays a role in dog training progression. Some teams chose to directly use the same types of samples at training start and for testing. On the contrary, others started detection work with samples with a higher amount of VOCs and decreased the intensity step by step. The latter strategy is supposed to be easier for the animals prior to lowering the threshold. These samples with more VOCs can be (i) bigger (bigger in volume, surface, quantity); (ii) more concentrated (exhaled air, sweat, etc); (iii) other types of samples, such as tumours or materials directly in contact with the tumour. However, the diversity of samples used before the final configuration is not systematically reported within studies.
In addition, there may have differences in odour intensity between diseases, especially infectious and viral diseases with strong diffusion (to be related to contagion) vs. hidden tumours. Hence the importance of odour sampling procedures and materials, as well as sample conservation.
4.2.7. Sample number of uses: pollution and memory effect
The number of times samples are used is not always well reported. It is evident, however, that some studies reused samples at least for some training. Here, two types of “reuses” are to consider:
Case 1: The same sample is presented several times to the same animal detector
Case 2: The same sample is presented to several dogs (several times per dog or not)
Case 3: Sample replicates of the same patient are presented to an animal detector
In cases 1 and 2, there is a risk of pollution (by direct contact with the animal or by its breath, by the atmosphere), which lead to sample alteration each time the sample is used. Therefore, once smelled, samples are not identical to “new” samples. Moreover, opening a sample several times can lead to a decrease in VOCs quantity. In cases 1 and 3, samples from the same person are presented several times to an animal. By doing so, there is a risk of training animal’s memory instead of discrimination. This latter issue has been reported by several teams who saw their results plummet in double-blind situations with only new samples.
On the contrary, however, Willis et al 2016 report that multiple uses of the same sample during training did not appear to lead to a significant loss of volatile signature since the dog continued to successfully select known melanoma samples used up to 15 times over a period of 18 months post-collection. With such observation, one can assume that the dog did not learn to discriminate samples but instead memorized one specific sample.
Ideally, an animal should smell only new (uncontaminated) samples, only once per patient (to avoid memory effect). The advantage of urine, faeces, blood and breath is that these body fluids are easy to sample or to aliquote, allowing to have several samples very quickly. This way, several dogs can be trained with samples from the same person, while preserving their quality.
In some studies (ex: Cornu et al 2011;), some control samples were reused during testing. This does not seem to be a problem in an unforced choice configuration (cf scent line-ups characteristics, part 3.7). However, in a forced-choice configuration, reusing some control samples might reduce the number of new possibilities for the dogs, leading to an easier design and higher success rate just by chance.
Except for dogs, giant pouched rats have been extensively used by one team working on tuberculosis detection in Tanzania. Little literature report reasons considering animal choice except for their high sense of smell. Dogs are the most used animals worldwide. This choice can be justified by the availability and experience of dog trainers in many countries, for instance, for drugs and explosives detection. Dogs have the advantage of being adaptable to different fields (battle, airports, rescue, remote scent tracing, contact with humans). However, for remote disease detection only (detection done in a controlled configuration, at a distance from patients), there is no need for such adaptation, and to our knowledge, no validated study to prefer dogs than rats. Authors generally report looking for motivated dogs with high olfaction capabilities. However, there seems to be no standard validated tests for dog selection, which so far remains empirical in the absence of clear guidelines.
Gordon et al 2008 mention that it has been an ongoing theory that certain breeds are better at scent detection than others (95). However, studies have shown a greater difference in scenting ability between dogs within a breed than between breeds (96). We observe performances variations in selected studies between breeds and within the same breeds. This has been described in Jamieson et al 2017, who concluded that a dog should not be solely chosen based on its breed (95) due to individual variation. In addition, if we consider that evaluated dogs were for the majority selected among the best, under the watchful eyes of an experienced professional, we can assume that even more discrepancies would exist without such selection. There are an estimated 500 million dogs worldwide and, so far, less than 200 have been considered potentially adapted to conduct disease screening tasks in controlled studies and achieved varying results. Such method seems to have huge potential; however these low numbers preclude extrapolation.
4.3.2. Selection success
In Elliker et al 2014, only three out of ten dogs initially recruited for the study passed the first stage of training. According to them, high failure rates are common when training dogs for specialist roles because of the specific behaviour/temperament attributes required (97, 98).
Despite this low selection rate, 82% of the dogs mentioned in the studies completed all the exercises requested. This number may seem high but hide several parameters:
It is likely that some studies only mention the dogs who performed well and do not mention all the dogs they evaluated before selecting their champions.
Some of the dogs, even after completion of the whole program, have poor results.
The loss rate is greater when the difficulty of the exercise increases (blank runs, double-blind). As most of the studies report forced choices scent line-ups, more dogs succeed.
Interestingly, Murarka et al 2019 report that all dogs leaving the disease detection program and switched to other odours (for example, narcotics, bed bugs, accelerants, blood plasma) have been rapidly and successfully trained. This strongly illustrates the difficulty of disease detection with dogs compared to other odours.
Elliker et al 2014 report that it has been suggested that it may be useful to breed dogs specifically for cancer odour detection (99), which may help to increase the proportion of suitable dogs available for future studies of this type.
4.3.3. Training duration
Considerable differences in training durations are observed within studies, going from a few weeks to several years. Such differences can be explained by the type of disease to detect, the difference between patients and controls, the choice of body fluids, the quality of samples, training differences, animal abilities. No correlation was observed between training duration and success rates among studies. However, Ehmann et al 2012 identified an improvement of lung cancer identification capabilities along with the test series and conclude that an ongoing training effect must be assumed, calling for even more extended dog training in future studies.
4.4. Scent line-up
4.4.1. Scent line-up: Number of samples and line vs circle, the distance between samples
The number of samples presented to animals ranges from 2 (Bomer et al 2012) to 12 (Essler et al 2021). No justification was provided considering these numbers. No study was performed with only one sample. It has been shown in the literature that dogs were able to perform tests with one sample only (100). In such a test, dogs have to make an absolute choice. They are asked to “evaluate”. On the contrary, when several samples are presented, the dog can perform a discrimination task and is probably more stimulated. In this situation, they are asked to “search”. All studies reviewed used the latter configuration.
Samples were presented in the line, circle, or randomly (Table 5). The choice of a line can be motivated by the easiness of designing “blank” runs, such that, at the end of the line, the dog can indicate that no positive sample was found. Blank runs can also be done in a circle configuration. The advantage with the latter is that there is no start nor end, so all samples are equivalent.
Space between samples is fundamental for several reasons. The most obvious reason is to preclude cross-contamination between samples. Another less apparent reason is that it gives enough time for latency and persistent olfaction times. Latency is defined as the necessary time to get an olfactive stimulus, estimated at 0,5 seconds for dogs. Persistence time is the duration the olfactive sensation stays. If samples are too closed, these durations cannot be respected, and dogs risk either missing a sample or mixing signals.
4.4.2. Scent line-up configurations: forced vs unforced choice
Using forced vs unforced choices scent line-ups have a strong influence on performances. Unforced choice exercises are more complex. In a forced-choice exercise, the animal learns only one configuration. They know they must “find” the odour of the disease. As a result, they chose the sample which resembles the most to the target or the one which is the odd one out. Moreover, Bomers et al 2012 report that anticipation of a single positive result could have influenced the trainer’s behaviour, thereby unintentionally influencing the dog’s response (90). Such configuration is therefore not only easier for the animal but also for the handler. On the contrary, animals must evaluate each sample in an unforced choice configuration and cannot choose only by simple comparison. This is a difficulty that not all animals can overcome. An unforced choice situation is, however, the only one that could be applied for screening.
With the particular configuration reported by Murarka et al 2019 (see results), the dog has only one sample to evaluate, while the distractor is here for stimulation (58). Such configuration is an interesting tradeoff between one vs several samples scent line-ups described above and can easily be applied for screening.
4.4.3. Atmospheric conditions
Atmospheric conditions during training and testing are known to affect dogs sense of smell. These conditions are poorly documented within the reviewed studies. Those who did, however, reported working with controlled temperatures between 12°C and 20°C (see Table 5). It can also be seen that, when not under control, this can negatively impact scent detection work, like for instance reported by Sonoda et al 2011 where tests were conducted from 13 November 2008 to 15 June 2009 because the dog’s concentration tended to decrease during the hot summer season (68). As well, Hackner et al 2016 observed that some limiting influences included high humidity and elevated ambient temperature, which were found to be detrimental to the dogs’ performance. They suggest that testing should not be performed during unfavourable weather conditions (43).
4.4.4. Blind conditions
184.108.40.206. Proofs of principle vs double blind clinical trials in a screening like situation:
For a potential deployment of disease detection with animals, only double-blind clinical trials in screening-like situation (i.e. unforced choice) might be useful (see Sect. 3.9 for blinded conditions). Up to date, only 6 studies meet expectations (Table 6). Focusing on these studies, the results usually decreased at first when shifting to double-blind. This fall between training and double-blind testing has often been explained by the Clever Hans effect (90). To avoid failure, teams must train as much as they can in blind situations, as suggested by Gordon et al 2008 who report that the use of blinding during the training should be initiated early to preclude unintended clues by the trainers that may contaminate the process (53). Willis et al 2016 reported that after training the dog in a non-blinded situation, their trainer reported back a near 100% success rate in identifying the melanomas. It was decided to begin a series of double-blind tests. However, after 13 runs, the dog had successfully identified only one of the melanoma samples (44). Implementing blinded conditions is not easy during training because dog handlers need to know when to reinforce positive behaviour. To do so, a non-blinded assistant hidden from the dog and who can quickly tell the handler when to reinforce is needed.
220.127.116.11. Rewarding or not the dogs in screening-like situation: a puzzling question
In a screening like situation, nobody knows whether the animal’s indication is correct or not, which can be an issue for the reward. Indeed, if the trainer decides not to reward the animal, the latter can little by little lose interest. On the contrary, if the animal is rewarded every time, this might reinforce biases in case of incorrect indications. Therefore, several strategies are adopted among teams.
For instance, McCulloch et al 2006 report that, since the experimenters no longer knew the status of the target breath sample, they did not activate the clicker device after a sitting indication by the dog, and therefore the handler did not reward the dog with any food. Bomers et al 2012, in the case of C. difficile infections, search in hospital wards, confirms that surveillance is principally different from the type of case directed diagnosis in their study design because the dog cannot immediately receive a reward after a positive identification, potentially extinguishing the trained alert. The same solution was adopted by Willis et al 2011: “Both the trainers and researchers remained blinded throughout the trial, only breaking the sample and positional codes at the very end, meaning that the dogs could not be rewarded for a correct indication immediately after each test run. The trainers reported that, over time, this led to a loss of confidence in the dogs, with a deterioration in their performance”. On the contrary, Elliker et al 2014 performed two types of tests. On the first one, they were in a DB2 situation and decided to reward the dog for each indication. However, during three rigorously controlled double-blind tests involving urine samples from new donors, the dogs did not indicate cancer samples more frequently than expected by chance. The team finally switched to a DB1 situation, to be able to reward the dogs only for positive responses. These are exceptions because most of the studies were conducted in DB1 configuration, which allowed to know whether to reward the dog or not after each line.
According to Biehl et al 2019, rewarding dogs’ work has to be independent of the results achieved and should refer only to the work done. If dogs are only rewarded for positive indications, they will quickly learn to achieve more rewards through positive indications, which could easily lead to higher false-positive results. Hackner et al 2016 attributed the inferior results to the true double-blind and screening-like conditions. They report that this factor posed immense stress on the dogs and their handlers, and therefore suggest positive feedback mechanisms for future study designs. According to them, it seems to be favourable to confront dogs relatively often with the pattern odours. Their results report that a test situation where dogs will always find an unblinded positive and ignore an unblinded negative sample in the line-up would probably be better. The positive sample would create the opportunity to earn a reward and would reinforce the dogs’ motivation. The negative sample assures the handler that the dog is still performing well. The other samples in the line-up should be the blinded test samples.
Another similar solution would be to alternate training lines and test lines. It could be decided that one test line has to be performed only after an amount (to determine) of successful training lines. Another training line could be performed right after the test to ensure the dog is still doing well. Such a pattern is feasible for implementation; however, it would slow the testing throughput.
This subject is crucial for implementing such a method, and no consensus nor solution has been admitted so far.
4.5. Applications / Implementation
Pickel et al 2004 published a proof of concept with dogs sniffing humans. Even if scientifically feasible, such a technique seems hardly applicable in the field. Since then, several studies using remote disease detection have step by step built a new scientific discipline. This review shows that no scientific study has validated that animals can be used as a first-line remote detection tool prior to existing technologies. Only APOPO, the organization supervising Giant Pouched Rats detecting tuberculosis in Tanzania, has found its place as a second line screener, which makes sense for tuberculosis detection (101, 102).
4.5.1. How many evaluations are needed to validate a sample?
Most studies focus on the performances of each animal separately. However, as animals are living organisms, their performances can be subject to variations. Biehl et al 2019 reported that literature data show that some dog trainers included only one dog in scent detection, whereas others had five to six dogs and collected the individual dogs’ data (54). McCulloch et al. 2006 state that the sniffing quality of all dogs was comparable, and therefore the results obtained were similar. However, Ehmann et al. 2012 found differences in hit rates between individual dogs and consequently defined a ‘corporate dog decision’ that required at least 3 out of 5 dogs with an identical decision. Amundsen et al., as well as Hackner et al., also showed considerable variations in single dogs’ results. These variations might be due to the dogs’ different sniffing capabilities and the dogs’ different daily conditions and training.
Biehl et al 2019 report that in their study, single dogs’ results showed great differences concerning sensitivity in the range of 22–67% and concerning the specificity of 71–89%. They conclude that it is advisable not to rely on a single dog’s decision but to define a corporate decision to minimize variations arising from the single dogs. This choice is not straightforward. Indeed, Mahoney et al 2012 report that considering the use of two animals only, if they consider that a sample is positive if indicated by both animals, test sensitivity will decline, but specificity will rise. On the contrary, if only the indication of one of the two dogs is needed, the sensitivity will increase, but specificity will fall. The argument can be declined for more animals and indications. For instance, Gordon et al 2008 report that, at the time, their study was the only one to incorporate replicates for assessing specificity. There were 3 and 2 replicates (33 and 18 runs) for the prostate and breast arms, respectively. The team adds that any study, ultimately attempting to prove canine superiority over conventional cancer screening, must include replicates and, in the future, go head to head with standard screening methods. Another example is Mgode et al 2012, where for tuberculosis detection, a sample is considered positive if selected by two rats. Such a corporate decision is a tradeoff that has not found a consensus yet.
4.5.2. Number of samples to train a dog and maintain performances
The number of samples available for training is crucial. Indeed, many samples are needed so that animals learn to generalize and do not memorize each sample. Quantity is essential to work as often as possible with new (non-polluted) samples and limit the “novel object preference”. Willis et al 2011 report their protocol also avoids the phenomenon of novel-object preference, whereby dogs preferentially chose unfamiliar items over familiar ones (103).
This is not straightforward, as organizing efficient logistics to gather samples continuously can be challenging to implement. For instance, Gordon et al 2008 report that it took longer than anticipated to obtain enough samples to prepare for the final testing. This resulted in the training being spread over an extended period, 12–14 months. Possibly, the animals were periodically memorizing individual patients rather than recognizing an “odour signature” for cancer despite utilizing a large number of training samples. An ongoing system of recruitment of patients with cancer and control patients needs to be established, so the dogs have adequate numbers of new samples to maintain their proficiency even after the conclusion of the study. This has also been reported by Ehmann et al 2011, who wrote that during the training and also later in the testing, every test tube containing a human breath sample was used only once to preclude simple memory recognition of participants’ unique odour signatures.
This need for a continuous arrival of new samples is a huge limitation. Indeed, if intended to be implemented in countries with low access to diagnostics, this arrival of new samples from screened patients and controls will be limited. This implies continuous logistics and partnerships with hospitals that might not be cost-effective.
4.5.3. Field implementation
If scientifically validated, remote scent medical detection implementation will have to overcome several issues. First, if implemented in populations with low health access, such detection will have sense only if care can follow. We saw that many known samples, both from patients and controls, are required to train animals. If implemented in an area with low access to gold standard detection, sample recruitment might be compromised.
Routine adoption of such detection raises the question of the number of samples which can be screened every day and its cost. From the studies reviewed, it seems that one dog, if efficient, could screen roughly a dozen of new samples per day. Willis et al 2016 report that only one new test was conducted per week, with training sessions in between, which is not very efficient for mass screening. Rats, however, seem to be able to screen more samples, as reported by Weetjens et al 2009: “The use of trained rats to detect tuberculosis is reliable, potentially cheaper and faster than sputum smear microscopy. One evaluation cage can contain more than 12 rats per day, and one rat can screen 140 samples in 40 minutes. The evaluation set-up can therefore process up to 1,680 samples per day, while a microscopist can process up to only a maximum of 40 samples per day (WHO recommends an average of 20 samples per day) (104, 105)”.
Another important consideration is the prevalence of the disease to be detected. Indeed, if very few positive samples are present, this could lower animals’ motivation and accuracy. Hence the importance of training sessions with regular new known positive samples.
As discussed in Sect. 4.2.2, such detection will be helpful if the odour and/or the sampling localization is specific to a shortlist of diseases. If not, then in the case of an alert, medical staff will not know what to look for.
Free running rapid detection might be useful for infectious diseases. Free running proofs of concept have been published for C. difficile infections detection with encouraging results (74). However, such detection has not been proven yet to work in the field for other diseases. So far, published articles report successful proofs of concept in remote conditions (like for cancer). Free running detection has recently been presented as an objective by several teams working on SARS-COV-2 detection. For instance, Guest et al 2021 report that their preparatory work indicates that two dogs could screen 300 people in 30 min, for example, the time it takes to disembark from a plane, and PCR would only need to be used to test those individuals identified as positive by the dogs. However, no study has demonstrated such application in real screening conditions in contact with people in public places so far. On a different disease, Maureen et al 2018 report that despite being highly trained, dogs are vulnerable to distractions and other foreign stimuli in a unique social environment (106). Concerning their study, Essler et al 2021 report that though dogs have previously been shown to be able to discriminate between saliva samples of SARS-CoV-2 positive and negative patients, these studies are also using repeated presentations of the same samples. Thus, it is possible dogs can discriminate between their training set of positive and negative patient samples but are unable to generalize this odour to new samples. These considerations are major limitations that preclude short term implementation. However, this relatively new field of research is progressing quickly, and future studies may address and outcome these issues.
Finally, disease diagnostic can be expensive and complex to implement because of costly infrastructure and instruments, the need for consumables and high-skilled professionals (MSc, PhD, MD). In this context, several teams claim that medical remote scent detection with animals might be cheap, however, this has yet to be proven. Cornu et al 2011 report that in their proof-of-principle study, they tested a limited number of subjects in a costly, long study that makes it difficult to conceive of extended use for this test in clinical practice (66). Similarly, Sonoda et al 2011 declare “it may be difficult to introduce canine scent judgement into clinical practice owing to the expense and time required for the dog trainer and dog education” (68). No socio-economic study on the subject was found.