A Study on the Agreement of Body Temperatures Measured by Infrared Cameras and Oral Thermometry

Background: The COVID-19 pandemic has led to the rapid adoption and rollout of thermal camera-based Infrared Thermography (IRT) systems for fever detection. These systems use facial infrared emissions to detect individuals exhibiting an elevated core-body temperature, which is present in many symptomatic presentations of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). Despite the rollout of these systems, there is little independent research supporting their efficacy. The primary objective of this study was to assess the precision and accuracy of IRT screening solutions in a real-world scenario. Methods and Findings A single-centre, observational study investigated the agreement of three IRT systems compared to digital oral thermometer measurements of body temperature. Over five days, 107 measurements were taken from individuals wearing facial masks. During each entry, two measurements of the subject’s body temperature were made from each system to allow for the evaluation of the measurement precision, followed by an oral thermometer measurement. Each participant also answered a short demographic survey. This study found that the precision of the IRT systems was wider than 0.3 °C claimed accuracy of two of the systems. This study also found that the IRT measurements were only weakly correlated to those of the oral temperature. Additionally, it was found that demographic characteristics (age, gender, skin colour, mask-type) impacted the measurement error. Conclusions: This study indicates that using IRT systems in front-line scenarios poses a potential risk, where a lack of measurement accuracy could possibly allow febrile individuals to pass through undetected. Further research is required into methods which could increase accuracy and improve the techniques viability.


Introduction
Core body temperature is one of the four key vital signs, which is regularly assessed by healthcare settings, alongside respiration rate, blood pressure and heart rate 1 . In an in-patient setting, core body temperature can be assessed from different body locations using oral, rectal, tympanic or temporal artery thermometers, or even through urinary or pulmonary artery catheters with in-built temperature sensors 2 . The accuracy, precision, advantages and disadvantages of these temperature measurement devices in clinical settings has been well established 2-4 . In addition, in modern medical practice, every device must be assessed against national and international regulations. To ensure that a device meets appropriate levels of quality, accuracy and safety, strict medical equipment certification standards must be met by the device prior to its use in a clinical setting. The bodies administering these standards for medical devices include the Therapeutic Goods Authority in Australia, the Food and Drug Administration in the USA and the Medicines and Healthcare products Regulatory Agency in the UK 5 . The use of evidence-based assessment to evaluate technologies for use in a hospital setting is a common and critical part of modern healthcare, ensuring patient safety and assisting in the delivery of high-quality care 6 .
The COVID-19 pandemic has led to an unprecedented adoption and rollout of new fever detection technologies 7 . As fever is present in a significant proportion of symptomatic SARS-CoV-2 cases, the goal of the screening is to identify individuals exhibiting an elevated temperature, isolate them, and refer them for a more comprehensive assessment to a health practitioner 8 . Currently, fever screening technologies are typically installed in high-traffic areas, such as train-stations or airports, and also at the entrance of high-risk sites, such as hospitals, where the consequences of an outbreak could be catastrophic. Many of the deployed fever screening solutions have not yet been assessed by regulatory agencies.
The ideal screening technology must be accurate, rapid, widely available, and operate in a way that keeps both the test administrator and subject safe from viral transmission. In addition, an ideal solution would operate without consumables to deal with global supply chain shortages of critical resources, such as consumables, personal protective equipment (PPE) and medical devices, which have been experienced during the COVID-19 pandemic 9 . When compared to this ideal screening technology, it is clear that traditional measurement techniques have a variety of limitations which restrict them from being highly suited for use as mass-screening tools. Traditional hospital-grade, contact-based measurement techniques all require close proximity between the test administrator and the subject, and some methods are too invasive, too slow, or too expensive to be widely used. This has led to the increased adoption of infrared thermal detection systems for fever screening applications.
Infrared thermal detection systems operate through the measurement of thermal radiation emitted in the infra-red wavelengths, of the electromagnetic spectrum 10 . The thermal radiation is converted through a transducer to an electrical signal which can be interrogated and measured on-board of the device. In fever screening applications, detection systems fall into two main categories: handheld Non-Contact Infrared Thermometry (NCIT) devices, and Infrared Thermography (IRT) systems. These systems meet many of the criteria for a successful mass screening system, they are non-contact, require no consumables, are rapid, and in the case of IRT systems the operator can be physically distanced from the subject. While the use of NCIT devices has been explored in a hospital setting through a number of clinical trials and research articles, and that many of NCIT devices have achieved appropriate medical device approvals, these are not yet established for most IRT systems [11][12][13] .
Despite the wide-spread use of IRT systems, there remains limited independent evidence demonstrating their efficacy and accuracy when measuring body temperature for fever screening. Some clinical trials have been conducted, but the results have been mixed, and there is a lack of consensus in the literature on the effectiveness of IRT systems. A 2015 study in Singapore found that one system was able to achieve a high level of sensitivity and specificity (89.7% and 92%, respectively) 14 , and a similar result found from a study conducted in the USA in 2010, reporting a sensitivity of 91.0% and specificity of 86.0%. However, these results were not found to be broadly repeatable, as revealed by three experiments performed in Hong-Kong and NZ between 2011-2013 [15][16][17] . The most recent of these studies only reported a maximum sensitivity and specificity of 64% and 86%, respectively, when measurements were taken in a comparative manner 15 . International standards such as the ISO/TR 13154:2017 18 which are used to explain and outline current best-practice approaches suggest a number of considerations to be taken into account (e.g. measurement location, number of subjects who can be measured simultaneously, and recommended distance to subject) when performing fever screening. The existing studies did not report that any of the systems which were installed according to these standards. This is likely due to the fact that the finalised version of the ISO/TR 13154:2017 standard was released after the studies were conducted.
As such, this study, conducted in Australia during the COVID-19 pandemic, evaluates the precision and accuracy of three different types of IRT system for human temperature measurements when installed in a real-world scenario with a mask-wearing population. This was performed to determine their efficacy as screening systems, when compared with a certified benchmark temperature measurement device commonly used in hospitals.

Method Study Design
The investigation was designed as a single-centre observational study comparing the accuracy of three IRT systems to core-temperature measurements taken using certified oral thermometers. Additionally, the precision of each IRT system was determined through repeat measurements.

Setting
The study was conducted at a University during a five-day period in August 2020 from 9AM to 5PM.

Sampling and Eligibility Criteria
The study was performed on a community sample and used a convenience method for participant recruitment. The study aims, participation requirements, and methods of consent and requirements of participation were provided in an email to staff and students in the building. Verbal consent was gathered on the day. These measures were performed to ensure that social distancing guidelines were able to be maintained during the study. Every employee and student who attended the building during the study period was invited to participate. Due to the nature of the facility, the participants were all over 18 years of age.

IRT Systems
Three IRT systems were selected for use which represented three of the main types of systems that are being sold in the Australian market. These were as follows: • System 1 -A dual-camera system with a 40 °C external reference temperature device (blackbody), advertised to be able to measure up to 30 subjects simultaneously while performing facial recognition tasks.
• System 2 -A single camera system with laser-assisted autofocus which operates without a blackbody.
• System 3 -A single camera system with a 35 °C blackbody which is deployed in-line with the guidance provided in the ISO/TR 13154:2017 technical standard (apart from the guidance on masks) 18 .
The specifications for each system are included in Table 1

Experimental Setup
Each of the three IRT systems were loaned to the researchers by the manufacturing companies for the purpose of conducting this experiment. To ensure the independence of this trial, each company provided the equipment free of charge and signed a research services waiver giving the researchers the right to  The conditions of the experimental environment were as follows: • The area was not directly under any active Heating, Ventilating, and Air Conditioning (HVAC) system.
• The room in which the trial was located was temperature and humidity controlled.
• No lights or thermal radiation sources were directly in view of any of the cameras, and the cameras were not pointed at any reflective surfaces.
• Each IRT system was allowed 30 minutes of temperature stabilization each day prior to the first measurement being taken.
• The room had no direct entry from the outside, each doorway had an airgap which had to be traversed prior to entering the experimental area.
• Systems 2 and 3 were focussed at the beginning of the trial, and focus was checked four times per day (the systems remained in focus throughout the trial). The manufacturer of System 1 indicated that manual focussing was not required.
• Systems 2 and 3 were directly in-line with the subject's face. The manufacturer of System 1 indicated that this was not necessary.
• TV Screens were setup for Systems 2 and 3 so that subjects could orient themselves within the camera targeting area. System 1 used a web-based application and a laptop to display the data.

Experimental Procedure
Every step of this experiment was strictly conducted according to socially distancing guidelines between the researcher and the subject.
As a potential participant entered the building, they were approached by the researcher who confirmed their knowledge of the study (information which had been provided through email) and asked for their verbal consent. If the answer was in the affirmative, the participant then verbally completed a questionnaire of demographic data (age, gender, skin tone and mask-type), and also was queried as to their current health status, in particular: • "Have you experienced any fever symptoms in the last 24 hours?" • "If yes, did you take any medications to treat your fever in the last 4 hours?" • "Have you experienced any symptoms in the last 24 hours of sore throat, cough, runny nose, loss of taste or smell?" • Have you had a hot or cold drink in the last 30 mins?
This data was entered into a case report form for each individual in a secure REDCap (Research Electronic Data Capture) database.
The participant was then questioned to if they had been outside the building in the past 5 minutes, if the answer was in the affirmative, they were asked to wait for 5 minutes to acclimatise. The purpose of this was to avoid readings from being impacted by the exterior environment. The participants were also asked to remove hats or glasses to ensure accurate readings.
All participants were wearing masks during the trial, apart from when using the oral thermometers, as during the SARS-CoV-2 pandemic in Victoria, Australia, there was a state-wide mandate that all individuals must wear masks when outside their own homes, this means that any screening technology used during this time would be required to operate with masked participants 19 .
Each participant was then asked to spend 5 seconds standing in front of each IRT system facing directly into each camera at a distance of 1.5m (indicated by a mark on the floor). This was then repeated. At the conclusion of these measurements (6 in total), the participant was provided with a DT-01B oral thermometer (Measurement accuracy: ±0.1°C [35.5°C-42.0°C]) and requested to stand in an isolated location (indicated by a mark on the floor), where they proceeded to take an oral thermometer reading.
The two readings from each IRT system and the single oral thermometer reading were then entered into the secure REDCap database case report form.

Analysis
The calculated.
An investigation into the measurement precision was then performed to analyse the difference between the first and second measurements of each IRT system and to determine if the claimed 0.3 °C of accuracy was observed in a multi-measurement precision test. This precision error was also reported as a boxplot to observe the distribution of the quartiles as well as to identify outliers. The frequency distribution of the precision error was then calculated and reported as mean and standard deviation.
The measurements of the IRT systems were then compared against the reading of the oral thermometer, and the mean and standard deviation were calculated. The Pearson's correlation coefficient (ρ) was also calculated to determine the correlation between the measurements. The thermal camera measurements, in comparison to the oral thermometer readings were then fitted to a linear model, and the coefficient of determination for each system was calculated. The error of each measurement for the three systems was then calculated. Finally, the accuracy was assessed in relation to each of the demographic attributes (age, gender, skin-color, mask type), the mean and standard deviation of the error was calculated, and tests of statistical significance between the systems were calculated using Welch's t-test (as the sample sizes were unequal 20 ).

Ethical Considerations
This study was approved by the Deakin Human Ethics Advisory Group (Health) (approval number: HEAG-H 154-2020). All data was anonymised before being stored in a secure double identifier password protected REDCap (Research Electronic Data Capture) database administered by Deakin University with access limited to the study investigators.

Participant Characteristics
Over the five-day period of the study, a total of 107 measurements were taken from the participating subjects within the building. As individuals were able to take part in the study on subsequent days, the number of unique participants was seventy-one.

IRT System Precision Test
The first test performed on the IRT systems was to determine if the systems were able to achieve a precision of within ±0.3 °C through a repeated measurement precision test, the results are shown in Fig   2 and Table 3. As each participant in the study was measured twice by each system within 30 seconds, the ideal system would have had a temperature difference of 0 °C between the two measurements.

IRT System Accuracy and Correlation Test
The second test compared the IRT system readings against the oral thermometer to determine how correlated the readings from each system were to core body temperature. While not required to be exact, an accurate system for detection of elevated body temperature should have a strong degree of correlation with the oral thermometry readings, with an increase in core temperature resulting in an increased IRT measurement. The results of this test were tabulated and are displayed in Table 3. The oral thermometer measurements were similar to those found in previous studies 3,4 . Firstly, there were significant differences found between the IRT systems and the oral thermometer measurements with the mean difference in System 1 being 0.26 °C, System 2 being -1.90 °C, and System 3 being -1.31 °C. In addition, the correlation coefficients calculated were weak (<0.5), with System 3 being the most  Table 3 with the three IRT systems only presenting weak coefficient of determination's (r 2 ), with the strongest being System 3 (r 2 = 0.24) and the weakest being System 1 (r 2 = 0.11).

Participant Characteristics and IRT System Error
In addition to examining the participants in aggregate, a set of analyses was also performed to determine if any of the recorded demographic attributes had an impact on the measurement error experienced by the IRT systems compared to the oral thermometer. For each of the recorded characteristics (age, gender, skin colour and mask type), the mean and standard deviation of the error and p-values were calculated, using Welch's t-test (as the sample sizes are unequal 20 ). These results were tabulated and are displayed in Table 4. From the data, it can be seen that the IRT systems errors were impacted by the demographic factors, however, each of the systems was not impacted by the same demographic factors.
System 1 was impacted by the age of the subject as well as by the mask type. System 2 and 3 were impacted by the participants gender and skin colour. and was found to report the correct temperature within 0.1 °C, demonstrating its high degree of measurement accuracy.

Discussion
IRT systems are being installed in a wide variety of locations worldwide in response to the COVID-19 pandemic. Yet, there is limited evidence available in the literature reporting on their accuracy or efficacy in these application spaces, in particular on how they correlate to core-body temperature measurements.
In order to address this research gap, the current study investigated the use of three different IRT systems from different manufacturers in a real-world setting, with a community mask-wearing population.
Firstly, the precision of each of the three systems was determined through repeated measurements taken a short time apart (<30 s). Each of the systems were found to have a precision wider than the 0.3 °C accuracy claimed by two of the systems. Secondly, in regard to measurement accuracy, our results suggest that the IRT systems each experience a deviation from the core temperature measurements.
Thirdly, this study has shown that in our sample, the IRT systems measurements only have a weak correlation to the oral temperature measurements, with System 3 having the highest ρ and r 2 with values of 0.49 and 0.24, respectively. Fourthly, the participant characteristics were associated with changes in the error between oral measurements and IRT systems measurements, however, these were not consistent across all three IRT systems.
Some existing studies have found IRT systems to be sensitive and specific in assessing the febrile status of subjects 14,21 . These studies were also conducted with participants >18 years of age and with multiple IRT systems. However, neither of these studies were conducted with a mask-wearing population, and nor with IRT systems installed at the first stage of entry to a building as in our study, which is a common use-case for these systems in hospitals and workplaces in 2020. In addition, the study by Nguyen et al. 21 which found the IRT systems to be sensitive and specific, was performed on individuals after they had been registered in the emergency department, which may have allowed for an extended period of acclimatisation time when compared to a regular building entry.
Our study is the only recent study reporting on repeated measurement precision of the IRT systems, this is a significant result as it describes the repeatability of the system when measuring the same person multiple times. From our results it seems unlikely that a 0.3 °C level of accuracy would be achievable with such a wide measurement precision in each of the systems. Indeed, all 3 of the systems reported measurements with differences of > 0.5 °C on the same individuals only 30 seconds apart.
The research conducted by Ghassemi et al. 22 , and the ISO/TR 13154:2017 standard suggest that using an external reference device (blackbody) increases the accuracy of the measurements. Our study results show that this is a good recommendation; System 3, which used a blackbody reference, had a mean difference 0.59 °C closer to the core temperature measurement than System 2, which used a similar camera without a blackbody. In addition, System 3 was found to have a greater correlation to core temperature than System 2, with their respective correlation coefficients being 0.491 and 0.407.
The current ISO standards also recommend that measurements should be taken from the inner canthus of the eye rather than the forehead or general facial measurements 18,22 . Our study found that System 1 which was not specifically taking measurements from the inner canthus in fact reported measurements which were closest to the core body temperature measurements (mean difference increase of 0.26 °C).
This was an unexpected result, as the literature suggests that the facial skin temperature is generally expected to be lower than the core-body temperature 16,23 . This suggests that System 1 is employing a correction algorithm on the measurement results, which may shift the measurements into a more "acceptable" range. Additionally, the measurements from System 1, which were not taking measurements from the inner canthus were found to have the smallest coefficient of determination of any of the systems to the core body temperature (r 2 = 0.11). This suggests that the use of the algorithm does not significantly improve the efficacy of whole-face temperature measurement, and that the current guidance is correct in recommending measurements be taken from the inner-canthus region over a general facial measurement.
The correlation found in our study (maximum being ρ = 0.49) is generally in agreement with the existing literature, which reported values between 0.3 -0.5 for well-functioning systems 15,21,24 . However, the study by Chan et al. 15 found that the correlation is higher among febrile populations with coretemperatures ≥38 °C, which we were not able to include in our study, so the correlation may have been improved with a population including a large cohort of febrile individuals.
The experiment involving the validation of System 2 against external temperature reference devices (blackbodies) demonstrated that this system is capable of measuring emitted thermal radiation with a high degree of accuracy in the experimental environment. When measuring these near-ideal sources, System 2 reported a measurement value within 0.1 °C of the expected temperature, which is within the margin of error of the reference device. This suggests that the measurement error observed in experiments with human subjects is likely due to the physiological link between core-temperature and facial temperature, rather than inherent technological error.
Our study also found that demographic characteristics had a significant impact on the measurement error of the systems, however, this was not consistent across the three IRT systems. System 1 which measured the whole face temperature exhibited an increased error on subjects ≥40 years of age, and those who were wearing thinner masks. Systems 2 and 3 which measured the inner canthus of the eye had an increase in error on subjects who identified as female, or those with darker skin. Other studies have found gender and age to be factors which impact measurement accuracy 15,21 . To the authors' knowledge, this is the first IRT study which has reported skin tone and face-mask type as demographic factors and investigated their impact on measurement error. Our earlier study using NCIT devices in a hospital setting found that these demographic factors (age, skin tone and gender) also impacted measurement results, so it appears that there is scope for future research to determine their precise impact on the performance of infrared measurement systems in fever-screening scenarios 12 .

Strengths and Limitations
To the authors' knowledge, this the first study of this type which has been conducted on a mask-wearing population, which was a mandated intervention in Victoria, Australia during parts of the COVID-19 pandemic. The limitations of the study were as follows: this was a convenience selected community sample, with no febrile subjects, there were no subjects >65, the duration of the camera loan agreements and facility agreements dictated a one week timeframe which restricted the sample size. Additionally, the variations in height of individuals impacted the measurement process as some subjects had to bend, lean, or use chairs in order to be in focus in the measurement. The inclusion of an NCIT device could have allowed for the exploration of the source of the measurement error, and finally the lack of febrile individuals made it impossible to assess the sensitivity and specificity for individual febrile detection.

Future Work
There is a clear need to expand this study into a setting with more febrile individuals to allow for the assessment of sensitivity and specificity of fever detection, however, the low correlation values found between each of the measurement sources raises doubt to the efficacy of these systems at detecting individuals with low-grade fevers (37.5-38 °C). In addition, there is clear scope to perform further investigations into the impact of age, gender and skin-tone on the measurement results. Finally, there is clearly an interesting avenue of investigation in relation to improving the precision of these measurement techniques for mass screening.

Conclusion
This paper presented the first study on assessing the capabilities of IRT systems in a face-masked population in a real-world mass screening scenario. This system was tested outside of a hospital setting, at the entrance to a research facility within a building, mimicking the installation scenario of many currently operating IRT systems. Our results show that using the systems as a front-line intervention for fever-screening poses a potential risk, where the lack of measurement repeatability could negatively impact sensitivity and specificity, possibly allowing febrile individuals to pass through undetected.
Although these systems are currently seeing widespread use due to the COVID-19 pandemic, our results show that there is still further research required to improve their precision and accuracy so that users can be confident in their operation. There remains an opportunity for new technology to meet this gap.