In an extensive clinical study, we have evaluated the use of IRTs under standardized conditions and collected a wide range of data on facial temperatures and their correlation to oral measurements. These data have yielded valuable insights into IRT-based temperature estimation and fever detection capabilities and the factors that impact system performance.
4.1 Thermographic screening accuracy and standardization
This study was largely based on two international consensus documents described above – IEC 80601-2-59 and ISO/TR 1315423,24. The guidance provided by these publications helped ensure that the devices used in this research had a high level of image quality and that the acquisition methods – including instructions to subjects – were optimized to enable accurate measurements. The optimal approaches identified in our study produced results that were equal to or better than most prior relevant works in terms of absolute agreement with, and correlation to, reference measurements, as well as discrimination of febrile subjects.
Our findings showed that the differences between 𝑇𝑟𝑒𝑓 and temperatures of different facial regions were in the ranges of 1.6-2.8 °C for the forehead region, 1.4-2.4 °C for the inner canthi regions, 1.7-1.8 °C for the mouth region and 1.2-1.3 °C for the maximum face temperature. The magnitude of these results is smaller than results from Nguyen et al. who showed differences in the range of 2.1-8.7 °C between 𝑇𝑟𝑒𝑓 and facial maximum temperature by three IRTs17; similarly Chan et al. showed forehead temperatures differences of 3.0 °C and 3.9 °C for febrile and non-febrile subjects, respectively18. Our results also showed strong correlations between IRT-measured temperatures (𝑇𝐼𝑅𝑇) and 𝑇𝑟𝑒𝑓, with both IRTs producing r values as high as the 0.75-0.80 range. These values are much higher than several prior studies that found r values between IRT and oral temperatures of no greater than 0.452,17,39. Scatter plots of 𝑇𝐼𝑅𝑇 vs. 𝑇𝑟𝑒𝑓 provided in prior studies, such as Chan et al.18, also do not show the strong linear trends seen in our 𝑇𝑚𝑎𝑥 and 𝑇𝐶𝐸𝑚𝑎𝑥 data (Figure 5). It is likely that this improvement in correlation is due to control methods that help to reduce measurement variability, including stability correction with a blackbody, reduction of confounding environmental factors, multi-frame averaging, and the use of canthi regions in thermal images.
Strong temperature correlations enabled discrimination between febrile and afebrile subjects to a high degree of accuracy. For a low-grade fever diagnostic threshold of 37.5 °C, 𝑇𝑚𝑎𝑥 data produced an AUC values of 0.95-0.97 and Se/Sp values in the 0.85-0.95 range. For a diagnostic threshold of 37.8 C, Se/Sp values increased to 0.93-0.95. These results for relatively low-grade fever detection, as well as findings at higher diagnostic thresholds shown in Table 5, compare favorably with the literature. In a study of airport travelers, Priest et al. found an AUC of 0.71 (Se/Sp = 0.86/0.71) for a fever threshold of 37.5 °C using full-face maximum temperatures20. Nishiura et al. estimated that the AUC values were 0.79 and 0.75 for threshold temperatures of 37.5°C and 38.0 °C2. Nguyen et al. compared IRT performance for fever screening using images of the face and neck, with 37.8 °C as the fever threshold 17. In this study, AUC values of 0.96 and 0.92 were found for two IRTs, yet the corresponding r values of 0.43 and 0.42 do not appear sufficient for high accuracy measurements. Hewlett et al. obtained AUC values of 0.86 and 0.90 for fever thresholds of 37.8°C and 38°C, but did not report r values or results for 37.5 °C19. These comparisons provide substantial evidence that an approach based largely on adherence to recently published standards has the potential to advance IRT-based fever screening capability.
4.2 Comparison of facial temperatures
The 17 facial temperatures extracted from each subject’s thermal image can be categorized by facial region (forehead, canthi, mouth, entire face) or by measurement location selection method (fixed-location vs. maximum value of a defined region). Analyzing our extensive clinical testing results provided insight into key trends and potential approaches for optimizing IRT-based fever screening.
IRT system performance was highly dependent on measurement location, with the forehead producing lower accuracy than canthi regions. Temperatures determined from five fixed locations on the forehead (𝑇𝐹𝐶 , 𝑇𝐹𝑇, 𝑇𝐹𝐵, 𝑇𝐹𝐿 and 𝑇𝐹𝑅) had relatively low correlations (r < 0.50) with 𝑇𝑟𝑒𝑓 and larger pairwise differences. Fixed locations in the canthi region showed moderately strong correlations (r values of 0.51 to 0.63) with 𝑇𝑟𝑒𝑓 and their pairwise differences from 𝑇𝑟𝑒𝑓 were also relatively small. Similarly, the maximum-value data for canthi regions showed better performance than the forehead or mouth regions. This result aligns with a prior comparison of IRT-based eye and forehead measurements43. In our study, maximum-value of the entire face (𝑇𝑚𝑎𝑥) provided better performance than the forehead (𝑇𝐹𝐸𝑚𝑎𝑥) in terms of correlation and fever detection; this finding is consistent with a prior study that compared maximum temperatures from the full face and forehead (r values of 0.43 and 0.36, respectively)18. Differences in performance between the forehead and inner canthi are likely due to perfusion of the canthi from the internal carotid (ophthalmic) artery, proximity to large vessels and relatively thin skin32, whereas the forehead is more diffusely perfused and susceptible to convective and evaporative cooling43,44. These findings may shed light on the poor sensitivity values found in some NCIT studies14.
Overall, the maximum value in a region showed better diagnostic performance and correlation with 𝑇𝑟𝑒𝑓 than the value at a fixed location within this region, with greatest 𝑟 values and AUC values for 𝑇𝑚𝑎𝑥, followed by 𝑇𝐶𝐸𝑚𝑎𝑥 and then 𝑇𝐹𝐸𝑚𝑎𝑥 . Prior studies have also found that maximum-values approaches tended to provide greater performance18. 𝑇𝐶𝐸𝑚𝑎𝑥 and 𝑇𝑚𝑎𝑥 yielded similar 𝑟 values and statistically equivalent AUC values, as well as significantly higher AUC values than 𝑇𝐹𝐸𝑚𝑎𝑥. Interestingly, Figure 5 shows that unlike the relatively tight cluster of normal-range data points (𝑇𝑟𝑒𝑓 = 36.4-37.4 oC), data for 𝑇𝐶𝑚𝑎𝑥1 exhibits a tail extending to lower IRT-measured values than other datasets. This feature is also present in the few scatter plots that have been published from clinical IRT data18,43. Additionally, we found that individual hairs on the forehead degraded accuracy. The improved performance observed for maximum region temperatures may be due in part to subject-to-subject variations in facial anatomy and physiology that cause unpredictable nonuniformity in spatial temperature distribution. Taking the maximum temperature of a region affords greater robustness to such variations.
As noted above, approaches involving the inner canthi or maximum-temperature locations provided higher levels of performance. Therefore, it is not surprising that 𝑇𝐶𝐸𝑚𝑎𝑥 – which involves both of these features – provided one of the best options of the 17 temperatures tested. The finding that 𝑇𝑚𝑎𝑥 provided slightly better performance than 𝑇𝐶𝐸𝑚𝑎𝑥 is a more unexpected result, because it was not advocated in IEC TR 13154 as a “robust measurement site”, as the inner canthi were. However, this approach has been used in a number of prior studies17,18,20 likely due to its combination of simplicity and effectiveness. These prior studies achieved relatively high Se/Sp values (0.7-0.9) using this approach. In part, this effectiveness stems from the fact that the inner canthi are a key thermal feature in full face images, as discussed in the following section. In spite of these benefits, there may be unresolved challenges related to the use of 𝑇𝑚𝑎𝑥 , such as confounding physiological factors like sinusitis that impact temperature distributions24.
4.3 Distribution of thermal maxima in full face images
In order to better understand the results obtained with 𝑇𝑚𝑎𝑥 , we evaluated the distribution of locations where maximum temperatures occurred over 3252 thermal images collected by the two IRTs from the first round of measurements. The locations of thermal maxima in full facial images are summarized in Figure 7 and Table 10. According to Table 10, thermal maxima most commonly appeared (59.5%) in the inner canthi region, followed by oral (21.7%), forehead (8.8%), nasal (4.1%) and temporal (3.6%) regions. The predominance of inner canthi maxima is expected given what is known regarding perfusion in this region. A relatively large fraction of maxima occurred in the oral region, likely due to perfusion from the facial artery which is closer to the external carotid artery than the vessels that perfuse most facial regions. The forehead maximum were typically along the hairline, likely due to the thermal insulation effect of hair. Some thermal maxima appeared in the temporal region, likely due to the superficial temporal arteries. It was unexpected to find maxima in the nasal/nostril region (bottom); whether these are due to some pathology such as sinusitis24,45 or perhaps exhalation of warm air is not currently known.
Table 10 Spatial distribution of facial temperature maxima
Region
|
Number
|
%
|
Inner canthi
|
645
|
59.5
|
Oral (closed)
|
235
|
21.7
|
Forehead (hairline)
|
95
|
8.8
|
Nasal
|
44
|
4.1
|
Temporal
|
39
|
3.6
|
Neck
|
17
|
1.6
|
Other
|
9
|
0.8
|
4.4 Quality of a thermographic screening system
The IEC 80601-2-59 standard23 defines a screening thermograph (ST) as a system composed of an IRT and an external temperature reference source (usually a blackbody with known temperature and emissivity), and in some cases, a computer and software for data acquisition, processing and storage. Therefore, most results in this paper, except for the data in Section 3.4, are technically not results of two thermal cameras, IRT-1 and IRT-2, rather, two fever screening systems, (IRT-1 + BB) and (IRT-2 + BB). We have evaluated these two systems in our previous work 28 and found that their uniformity, stability, drift, minimum resolvable temperature difference, and laboratory accuracy all satisfied the standard requirements.
The use of a blackbody for temperature compensation had a moderate impact on IRT screening ability. In our prior study28, such compensation vastly improved the stability of IRT-2, and enabled the system to meet IEC 80601- 2-59 performance specifications. Except to measure and compensate for long-term drift, the use of 3-frame averaging in the current study may have improved stability to the point where the blackbody was no longer critical. If no frame averaging is used, the use of a blackbody would likely be more critical for fever detection. Additionally, the current study was executed in an environment with relatively stable ambient temperature; it is likely that in a less controlled screening location with larger, more rapid thermal fluctuations, blackbody compensation would be more important.
While inherent IRT instrumentation quality is critical, performance also depends on effective implementation. The use of control methods such as an absorbing background, multiple frame imaging, and thermally stable, forward-facing subjects. Given that many of these confounding factors have been addressed in our study, the results presented here likely indicate a best-case performance level. As control methods we have implemented are removed – which may be necessary in certain real-world screening situations – it is likely that performance will degrade. The degree to which removal of any specific control will impact results is beyond the scope of the current study but may be important for predicting real-world performance.
4.5 Fever screening during an epidemic
The primary purpose of this study was to facilitate the implementation of IRT systems and practices that enable optimal measurement accuracy and highly effective fever screening during epidemics. However, achieving effective screening can be a complex process, as many factors need to be addressed beyond the physics, instrumentation and acquisition procedures. While our results showed that some facial temperatures had good discrimination abilities with high AUC values, some previous literature claimed that thermography was not highly effective for fever screening during disease outbreaks1,2,46,47. This may have been due to device instability7,11,32,34,48, inappropriate temperature reading locations, nonstandard calibration and environmental controls28.
The frequency at which fever presents as a symptom is another impediment to successful screening. In the current COVID-19 outbreak, many of those infected are largely asymptomatic and only 73% have exhibited a fever 3; in 2009, only half of H1N1 outbreak cases had temperatures of ≥37.8 °C8; and a 2011 study indicated that none of the 30 subjects identified as being flu-infected had a temperature of 37.8 °C or greater, and only two had a temperature of 37.5 °C20. Therefore, while IRT-based screening can detect individuals with elevated temperature, it is not a viable stand-alone tool for screening for individuals infected with specific diseases49. It may play an adjunct role along with other screening evaluations. Since fever is only one common symptom of infectious disease, an effective screening process should include evaluation of a range of symptoms3,7,8. The future development of an integrative screening system may include thermography along with optical imaging approaches for evaluation of vital signs such as heart and respiration rate as well as other physiological parameters8.
Given that 𝑇𝐶𝐸𝑚𝑎𝑥 and 𝑇𝑚𝑎𝑥 provided the best performance, it is worth considering issues that might influence the decision to implement one approach or the other. Acquiring a full-face region for calculation of 𝑇𝑚𝑎𝑥 would likely be easier to accomplish and performed more reliably than determining 𝑇𝐶𝐸𝑚𝑎𝑥 via auxiliary visible light imaging and computationally-intensive techniques for co-registration of inner canthi regions. This may be particularly important in a high throughput situation where delays due to computer processing or image co- registration errors could become highly inconvenient. However, implementing an approach that blindly determines the maximum temperature from a full-face thermal image may increase the need to identify confounding pathological/physiological conditions such as sinusitis24,45,50. To accomplish this task rapidly and effectively may require significant screener training, although automated approaches (e.g., deep learning algorithms) could also be developed to augment or replace manual assessments.
Another practical challenge involves the identification of an appropriate reference temperature diagnostic threshold, given the diversity of values that have been implemented. The Centers for Disease Control and Prevention (CDC) has recommended the use of 38°C51, whereas prior human subject studies have been based on 37.5 ºC7,39,40, 37.6 °C52, 37.7 °C32,48, 37.8 °C17,19,20, and 38 °C35. Different thresholds have been used for different outbreaks, such as ≥ 38ºC for SARS 1 and 37.7°C for adults and 37.9 °C for children in an H1N1 study8. The literature indicates that as the threshold temperature decreases, diagnostic accuracy typically degrades. Uncertainty in normal body temperatures – which can be influenced by gender, age, physical exertion and other factors – can further increase error in screening tasks15,35,39,53. Additionally, a recent study indicates that normal body temperature has decreased on average since the establishment of the 37 °C threshold 150 years ago54. In spite of these obstacles, our results indicated that IRT systems are capable of detecting low-grade fever (37.5 °C) in subjects, which could mean that early-stage infections and those producing only moderate symptoms could be more readily identified. The significance of this ability is demonstrated by the fact that of the subjects with 𝑇𝑟𝑒𝑓 values over 37.5 °C in our study, 60% would not have exceeded the CDC recommended diagnostic threshold of 38 °C.
Even if a suitable diagnostic threshold for fever based on body temperature can be defined, determining the IRT cut-off temperature for fever screening requires a variety of considerations. While we calculated optimal cut-off temperature to optimize both Se and Sp, this may not represent an optimal value for real-world use. For a severe disease, lower cut-off values may be needed to minimize false negatives in primary screening. Given the typically low prevalence of diseased individuals in a screening population, the false positive rate in primary screening will be high (and thus the positive predictive value low). On the other hand, it may also be important to balance the burden on the population being screened (e.g., travel delays) and screening personnel (e.g., workload, fatigue, cost to health agencies)17,39.