Gestures serve as a silent yet potent form of communication, finding utility in diverse fields such as automotive interfaces, healthcare systems, assistive technologies, entertainment and human-computer interaction [1–2]. This versatility enables effective and contactless communication across various contexts, which not only enables immersive virtual reality (VR) and provides touchless control over smart devices but also assists individuals with hear and speech impairments. It plays a crucial role in advanced domains such as robotic control technology and medical diagnosis [3–10].
The advent of deep learning has propelled advancements in digital gesture recognition, with methodologies predominantly based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNN is widely recognized for their superior ability in image classification and feature extraction, whereas RNN excel in handling sequential data. These attributes allow for the nuanced recognition of complex gestures. V.M., et al. used a low-resolution thermal imaging camera for gesture prediction and introduced a novel detection technique that integrates 2D CNN with Temporal Convolutional Networks (TCN). This approach attained a classification accuracy of 95.9% and a mean Average Precision (mAP) [11]. Wu J., et al. developed a dynamic gesture recognition model that employs data gloves, combine a CNN for local feature capture and Bi-directional Long Short-Term Memory (BiLSTM) networks for temporal feature extraction, culminating in a 95.05% accuracy [12]. Lin Z., et al. proposed a dynamic gesture recognition technique used a CNN with a 3D receptive field, achieves 97.5% accuracy [13]. Zhang X.J., et al. leveraged the AlexNet-based CNN model for gesture recognition, achieves an average accuracy of 98% [14]. Tsironi, et al. tried to use CNN and Long Short-Term Memory (LSTM) for dynamic gesture recognition, with a comprehensive accuracy of 80.10% [15]. Bao P., et al. advocated for a direct classification of seven gesture types via deep CNN, bypassing the segmentation or detection phases, and secured a 97.1% accuracy in simple background settings [16]. Oyedotun, et al. performed static gesture recognition based on deep learning and proposed CNN and stacked denoising autoencoder (SDAE), achieves an accuracy of 91.33% [17]. Molina J., et al. employed Deep Neural Networks (DNN) for classify seven gesture actions via Time-of-Flight (TOF) cameras, with a success rate of 94% [18]. Most of the above recognition models comprise several convolutional layers, max pooling layers, a variety of regularization layers and Transformer architectures. Therefore these models will have larger parameters, require more memory and longer model inference times.
Most of studies employ deep learning models that utilize either RGB cameras to capture visible light or depth cameras to collect depth information that are suitable for a wide range of scenarios. However, gesture recognition accuracy may be affected by the quality of RGB images under different lighting conditions. Additionally, the high cost associated with depth cameras restricts their widespread adoption. [19–21].
Miniature infrared sensors employed thermal imaging technology effectively capture gesture information via temperature differentials in environment where insufficient light. This significantly enhances the adaptability and robustness of gesture recognition systems. By acquire thermal rather than color images, these sensors ensure user privacy at the hardware level, safeguard against the inadvertent disclosure of personal identity information.
Thermal imaging cameras are distinguished by their ability to operate independently of ambient light conditions and their cost-efficiency compared to TOF cameras. This kind of sensor accurately capture thermal fluctuations induced by gestures through differentiating between the ambient background and the warmer human body temperature. After conducted a thorough review, this method has selected the MLX90640 infrared sensor due to its compact size, affordability, and high precision, which is characterized by a 32x24 pixel resolution and an adjustable sampling frequency. This study introduces a novel gesture recognition approach that utilizes a lightweight CNN architecture integrate a Spatial Transformer Network (STN) module, thereby enhancing the model's ability to handle variations in imagery. The effectiveness of this lightweight CNN framework was evaluated against the lightweight classification model FastViT [22], renowned for its superior performance on the ImageNet-1K dataset. The successful deployment of this optimized CNN model on the Raspberry Pi which low-energy computational platform, affirms its practical applicability and operational effectiveness in real-world scenarios.