The Gait video data collected from two different locations in the urban and rural environments. The data were collected from volunteers of different ethnicity, religion, and a range of body forms from slim to fat. The participants were both men and women volunteers were wearing different clothing, coats, case, and backpack are considered for this analysis. Walking along straight lines perpendicular to the camera view axis in the urban and rural environments are recorded using Longwave infrared (LWIR) and Visible cameras. The rural data consists of 24 subjects and the urban data consists of 31 subjects. Two walking data sequences, Right to Left and Left to Right. In this work, we considered Left to Right walking data sequences.
A. Preprocessing
The frames are extracted from videos. NVIDIA GeForce RTX 2060, Processor Intel (R) Core (TM) i7-10750H CPU @ 2.60GHz, 2592 MHz, 6 Core(s), 12 Logical Processor(s), is used to conduct experiments.
2.1 Human detection
The algorithms used for human-based detection are HOG, YOLO and Mask-RCNN. The YOLO-based object detection outperforms other methods.
a. HOG
HOG is for object detection. The following steps are required to calculate HOG for an object:
1.Image normalization to reduce the influence of illumination effects.
2.Computing the gradient image in x and y to add further resistance to illumination variations.
3.Computing gradient histograms provides resistant to small changes in pose or appearance.
4.Normalizing across blocks provides better invariance to illumination, shadowing, and edge contrast.
5.Flattening into a feature vector.
b. You Only Look Once
A single convolutional neural network predicts bounding boxes, class labels and probabilities directly from full images in one evaluation. The main advantage of YOLO is it extremely fast and makes predictions that are comparatively better than traditional methods for object detection. YOLO makes less than half the number of background errors and false positives and negatives compared to other methods. In YOLO, the detected box is bounded towards the object approximately the same size as the object. The limitation of YOLO imposes strong spatial constraints and struggles to generalize aspect ratios or configurations to objects.
c. Mask R-CNN
Mask R-CNN is for semantic segmentation and extends Faster R-CNN for the bounding box recognition. Mask R-CNN detects objects and generates a segmentation mask for each instance. The results of HOG, YOLO, and Mask R-CNN are shown in Fig. 1.
The bounding box of HOG is larger than the object. False positives and false negatives are relatively higher than YOLO. Instance segmentation of Mask R-CNN results in a rectangular effect. YOLO outperforms other methods with a smaller number of false positives and false negatives with a compact bounding box around the object.
2.2. Background Subtraction
The background subtraction was to check the quality of the image using GMM and ViBe methods. The results of both the methods are shown in Fig. 2. The figure shows that the ViBe results are comparatively better with fewer artifacts and clutter than GMM.
2.3 Silhouettes Extraction
Each subject is divided into four groups normal, coat, bag, and suitcase. The silhouettes for normal data consist of 12 sequences, six sequences of walking from Left to Right and six sequences of walking from Right to Left. The coat, bag, and suitcase data consist of four sequences, two sequences of walking from left to right and two sequences of walking from right to left. In this work, Left to Right walking sequences are considered for gait analysis. The silhouette data are divided into training and testing sets.
The training data set consists of four sequences of normal silhouettes. The testing data sets consist of two sequences of normal, coat, bag, and suitcase. The silhouettes are shown in Fig. 3.
2.4 Gait Energy Image
The Spatio-temporal silhouettes are averaged over the Gait cycle to calculate Gait Energy Image (GEI). The GEI are shown in Fig. 4.
Gait Energy Image is defined as,
where pre-processed binary gait silhouette, N is the number of frames in a gait cycle, n is the frame number and and values are the 2D image coordinates.
2.5 Discrete Fourier Transform
The amplitude spectra of Gait Silhouette Volume (GSV) are calculated by Discrete Fourier Transform (DFT) analysis based on the gait period.
where is amplitude for temporal axis, N is the number of frames in a gait cycle, is a base angular frequency for a gait cycle and is the frequency component. The DFT analysis of gait period is shown in Fig. 5.
2.6 Principal Component Analysis
Principal Component Analysis reduces data by geometrically projecting them from higher dimension to lower dimensional features. PCA by projecting simplifies the complexity in high-dimensional data while retaining trends and patterns. The gait sequences are represented as GEI and DFT-GEI, gait recognition can be performed by matching testing dataset to the training dataset that has the minimal distance to the testing GEI and DFT-GEI. The PCA projects the original features to the subspace of the lower dimensionality so best data representation and class separability can be achieved simultaneously. The reduced dimension features are used for gait recognition by using classifiers.