CGA: a new feature selection model for visual human action recognition

Recognition of human actions from visual contents is a budding field of computer vision and image understanding. The problem with such a recognition system is the huge dimensions of the feature vectors. Many of these features are irrelevant to the classification mechanism. For this reason, in this paper, we propose a novel feature selection (FS) model called cooperative genetic algorithm (CGA) to select some of the most important and discriminating features from the entire feature set to improve the classification accuracy as well as the time requirement of the activity recognition mechanism. In CGA, we have made an effort to embed the concepts of cooperative game theory in GA to create a both-way reinforcement mechanism to improve the solution of the FS model. The proposed FS model is tested on four benchmark video datasets named Weizmann, KTH, UCF11, HMDB51, and two sensor-based UCI HAR datasets. The experiments are conducted using four state-of-the-art feature descriptors, namely HOG, GLCM, SURF, and GIST. It is found that there is a significant improvement in the overall classification accuracy while considering very small fraction of the original feature vector.


Introduction
Human action recognition (HAR) plays an essential role in human-to-human interaction and many interpersonal relations by providing vital information about the identity of a person, their personality, and psychological state, which are generally challenging to extract [1].The ability to automatically recognizing human activities is one of the leading research fields of computer vision and machine learning.In the field of artificial intelligence, machine learning, and deep learning, researches aiming at understanding human actions received tremendous attention and are monotonically increasing for decades [2,3].This, in turn, results in a plethora of HAR techniques proposed by various researchers as well as evaluated on different benchmark datasets containing still images, video sequences, collected from an accelerometer, smartwatch sensors, gyroscope sensors, gravity sensors and also by using a complete length of motion curves by tracking the optical flow features.
During classification, mainly two questions arise: ''What is the action?''(i.e., what is to be recognized) and ''Which part in the video?''(i.e., the localization problem) [4].It is essential to determine the kinetics of a person while classifying a particular action to make it easy for a computer to recognize the target action efficiently.Various human activities, which arise in a natural manner such as ''Walking,'' ''Running,'' are much tricky to recognize because of the nearly same type of actions performed by a subject.HAR is still considered a challenging task when it comes to do the classification using benchmark datasets containing video sequences of variable length and resolution.It has already been mentioned that several methods have been proposed using machine learning approaches on various datasets that do not show satisfactory performances to a certain extent.However, a better algorithmic approach and further optimization are still required to be incorporated in the HAR system.
A recent survey published in the year 2019 depicts that a lot of researchers have focused on the development of hand-shaped action features for solving the problem of visual HAR.In the visual HAR method, the features are generally extracted from different video classes exhibiting different human actions.Since a video consists of a finite number of frames, it may sometimes happen that the features are extracted from such frame(s) where either no human action is performed, or the human action is already completed.Moreover, in motion-based human actions like ''Walking,'' ''Jogging,'' or ''Running,'' it is also observed that the subject is not present in the last few frames of the respective videos.In these cases, the redundant features extracted from such frames may mislead the classification process.Again, since the features are extracted from continuous video frames, the size of the feature vector increases monotonically.An alternative solution for this problem is to design a new feature selection (FS) model, which will select the most significant feature subset among the original feature set while ignoring the redundant ones.This will, in turn, help to increase the overall recognition accuracy as well as reduce the time complexity of the entire HAR procedure.The concept of FS [5][6][7][8] has already been implemented for incrementing the accuracy of different pattern recognition problems such as handwritten numeral recognition [9,10], handwritten word recognition [11,12], online handwritten Bangla character recognition [13], and facial emotion recognition [14,15].
FS is considered to be one of the most challenging tasks in the machine learning domain.In order to get the most optimal feature combination for a set of features, every possible combination must be evaluated.So, for a n-dimensional feature vector, 2 n feature combinations are needed to be evaluated, which results in an exponential increase in time and resource requirement as n increases.This kind of deterministic approach is not applicable in real-life scenarios, which is full of several thousands of features in a set.In practice, the process of FS progresses by introducing candidate solutions, evaluating them, and improving them over successive iterations.This mechanism does not ensure to find the best possible feature combination in the feature space, but it surely guarantees to find a near-optimal solution within a reasonable time duration.Depending on the criteria of evaluation, FS is broadly divided into three separate categories, namely filter [16][17][18], wrapper [12,[19][20][21][22], and embedded [10,11,23] models.Filter methods use the statistical characteristics and intrinsic properties of features to evaluate candidate solutions.In contrast, wrapper methods take the help of a learning algorithm (a classifier) to evaluate the solutions at every iteration.Embedded models, however, take advantage of both filter and wrapper methods to create more robust solutions.Although filter methods may take less time for generating solutions, solutions produced by wrapper models are of higher quality.
In this paper, a new wrapper FS procedure named cooperative genetic algorithm (CGA) is designed for the feature dimensionality reduction from a combination of four state-of-the-art feature descriptors, namely Histograms of Oriented Gradients (HOG) [24], Gray Level Co-occurrence Matrix (GLCM) [25], Speeded Up Robust Features (SURF) [26], and Graphics and Intelligence-based Scripting Technology (GIST) [27].These feature descriptors have been successfully applied for solving typical pattern recognition problems.In the present work, we have considered them for extracting feature vectors, which will serve as an input to our proposed FS model.These feature descriptors are applied on four benchmark 2D RGB video datasets having a different number of action classes such as Weizmann [28], KTH [29], UCF11 [30], and HMDB51 [31].These datasets are challenging because they employ complex motions and varied camera angles and also contain camera motion.To test the robustness of the proposed FS model, we have additionally considered two smartphone sensor-based datasets from the UCI machine learning repository [32,33].
CGA overcomes some major drawbacks present in coalition or cooperative games and makes it applicable to FS.Our main contributions to this paper are mentioned below: 1. Formation of a new FS procedure where the concepts of coalition game and GA are amalgamated for the first time (to the best of our knowledge), thereby making coalition games practically applicable to the domain of FS. 2. Use of mutual information (MI) to guide the mutation process in GA (which is otherwise done randomly).The new version of GA is called enhanced GA or EGA. 3. Proposal for a new fitness measure for FS having three different components, namely classification accuracy, number of selected features, and Shapley value-based score to perform FS in a multi-objective fashion.4. Application of the proposed FS method over some standard feature vectors extracted from various HAR datasets.
The rest of the paper is organized as follows: Sect. 2 gives a brief overview of the state-of-the-art machine learning-based models used for HAR as well as some other popular FS methods.Section 3 reviews some of the preliminaries used for defining the CGA model, whereas Sect. 4 presents the detailed procedure of our proposed CGA model.Section 5 describes the feature descriptors briefly, whereas the benchmark action datasets used for evaluation of the proposed model along with the detailed experimental results are shown in Sect.6.Finally, some concluding remarks and future extensions of the current work are provided in Sect.7.

Previous work
Even though a significant amount of advancements has already been done in the field of computer vision, researchers still tend to rely on the traditional pattern recognition approaches for finding the solution to the HAR problem.Hand-shaped and logical methodologies are used for understanding the actions performed by humans.Some of the works done for visual HAR are described in brief.
In [34], the authors have proposed a hierarchical model for HAR, which has the characteristics of combining both spatial and spatial-temporal features as a collection of bags-of-features.The model is capable of categorizing human actions on a frame-by-frame basis for any novel video sequence.However, the model cannot classify more complex human actions performed in the unconstrained environment.A new HAR method based on a three-dimensional scale-invariant feature transform (3D-SIFT) descriptor has been proposed in [35] by Scovanner et al. to reflect the 3D nature of the video data.The authors have represented videos through a bag-of-words approach and have presented the relationships between spatial and temporal words for better video data description.But these descriptors do not expressly portray the exact spatiotemporal nature of the video information.Probabilistic latent semantic analysis (pLSA) model and latent Dirichlet allocation (LDA)-based HAR model are proposed in [36].The method has the feature of automatically learning the probability distributions of the spatial-temporal words and the domains corresponding to human action classes.It can categorize and classify the human action present in a video sequence.The method failed to include geometric information, such as the spatial and temporal course of action of local features along with explicit models for some closely related human actions like ''Running'' and ''Jogging.''In [37], Ikizler-Cinbis et al. have suggested the use of a multiple-instance learning (MIL) framework based on multiple feature channels consisting of ''person-centric,'' ''object-centric,'' and ''scene-centric'' features.The work concentrates on capturing human, scene, and object properties for action recognition in the wild but ignores the spatial and temporal relationships of these feature channels.
A novel neural network approach is applied by Huang et al. [38] for HAR.It is based on the concept of selforganizing map (SOM), which is used as a tool for data dimensionality reduction and feature data clustering.The input to SOM is provided through groupings of human silhouettes, which acts as the main feature representation.However, their approach does not incorporate the temporal information in SOM, which is required to express and recognize human actions more efficiently.In [39], Chakraborty et al. have proposed a novel method for the detection of spatiotemporal interest points (STIPs), utilizing surrounded suppression added with local and temporal constraints.A vocabulary of visual words is built by using a bag-of-visual words (BoV) model, which is used for action representation.The method used a greedy agglomerative information bottleneck (AIB) approach for vocabulary compression which makes the computation time too high.A histogram of 3D spatiotemporal gradients (3D-STHOG) based model is proposed in [40] by Reddy et al., which computes the gradients from spatiotemporal volume.This model is different from 3D-HOG, where the gradients are calculated from 3D spatial coordinates.But the model fails to investigate difficulties brought about by false alarm tracks in more detail.Likewise, the authors also overlook longer-term time dependencies in HAR signals.In [41], the authors have proposed a new action descriptor for HAR based on the STIPs, which uses HIPLs (histogram of interest point locations).HIPL represents the information about the spatial location by rearranging STIPs and is added as a supplement to the bag of interest points feature.To overcome overfitness, the method also includes a novel classifier combination framework for the integration of different extracted features.
Nevertheless, the focus on including both the spatial and temporal location information of STIPs is found to be missing in the design of the new action descriptor.Yuan et al. [42] proposed a new global feature transform called the R-transform, which captures the detailed geometrical distribution of interest points.Also, they have proposed a new technique to uniquely combine bag-of-visual-words (BoVW) with R-transform so as to improve recognition accuracy.However, the technique is found to be sensitive to noise and outliers in the video data.In [43], the authors have proposed a new method for the HAR based on motionlets (or motion templates), which are tight moving clusters corresponding to the different parts of the body.The motionlets are extracted from training videos through a data-driven technique.In order to represent and match motionlets, the authors extract features like spatiotemporal orientation energy (SOE), which is computed for every pixel.This, in turn, makes the feature sensitive to small shift as well as increase the computational complexity of the whole method.
In work described by Sadek et al. in [44], the authors have proposed a method for the HAR based on dynamic affine-invariant shape representation.The method is tested using the SVM classifier and is applied to Weizmann and KTH datasets.The work fails to consider some real issues like obstruction, target clarification, and substantial background clutter.In [45], each video sequence is represented and classified by a single global descriptor.The global descriptor captures the components at various intervals of the frequency spectrum of a video sequence through a bank of 3D spatiotemporal filters.These filters consist of 68 3D-Gabor filters for 2 scales, 37 and 31 orientations in the first and second scales, respectively.
Nonetheless, expanding the number of channels in the passband causes the channels to turn narrower and the descriptor to have better feedback with higher computational requirements.A new type of video representation is introduced by Wang et al. [46], which provides local motion information using dense trajectories.The features of the trajectories are extracted using differential optical flow dependent on motion boundary histograms.The model has been applied over nine popular datasets, which include KTH, YouTube, Hollywood2, etc., but the performance of the model is entirely dependent on the quality of the optical flow obtainable.A global representation named Multi-View Super Vector (MVSV) is proposed in [47], which performs kernel averaging over relatively independent components derived from two descriptors.MVSV can provide promising outcomes when applied to HMDB51 and UCF101 datasets during experimentation.The proposed MVSV feature descriptor suffers from two significant drawbacks.First, the calculation of MVSV includes matrices multiplications, which may not be moderate while performing complex HAR tasks where speed is considered as an important factor.Second, the present structure of MVSV cannot be applied to more than two descriptors beyond which the whole system gets monotonous.Wu et al., 2014 [48] proposed a method for HAR that represents each action class by hidden temporal methods.The video segment is described by a temporal pyramid model to capture the temporal structures.The method works well for two models per action class, and a useful heuristic for introducing the learning procedure of various models is a significant concern.In this case, the procedure involving the training of these models is also likely to be required.In work, proposed by Zhou et al. [49], multipleinstance Markov model is used to address the issues of modeling long-range temporal relations for very complex human activities.Multiple instances are used in the model to find the most discriminating set of Markov chains used for the representation of the activities.The method suffers from a lot of unwanted background STIPs, which are mostly found in unconstrained videos due to background chaos and motion of the camera.This, in turn, makes the learning of the parameters of the Makov model inefficient, thereby increasing the computational complexity.A template selection approach is proposed by Seto et al. in [50] to avoid complicated feature extraction and domain knowledge requirement based on dynamic time warping (DTW).The algorithm has been tested on simulated and real smartphone data.However, the approach fails to consider more complex human activities that need modification of DTW.Again, the approach sometimes overfits due to the presence of redundant templates.
An exciting way is to treat the activity recognition as a max subgraph problem which is demonstrated by Chen et al. in [51].The authors of this paper have created a space-time graph of the test video, which is then searched for the maximum subgraph through localization of action instances.The method offers increasingly strong recognition in noisy backgrounds.Additionally, the high-level feature descriptor performs much better for complex human actions.It makes it conceivable to retain the spatiotemporal connections between humans and objects in the test video.Kushwaha et al. 2016 in [52] design a new HAR framework which consists of three steps.First, the subject is detected using background subtraction followed by extraction of contour-based pose features from human silhouettes.Finally, human actions are classified using a multiclass SVM classifier.The framework is applied to the KTH dataset and achieves less recognition accuracy as compared to the state-of-the-art methods.The work described in [53] by Luvizon et al. 2017 proposes 2D motion templates based on motion history image (MHI) of the human actions.The MHIs are described with the help of HOG feature descriptors, which are finally classified using a nearest neighbor (NN) classifier.However, the method is dependent on a few external parameters which need to be tuned carefully to get better results.Sharif et al., 2017 [54], proposed a framework based on uniform segmentation and combination of Euclidean distance and joint entropy-based features selection.Their work consists of four steps: segmentation of the subjects by fusing novel uniform segmentation and expectation-maximization, extraction using HOG and Haralick features, feature selection using Euclidean distance, and standard entropy-PCA-based method and feature classification using SVM.The framework ignores the process of removal of background occlusions and is highly sensitive to light and intensity variations present in the video data.
To address the problem of multiple view invariance, Singh et al. in [55] have proposed a three-step framework that uses the positioning of background subtraction, feature extraction, and activity referencing through hidden Markov models.During the feature extraction phase, contour-oriented signal features, optical flow-dependent motion features, and uniform rotation, local binary patterns are used.The framework is pose invariant as well as rotation invariant.Furthermore, the target actions can be recognized in cases where humans perform their action(s) facing side view.Sahoo et al. 2019 [56] proposed the fusion of histogram-based features for action recognition called as Bag of Histogram of Optical Flow (BoHOF).After calculation of features, Sobel edge filters [57] used, and median filtering is applied to suppress background noise.HOG is then extracted and mixed with BoHOF, and finally, the result is fed into the SVM classifier.The overall accuracy of 96.70% has been achieved on the KTH dataset.Nonetheless, the proposed feature descriptor fails to deal with issues like the shadow of the human performing the action, presence of noisy background frames in the video, and speed of action.
A sufficient volume of research works has also been conducted for sensor-based HAR.In [58], Kolosnjaji et al. focus on developing a user-independent HAR model using a pre-trained neural network with dropout.The method extracts user-independent features and averages out userdependent features, which are considered as noise.Features are then extracted from the noisy signal using an autoencoder with three hidden layers.The model is implemented using GPUs on both UCI HAR and WISDM datasets.However, the model is highly dependent on hardware resources and requires huge training time.Kim et al. [59] have used an ensemble of hidden Markov models to classify human activities in an attempt to overcome the limitation of misclassification caused due to intra-class differences and inter-class similarity.The authors applied their model on the UCI HAR dataset by using two essential features, such as mean and standard deviation.These two features are not enough for classifying complex human activities.However, their model heavily depends on the performance of the clustering technique.In [60], the authors have preselected a subset of all the features (from an original feature vector of size 561) present in the UCI HAR dataset and performed classification using the subset in two ways: using mean square error (MSE) and using threshold-based condition box.The authors do not verify their model on other sensor-based HAR datasets, which is a significant limitation of their work.
The work described in [61] provides an extensive study and comparison among all the features extracted from inertial sensors.It indicates that the features which are recently recognized to be independent of the orientation of the smartphone are not able to accurately distinguish between the users' activities using the inertial sensors.The proposed study is done on only the UCI HAR dataset, which is a significant limitation.Also, feature selection, as well as noise reduction techniques, has not been considered in this study.A comparative study is presented in [62], which compares the performance of K-nearest neighbors (KNN) and random forest (RF) classifiers over the UCI HAR dataset.The high classification accuracy obtained by the RF classifier also confirms the excellent quality of data contained in the dataset.A supervised learning method based on a two-channel convolutional neural network is used in [63] to recognize human activities presented in the UCI HAR dataset.The method extracts both power and frequency signals collected from the human activity signals.However, the method suffers from low recognition accuracy in case of some confusing activity classes.
It is very clear from the above literature survey that the problem of visual HAR has been broadly investigated in the computer vision domain.Previous research works on visual HAR mainly focus on developing new and robust feature descriptors such as 3D-SIFT, STIP, 3D-STHOG, BoVW with R-transform, MVSV, and BoHOF.Since these feature descriptors are simple yet effective, so they are widely applied in recognizing simple human actions.However, these descriptors are sometimes found to be incapable of detecting complex human actions.This is because some of the feature descriptors fail to capture either temporal or spatial relationships between simple human actions.Though some works claim to capture both the spatial and temporal action relationships, the computational as well as time complexity, need to be comprised, it is also evident from this survey that the selection of optimal and discriminative features has been still leaving untouched for video-based HAR problem.
On the other hand, for sensor-based HAR, researchers generally extracted two different types of features like statistical features (such as mean, median, standard deviation, skewness, correlation, and kurtosis) and frequencybased features (such as energy, Shannon entropy, and the mean and median frequency of the power spectrum).Unlike video-based HAR [64,65], we found two significant works [66,67] that motivate us to use the concept of FS for sensor-based HAR tasks.But the application of FS for sensor-based HAR is still in its infant stage.Motivated by the above facts, the present work proposes to design a relatively new FS model named as CGA.To validate its robustness in finding the optimal subset of features from the original feature vector, we have tested our CGA-based FS model on four benchmark video-based as well as two sensor-based HAR datasets.The proposed CGA model is also compared with six state-of-the-art FS models.Genetic algorithm (GA) is a simple metaheuristic approach for solving optimization problems [9,[68][69][70].It is inspired by the evolution of biological features like selection, crossover, and mutation.
There are five necessary steps in GA-creation of initial population, parent selection, crossover, mutation, and replacement of child chromosomes.GA starts with creating a random population of a finite number of chromosomes, with each one having a fixed-sized length.Initially, the individual genes of each chromosome are filled with random values.This serves as the initial population for GA.From this set of chromosomes, some are selected as parent chromosomes, which are further passed for crossover and mutation to create child chromosomes.There are various approaches to select the parent chromosomes like tournament selection [71], Roulette-wheel selection [72], etc.After child chromosomes are produced, they are evaluated against some fitness measures.If their fitness values surpass the fitness of some chromosomes in the current population, they are replaced in the population in place of the fewer fit chromosomes.
The fitness function is defined over the genetic representation and measures the quality of the represented solution.These processes ultimately result in the next generation of chromosomes, which are again passed through the same process of selection, crossover, and mutation to produce the subsequent generations.After a certain number of generations, GA converges to a nearoptimal solution.A binary version of GA is used in FS, where each chromosome is represented as vectors of ''0'' and ''1''s.A ''0'' means the corresponding feature is not selected, whereas a ''1'' means that the corresponding feature is selected.

Present work
In this work, we have used a cooperative game [76,77] and GA in conjunction to reinforce each other with their individual qualities.The detailed description of the cooperative game can be found in the Appendix section of the manuscript.When applied to the field of FS, the coalition game theory has a particular approach.So, first, we need to discuss a little about how we have adapted the concepts of coalition game for FS.

Cooperative (coalition) game for FS
Every cooperative game proceeds by creating a coalition of players.In FS, every feature acts as one of the players participating in the game.The contribution of a feature in a coalition is denoted by a special value known as the Shapley value.Shapley value considers the intrinsic properties of the features to calculate the worth with respect to their coalition.
Consider a coalition game with a total of m players.The entire set of players is defined as M ¼ f1; 2; . ..mg say in the process of the game, a coalition C has been formed, which consists of n players (features).The contribution of any player f i (not currently included in C) to C can be calculated as: where v C ð Þ is the total amount of payoffs, and the members of C can accumulate by cooperating.The Shapley value for a particular player is the sum of contributions of that player in all possible combinations of coalitions.Let us denote Shapley value for player i with respect to a set of players M and payoff function v as S i M; v ð Þ.The expression for S i can be represented as: The function v and thereby D i C ð Þ vary depending on the context where the coalition game theory is being used.For FS, we have used a PCC-based independence metric to find the worth of a feature in a coalition with other features.So, we can see that there are certain things we need to specify before applying coalition game in any context-contribution of a feature in a coalition (D i ), Shapley value of a feature (S i ).In the following portions, we have described how we have performed the calculation of D i ; S i and finally the importance of a feature in the entire feature space.

Contribution of a feature in a coalition (D i )
FS may be considered as an incremental process where we have a set of selected features and another set of unselected features.Now, if we want to move a feature from the unselected set to the selected one, there should be a measure by which we can say that the transfer of the feature will be fruitful.To find that measure, we need to discuss a bit about the FS process.In FS, we are trying to select the most discriminating set of features which can help in the classification process.For example, if two features have the same values, both of them are not helping in classification, and so only one of them can be used.The other one then becomes a redundant feature.Thus, our ultimate objective in FS is to remove those redundant features.Linear dependency is a great way to find a redundant feature with respect to the other features.So, if we want to transfer an unselected feature to the selected feature set, a great way is to choose a feature that is linearly independent of the selected feature set.We have used this concept to calculate the worth of a feature in the coalition formation.
Pearson Correlation Coefficient or PCC [78][79][80][81] is a correlation metric used for finding linear dependency between two random variables.Say we want to check the correlation between features f 1 and f 2 .So, we first calculate the PCC value r f 1 f 2 , and then, the final decision is made based on the following fact.
According to the theory of PCC, two variables are correlated if the PCC value is closer to 1 than to 0. The threshold in this work is set to 0.5 because if the PCC value between two features is greater than 0.5, then it is closer to 1 and thus correlated else; they are not.The relevant equations and descriptions used to compute PCC of the features are described in the Appendix section.It is to be noted that although we say when the PCC value between two features is less than 0.5, they are independent, but they are not completely independent; somewhat, their mutual dependency can be ignored to suit our purpose.In a coalition C, the contribution of a feature D i C ð Þ is determined by the number of features independent of that feature divided by the number of dependent features.

Shapley value calculation S i
To calculate the Shapley value of a feature f i , we need to find the contribution of the feature for every coalition possible in the scenario.So, finally the Shapley value of f i becomes:

Scoring of features
In our model, we have also used MI [82,83] values of the features to assign a score value to every feature at each iteration.The final score value of a feature f i is given by score where S i is the Shapley value for f i and I f i ; class ð Þis the MI value of f i with the corresponding pattern classes.The computation process of MI is described in the Appendix section.

Cooperative GA (CGA)
The primary motivation behind proposing this model is a significant drawback present in the coalition game approach followed in FS.In Sect.4.1.2,we have presented the equation for Shapley value calculation.It can be observed that, generally, it is computed based on contributions of the feature in every possible coalition.In practice, it is impossible to calculate the contribution in every coalition because, for n features, there will be 2 nÀ1 possible coalitions to consider for a feature.So, for n features, the total time complexity for computing Shapley values will become Oðn Ã 2 nÀ1 Þ; which is exponential.Hence, nowadays, researchers are tryconsists of two differenting to provide alternative approaches to calculate Shapley values.
CGA consists of two different parts: i. Enhanced GA (EGA) ii.Coalition game In CGA, we have made an effort to reduce the computational complexity of calculating Shapley value and use that Shapley value to improve the solution of EGA.Thus, it is a bidirectional reinforcement approach between EGA and coalition game where at each iteration of CGA, Shapley value is being computed based on the solutions of EGA, and that value is again used to produce next solution of EGA.
Let us discuss both of them one by one.

EGA
EGA uses all the primary approaches used in GA, like parent selection, crossover, and mutation.Besides, EGA embodies a new fitness function and an MI-guided mutation procedure.Mutation in GA is a random process where some of the features are randomly selected, and their states are toggled, i.e., if a feature is in state ''0,'' it is made ''1'' and vice versa.We wanted to use some guided approach in place of this random one.Hence, we have proposed a MIguided mutation approach where some random features are selected for mutation, but they are not directly toggled.Say, a random feature set S is selected for mutation, and then, the MI-guided mutation does the following: where d i and I d i ; class ð Þ are the feature values and MI score of the feature f i , respectively.
We have also used a new fitness evaluation function.At each iteration of EGA, we need to evaluate the candidate solutions or chromosomes for parent selection as well as child replacement procedures.The fitness function used for the evaluation of chromosomes in EGA is given by Eq. 6.
In the above equation, accuracy X ð Þ represents the classification accuracy provided by the feature vector X; X i is the state of the ith feature in X and score i ð Þ is the final score of the ith feature as calculated by Eq. 5. a; b are two parameters used to provide weightage to the three different components of the fitness function.We have used a as 0.5 and b as 0.2.

Coalition game
After EGA produces the final population of chromosomes in an iteration, they are passed to a coalition game for computation of Shapley values for the next iteration.As we have discussed before, it is practically impossible to compute the contribution of every feature for all the possible coalitions.Hence, instead of considering the features for every coalition, we have used the final chromosomes of EGA as the feature coalitions, which we have considered for computing the Shapley values.From the perspective of EGA, we can say that the final population of the algorithm may be considered as the most crucial combination of features handpicked by the algorithm itself.So, this reduces the computational complexity of the Shapley value calculation procedure to a large extent.Normally, EGA uses a very small population size like 20-50.So, in the proposed model, the Shapley value computation requires O n ð Þ time.Moreover, the chromosomes in EGA get guided to better solutions through the fitness function involving the Shapley values.
The entire model of CGA can be described by the flowchart represented in Fig. 1.From the figure, we can see that the two segments of the model, namely coalition game and EGA, are interacting with each other through the sharing of score values generated using Eq. 5. MI values are computed only once and shared with both the segments because of their requirements.
6 Brief overview of datasets and feature descriptors

Weizmann
Weizmann dataset [10] consists of 90 videos of ten actions.Each of these ten actions is performed by ten different persons.The sample frames from the Weizmann dataset are shown in Fig. 3.

UCF11
UCF11 dataset [12] consists of 11 action classes for which 25 groups with more than four clips in each group were there.This is a very challenging dataset as it has significant variations in camera motion, object appearance, pose, object viewpoint, scale, background cluttering, illumination conditions, etc.Some samples of the frames taken from this dataset are shown in Fig. 4.

HMDB51
HMDB51 dataset [13] consists of 51 classes, of which each class consists of a minimum of 101 video clips.The dataset comprises a variety of realistic videos collected from YouTube and Google videos.Some sample frames from the dataset are illustrated in Fig. 5.

UCI HAR
This dataset has been prepared using a group of 30 volunteers within 19-48 years of age.Each participant performed six activities wearing a Samsung Galaxy SII smartphone.Data obtained from 70% of volunteers were used for training, and rest were used for testing.Each record in this dataset contains a 561-feature vector with time and frequency domain variables, activity label, and an id of the subject who experimented.It contains a total of 10,299 instances and 561 attributes.

UCI HAR_AAL
It is an addition to the UCI HAR dataset mentioned in 5.1.5.More data were collected to perform a social connectedness experiment in Ambient Assisted Living (AAL).
Just like the UCI HAR dataset, it contains 561 attributes but has a total of 5744 instances.

Feature descriptors
As mentioned earlier, a set of four different feature descriptors, namely HOG [24], GLCM [25], SURF [26], and GIST [27], are extracted from the video frames in the present work.These features are computed after performing background subtraction (using one of the previous methods described in [84]) and detection of the minimum bounding box.In this work, from each action video, 20 frames are used entirely on the basis of past experience.It is also used as a standard value in the literature.The reasons are mainly twofold.First, we want to reduce the computational complexity.Second, some starting and ending frames do not carry much significant information.
After applying a minimum bounding box for each frame, the human body performing the action has been resized to a pre-defined dimension of 120 9 60 pixels.This dimension is kept common for all the frames, regardless of the video or dataset it belonged to.Since these feature descriptors are well-known terms in the pattern recognition community, so here we have described these descriptors briefly.For the feature extraction purpose, we have taken help from the inbuilt implementations of image processing and computer vision toolbox present in MATLAB 2016a.The functionalities used for different feature descriptors are mentioned in the next subsections.

HOG
The HOG is a feature descriptor used in computer vision and image processing domains for object detection [24].
The HOG feature descriptor counts occurrences of gradient orientation in localized portions of an image.A set of 8640 features are extracted (from 20 selected frames) from each input video by applying the extractHOGFeaturesðÞ function implemented in Computer Vision toolbox of MATLAB 2016a.The number of features for HOG (N HOG ) for an image frame (Img) is calculated as follows: where The values of the parameters block size , cell size are taken as 2; 2 ½ and 16; 16 ½ , respectively.The values of block overlap and Num bins are considered as block size =2 and nine, respectively.After calculating HOG features from each of the 20 selected frames, 100 of the most useful features are selected from each frame using a sparse filtering technique [85].The selected features for these 20 frames are then aligned into a row vector, and thus, for HOG, a total of 2000 features are selected for each action class.

GLCM
GLCM was first proposed by Robert M Haralick, K Shanmugam, Its'hakDinstein (1973) in his work ''Textural Features for Image Classification'' [25].GLCM with associated texture features is a method of extracting statistical texture features of the second order.
GLCM is a matrix where the number of rows and columns is equal to the gray levels in the image.From the second-order statistics mentioned above, one can derive some useful properties, and all of them have been used for our experiment.There are a total of 13 features calculated from GLCM, each feature containing a definite amount of information about an input image.These features are as follows: energy, entropy, homogeneity, correlation, etc. GLCM also requires an ''offset'' value, which defines relationships of varying direction and distance, and ''offset'' values of 01; À11; À10; À1 À 1 ½ (represented in a matrix form) are used for calculating the feature values from each selected frame.Here, 13 Haralick features are calculated for four different offset values over 20 selected frames to get a feature vector of 1040 dimensions.In terms of implementation, 1040 features are extracted by passing the offset vector as a parameter to the graycomatrixðÞ function, which is a part of MATLAB 2016a Image Processing toolbox.
After calculating the GLCM based features from each of the 20 selected frames, 50 feature values are selected from each frame using a sparse filtering technique [85].The selected feature values for these 20 frames are then aligned into a row vector, and thus, for the GLCM feature, a total of 1000 feature values are selected for each action class.

SURF
SURF is a local feature detector, and it was first proposed by Bay, Tuytelaars, and Gool L Van in their work [26].It is known to be an upgraded version of SIFT.The SURF descriptor is scale and rotation invariant.The interest region is split into smaller 4 Â 4 square subregions, and for each one, the Haar wavelet responses are extracted at 5 Â 5 regularly spaced sample points.The responses are weighted with a Gaussian filter.The Computer Vision toolbox of MATLAB2016a function detectSURFFeaturesðÞ is used to extract the final 400 features (from 20 selected frames) of each input action video by experimentally choosing the value of the only parameter select the strongest as 20 for each frame.
After calculating SURF features from each of the 20 selected frames, 20 feature values are selected from each frame using a sparse filtering technique [85].The selected features for these 20 frames are then aligned into a row vector, and thus, for the SURF feature, a total of 400 feature values are selected for each action class.

GIST
GIST features are used for estimating various statistical characteristics of an image [27].GIST features are calculated by using the convolution of the Gabor filter with each of the selected frames at different scales and orientations.In our experiment, we have used a Gabor filter transfer function having orientation per scale as [8 8 4].The values for each convolution of the filter at each scale and orientation are used as GIST features for a particular frame.The features are obtained using the following process.First, the input frame is convolved using 20 Gabor filters, which produces 20 feature maps having the same size as the input frame.Then, each feature map is divided into 16 regions using a pre-defined grid size of 4 Â 4; and those feature values are averaged within each region.Finally, the 16 averaged values are concatenated of all 20 feature maps, resulting in 16 Â 20 ¼ 320 features using GIST descriptor.Thus, a feature vector of dimension 6400 is calculated from 20 selected input frames of an action video.
After calculating GIST features from each of the 20 selected frames, 100 features are selected from each frame using a sparse filtering technique [85].The selected features for these 20 frames are then aligned into a row vector, and thus, for the GIST feature, a total of 2000 features are selected for each action class.
In our work, we have reduced the total number of features per frame through sparse filtering into an available total of 5400 features (2000 features for HOG, 400 features for SURF, 1000 features for GLCM, and 2000 features for GIST) representing a video frame as mentioned in each section of the feature descriptors.
7 Relation among datasets, feature descriptors, and feature selection At this point in the manuscript, it may become a bit hazy to grasp how datasets, features descriptors, and FS are related to each other.That is why we have provided a quick overview of the existing relations among these three parameters in this section.In this work, we have used four video-based datasets, namely KTH, Weizmann, UCF11, HMDB51, and two sensor-based datasets, namely UCI HAR and UCI HAR_AAL for the evaluation of the proposed FS method.The two sensor-based datasets are already available in the form of vectors in the UCI machine learning repository [86].But, features are extracted from the other four video-based datasets using the feature descriptors mentioned in Sect.This section presents the results obtained by the proposed method over different feature sets.The focus has been given on two of the most crucial evaluation metrics for any FS model-classification accuracy obtained by CGA over the feature sets and the number of features used to achieve that accuracy.The accuracy and number of features selected by CGA are then compared with some benchmark as well as recently proposed metaheuristic algorithms in the domain of FS.In order to provide a neutral environment for comparison, each of these algorithms is run on the feature sets for five times, and multilayer perceptron (MLP) [87] classifier is used to generate the classification accuracies with the number of hidden layers set to 70 (MLP-70).Table 1 presents the best, worst, average accuracies, and standard deviation achieved by CGA over the feature sets used for experimentations.From the table, we can see the difference between the best and worst accuracies is very less, and the standard deviation is also negligible.
Only for KTH and Weizmann, the standard deviation is more than 1 (approx.4.8 and 7.2, respectively).For other feature sets, the standard deviation is very less (less than 1).These facts prove that CGA is able to provide the stable results over different runs, and there is not much deviation.For KTH, CGA can achieve 100% accuracy, whereas, for UCI HAR, CGA has got accuracies of more than 90%.For the remaining datasets, even though CGA is not able to get 90% or above accuracy, the results are quite remarkable when compared to other methods.In order to prove the effectiveness of the proposed method, the results obtained by CGA are tallied against some old and some recently proposed FS models-GA [74,88], BPSO [89], BGSA [90], BGSO [91], WFACOFS [92], and HMOGA [9].The comparison data are provided in Tables 2 and 3. Table 2 compares the average classification accuracies obtained by different algorithms over the same set of features for five runs.The rank of CGA among all the algorithms is provided in the final column of Table 2. From the table, we can observe that CGA has been able to get the best average classification accuracy over five out of six feature sets, and for the only other dataset, KTH, it is ranked second.These results clearly shown how good CGA is in terms of classification for different feature sets.
Apart from accuracy, the number of features selected by a FS algorithm is also significant.The average number of features selected by each algorithm over different feature vectors is presented in Table 3. From Table 3, we can see that even though CGA does not always use the lowest set of features among the algorithms, it uses a balanced set of features to achieve the high classification accuracies presented in Table 1.CGA uses the least features for KTH and Weizmann feature sets among all the algorithms.From the table, it can also be seen that WFACOFS is very good in terms of the average number of features.As it uses a considerably smaller number of features in most of the cases, its classification prowess gets compromised.As a conclusion, it can be stated that as classification accuracies are always given higher priority over the number of selected features in FS problems, and CGA performs the best among all the algorithms used for comparison.
To prove the systematic improvement in the population of the algorithm, which is the heart of any evolutionary algorithm, a comparison of algorithms is provided in terms of convergence of accuracies over iterations for different feature sets.For proper evaluation, three top-ranked algorithms (in terms of overall average classification accuracy), i.e., CGA, WFACOFS, and BGSA, are selected from Table 2. Figures 6, 7, 8, 9, 10, and 11 represent the convergence of accuracies obtained by these three algorithms over UCF11, UCI HAR_AAL, HMDB51, UCI HAR, KTH, and Weizmann feature sets, respectively.
From the convergence graphs, it is visible that CGA can get a reasonable convergence rate in terms of classification The maximum average accuracy is made bold.The rank of CGA in terms of classification accuracy among all the algorithms over all the feature vectors is also provided in the last column of the table  accuracy compared to the other algorithms.This establishes the success of CGA as an evolutionary algorithm because it can improve the solutions over the iterations stably.We can clearly state that CGA is an excellent evolutionary algorithm that is applicable to solve FS problems, which is evident from the discussion of the results.

Comparison with state-of-the-art HAR methods
To evaluate the quality of the proposed model, the results of some state-of-the-art models, outlined in Sect.2, are tabulated against the results obtained by the proposed model in Table 4 for different datasets.To prove the effectiveness of the model, the proposed model is compared with some previously proposed models (with no FS method used), which have been already applied over the corresponding datasets.Thus, we compared our CGAbased FS model with four models for UCF11 and UCI HAR datasets, three models for UCI HAR_AAL dataset, five models for HMDB51 and KTH datasets, and six models for Weizmann dataset.The references, along with their classification accuracies, are provided in Table 4.It can be observed from Table 4 that in the case of two video-based HAR datasets such as HMDB51 and KTH, as well as two sensor-based HAR datasets such as UCI_HAR and UCI HAR_AAL, the proposed CGA-based FS model performs significantly better than all other state-of-the-art methods.It is to be noted that the proposed model achieves nearly perfect accuracy (i.e., 100%) for the KTH dataset.However, the model did not perform well for the remaining two video-based HAR datasets, namely UCF11 and Weizmann.This is because the Weizmann dataset suffers from some serious issues like the presence of background occlusion, misleading shadows, noisy background, and changes in illumination due to shallow resolution videos (see Fig. 3 for more detail).Similarly, the videos present in the UCF11 dataset (whose sample frames are shown in Fig. 4) consist of additional problems like different viewpoints or camera angles, different sizes of objects or humans, etc.As a result, the four feature descriptors (which are HOG, GLCM, SURF, and GIST) used for extracting the features from the two datasets fail to capture valuable information from these video frames in such situations which, result in comparable lower classification accuracy.Most of the existing HAR methods used in the comparison for UCF11 or Weizmann datasets have focused their effort specifically for the corresponding dataset.Hence, they have taken care of the problems faced in the case of UCF11 or Weizmann datasets while designing the feature vectors.On the contrary, the present work aims to develop a generalized feature selection model that is independent of the  underlying datasets or feature descriptors and provides the satisfactory, if not the best, results over all kinds of datasets.In this regard, we would like to mention that our proposed CGA model provides a comparable classification accuracy by utilizing very few features but still capturing the useful information from four very important feature descriptors.This proves the ability of the proposed CGA model in selecting the optimal feature subset from feature vectors obtained from UCF11 and Weizmann datasets.This proves the performing ability of our proposed CGA model for selecting the optimal feature subset over both video-based as well as sensor-based HAR datasets.For a better understanding of the comparison outcome per dataset, the graphs are presented in Figs. 12, 13, 14, 15, 16, 17.Each graph plots the accuracies obtained by different models for a particular dataset.So, for six datasets, six graphs are plotted, illustrating the model-wise comparison of classification accuracies.From the graphs, it is clear that for four (UCI HAR_AAL, HMDB51, UCI HAR, KTH) out of six datasets, the proposed CGA-based model obtained the best classification accuracy among all the models used in the comparison.For the remaining two datasets (UCF11, Weizmann), however, the model did not achieve the best accuracy, but the results are comparable with the other models mentioned in Table 4. So, it can be concluded that the proposed model can work very efficiently on various HAR datasets and provides competitive outcomes when compared with some state-of-the-art models.

Conclusion
From the results and associated discussion, we can see that CGA has exceptional abilities of FS.CGA can address a significant drawback of coalition games-a massive time requirement for processing worth of a single feature.Thus, it is a novel attempt to make coalition games applicable to the domain of FS.It also addresses one of the drawbacks present in classic GA, i.e., no guidance policy in mutation.CGA uses mutual information to perform mutation in order to guide the solutions to improve their quality.It works better than the random mutation procedure.From the experiments, we can state that there are many redundant features in the HAR datasets.CGA can produce very high accuracy using less than 70% of the features.This level of reduction can hugely affect HAR tasks by making them considerably fast.In addition, by using CGA, we have been able to capture useful information provided by four important feature descriptors (HOG, GLCM, SURF, GIST) which would have been non-viable if FS was not performed.The entire dimension of the feature combination is quite large which will require huge time without FS.In the future, we want to apply this method to various new fields     Compliance with ethical standard

6. 1
Datasets 6.1.1Kth KTH dataset [11] consists of six classes of actions.The dataset consists of 599 videos, which are equally divided among the classes (100 videos each) except Hand-clapping, which consists of 99 videos.Sample frames from the KTH dataset are shown in Fig. 2.

Fig. 1 Fig. 2
Fig. 1 Flowchart of the CGA model consisting of two different segments, namely EGA and coalition game interacting with each other

Fig. 12
Fig. 12 Model-wise comparison of performance (in terms of classification accuracy) over the UCF11 dataset

Fig. 13
Fig. 13 Model-wise comparison of performance (in terms of classification accuracy) over the UCI HAR_AAL dataset

Fig. 14
Fig. 14 Model-wise comparison of performance (in terms of classification accuracy) over the HMDB51 dataset

Fig. 15
Fig. 15 Model-wise comparison of performance (in terms of classification accuracy) over the UCI HAR dataset

Fig. 16
Fig. 16 Model-wise comparison of performance (in terms of classification accuracy) over the UCI KTH dataset Fig. 17Model-wise comparison of performance (in terms of classification accuracy) over the Weizmann dataset 5.2.After the feature vectors are generated, they are divided into training and testing sets.We have applied the proposed FS method named CGA over only the training feature vectors of every dataset.The FS provides the features which are relevant for the classification of activities in a particular dataset.The final set of features is then used to classify activities that are present in the testing vectors.Classification accuracy is then measured as:

Table 1
Description of the results obtained by CGA over different feature sets

Table 3
Comparison of an average number of features selected by different FS algorithms (CGA, BPSO, BGSA, GA, BGSO, WFACOFS, and HMOGA) over different feature vectors

Table 4
Comparison of the proposed CGA-based methodology with some preceding HAR methods