This section discusses first, the feature extraction and selection methods for face tracking. Second, the feature extraction of the eye tracking data is also reported.
3.2.1 Facial Features
In the facial feature extraction, 34 facial landmarks were extracted frame-by-frame from the recorded session for each participant through the webcam. These facial landmarks cover five regions of the face: eyes, eyebrows, nose, lips, and jaw, represented as a pool of feature vectors consisting of x and y coordinates represented in Eq. (1). Suppose \({\text{f}}_{\text{n}}^{\text{i}}\) denotes each landmark in the nth video frame, starting with the ith frame.
\({\text{f}}_{\text{n}}^{\text{i}}=\left[\begin{array}{ccc}{\text{x}}_{0,}^{\text{i}}{\text{y}}_{0}^{\text{i}}{ \text{x}}_{1,}^{\text{i}}{\text{y}}_{1}^{\text{i}}& \cdots & {\text{x}}_{33}^{\text{i}}{\text{y}}_{33}^{\text{i}}\\ ⋮ ⋮& \cdots & ⋮\\ {\text{x}}_{0,}^{\text{n}}{\text{y}}_{0}^{\text{n}}{ \text{x}}_{1,}^{\text{n}}{\text{y}}_{1}^{\text{n}}& \cdots & {\text{x}}_{33}^{\text{n}}{\text{y}}_{33}^{\text{n}}\end{array}\right]\) | (1) |
To explore the temporal variation of muscular activity across landmarks, we estimated the lengths of distances from all pair distances using the Euclidean distance formula in Eq. (2). This method is common in the literature for exploring differences in posed emotion and neutral face [47–50].
\(\left[\left({\text{x}}_{1} , {\text{y}}_{1}\right),\left( {\text{x}}_{2} , {\text{y}}_{2}\right)\right]=\sqrt{{\left( {\text{x}}_{2} – {\text{x}}_{1}\right)}^{2}+ {\left( {\text{y}}_{2} – {\text{y}}_{1}\right)}^{2} }\) | (2) |
where \({\text{x}}_{1} , {\text{y}}_{1}\)and \({\text{x}}_{2} , {\text{y}}_{2}\) are representing two different facial landmarks. The estimated geometrical information generated was between one landmark to other landmarks and these sum up to 561 geometric-based features distances. These geometric features were reduced from 561 to 20 features by applying the feature selection method [51]. Feature selection reduces training samples to those with the best features while maintaining the efficiency of the model. The objective of feature selection is to reduce computational costs. In this model, we selected the best features from the 561 pair distances. Data samples were standardized to ensure comparable data sample range. This standandarization is achieved as follows (Eq. 2).
\(\text{Z}= \frac{{x}_{i}-\text{mean}\left(x\right)}{\text{stdev}\left(\text{x}\right)}\) | (2) |
Where Z is the standardized score, and stdev is the standard deviation of the data samples. In particular, the standardization subtracts the mean value of the samples and divides their value by the standard deviation.
0-Right Top Jaw, 1-Right Jaw Angle, 2-Gnathion, 3-Left Jaw Angle, 4-Left Top Jaw, 5-Outer Right Brow, 6-Right Brow Corner, 7-Inner Right Brow Corner, 8-Inner Left Brow Corner, 9-Left Brow Center, 10-Outer Left Brow Corner, 11-Nose Root, 12-Nose Tip, 13-Nose Lower Right Boundary, 14-Nose Bottom Boundary, 15-Nose Lower Left Boundary, 16-Outer Right Eye, 17-Inner Right Eye, 18-Inner Left Eye, 19-Outer Left Eye, 20-Right Lip Corner, 21-Right Apex Upper Lip, 22-Upper Lip Center, 23-Left Apex Upper Lip, 24-Left Lip Corner, 25-Left Edge Lower Lip, 26-Lower Lip Center, 27-Right Edge Lower Lip, 28-Bottom Lower Lip, 29-Top Lower Lip, 30-Upper Corner Right Eye, 31-Lower Corner Right Eye, 32-Upper Corner Left Eye, 33-Lower Corner Left Eye
To explore the prominent facial features for differentiating attention and inattention, the geometrical information was estimated from all points pairwise using the Euclidean distance formula (Eq. 3) as follows
Euclidean distance =√ ((x2 – x1 )^2+ ( y2 – y1 )^2 ) (3)
where x1, y1, and x2, y2 are representing two different facial landmarks.
To select the best feature from the facial features, threshold distance was used. The threshold distance is an estimation measurement that describes the changes between facial expression at a neutral frame and expression frame [52]. The threshold distance value is an established method for revealing the information embedded in a dataset. This approach has been successfully applied in differentiating posed emotions from neutral emotions [38, 53]. In this current study, the threshold distance between attention and inattention with higher values represented in Fig. 4 were selected to train several binary classifier algorithms. The parameters describing the feature selection process are described in Table 1.
Table 1
Description of parameters used in the feature selection (Fig. 4)
Parameters | Description |
\({\text{f}}_{1..\text{n}}\) | Frame by frame detection |
\({\text{f}{\prime }}_{1..\text{n}}\) | Frames annotated as attention |
\({\text{f}{\prime }{\prime }}_{1..\text{n}}\) | Frames annotated as inattention |
\(\text{g}{\text{f}{\prime }}_{1..\text{n}}\) | Geometrical information of attention |
\(\text{g}\text{f}\)= | Geometrical information represented the mean value of landmark coordinates |
\(\text{g}{\text{f}{\prime }{\prime }}_{1..\text{n}}-\text{g}{\text{f}{\prime }}_{1..\text{n}}\) | The difference between the mean value of attention and inattention frames. |
3.2.2 Eye-Tracking Features
The gaze-based attentional model consists of six primary eye-tracking features described in Table 1: gaze position, fixation position (FixationY, FixationX), FixationDuration, Ocular distance i.e., head distance to the screen (DistanceLeft, DistanceRight), pupil size (PupilLeft, PupilRight), and interocular distance were collected. The description of these features is provided in Table 2.
Table 2
Description of Gaze-Based Features
Gaze Features | Gaze Sub Features | Description |
1. Pupil Size | PupilLeft | The pupil size of the left eye |
PupilRight | The pupil size of the right eye |
3. Ocular Distance | DistanceLeft | The distance of the participant’s left eye to screen |
DistanceRight | The distance of the participant’s right eye to screen |
5. Fixation Duration | - | The duration of time participant spent looking at the stimuli |
6. Fixation Position | FixationX | The x-coordinate value of the position of the eye to the stimuli on the screen |
FixationY | The y-coordinate value of position of the eye to the stimuli on the screen |
8. Gaze Position | GazeLeftx | The x-coordinate value of the participant's gaze on the screen through the left eye |
GazeLefty | The y-coordinate value of the participant's gaze on the screen through the left eye |
GazeRightx | The x-coordinate value of the participant's gaze on the screen through the right eye |
GazeRighty | The y-coordinate value of the participant's gaze on the screen through the right eye |
12. InterOcular Distance | - | The distance between the left pupil and the right pupil |
Next, we identified the annotated samples labelled as attention, inattention, and unknown. The samples with unknown labels were deleted, leaving us with only samples labelled as attention and inattention. The annotation column with string values: attention and inattention were converted to integers 1 and 0, respectively. Lastly, we normalized each feature to be on the same scale using the StandardScaler library in sci-kit-learn (Eq. 4) using the following equation to achieve a relatively normalized sample distribution.
$$\text{Z}= \frac{{x}_{i}-\text{mean}\left(x\right)}{\text{stdev}\left(\text{x}\right)}$$
4
The best features were selected using an embedded method that uses the inherent characteristics of decision tree algorithms such as random forest, and CART [54] and it is especially encouraged for imbalanced datasets [55]. Feature permutations of randomly selected samples and estimated percentage increase in the misclassification rate were used in the embedded method to select the best individual eye-tracking features (Fig. 5).