Literature Review
There has been some research focusing on implicit and explicit communication of vulnerable road users. In addition to official traffic regulations, informal rules and communication among road users result in improved traffic safety [1]. Means of communication among road users can be formal or informal and can indicate an intent for a specific action. Formal means of nonverbal communication are acknowledged and specified by official traffic regulations. For motor vehicles these include among others the rear brake light, the blinker, the hazard warning lights etc. Bicyclist hand gestures are also acknowledged as a formal gesture by the German Traffic Regulations [2]. Informal means of communication are unofficial signals, which result from the shared experience of the road users and the prevailing social norms. These may include “looking around”, eye contact or driving dynamics [3].
Human road users are found in many instances to consider vehicles in the traffic scene as social units and predict the evolution of a traffic scenario through observation and interpretation of the behavior of the other road users. In cases of uncertainty in the immediate evolution of a traffic scenario, road users will try to interact with other road users, communicate their intentions and cooperate in order to resolve the traffic situation [4]. In this context, an effort is made to categorize the communication behavior of road users based on their goals in their present traffic situation. Movement-achieving, movement-signaling (or intent-signaling), perception-achieving and perception-signaling are defined as the primary categories of the communication behavior, where a given communication behavior can belong simultaneously to multiple categories. Explicit and implicit communication behavior of road users are then considered a subset of this space [5].
The study of bicyclist communication behaviour is currently becoming increasingly important for the development of automated driving functions. In order to be able to develop such systems, it is important to understand the interaction between bicyclists and automated vehicles beforehand and be able to predict the intentions of the bicyclists [6]. Human prediction of intent is largely based on single strong indicators, e.g. that the bicyclist makes a clear head movement. Several weaker indicators, which together could be a strong combined indicator or equivalent indications, are likely to be overlooked or too complex to be handled by people in real time. Six implicit communication instructions are decisive for the prediction of the bicyclist's behavior: head movement, speed, speed adjustment, leaning, position and pedaling behavior [7]. Several scientific studies have determined that hand gestures [8–13], eye contact, head movements and head over shoulder turns [7, 10–14] can be decisive explicit and implicit communication cues that reveal a bicyclists intentions and therefore can be used for developing a methodology for predicting the behavior of the bicyclists. The communication cues can be extracted from the bicyclists using skeleton-based detection and tracking algorithms [15, 16] coupled with k-means classifications for identifying bicyclist gesture archetypes [13].
Previous scientific studies hint that implicit and explicit communication cues can be used for the prediction of road user behavior. The question arising is if and to what extent can this information be used for the development of supporting autonomous driving applications and for the better understanding of the bicyclist’s behavior. In the case of executing maneuvers, bicyclists often perform hand gestures to indicate their intended direction of travel. As a part of the execution of the maneuver, a bicyclist might perform a sequence of implicit and explicit communication cues. These cues can fall into any of the categories defined in [5] and can be different cues for different traffic situations and for different road users. Also, these cues might be performed multiple times during or before the execution of a maneuver. Therefore, a sequence of communication cues arises before the execution of a specific maneuver. The same cue as a feature might appear multiple times in the communication sequence, while multiple communication cues can appear in a specific order in the sequence or be included in different communication sequences.
Sequential pattern mining, classifiers and deep learning models can provide insights and discover meaningful patterns to predict intended behaviors. Several approaches have been proposed for the intention prediction of human actions in automated driving and road user behavior research. The vast majority of proposed approaches address either the trajectory and sequence prediction problem or the classification problem of specific actions that will be performed by road users in a future time horizon. Recent scientific work on this topic primarily relies on Ensemble Methods and Classifiers and mostly Deep Learning Models [17–19]. Proposed methods are tailored to specific data input types, perception methods and detector types. Additionally, the deployment of deep learning models is a time consuming and complicated process heavily relying on the specific data input and the dimensionality of the prediction task. As they address specific prediction or classification problems making them not easily transferable and adaptable for addressing similar problems on other fields and with other types of data streams.
Automated vehicles are already equipped with several cameras and lidar sensors for an almost complete detection of their environment [20]. In the case of bicyclists, automated vehicle sensors can capture key features of their dynamic and operational behaviour and make short-term predictions on their possible trajectories. Already the importance of the bicyclist communication behaviour as an additional information layer has been highlighted in previous research work [13, 16]. Thus, the development of a standalone method for introducing the benefits of the additional information gain originating from the bicyclist communication cues has the potential to expand and support existing models and functions in the field of automated driving and the improvement of traffic efficiency and safety through the enhanced and more accurate forecasting of bicyclist behaviour.
Bicycle Simulator Study Design
In this paper we explore and evaluate supervised machine learning methods that for predicting bicyclist intention for specific maneuver types at an intersection approach. We use exclusively explicit, implicit communication cues and features that can be typically extracted using video cameras which are part of the standard set of detectors found on automated vehicles. First, in order to study the explicit and implicit communication behavior of bicyclists, a bicycle simulator introduced at the Chair of Traffic Engineering and Control of the Technical University of Munich is used [21]. There is no existing taxonomy with respect to clustering traffic situations where interaction between two or more road users takes place. Situations of interaction between an automated vehicle and one human road user are classified based on discussions and assumptions derived from real situations [1]. Road users at an intersection approach under consideration of the German Traffic Regulations [2] and can perform the following driving maneuvers: crossing the intersection, turning left and turning right.
The study consists of six scenarios, each of which consists of a different combination of conflict vehicles and participant maneuvers (Table 1). Three scenarios are examined for the left turning maneuver, two scenarios for the crossing and one scenario for right turning. In all scenarios, the test subject approaches a four-way intersection of two two-lane roads. Traffic must follow the right-before-left rule. Except for those in scenario 6 (Table 1), each traffic lane is 3.3 meters wide. Adjacent to the traffic lanes, a 2-meter-wide parking lane and then a 3-meter-wide sidewalk are placed. The curb turn radii at the intersections were set to 9.5 meters. The speed limit on all the streets is set to 30 km/h. The infrastructure layout and traffic lane widths at each intersection were established based roughly upon those at the intersection of Oberanger Straße - Rosental Straße in Munich, Germany. Motor vehicles are simulated using SUMO [22], while the traffic flow and behavior are controlled using the TraCI API [23]. The advantage of using SUMO is that simulated vehicles can interact with the test subject inside the simulated environment. Instructions for specific maneuver types are given to the test subjects using on screen messages appearing at the bottom right corner of the front screen. Each message is programmed to appear when the ego bicyclist was approximately 200 meters away from the intersection and disappear when they were approximately 100 meters away.
An Intel RealSense D435 depth camera is used to collect the implicit and explicit communication features. The camera uses stereo recording to calculate the depth and has two depth sensors, an RGB sensor and an infrared projector for collecting the depth information. The PoseNet model [24] is used for extracting the skeleton of the test subjects. Arm gestures (left arm gesture, right arm gesture) are detected using a k-means classification. Head movements are detected using a helmet marker and the head movements (right glance, left glance, left over shoulder glance) are then classified using a threshold value method.
Data Collection and Preprocessing
Thirty-one test subjects participated in the simulator study (M = 21, F = 10). Of these, two were under the age of 18, one between the ages of 18 and 24, and three were over the age of 60. The rest of the subjects were between 25 and 59 years old. All test subjects were informed about the objective of the simulator study at the beginning. At the beginning of the simulator study, the test subjects are asked to cycle a test track in the simulated environment to familiarize themselves with the simulator. At the end of the simulator test, all test subjects completed an online questionnaire.
The explicit and communication cues were recorded with a 0.1s resolution. A total of 357 implicit (n = 357, 86.9%) and explicit (n = 54, 13.1%) gestures were recorded. Explicit and implicit gesture classes include the left- and right-hand gestures, the left over shoulder look, the left and right look, and classes of body posture with respect to leaning towards a specific direction (forward, backward, left and right lean). The states of these gesture classes were one-hot encoded in the dataset. Additionally metric features of the skeletal model joints of bicyclists were also used in the dataset. These include the lateral and longitudinal body angle, the head yaw angle, the lateral and the longitudinal left arm angle, the lateral and the longitudinal right arm angle, the left and right elbow angle, and the shoulder twist angle.
Generally, implicit gestures were made more often than explicit gestures. In the bicyclist maneuvers performed, explicit gestures with an average frequency of less than 2 gestures per test subject were given before the maneuvers were carried out, however not all test subjects perform explicit gestures in the corresponding scenarios. Multiple non-mutually exclusive implicit and explicit communication cues might also take place at the same time. When it comes to the explicit communication cues (left arm gesture, right arm gesture), no test subject executed such a gesture in the scenarios with a crossing maneuver. In no case was an incorrect hand gesture performed. An average number of 5.2 implicit and/or explicit communication cues are performed by the test subjects in every scenario (Figure 1).
The streams of implicit and explicit communication cues for each test subject in each scenario were extracted from the dataset and included in the same sequence. Only communication cues taking place 20m before the intersection approach are considered. A timestamp value that corresponds to the estimated arrival time to the start of the intersection approach is associated with the communication cues. In total 125 sequences of explicit and implicit communication cues were extracted for all different types of maneuvers. Eighty-eight sequences are used for training the maneuver prediction model and 37 for testing the prediction model. The sequences for the training and testing groups are split randomly to avoid any bias. In order to incorporate the sequential information and the duration of communication cues taking place in a specific order, the time in (s) until reaching the intersection approach stop line is also introduced as an additional feature for each bicyclist communication sequence. Finally, we proceeded on normalizing the features by scaling them according to their individual maximum and minimum value ranges.
Task Formulation, Classifier Model Selection and Deployment
The prediction of the bicyclist maneuver type intention at an interrsection approach can be formulated as typical multiclass classification problem. The three predefined maneuver classes are 1) left-turn, 2) right-turn, 3) crossing. We utilize scikit-learn [25] for deploying, calibrating and assessing classifiers that are appropriate for multiclass classification problems. These include the k-nearest neighbors’ classifier (kNN-C), the extra trees classifier (ET-C), the decision trees classifier (DT-C), random forest (RF-C), logistic regression (LR-C), the linear support vector machine (SVM) coupled with SGD training (SVM-SGDC), SVM coupled with an appropriate C-support value and linear kernel for vector classification (SVM-C) and the gaussian naive-bayes classifier (GNB-C). All classifiers are appropriate for multiclass classification tasks with high dimensional features. Finally, the dataset is split into k = 10 equal subsets using stratified 10-fold cross validation. We then proceed on training and testing our model using the k-fold samples. The model evaluation will finally be performed using the averaged cross validation accuracy scores over all k-folds. The dependent and independent variables used for the classification task are listed in Table 2:
Table 2: Dependent and independent variables.
Input variable
|
Type
|
lateral body angle (°)
|
Continuous
|
longitudinal body angle (°)
|
Continuous
|
left hand gesture (-)
|
Boolean
|
right hand gesture (-)
|
Boolean
|
left look (-)
|
Boolean
|
left overshoulder (-)
|
Boolean
|
right look (-)
|
Boolean
|
head yaw angle (°)
|
Boolean
|
back lean (-)
|
Boolean
|
forward lean (-)
|
Boolean
|
left lean (-)
|
Boolean
|
right lean (-)
|
Boolean
|
lateral left arm angle (°)
|
Continuous
|
longitudinal left arm angle (°)
|
Continuous
|
left elbow angle (°)
|
Continuous
|
time (s)
|
Continuous
|
lateral right arm angle (°)
|
Continuous
|
longitudinal right arm angle (°)
|
Continuous
|
right elbow angle (°)
|
Continuous
|
shoulder twist angle (°)
|
Continuous
|
Output variable
|
Type
|
Maneuver Type (left, right, crossing)
|
Discrete
|
Feature Selection
In the case of machine learning models feature selection is a necessary step towards reducing the high dimensionality of the problem formulation by selectively discarding features that do not contribute significantly towards a better model performance [26]. Here, prior to training the classifiers, we use scikit-learn [25] for deploying random forests as a feature selection method for identifying the most important features that optimally balances robustness and classification performance [27].