Understanding the Factors Influencing Pedestrian Walking Speed over Elevated Facilities using Tree-Based Ensembles and Shapley Additive Explanations


 Accurate estimation of factors affecting pedestrian walking speed is of paramount importance for efficient operation and management of at-grade and grade-separated infrastructures (such as foot over bridges or skywalks). Understanding such factors helps in planning for better circulation of pedestrians within confined elevated passageways as well as evacuation preparedness during emergencies. The walking speed on elevated infrastructure generally depends on the microscopic factors (demographics characteristics), macroscopic factors (average flow and density), and geometric factors (obstruction, land use type, length, connectivity, and effective width). The wide variability of these factors and their impact on walking speed makes the speed prediction modeling complex. Therefore, accuracy of such models depends on accurate field data collection, identification of pertinent variables, and implementation of appropriate modeling approaches. With the increase in computational capabilities, tree-based ensembles have gained immense popularity due to their high prediction accuracy in comparison to traditional regression models. The tree-based ensembles provide better interpretable results without a huge data requirement and are able to capture the complex non-linear relationships. These properties make tree-based ensemble models better candidates for modeling pedestrian walking speed, however, exploration on the tree-based ensemble in pedestrian related research is limited. In the current study, an attempt is made to model and compare seven tree-based models (including ensembles) to suggest the best modeling approach to identify the dominating factors and accurate prediction of pedestrian walking speeds over elevated walkways. The result of the present study showed that Gradient Boosted Trees (MAE 9.27) and Light Gradient Boosted Trees (MAE: 9.96) were best in predicting walking speed over the skywalk and foot over bridge facilities, as these boosting based methods improved the weak trees (on the basis of accuracy) sequentially. The variable importance of final models was estimated using SHapley Additive exPlanations (SHAP) which revealed that walking speed was dependent on the average flow, average density, and length of the facility. Moreover, other features such as gender, age, height, and width of the facility also play a significant role in determining the pedestrian walking speeds. The identification of important variables not only provides better insight on factors that affect walking speed over elevated facilities but also provides a valuable source of information to researchers, planners, and policymakers for better designing, operation, and management of the elevated pedestrian infrastructures.

Speed plays a very signi cant role in the better circulation of pedestrians (with comfortable levels of service) as well as evacuation preparedness during emergency situations. Moreover, as the speed depends on different individual and geometric factors, it affects the travel time accessibility to elevated facilities as well. As per Table 2, machine-learning tools were mainly used to predict the speed under controlled conditions; while the conventional regression modeling approach was used for estimating speed over sidewalks/ crosswalks.
However, in the present study, an attempt is made to predict the factors impacting pedestrian walking speed over elevated pedestrian walkways (foot over bridges and skywalks) using individual characteristics (gender, age, luggage condition, mobile use), group characteristics (average ow and density) and geometric conditions (the type of obstruction present, land use type, type of connectivity, length and width of the facility), through different tree-based machine learning algorithms. The outcomes of the present study can provide researchers, planners, designers, and policymakers with ample justi cation to account for the factors affecting the walking speed over elevated walkways, and thus come up with better designed user-friendly infrastructures in the future.

Selection Of Survey Locations And Collection Of Data
In order to capture the factors affecting pedestrian walking speed, a videography data collection technique was used. Firstly, the different elevated pedestrian facilities (FOBs and skywalks) were visited across six cities (NCR: National Capital Region, Bengaluru, Kolkata, Gangtok, Guwahati and Mumbai) covering different geographic locations of India. In total 13 FOB locations and 7 skywalk locations were xed for nal data collection. The data was collected using high de nition video camera xed over a high vantage tripod stand. The duration of data collection over the mid-block section of the elevated facilities was approximately 3 hours, during either morning peak hour (7.30-10.30am) or evening peak hour (4.30-7.30pm). The trap length across different locations varied between 10-15m for both the elevated facilities. Figure 1 shows the position of the camera along with the trap length and effective width for an elevated walkway situated in Gangtok. The details of the locations from where the data were collected across different Indian cities are provided in Table 3.  From Table 3, it can be observed that the total length of the elevated walkways varied between 18.5-88.1m (FOBs) and 315-1287m (skywalks). The effective width varied between 1.4-4.7m (for FOBs) and 2-3.4m (for skywalks). The effective width was calculated after deducting the shy away or buffer distance from the actual width. The average sample size per location varied between 679 (for FOBs) and 763 (for skywalks). The dominant categories across both the facilities were male pedestrians (≥ 70%), 21-40 years' age group (≥ 65%), with luggage (≥ 52%), and without mobile phones (≥ 85%). From Table 3 it is also observed that across many FOB locations, the beggars and vendors were prevalent; while across some skywalks, the vendors were present. Also, the different land-use types considered in the study ranged from public transport terminal (PTT) to commercial, residential, institutional, educational, and shopping locations.

Data extraction
Collected data were processed in the lab using the manual data extraction technique. As the aim of the study was to identify the factors affecting the pedestrian speed, thus individual parameters (such as age, gender, luggage condition, mobile use) along with other factors (such as obstruction, land use type, time of data collection, average ow, average density, length of facility, connectivity, effective width) were considered while extracting the video data. Table 4 shows the description of the factors extracted and considered for the modeling of pedestrian behavior. Calculated by counting the number of pedestrians within a trap area at every 20 second interval. In one minute, three density reading were taken, and then average value of density was calculated per minute

Data Analysis
The demographic characteristics and speed distribution data were obtained by performing exploratory data analysis on extracted videography data containing 7522 (FOB) and 5325 (skywalk) samples respectively. The nal analysis was carried out comparing the speed prediction between the two types of elevated walkways (FOBs and skywalks) under different land-use types (commercial, educational, institutional, public transport terminal, residential, and shopping).

Demographic characteristics
The demographic characteristics such as gender, age, luggage condition, and mobile use are relevant in understanding the existing usage pattern of the elevated walkways. Table 5 presents the demographic characteristics of the pedestrians based on gender, age, and luggage condition for FOB and skywalk facilities. The table also shows the results of the statistical tests (t-test and ANOVA single factor test) between the different pedestrian demographic characteristics. The statistical tests were conducted to check whether a signi cant difference exists between the different pedestrian categories. The t-test is performed to compare if a signi cant difference exists between two sub-categories (e.g. gender: male/ female, luggage: with/ without, and mobile: with/ without), while the ANOVA test is performed to compare between two or more pedestrian categories (e.g.: age: <10/ 11-20/ 21-40/ 41-60/ >60 years). Higher values of t-statistical value in comparison to t-critical value (for t-test) and higher F-statistical value in comparison to F-critical value (for ANOVA test), signi es that signi cant difference exists between the different pedestrian demographic characteristics (at 5% signi cance level). From Table 5 it is observed that the majority of the pedestrians using the FOB and skywalk facilities were male pedestrians (73-81%) in the age group of 21-40 years (73-78%) and with luggage (71-78%). The male pedestrians were observed to walk at higher average speeds in comparison to the female pedestrians over both the elevated walkways by 6-7m/min. The pedestrians in the age group of 21-40 years walked fastest in comparison to the other age categories. The pedestrians without luggage had signi cantly higher average walking speeds in comparison to the pedestrians with luggage for both the facilities. Further, the proportion of pedestrians using a mobile phone while walking alone over skywalk facilities was double in comparison to FOB facilities.
The reason for the higher proportion of mobile users over skywalks could be due to the long traveling length on skywalk facilities and thus the pedestrians may use the mobile phones to overcome boredom. The average walking speeds for mobile users were 3m/min slower than the non-mobile users. The results of the statistical tests (t-test and ANOVA test) showed that for different demography categories, signi cant difference exists between gender (male/ female), luggage (with/ without), mobile (with/ without) and age (< 10/ 11-20/ 21-40/ 41-60/ >60 years).

Speed distribution
Probability density functions were also used to understand the speed variation among the different categories of pedestrians (based on gender, age, and luggage condition) for FOBs and skywalks (refer to Fig. 2). The x-axis represents the walking speed (m/min) while the y-axis represents the probability or relative frequency. Figure 2(a) shows that the male pedestrians walked faster than the female pedestrians for both the facilities by 6m/min. The male pedestrians using skywalks were observed to have higher mean speed than the male pedestrians using FOBs by 7m/min. This increase in speed is observed as skywalks offer a wide path for pedestrians which encourages them to walk at a higher speed than FOB facilities. Further, as the walkways are much lengthier than FOBs, pedestrians try to travel faster to cover the longer length quickly.
The age-wise speed distribution (refer to Fig. 2(b)) shows that the child (< 10 years) and elderly (> 60 years) pedestrians over FOB facilities were the slowest pedestrians. The young adult pedestrians (21-40 years) were observed to have the highest walking speed across both the elevated facilities. In comparison to the FOB pedestrians (of age 21-40 years), skywalk pedestrians users (in the age category of 21-40 years) had higher walking speeds by 7m/min due to the greater available walkway widths and thus had the freedom to choose higher walking speeds to cover longer distances.
From Fig. 2(c), it was observed that the pedestrians with and without luggage had higher walking speeds on skywalks in comparison to FOBs by 6-10 m/min.
The speeds for pedestrians with and without luggage over FOBs were quite similar, while over skywalk the pedestrians without luggage walked at higher speeds (4m/min) in comparison to the pedestrians with luggage. The reason for similar walking speeds over FOB facilities was due to the fact that as the traveling length was smaller, the luggage did not have much impact on their speed. However, over skywalk facilities when pedestrians had to travel longer distances, carrying luggage played a crucial role and signi cantly reduced the walking speed.
Figure 2(d) shows that for both skywalks and FOBs, the pedestrians without mobile usage had a higher walking speed than the ones with mobile by 3m/min.
Also, similar to other distribution functions, the pedestrians with/ without mobile over skywalks had higher speeds in comparison to FOBs. The main reason for higher speed over the skywalks could be the available higher walkable width and the longer length which pedestrians had to travel over skywalk facilities.

Modelling Approaches
In the present study, an effort was made to understand the best-suited model in terms of prediction accuracy of walking speed over elevated walkway facilities. The tree-based algorithms have several advantages over other machine learning algorithms, described in Table 6

Study Methodology
Algorithm 1 shows the step-by-step methodology of speed prediction for elevated walkways. The study methodology involved literature survey, preliminary site inspection, videography data collection and extraction, followed by speed prediction modeling, and nally extracting the important features for a policy decision.
As explained earlier in Table 3, the data was collected from 13 FOBs and 7 skywalks across different Indian cities. This data must be processed carefully before using them for training speed prediction models. Initially, the data columns were normalized using min-max scalar. Further, one hot encoding was applied to the categorical columns. The nal prepared dataset was randomly divided into 80% (FOB: 11332, Skywalk: 4273) for training and 20% (FOB: 2833, Skywalk: 1069) for testing of the developed model respectively. Different modeling algorithms offer different hyper-parameters. Thus, initially, all the selected algorithms were trained using 10-fold cross-validation (CV) with 10 random hyperparameter space on 80% train data. The models were ranked in decreasing order based on Mean Absolute Error (MAE) evaluation metric. The MAE metric was selected for model performance evaluation due to its less sensitivity to outliers. Once the top algorithms were identi ed, they were further tuned with 100 random hyperparameter space using a 10-fold CV to get more reliable estimates. The tuned model was then nally tested on the remaining 20% test dataset.

Speed Prediction Model Development
In the present study, different tree-based modeling approaches (GBM, LGBM, XGBoost, Adaboost, RF, ETR, and DT) were explored to predict the walking speed determinants over elevated pedestrian facilities (regression: continuous outcome) using PyCaret 2.0 (Ali, 2020) machine learning library (through open-source programming language Python version 3.6). The speed models were trained for two separate elevated pedestrian facilities i.e., Foot Over Bridges (FOBs) and Skywalks.

Model Training And Hyperparameters Tuning
In order to train the speed models, PyCaret 2.0 machine learning library was utilized. The total samples (FOB: 14165, Skywalk: 5342) were randomly split into 80% train (FOB: 11332, Skywalk: 4273) and 20% test dataset (FOB: 2833, Skywalk: 1069). Comparing multiple models and tuning all types of hyperparameters could be time-consuming; thus, initially, a 10-fold CV was performed with default hyper-parameters to get the idea about the overall best-performing model.  Table 7) revealed that LGBM topped in the overall performance (MAE: 9.520). Similarly, models were trained using 10-fold CV for skywalks; where GBR was observed to be the best performing model with an MAE of 9.232 (refer to Table 8).

. Model Hyper-Parameters Optimization
To obtain the best performing model and to reduce over tting, a random hyper-parameter search was performed. Random search is faster and computationally less expensive compared to complete grid search [10]. For the FOB speed model (i.e., LGBM), the hyper-parameters were the number of leaves, maximum tree depth, learning rate, number of estimators, minimum split gain, regression alpha, and lambda. Similarly, for the skywalk model (i.e., GBM) the hyper-parameters were loss, the number of estimators, learning rate, subsample, criterion, minimum samples split, minimum samples leaf, maximum depth, and features. The different hyperparameters, their ranges, and de nitions are presented in Table 9. • num_leaves: it is the main parameter that controls the complexity of the tree-based models.
• max_depth: it de nes how long a tree will be allowed to grow, i.e., the maximum number of children which can grow out from the tree until the tree is cut off.
• learning_rate: it is the process of adding weighting factor to new trees in the model to slow down the leaning.
• n_estimators: the parameter represents the number of trees that need to be built before majority voting or an average of predictions.
• min_split_gain: it is the minimum loss reduction requires in order to make a further partition on the leaf node of the tree.
• loss: it is a function that de nes the mean squared error (MSE), which can be calculated by using gradient descent and updating the predictions based on the learning rate.
• subsample: the parameter controls the proportion of random samples for each tree. The lower value of subsample prevents over tting.
• criterion: it is the parameter that measures the impurity of the split. In the case of regression, it is represented by "friedman_mse", "mse", or "mae.
• min_samples_split: it represents the minimum number of data points or samples placed in a node before splitting operation.
• min_samples_leaf: it represents the minimum number of samples that are required in the leaf node.
• max_features: while splitting a node, it is the size of the random subset of features to be considered in the model.  Table 10). In case of the skywalk model, the hyperparameter tuning showed an MAE of 9.223 with a standard deviation of 0.119 (refer to Table 11). The optimized model hyperparameters for both FOB and skywalk speed models are presented in Table 12.   Table 13 shows the model performance summary on the test data set. The performance summary revealed that the overall optimized FOB speed prediction model (i.e., using LGBM) performed well on the unseen/test dataset (MAE: 9.960). Similarly, the skywalk speed prediction model (i.e., using GBM) performance on unseen/test dataset provided an overall good performance (MAE: 9.273).

Applications Of Tree-based Machine Learning Techniques In Other Areas Of Transportation Engineering And Its Comparison With The Current Study
There are different studies based on application of advanced soft computing techniques in the transportation engineering domain (refer to Table 14), but very few of them are related to pedestrian-based research. Results of the present study highlighted that boosting-based model could be one of the best choices for predicting pedestrian walking speed over FOBs and skywalks.   Table 14.
Similar to other domains, in pedestrian-based researches where accurate pedestrian macroscopic behavior (speed and ow) prediction is required, these algorithms could provide an accurate solution. They would help in smooth management of busy facilities such as bus, train or airport terminals. In this regard, the current study results tried to ll this gap and showed the effectiveness of such algorithms in pedestrian-based research, which could act as a better alternative when model quality (or accuracy) is the main goal.

Variable Importance Analysis
As discussed in Table 13, the LGBM (FOB) and GBM (skywalk) models were found to perform best on the test dataset as well. The main advantage of a treebased regressor is that it provides the global importance scores of each feature which explains the contribution of different predictors in the model. Still, these high-end black-box models lack interpretability as they do not provide the direction of impact, i.e., whether the model variables have a positive or negative in uence. Thus, to trust a black-box model, the understanding of inner workings is essential. Lundberg and Lee (2017) proposed the SHAP (SHapley Additive exPlanations) values method which is fast and offers a high level of interpretability for a model. In the present study, the "shap" python library was utilized to interpret the existing trained models. The SHapley values were estimated for tuned LGBM (i.e., for FOBs) and GBM (i.e., for skywalks) on the test data and were plotted using a summary plot (refer to Figs. 3 and 4).
The summary plot not only provides the variable importance in descending order but also illustrates the positive or negative relationship with the outcome variable. The y-axis shows different variables (top ve predictors) while the x-axis shows SHAP values ranging from -ve to + ve. The feature value is illustrated with blue and red color gradients. The red color indicates a high feature value, while blue indicates a low feature value.
The feature importance plot of the FOB model (refer to Fig. 3) illustrates crucial ndings for the top ve factors which in uence walking speed over FOBs. As per Fig. 3, the total length of the facility, average density, average ow, facility height, and mid-block width are the top ve parameters that impact the pedestrian walking speed over the FOB facilities. The most important feature is the length of the facility which determines the walking speed. In FOBs, after climbing the stairs (in most of the FOBs considered, stairways were the only form of vertical connectivity), pedestrians feel tired. Due to this tiredness, the pedestrian speed was initially observed to be a little slower. However, with the increase in length of the FOB as the pedestrians approach the middle section, this impact on pedestrian speed towards the middle portion of the FOB (where the data was collected) does not show much variability in speed. From Fig. 3 it is observed that density values present a wider range and have a negative relationship with walking speed. As the average density increases, the space for faster movement reduces and thus pedestrians' walking speed reduces. The impact of width and height of the facility on the pedestrian speed is not clear, and this necessitates data requirement over a wider range of facilities to establish a concrete relationship. The impact of the ow parameter re ected here (as well as in Fig. 4 for skywalk facilities) is somewhat unclear or contradictory. Such behavior of ow is re ected as few sites for both elevated facilities (FOBs as well as skywalks) were in congested conditions (i.e. speed increases with increase in ow under congested regime), as opposed to most of the other sites which were in free ow condition with lower densities. Observation of pedestrian speed data over a wide range of densities in most sites might resolve this ambiguity.
The feature importance plot of the skywalk model illustrated that the top ve parameters impacting the walking speed over skywalks were the average ow, average density, gender (male/ female), age (< 10, 11-20, 21-40, 41-60, and > 60 years), and length of the facility (refer to Fig. 4). Males were found to walk faster compared to female pedestrians. Moreover, the proportion of male pedestrians leads to higher stream speed. Similarly, pedestrians belonging to age group of 21-40 (young adults) walked faster than any other age group. The old (> 60 years) and young (< 10 years) pedestrians are observed to negatively impact the relative stream speed. Thus with a higher proportion of old and young pedestrians, the overall stream speed would be signi cantly reduced. The total length of the facility although is found signi cant, however, its direction of in uence on the walking speed is not clear.

Conclusions, Limitations, And Future Recommendations
The current study focuses on the accurate prediction of factors impacting pedestrian walking speed over elevated facilities. The observed data of pedestrian behavior were collected using the videography survey method over 13 Foot Over Bridges (FOBs) and 7 skywalk locations across different land-use types in ii) The majority of the pedestrians on the skywalk were observed to carry luggage (77.91%) in comparison to pedestrians on FOBs (51.52%). Further, a small proportion of pedestrians (FOBs: 7.81%; skywalks: 14.93%) were observed using a mobile phone while walking on both facilities.
iii) Light Gradient Boosting Machine (with MAE: 9.96) and Gradient Boosting Machine (with MAE: 9.27) provide the best prediction accuracy of pedestrian walking speed models over FOB and skywalk facilities respectively. iv) Variable importance for both elevated facilities revealed that average ow and average density were extremely important to predict the walking speed.
v) FOBs variable importance plot revealed that the length of the facility in uences the walking speed positively while the reduction in available walkway width reduces the pedestrian walking speed.
vi) Skywalk variable importance plot revealed that pedestrian demographics (gender and age) were important predictors for walking speed. Male pedestrians walked faster than female pedestrians, and a higher proportion of male pedestrians led to higher stream speeds. The young adults (aged between 21-40 years) had the highest walking speed among all age groups. Similarly, locations with a higher proportion of young (< 10 years) and old (> 60 years) pedestrians signi cantly reduced the overall walking speed of the facilities.
The identi cation of important variables not only provides better insight on factors that affect walking speed over elevated facilities but also provides a valuable source of information to researchers, planners, and policymakers for better design, operate and manage elevated pedestrian infrastructures.
Similar to other studies this study also has some limitations. Some of the signi cant challenges were: duration of data collection (restricted to a single day observation for 3 hours) and the number of locations covered.  Variable importance plot for FOB walking speed model Variable importance plot for skywalk walking speed model