ALF-Score: Network-Based Walkability

Walkability is a term that describes various aspects of the built and social environment and has been associated with physical activity and public health. Walkability is subjective and although multiple deﬁnitions of walkability exist, there is no single agreed upon deﬁnition. Road networks are integral parts of mobility and should be an important part of walkability. However, they are missing from existing methods. Most walkability measures only provide area-based scores with low spatial resolution, have a one-size-ﬁts-all approach, and do not consider individuals opinion. Active Living Feature Score (ALF-Score) is a network-based walkability measure that incorporates road network structures as a core component. It also utilizes user opinion to build a high-conﬁdence ground-truth that is used in our machine learning pipeline to generate models capable of estimating walkability. We found combination of network features with road embedding and POI features creates a complimentary feature set enabling us to train our models with an accuracy of over 87% while maintaining a conversion consistency of over 98%. Our proposed approach outperforms existing measures by introducing a novel method to estimate walkability scores that are representative of users opinion with a high spatial resolution, for any point on the map.


Introduction
Road networks serve as a major part of the urban fabric, connecting and dividing different parts of our cities. Most people use roads on a daily basis making road networks a crucial part of our daily lives. Mobility 1 is defined as having access, and the ability to get to places necessary for "living a healthy life". According to a study 2 published by Statistics Canada, 12.6 million Canadians reported in 2016 to have commuted by car to work with the average duration of the commute being 24 minutes, and the median distance to workplace being 8.7 kilometres. Furthermore, close to 1 million car commuters spent at least 60 minutes travelling to work. Trips to work, grocery store, hospital or even casual jogs and road trips mainly occur on walkable or driveable roads.
With the increase of publicly available geographic information systems, road network data is now freely accessible from numerous sources. Due to this availability, road network data can be used as a good source for various types of analysis like road importance 3 , 4 , 5 , 6 road characteristics 7 , 8 city planning 9 , 10 , 11 creating walkability or bikeability scores, or examining the association between diabetes and obesity in local communities with walkability of the neighborhoods 12 , 13 , 11 . Walkability 14 is a term used to describe aspects of the built and social environment with important population level impacts on physical activity and health. Only a small amount of work has focused on developing a conceptual definition of walkability. However, the term has been widely used in the academic literature since the late 1990s by urban planners, the general public, and researchers 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 when researchers began to examine the association between walkability and various factors including travel behavior, urban design, real estate, physical activity, and obesity. Although there are multiple operational definitions of walkability in the literature 23 , 24 , 25 , 26 , 27 , 28 , 29 there is no single agreed-upon conceptual definition of walkability. Simply put, Walkability is a way to show how walkable/connected/accessible our surroundings are, with respect to walking. Walkability is typically used by researchers as a measure to operationalize characteristics of roads combined with other characteristics to create meaningful indicators for the environment. Knowing how walkable an area is (i.e. Walkability score) is an important factor in everyone's lives, especially with the spread of the COVID-19 restrictions and limitations to how and when we go outside. Walkability scores provide important insight about the roads and neighbourhoods around us, but current walkability measures do not consider individual preferences, and are mainly area based with low spatial resolution.
Our Active Living Feature Score or ALF-Score, is a network-based walkability measure that utilizes road network structure alongside user opinion and various other features through machine learning approaches to build predictive models capable of generating high spatial resolution and network-based walkability scores. Below is a list of the main contributions of this paper: 1. High Spatial Resolution -compared to existing walkability measures, such as Can-ALE 3 which provides area based walkability scores, in this paper, we show that ALF-Score is capable of producing a much higher point-level walkability resolution which establishes point-specific walkability scores for all points on the map to provide more refined and usable walkability scores.
2. Road Network Structure Incorporation -ALF-Score relies on road network structures and does not treat area-based locations as individual vectors. Roads are connected and therefore spatial correlations between various roads and points exists. Incorporating road network structure allows determination of similarities throughout the entire network regardless of physical distances, to help improve the overall accuracy of estimated scores.
3. User Opinion -most existing walkability measures are presumptive and generic. ALF-Score utilizes end-users' opinions as an input fed to its pipeline to ensure an appropriate representation of user opinion in the estimation of the walkability of each specific point. 4. Data Scalability -The scalability of this work is twofold. First, since a small fraction of all users provide their opinion, ALF-Score takes certain measures to ensure appropriate, fair and consistent distribution of user opinions on the road network. Second, the method proposed can be applied to many cities all over the world.
In this paper, we start by reviewing related works revolving about walkability measurement. We then present our results and discuss our findings. We will then discuss various data types (publicly available, and crowd-sourced) and methods used in our pipeline with a brief overview of our crowd-sourcing platform. Additionally, we will explore our experimental setup and how we take on the problem using machine learning. Finally, we will examine the consistency and validation measures used to verify our results.
There are a number of existing walkability measures in Canada including the Canadian Active Living Environments Database (Can-ALE) 3 and Walk Score® 4 . Figure 1 right shows Can-ALE scores for Victoria, BC and left represents Walk Score®for the same city. These measures each have different strengths and limitations. Can-ALE is one of the most prominent and publicly available walkability scores in Canada. It is an area-based measure at the Dissemination Area (DA) level 30 , 31 . A DA is a small, relatively stable geographic unit composed of one or more adjacent dissemination blocks and is the smallest standard geographic area for which all census data are disseminated. A DA normally has a population of between 400 to 700 persons. Can-ALE, widely used for walkability scores in Canada, utilizes intersection density as well as dwelling density to devise a walkscore for each Canadian DA. The measure also includes the number of Points of Interests (POI) within 1KM buffer around the DA's centroid. According to Canadian Active Living Environments Database (Can-ALE) User Manual & Technical Document 32 "Can-ALE measures are based on one-kilometre, circular (Euclidean) buffers drawn from the centre points (centroids) of dissemination areas (DAs)". Can-ALE's only measure of road network importance is a simple count of the number of 3 (or more)-way intersections per square kilometre of the buffer around a dissemination area's centroid. Walk Score®is another well-known walkability tool which uses a proprietary method that has a number of features including population density and road metrics such as block length and intersection density. According to its authors, they use a patented system to measure the walkability of any location and focus on analyzing hundreds of walking routes to nearby amenities. Their walkability score (between 0-100) is awarded based on distance to amenities of certain categories (eg. grocery stores, coffee shops, restaurants, bars, movie theatres, schools, parks, libraries, book stores, fitness centres, drug stores, hardware stores, clothing/music stores). Amenities within a 5-minute walk (.25 miles) are given maximum points. They use a decay function to devise what score should be given to more distant amenities, where no score is given for amenities located farther than a 30-minute walk. Walk Score®also measures pedestrian friendliness by analyzing population density and road metrics such as block length and intersection density. Their data sources include Google Maps, Factual, Great Schools, Open Street Map, the U.S. Census, Localeze, and places added by the Walk Score®user community.
ALF-Score is a measure that provides a faster and a better walkability scoring system that aims to fill in the gap and address major limitations of existing measures by introducing use of opinion-based crowd-sourced user parameters in conjunction with road network characteristics to provide scores that better represent one's perspective of walkable areas while utilizing road network structure to improve accuracy and spatial resolution.
Road network is a system of interconnected roads designed to accommodate vehicles, bicycles, and pedestrian traffic. According to Urban Securipedia 33 , road networks consist of "a system of interconnected paved carriageways which are designed to carry buses, cars and goods vehicles". Furthermore, road networks form the "most basic level of transport infrastructure within urban areas, and will link with all other areas, both within and beyond the boundaries of the urban area". For computational purposes, road maps are typically converted into road network structure. Road networks are a form of complex network where nodes refer to physical geographical points and edges are the connections between two points. Road networks can be represented as graphs. We define such graphs as G = (V, E) where E represents a list of edges/links and V represents a list of vertices/nodes. Points in these networks represent nodes while roads and their connections where two or more nodes connect, represent edges. Adjacency matrix or adjacency list are two common ways to represent graph data structure. Since the road networks are generally sparse and can be vast, the adjacency list is often preferred. . Dark green is most walkable, dark yellow/orange is least walkable. Can-ALE Walkability score of Victoria, BC (right), assigned by dissemination area. Dark green is most walkable, dark red is least walkable.
There are a number of data types required for this work such as user opinion, road network structure, points of interest, and those features derived form the road structure such as edge list, node/edge centrality and node embedding. Since road networks are accessible through many different sources such as OpenStreetMaps (OSM) 34 and Statistics Canada 35 , different data formats are also available for processing. However, for us the main concern when selecting the data source is data comprehension and accuracy. OSM follows an XML scheme containing nodes and ways elements with former defining points in space and the latter, defining linear features and area boundaries. The road network files are filtered to only include highway tags 36 encompassing only walkable and drivable roads. Points Of Interest or POI data for this research was exclusively extracted from OSM, with "amenity" key 36 being used as the most relevant set of POIs covering numerous categories such as: Sustenance, Education, Transportation, Financial, Healthcare, Entertainment, Arts & Culture, and Others.

Results
In total, 666 features are used to train various models, with some models trained with a specific subset of these features, to examine the resulting accuracies based on various features. We use the following input parameters: There are total of 1,050 user-submitted locations among their relative ranks. From these 1,050 locations, there are 852 unique locations. leaving 198 "anchor" locations (defined in Methods section) appearing in two or more submissions. Each road node v i (i = 1, 2, . . . , 5) has a unique relative ranking r v i between {1, 2, . . . , 5}. The order of r v i , the relative ranking, is of utmost importance as it defines how users perceive the walkability of each v i with respect to one another.
Using our Generalized Linear Extension of Partial Orders or GLEPO algorithm, we were able to successfully convert user opinion relative among groups of 5 locations into globally relative scores (among all submissions) with only a small inconsistency (see Method for description). GLEPO maintains a high consistency of 98.24% throughout the conversion. Furthermore, after numerous variations and experimentation, we have determined that the best results were produced using the fully randomized version of the algorithm with with at least 30 iterations. Virtual link was enabled when best results were produced. Virtual link is a method we developed that establishes associations between user submissions based on their geographical distance (explained further in Methods).
When compared to Can-ALE, a clear variation is observed, especially when it comes to producing fine-tuned high spatial resolution ground-truth. In Figure 2, we can observe while Can-ALE (left) is dedicating a low walkability score to large regions, users have different varying opinions (center) ranging from low walkable to high walkable scores, for different points on the map of the same region. GLEPO therefore provides a more representative and high spatial resolution ground-truth. This paper is not intended to imply that Can-ALE is incorrect, but rather pointing out some of its limitations and shortcomings due to its characteristics. Since Can-ALE is area-based its scores provide a low resolution coverage. Since there is no use of user opinion in its pipeline, Can-ALE scores are not representative of users' opinion. As observed in GLEPO's result, not only our volunteer participants have a difference of opinion with that of Can-ALE's provided walkability scores, but there's also a difference in opinion among participants themselves which opens up the possibility for future research on applying GLEPO and ALF-Score to different sub groups of users. A score variation ranging between 16-81 is observed within GLEPO which provides a user-level insight about how users actually perceive their neighbourhood in their own opinion, as opposed to what Can-ALE suggests of a neighbourhood to its users. Furthermore, we can observe a variation ranging between 20-70 in ALF-Score which provides a high spatial resolution rankings that are well refined and specific to each point as opposed to a single score for an entire area observed in Can-ALE. Note: Can-ALE colors are slightly dimmed due to the adjusted opacity to visualize street overlay.
Although GLEPO's results provide an overall well-represented user opinion baseline to walkability scores while establishing a ground-truth for ALF-Score's machine learning pipeline as its y label vector, GLEPO's main contributions may be extended to vast range of researches such as enabling accurate use of crowd-sourced data while reducing bias as a result of difference in opinion.
We used GLEPO's output to train our supervised machine learning models as a regression problem on our 666 features iterating over multiple variations of data combination subsets. Six major techniques were used, namely: 1) Random Forest Regression, 2) Linear Regression, 3) Support Vector Regression (SVR), 4) Gradient Boosting, 5) Decision Tree Regression, and 6) Multi layer Preceptions (MLP). We found that Random Forest outperforms other techniques in terms of performance and accuracy by achieving a top prediction accuracy of 87.49% using all features. In Table 1 we take a closer look at all feature combinations used in our experiments as well as their best achieved accuracy. All accuracy values are representative of the best recorded accuracy over numerous runs. We observed models trained on only POI features performed relatively similar to that of those models trained on POI plus network features and POI plus road embedding. However, models trained on network features combined with network embedding appear to perform better than those using POI previously motioned (with the exception of linear regression). Highest accuracy was observed when all features were used together. This presents a convincing argument in contrast to that of some other walkability measures' hypothesis, which assume high importance 4/16 towards POI features. Furthermore, it also conveys an important message that road network structure plays a crucial role when it comes to measuring walkability. The improvement in accuracy observed after the addition of POI features to the network and road embedding features contributes to the complimentary positions POI and road network structure have with respect to one another, where combined can provide a more in-depth understanding of our surroundings. In Figure 3 we can observe how ALF-Score (center) compares with two prominent walkability scores namely Can-ALE (right) and Walk Score® (left) for the walkability of the city of St. John's, NL (Canada). In the walkability produced by Can-ALE, we can observe various regional "blocks" (DAs) where nodes within them carry the exact same score. Some of these regions may cover much larger geographical areas, whereas other regions may cover a higher population density in a smaller geographical area. This significantly reduces the accuracy, relevancy and spatial resolution of the scores when using Can-ALE. As observed in Figure 3 center, this issue is no longer the case with ALF-Score walkability ranks as they are much more refined, with clear variability among different nodes' scores, and relevant when compared to Can-ALE data with significantly higher spatial resolution, right down to every point on the map. Furthermore, ALF-Score walkability rankings are representative of user opinion with higher accuracy associated to users' perception of walkability in different regions. We were able to successfully verify the accuracy of our predicted ALF-Score rankings based on local knowledge of the city of St. John's. , followed by Can-ALE Walkability score (center) and ALF-Score walkability of the same region (right). Dark green is most walkable, dark red is least walkable. Walk Score®is observed not to have sufficient data to cover the entire region while Can-ALE's over generalized scores may also be rendered useless due to its over generalization. Figure 2 further expands on how predicted walkability scores generated by ALF-Score (right) compares with that of Can-ALE (left), and how ALF-Score is able to provide high spatial resolution walkability scores as opposed to existing area-based walkability measures. We found a high variation in ALF-Score walkability rankings while observing an even distribution withing the selected region. As locations get closer to the Kelsey Drive region they are marked as more walkable by ALF-Score. (There are numerous amenities within the vicinity of Kelsey Drive.) In addition, we have observed that typically areas with greater population density are assigned with higher Can-ALE scores. This may not always be how users actually perceive walkability. In Figure 4 which represents two examples where ALF-Score and Can-ALE do (right) and do not agree (left), we found Can-ALE fail to identify regions such as local in-city parks and trails as walkable areas as observed in Figure 4 left. The Signal Hill area is a well-known and commonly visited area by the local community especially hikers, runners, joggers and families. There are numerous amenities nearby such as convenience stores, restaurants, coffee shops and gas stations and there are multiple bus stops. Yet, the area was considered as

5/16
not walkable by Can-ALE. The analysis suggests that Can-ALE's approach is missing out on some of the important features and area characteristics that people consider important such as being near to ponds, trails, and national monuments. We found that ALF-Score is able to do a much better user representation and provide more in-depth rankings for this area as well as many others in our experiments by better capturing various characteristics while associating their importance based on user opinion. Can-ALE is represented as an overlay with a small opacity/alpha and is visualized with a significantly lighter colors due to this transparency. Legends and border lines represent the actual colors.

Discussion
An important factor about this research is its interdisciplinary contributions, specifically in the fields of Computer Science and Public Health. Interdisciplinary researchers are crucial and typically provide practical applications to many important and immediate challenges at hand. Day-to-day users are among the intended users of ALF-Score, but they are not the only intended users. We hope this research can be used as a tool that allows researches in the public health sector, specially the epidemiologist, to have a better understanding of who ALF-Score is associated with health outcomes, including physical activity and obesity. Technology is advancing every day and what better use of it but to improve people's lives and health, and we genuinely believe these kind of practical interdisciplinary research can truly and positively impact the world.
The purpose of this research was to fill in the gap and address some of the most prominent challenges observed in existing walkability measures by introducing a novel approach to measure walkability score with high spatial resolution, precision, accuracy and efficiency. In this section, we will discuss some of the main contributions of this research to computer science and public health communities.
We believe a strong indicator of walkability can be derived from road importance which requires extensive utilization of road network structure. However, there's very little use of road network structure, if at all, in most existing walkability measures. This lack leads to a significant gap in utilizing one of the most important factors of walkability and hence lower accuracy. Although an important consideration is that road network structure can be quite large and creating a graph structure while calculating graph-based features (such as centralities and network embeddings) pose various challenges (which we have addressed in a previous publication 5 ), our walkability results using road network structure show an important improvement, especially when combined with user opinion in conjunction with our machine learning pipeline. Road network structure provides crucial information about neighbourhoods and cities. Furthermore, road network structure is also a conduit to propagate the importance of POIs. We further improved existing walkability measures by incorporating user opinion and perception as one the crucial contributing components in the pipeline. Existing walkability measures lack user opinion and we believe walkability scores generated should consider public opinion.
Use of crowd-sourced data in ALF-Score opens the door to many new possibilities such as building a platform to provide personalized walkability based on each individual's profile. This will not be possible without the use of user opinion. Users' observations and demographics can be included into building personalized models that are able to identify user patterns and generate models to best fit individual user profile.

6/16
Collection of user crowd-sourced data is expensive, time consuming, limited and resource intensive. It is difficult to recruit volunteer participants who have the knowledge of the city and are willing to help with providing their personal walkability scores that would amount to sufficient data to be used as ground truth. It is also important to consider that walkability is subjective, so people will have different opinions. Specifically, when it comes to walkability scores, each user is entitled to their personal opinion and everyone may and likely will have a different opinion on how walkable they may consider specific locations. This expected variation can potentially lead to noticeable inconsistencies within the user labels. We addressed this challenge in two phases by 1) developing our own web-based crowd-sourcing tool to maximize data calibration and accuracy, 2) using GLEPO to further process the collected data to ensure fair distribution within localized relative scores and accurate globalized representation among all users. Accuracy of ground-truth data is crucial as it is representative of the population through only a small sample size and we believe as more data is collected, accuracy will be improved.
The concept and processes behind our data collection method and GLEPO can be applied to many other fields when it comes to utilizing opinion-based data as it allows for collection of small and relative data sets leading to reduced bias and presumptive measures, such as defining a hard-coded range for walkability. For instance, if users were to rank walkability score for locations, by assigning a number between 0-100, locations will get varying scores associated to them by different users. This assignment is dependant on how user define the numerical range and how that aligns, or otherwise, with that of the researcher. Utilizing relativity as opposed to absolute scores, allows for reduction in the imposed cognitive challenge of assigning a representative score for each location, and instead, users utilize relative ordering by indicating if they consider a location more or less walkable than another location without assigning a fixed numerical value. GLEPO will then take the responsibility of globalizing the relative ranks to reflect how they would rank within all other user submissions. This gives researchers the flexibility to maintain their own rational while preserving user opinion integrity.
To tie in with the previous points, since ALF-Score lies on crowd-source data, in the initial stages, only a small amount of data is available to represent two key components: 1) user opinion representative of a much larger population 2) road nodes representative of much larger geographical regions and road networks. Based on our experimentation and results, our pipeline has proven to accurately generate walkability scores for various cities using only a small sample data.
Furthermore, we were able to significantly improve the spatial resolution of walkability, by utilizing road network, user opinion and machine learning approaches that help generate point-level walkability scores of all nodes within a given road network, as opposed to area-based approach found in some existing measures. This higher spatial resolution of walkability provides a much needed depth for analysis and other researches that require fine-tuned, refined and specific scores for various locations that may fall within close proximity and within the same DA.

Methods
The overall ALF-Score pipeline (shown in Figure 5) requires various data input as well as processes. However, since ALF-Score is a network-based walkability measure, an important step in this pipeline is to better utilize road networks as well as other road characteristics. To this end, ALF-Score incorporates a map database derived from road network and POI inputs extracted from OpenStreetMap and Statistics Canada. Other GIS features such as road embedding and various centrality measures were later generated from the road network structure and used as additional road network based features. In addition, crowd-sourced user opinion is collected through our web-based data collection platform which is specifically developed for this purpose. The web tool allows collection of user opinion within groups of 5 locations, all of which only relative among their respective groups. The crowd-sourced data is then processed through GLEPO in such a manner that the relative structure of each submission is converted into a global view within all user submissions. GLEPO's output of user ranking along side our GIS-derived features are then fed to our supervised machine learning pipeline to train predictive models that are capable of estimating walkability score for any given point on the map that falls within the map database coverage.
To further expand on our crowd-sourcing platform and the collected data which play crucial roles in providing the required user opinion, we will first explore how the data collection takes place followed by why we chose our approach. To grant ourselves the ability to collect accurate user data and ensure data completeness while reducing bias, we developed a web interface that is capable of collecting various information from volunteer users such as: walkability order of 5 randomly selected locations (relative ranking among the 5 locations), users preferred walkable distance, their age group, gender, if users Live alone, if users have children, their occupation, and a few other browser agent details that are publicly available (such as users browser type, operating system, etc.). The web interface shown of Figure 6, left, displays an interactive map with 5 randomly selected locations marked as {A, B,C, D, E}. Every time the web page is reloaded, 5 randomly selected locations are chosen from a given geographic area. The coverage is tied to a pool of nodes derived from our map database, specifically the road network structure of the region. The randomization method is introduced to ensure an evenly distributed coverage of node labels and user opinions across the given geographical area. Users have the ability to reorder the ranking of the 5 locations by moving them up or down based on how they perceive the relative walkability of the shown locations among each set of 5 locations. This approach reduces a potential cognitive challenge and allows users to use their intellect and knowledge of Figure 5. ALF-Score utilizes various GIS features such as road network structure, POI as well as features derived from road network such as various centrality measures and road embedding. GLEPO's linear extension of user opinions that produces a global view of relative user opinions is then aligned with the GIS features as an input of our supervised machine learning processes. Walkability estimates that are produced through trained models will have a high spatial resolution, be representative of user opinion and provide a better insight of different regions and neighbourhoods.
the city while exploring a diverse set of potential locations in the region. The relative rankings are then submitted as groups of 5 locations, to the server for downstream processing. Given the random nature of the web tool, the group of 5 locations provided by the web tool can overlap across different users or even different submissions of the same user. On one hand, the overlap provides common references among submission groups, but on the other hand, the submission groups will likely accumulate perceived conflicts across their relative rankings. We can observe from Figure 6, right, all the locations with 2 or more associated submissions, leading to conflicts.

Figure 6.
Our interactive web based data collection platform (left) has been deployed with road data from various cities in Canada. Displayed here is St. John's, NL. Total 1050 (center) rankings were received from participants showing a well distributed data collection. Some locations were ranked multiple times (right) by the same or various participants. We can observe in our collected data, the maximum number of conflicts is 3 per location with a very little occurrence.
The decision to collect the data labels as a combination of relative rankings and not absolute scores (eg. ranking each 5 locations between a fixed range such as 0-100 where the same rank may apply to other locations) was made to reduce bias, conflict and variability in the data while incentivizing the participants to make a precise decision to determine which of the given 5 locations would be most or least walkable and order the locations as deemed appropriate. While absolute scores are easier to use, each individual may have a completely different rational when it comes to why a location has been ranked the way it is, for example why 65 and not 40. Utilizing relative rankings takes that factor out and allows us to focus on the most important factor: whether location A is more walkable than location B for a given individual. This way, there will be a concise approach

8/16
to determine each individual's perception towards walkability when comparing different points together over the responses of other users. Conflicts are unavoidable and lead to inconsistencies among user opinions. The challenge here is to balance the opinions collected from users to yield a walkability metric that is agreeable to most users. User opinions collected maintain a unique form observing only relative orders within each submission of 5 locations and therefore do not represent the global view among the overall data. Due to this and the nature of the conflict resolution, the problem remains in the NP-complete space, and we therefore need to device a new approach to handle conflicts and represent user opinions within a global perspective.

Generalized Linear Extension of Partial Orders -GLEPO
Generalized Linear Extension of Partial Orders or GLEPO, is an algorithm we have devised specifically to process this approach of user opinion label collection. GLEPO produces a generalized list of all user opinions in total order and absolute ranks to represent relative ranks of small submissions, within a global observation of the overall data. This conversion is especially important as the users rank locations within small groups of 5 locations. The order of places in these small groups is unique and only relative within the same group among the 5 locations. Rankings o r i between 1 − 5 that are relative only within their own set of 5 nodes. The o r i relative rankings similarly apply to all user submissions within S. Each S j submission maintains 5 nodes holding a unique rank between 1 − 5. o r i is completely localized at this stage and in no relationship with nodes from other submissions. In order to correctly utilize user data in conjunction with their submitted rankings and find the missing link between various submissions, we need to evaluate and establish a unified relationship between all nodes within user submitted data.
The first step of GLEPO is to group the submissions into multiple lists S j containing 5 nodes {v 1 , v 2 , v 3 , v 4 , v 5 } each. The next step is to detect anchor nodes. Anchor nodes are defined as those commonly and naturally occurring nodes that repeat in various submissions. For instance, if we have submission x = {n1, n2, n3, n4, n5} and submission y = {n6, n7, n3, n8, n9}, node n3 would be considered as an anchor node defining a connection between submissions x and y. GLEPO algorithm uses anchor nodes to define connections between two or more submissions with a concept that if there's an anchor node between two submissions, using the position of the anchor node(s) we are able to narrow down an approximate positioning between the nodes of the connected submissions. Submissions may have none, 1 or more anchor nodes. For instance, in the example above, since n3 is the anchor node and it also happens to be in the centre of both submissions x and y, we can deduce that {n1, n2} must fall before {n8, n9} and similarly {n6, n7} must fall before {n4, n5}.

Algorithm 1 GLEPO: Main Routine
Input: User data organized by submission: subs_grouped Output: List of sorted user data entries: sorted_entries 1: initialize sorted_entries as an empty list 2: initial randomization by shuffling subs_grouped 3: for every submission group sub_g in subs_grouped do 4: if sorted_entries is empty then 5: for every submission sub in sub_g do 6: append sub to sorted_entries for every submission sub in sub_g do 9: if submission node exists in sorted_entries then 10: add submission to list of anchor nodes. 11: if anchor node(s) detected then 12: pass sub_g and anchor list to addToSorted() function 13: else 14: pass sub_g to FindV Link() function 15: return sorted_entries list Our main routine (Algorithm 1) iterates through the user data to form one possible variation of a newly sorted global list. When an anchor node is detected, the submission containing the anchor node is passed along to addToSorted() function to be evaluated and to decide where the node entries within that submission are placed. This process helps appropriately address the anchor nodes' associated relationship with regards to nodes' ranks, while sorting the entries based on their user-submitted order. However, in the case where multiple anchor nodes are detected, special conditions may be invoked within addToSorted() function that determines the best course of action.
GLEPO's sorting subroutine (Algorithm 2) parses the passed submission where one (or more) anchor node(s) has been detected. The core of this algorithm revolves around determining the number of anchor nodes, their positions within the current submission and associated position in the partially sorted list, and invoking the appropriate case associated to each condition. There are 4 main conditions two of which associated with the detection of only a single anchor node positioned either at the beginning or the end of the submission in process. Third condition is applied when the anchor node is not at the beginning or the end of the submission in process and applies to two cases: 1) if only a single anchor is detected, or 2) multiple anchors detected but the anchor in process is the last anchor. The final case accounts for all other conditions. Each case determines the segment or segments of the submission in process that requires insertion into the global list. These segment is then passed to another function to insertion into the list. if it's the only anchor and is positioned at the beginning of sub_g then 4: set segment to all 4 elements to its right 5: pass segment to RandomizeInsertion() function 6: else if it's the only anchor and is positioned at the end of sub_g then 7: set segment to all 4 elements to its left 8: pass segment to RandomizeInsertion() function 9: else if it's the last anchor but positioned in the middle of sub_g then 10: set left segment to all elements to its left 11: set right segment to all elements to its right 12: pass left segment to RandomizeInsertion() function 13: pass right segment to RandomizeInsertion() function 14: update current pointer 15: else 16: set segment to all elements between current pointer and anchor's position 17: pass right segment to RandomizeInsertion() function 18: update current pointer 19: return sorted_entries list Due to the structure of the data input and the nature of the algorithm, different outcome is expected after every run of the algorithm if any variation in the order the input is fed into the algorithm is detected. For example, the order of submissions and the nodes can affect the decision making processes of the algorithm when conflicts are detected. Due to this, we introduced 2 randomization components into the process. The first randomization component is applied within our main subroutine (Algorithm 1) which randomly shuffles the order of appearance in all submission input. The second randomization component can be observed being called from our sorting subroutine (Algorithm 2) which randomizes the location nodes are inserted into the global list by determining all appropriate positions and randomly choosing one to insert the node in process. Further details on how the randomized insertion is used can be found in Algorithm 3. With the randomization introduced, the entire GLEPO process requires to run multiple iterations until the global list starts to converge. This convergence is associated with the order each node appears within the global list over each run. The main process averages the order for each single node over the course of all iterations and determines the final order in which each node falls under after multiple iterations. Due to this approach, it is possible, multiple nodes may have similar ranking which is the expected results. We tested the algorithm over multiple numbers of iterations such as 10, 30, 50, 80 and 100 iterations. We found that around 30 iterations and onward the generated global list starts to converge and yield a stable result.
An important step here is to ensure GLEPO is capable of handling special cases, no matter how rare their occurrence may be. To this end, we created various artificial networks that presented different scenarios within the user submission data set. Each of these artificial networks was converted into a graph and processed by GLEPO and the resulting output was tested and verified to ensure the algorithm always returns results with the least amount of possible inconsistency (defined later in this chapter). In Figure 7 we demonstrate 3 artificial networks used for verification of GLEPO's accuracy and consistency. We can observe there are a few possible (and correct) outcomes for each of these graphs. For example, in Figure 7 right, node 5 is the highest ranked node among the two submissions whereas node 1 is the lowest ranked node among the two. Therefore, based on users opinion and the limited information given by only these two submissions, we know node 5 should be ranked the highest and node 1 the lowest. In this example, all other nodes could form number of different orders as long as they remain true to their respective submission. To name a few possible outcome: [1,2,3,4,6,7,8,5], [1,6,7,8,2,3,4,5], [1,2,6,3,7,4,8,5]. We can observe node 1 is maintaining the least rank while node 5 is the highest in all generated outcomes. We can further observe nodes from each submission always follow their original order of: [2,3,4] and [6,7,8]. It is important to reiterate the significance of the randomization component and how it will help form a uniformed distribution of nodes into positions within GLEPO's output that are most appropriate.
To ensure the sorted list remains true and consistent to the user submitted rankings we examine its consistency to that of the original user data. Each user submission consists of 5 road nodes v 1 , v 2 , v 3 , v 4 , v 5 , where each node v i (i = 1, 2, 3, 4, 5) has a ranking r denoted o r i , ("order") between 1 and 5. The rank o r i defines users' assigned order to each node with respect to the other 4 nodes within that submission. Now when the nodes from all submissions are sorted using the GLEPO algorithm into a unified global list, the most natural approach to check for any inconsistencies is to verify the order of all the original user submissions with that of the global list. If the order of the 5 nodes in each of the original submission remains intact to that appearing within the global list, then the nodes within that submission are considered fully consistent to that of the original user's opinion. If the order differs however, all out-of-order nodes are counted and amounted to an inconsistency value. This is done by a pair-wise comparison of each submission appearing in two separate lists: 1) a list in order submitted by the user, 2) a list in order appearing in the sorted list.

Algorithm 3 GLEPO: Randomized Insertion Subroutine
Input: Current list of sorted entries, minimum insertion range, maximum insertion range, list of elements to insert Output: Updates: sorted_entries by randomly inserting new elements within the given range 1: if the size of the list of elements to insert is less than the range between min and max then 2: for the size of the list of elements to insert, generate random numbers within min and max range 3: sort the generated random indexes 4: for a range between 0 and the size of the list of elements to insert do 5: insert an element of the list into sorted_entries at position within the generated numbers 6: else 7: for the size of the range between min and max, generate random numbers within the size of the list of elements to insert 8: sort the generated random indexes 9: for a range between 0 and the size of the range between min and max do 10: insert an element of sorted_entries at selected position into the list of elements to insert at position within the generated numbers 11: replace elements of sorted_entries within min and max range with updated list of elements to insert 12: return sorted_entries list Based on our earlier definition of anchor nodes, submissions containing one or more anchor nodes become anchored. In contrast, we define those submissions without an anchor node as Constrain-Free submissions. Based on this definition, we can observe at least to possible challenges with GLEPO pertaining to how well-anchored the submissions are with respect to user data: 1) over constraining, 2) under constraining. When adding new nodes to the general list, over constraining occurs when excessive conflicts and inconsistencies are detected, through anchor nodes. For example, when majority of the submissions have multiple conflicts with multiple submissions, we are faced with over constraining. With more crowd-sourced user opinion, comes more potential conflicts and disagreement. A natural solution to this is the embedded randomization of GLEPO over multiple iterations to ensure a well distribution within the input list and node insertion has been applied by GLEPO to produce the resulting global list.

Algorithm 4 GLEPO: Virtual Link Subroutine
Input: Current list of sorted entries, submission group sub_g, distance matrix of all nodes Output: Create a virtual link for the given sub_g and updates sorted_entries accordingly 1: set the desired threshold -maximum distance to consider a virtual link 2: initialize main shortest distance to a very high number 3: for every element in sorted_entries do 4: find element's distance to all 5 nodes within sub_g 5: if shortest distance among the 5 is shorter than main shortest distance then 6: set main shortest distance to new distance 7: set main shortest node to associated node 8: if main shortest distance < threshold then 9: treat main shortest node as an anchor node 10: pass sub_g and main shortest node to addToSorted() function 11: else 12: pass the entire sub_g to RandomizeInsertion() function 13: return sorted_entries list Under constraining on the other hand occurs when there's an insufficient number of connections (or anchors) among user submissions. This lack in the number of anchored submissions leads to an inability to establish an accurate association among various submissions. These associations enable the algorithm to present the data within a global list that reflects the global view of all users with respect to each other. Within our crowd-sourced data, we have observed that majority of user submissions do not have an anchor node and are constrain-free, hence leading to under constraining. Two main approaches are considered when addressing this challenge, both of which are complimentary and can be done concurrently. First, collection of more 12/16 user data. Second, through creating virtual anchor nodes, or virtual links. Virtual link is defined as an artificial connection between two or more submissions created specifically to associate constrain-free submissions that do not have any anchor node associated to those submissions already processed by GLEPO. Virtual link utilizes geographical proximity to establish a virtual connection between a single submission and the partially complete global list, by determining the closest nodes (in terms of physical distance) among the two. Algorithm 4 shows our approach to creating virtual links. This subroutine is called through our main routine whenever a submission without an anchor node is detected.

Active Living Feature Score -ALF-Score
The Active Living Feature Score or ALF-Score, is a novel measure of walkability that aims to create a better walkability scoring system capable of generating walkability scores at a higher spatial resolution (point-level) that are representative of user opinion while more informative to most users. ALF-Score takes a user-centric approach instead of the traditional researcher-centred approach. ALF-Score aims to better utilize road network structure and characteristics and derive node features that are defined by various road characteristics and their direct and indirect associations that can consider how the underlying road structure can influence neighbourhood and city structure and walkability. ALF-Score employs user-based parameters such as user opinion of walkability, to better represent user perception and provide user-derived scores of what users would perceive as various degrees of walkability.
At the core of ALF-Score's pipeline lies various machine learning approaches and methods with the goal of training machine-learned models that are capable of estimating numeric values defining how walkable specific points on the map are, based on various feature combinations that include road network, road embedding and POI features. This approach would help estimate walkability scores based on node characteristics. The input maintains a combination of various features derived from road structure and POIs generated by different methods while user opinion acts as training labels. The ideal outcome is to avoid calculating walkability score individually for every single point on the map, and have the ability to accurately estimate point-specific scores, from only a small sample of crowd-sourced labelled data. The output of the machine learning pipeline would be a trained model and the output of the model will be walkability score predictions for given nodes. The score is defined as how walkable users may find the selected points, an output that will be normalized between 0-100.
When it comes to the machine learning pipeline, we are working with a node regression problem as we seek to estimate continuous numerical output (walkability scores) based on numerical features (some converted from other types such as categorical, ordinal, etc.). To address a regression problem, there are many possible approaches such as linear regression, random forest, support vector regression (SVR), and multi-layer perceptron neural network (MLP) which are explored in this research. The prediction input variables describing data features are represented as: {x 1 , x 2 , x 3 , . . . , x n }, while the output is represented by y, respectively. We use labelled sets such as {features, label}: {x, y} to train the model. The expectation from trained models is to produce a prediction where given an unlabelled set such as {features, ?}: {x, ?} where ? would be replaced with y ′ prediction. The models, which are defined by internal parameters learned through the process, map unlabelled example sets to predicted value y ′ . Since some of the features are not numerical (eg. categorical or ordinal), we have used one hot encoding, where applicable, to convert all features into appropriate numerical entries. The developed machine learning pipeline has five primary components: 1) GLEPO, 2) consistency measure and verification, 3) Feature selection, 4) Model training, 5) Model validation. Output of the model, walkability score for any given point within the data set, is robust and refined with much higher spatial resolution, compared to other walkability measures. The nature of the output and their predictive structure can be used to create interactive interfaces that allow users visually view walkability scores for any selected areas with as high as point-level resolution. Furthermore, due to the nature of the model, there is be no need to calculate the walkability score for the entire network. Predictions for small subsets or even the entire network, as needed, can be made available in fraction of the time required to calculate the walkability scores using traditional methods.
Our desired walkability metric is a global scalar which is consistent across all road nodes that have user input. But it is also coherent across all road nodes with and without user input labels. Our walkability function is defined as w : V =⇒ R + where V is a set of vertices and R is a set of user rankings. The performance metric of w is based on consistency with user rankings R. Specifically, the performance metric considers user ranking r ∈ R involving a set of 5 nodes Ranking r gives each node v i (i = 1, 2, 3, 4, 5) a rank denoted o r i , (i.e. the "order" of node v i as they appear in user ranking.) Similarly, walkability w would also imply for v i (i = 1, 2, 3, 4, 5) a rank denoted o w i , (i.e. the "order" of node v i as suggested by w.) Furthermore, the loss function of w with respect to r, or the "inconsistency" between r and w, can be defined as while the loss function with respect to R is aggregate of L(r, w): That is, the total amount of inconsistency of w as compared to all user-provided rankings. The aggregate function can take other forms too, such as mean or multiplication. There can also be other ways to define such an inconsistency measure, e.g. l 2 -norm or "out-of-order" counts, or Kendall rank correlation coefficient 38 . The goal is to determine the most appropriate w that minimizes L(R, w), the amount of inconsistency between w and R. A conversion from "relative" ranking associated within a small localized set of nodes provided by a single user to a "global" list of scores which represents relativity among all users and their provided opinions within the network with as little discrepancy among R as possible is a crucial step. We have applied various machine learning techniques and have compared their accuracy based on various feature set combinations. The goal is to find the most suitable technique as well as feature combination that produce the most appropriate models predicting accurate walkability scores. The techniques that were used in both supervised and semi-supervised environments were: 1) random forest, 2) linear regression, 3) decision tree, 4) support vector regression (SVR), 5) gradient boosting, 6) polynomial features (a non-linear approach), 7) lasso CV and 8) multi-layer perceptron neural network (MLP). Feature combinations used are: 1) only POI features, 2) only network features (centrality measures), 3) only road embedding features, 4) POI + road network features, 5) POI + road embedding features, 6) road network + road embedding features, 7) all features.
It is important to point out that ALF-Score is a novel and unique approach with no similar measures for direct comparison. Our success metric when it comes to determining the accuracy of our models, is based on how close the results are to user opinions and the general knowledge of the area, based on 2 main approaches: 1) using validation and test sets to verify the models, 2) visual inspection and verification based on local knowledge of various cities. As with most machine learning approaches, various challenges are observed especially when it comes to keeping our models as generalized as possible. For example, lack of sufficient data, biased in data collection or processing, and incorrectly labelled data are some of the challenges that can lead to modelling errors. Overfitting and underfitting, are also two very common modelling errors that occur when the model's main function is too closely fit to the data (too well-learned) for the case of overfitting, or when the function does not capture the prominent patterns in the data (not enough learning) in the case of underfitting. Overfitting leads to high accuracy in training results (previously seen data), but significantly lower accuracy when applied to new (unseen) data. Underfitting on the other hand, leads to unpredictable outputs. In both cases, low generalization of the models lead to unreliable predictions. Furthermore, one of the prominent challenges faced in this research is that the crowd-sourced data is relatively small. Two directly associated issues are: 1) potential bias in lack of variety in participating user groups, 2) lack of appropriate coverage when it comes to relative walkability scores associated to various road network structures.
There are many ways to address these challenges. One of which that we have used in this research is data split. The core concept of data split is to separate the labelled data into two data sets: 1) training set, 2) testing set. For example, in a 80-20 percent split, 80 percent of the labelled data is assigned to training set while the remaining 20 percent form a testing set. Typically a 70-30 or 80-20 percent split is commonly seen, but we have also experimented with various other variations such as 90-10 and 60-40. The labels from the test set are never shown to the algorithms while training the models. The prediction results are then compared to the actual results to determine a base line on the accuracy and performance of various models. Furthermore, a well-balanced split can make a huge difference when it comes to a well generalized model. For instance, when working with a small set of data, if too much or too little is assigned to either the training or the test set, the model may not perform well due to lack of observable patterns. K-fold cross validation 39 is used as yet another method to address this challenge. K-fold cross validation is typically done by splitting the data set into k smaller sets. One set is selected as the test data set, and the remaining sets are combined as one training data set. Once the model is built, it is then evaluated on the test data set and the procedure repeats for every set in k to ensure the entire data set is fully utilized towards building and evaluating the models (for further information on K-fold cross validation please refer to the supplemental document). Additionally, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) were used to measure each model's error. Mean Absolute Error (MAE) is defined as MAE = ∑ n i=1 |y i −x i | n where x i is the actual value, y i is the prediction, and n is the total number of data points.