Leveraging the self-transition probability of ordinal patterns transition network for transportation mode identification based on GPS data

Analyzing people’s mobility and identifying the transportation mode is essential for cities to create travel diaries. It can help develop essential technologies to reduce traffic jams and travel time between their points, thus helping to improve the quality of life of citizens. Previous studies in this context extracted many specialized features, reaching hundreds of them. This approach requires domain knowledge. Other strategies focused on deep learning methods, which need intense computational power and more data than traditional methods to train their models. In this work, we propose using information theory quantifiers retained from the ordinal patterns (OPs) transformation for transportation mode identification. Our proposal presents the advantage of using fewer data. OP is also computationally inexpensive and has low dimensionality. It is beneficial for scenarios where it is hard to collect information, such as Internet-of-things contexts. Our results demonstrated that OP features enhance the classification results of standard features in such scenarios.


Introduction
Analysis of GPS data is a well-studied problem applied to understand the mobility of different moving entities, as well as the environment they live in, such as people [1], vehicles [2,3], animals [4], and meteorological events [5]. In particular, to solve the challenge of sustainable development, which must meet the needs of their increasing population, cities are investing in solutions based on human mobility. They want to understand and characterize how humans commute, where and why they go (i.e., create travel diaries), to develop essential sustainable transport technologies capable of reducing traffic jams and travel time, for instance, which can improve the life quality of humans [6][7][8].
Regarding its interdisciplinary and broad scope of real-world applications, the need to identify transportation modes from trajectory data is evident. Several studies analyze many approaches in this field. For instance, [9] and [10] use GPS and wearable sensor data (e.g., body temperature and heart rate) to detect activities such as walking, rowing, and cycling. However, this approach obliges the user to carry several sensors to detect their transportation mode information.
Moreover, some studies show that it is possible to combine sensor data with external information. As examples, we can cite [11,12], which classify different forms of transport using GPS data and GIS information, such as transit route information published by transit agencies. This approach can help in unveiling information that are not possible with only one source; for instance, [13] can detect public transport vehicle in addition to transportation mode. In [14], the authors recognize not only the transportation modes but also the circumstances in which the users performed the activities (e.g., road surface and shoe types). However, although external data may increase the accuracy and the kind of information that we can discover, it must be collected every other time since city information can change over time. Even more, external data may not be available every time, which can harm the method's behavior and even make its use unfeasible.
Accordingly, many studies are focused on classifying using information from just one source. For instance, [15] presented a systematic review about transportation mode detection using mobile phone network data, discussing the data, preprocessing steps, and transport mode identified in more than 20 studies. In this work, we also use a source alone to avoid the burden of collecting information from different sources that might be impossible in some cases. Therefore, we concentrate our efforts on using only GPS data for transportation mode identification (TMI).
Furthermore, mining trajectory data to extract valuable information faces several complexities, challenging due to its unique properties. Trajectories have high dimensionality, heterogeneity, and noise, well-known problems in the Big Data era. Moreover, as time-series data, trajectories depend on the ordering, and, thus, a change in the order could change their meaning. It opposes the typical assumption made by many machine learning (ML) algorithms, such as Naïve Bayes, of independent and identically distributed observations, leading standard classification methods to perform poorly [16].
The literature explored diverse approaches to overcome the issues above. For instance, many studies use traditional ML methods [6,9,17]; however, these methods perform worse than ensemble methods [18]. In this context, random forest (RF) has given appealing results in this field [12,18,19], including data collected with different frequencies [20] and mobile phone signaling data [21]. Additionally, it is applied to detect socioeconomic attributes [22] and travel circumstances [14]. As said before, the recognition of more information needs the use of external data. Also, these works extract many hand-crafted statistical features related to the motion of transports, reaching hundreds of them.
Recent papers applied deep learning (DL) methods to avoid the burden of manually extracting such features since these methods can extract many levels of representation without human interference [23][24][25][26][27]. The main difference between both approaches is that handcrafted features generate interpretative models since we know the features but need domain knowledge to extract them. On the other hand, humans may not understand the high-level features generated by DL methods; in addition, such methods need intense computational power and more data than traditional ML methods to train their models.
Another research direction that has also been successful in the characterization of time series is transforming time series into graphs. Using this strategy, studies that inherit the characteristics of the original time series are constructed. (For example, periodic series result in regular graphs and random series transform into random graphs.) Some examples are the visibility graph [28], the horizontal visibility graph [29], and the vector visibility graph [30]. However, these techniques impact the scalability of these techniques as each time-series sample is transformed into a vertex of the graph, making them not feasible for highdimensional time-series data.
In this context, this paper investigates the feasibility of information theory (IT) quantifiers in TMI in urban scenarios, such as ordinal patterns (OPs). Diverse authors have been studying the characterization of other time series using these techniques, such as the behavior of vehicles through their velocities [2] and electric loads [31]; hence, we want to explore whether they can be used in TMI as well. More specifically, this work intends to answer the following research question: Given a trajectory extracted from GPS data, is it possible to obtain a useful characterization that can help the identification of the transportation modes used in urban scenarios?
To answer this question, we investigate how different IT quantifiers affect the identification of transportation modes. Whereas previous work analyzed permutation entropy [32] and multiscale permutation mutual information [33], we extend this analysis to other IT quantifiers. To the best of our knowledge, this is the first work investigating it. We focus on urban scenarios, where acquiring data can be challenging since it depends on active user participation.
Hence, our methodology transforms the latitude and longitude of GPS trajectories into standard motion features, namely distance, speed, and acceleration. It is a common step in the literature. However, we do not limit the trajectories to a fixed number of points; this situation is closer to reality (i.e., people can make trajectories of different sizes, which may also create highdimensional trajectories). Many approaches, especially in DL, need fixed-length trajectories to be used. After, we calculate from each motion feature the OP transformation. From this transformation, we extract two traditional IT quantifiers, i.e., permutation entropy and statistical complexity. Additionally, we also compute from each motion feature the ordinal patterns transition network (OPTN) [34]. We intend by representing the mobility data in these new domains to understand how they contribute to TMI. Particularly, we propose the use of a new feature, derived from OPTN, called probability of self-transition ( p ST ) [35]. Our results show that it can discriminate transportation modes better than the traditional IT quantifiers, enhancing the results when we apply these three features together. The validation of our proposal is made in real-world data, referring to transportation mode identification: We want to characterize which transportation mode (car, bus, bike, and walk) a given person carrying a GPS is traveling.
Our proposal presents the advantage of using fewer data to perform transportation mode identification. It can benefit diverse city scenarios, such as the Internet of things (IoT), where data collection is complex. Furthermore, OP transformation is computationally inexpensive and has low dimensionality. It allows the processing to occur within the IoT devices, which generally present severe limitations in the computational resource. Hence, we can reduce latency (e.g., we can achieve good metric results using fewer trajectories) and energy consumption. It also enhances privacy since we do not need to send data to a central server. Consequently, OP can leverage a wide range of approaches  [1], and jerk [23]. We hope that our study, with others, can shed light on how the IT field can support the understanding of mobility data in urban scenarios, creating opportunities to improve the quality of life in cities.
To assist in the reading, in Table 1 we present the main notations used throughout this work.
This work is organized as follows. Section 2 presents the fundamentals of the IT methods used in this work, i.e., OP and OPTN, along with the features extracted from them, namely permutation entropy, statistical complexity, and probability of self-transition. Section 3 discusses our methodology. Section 4 relates the results obtained, and finally, Sect. 5 concludes this work.

Ordinal patterns transformation
OP is a simple method of transforming time series that does not require any model assumption about the time series (i.e., domain agnostic), making it possible to apply it to any arbitrary time series. To obtain an OP transformation, we observe the sequence that naturally arises from the time series, comparing the values in the same neighborhood in a sliding window and replacing them with a sequence of symbols. It needs two parameters, the embedding dimension D, which determines the sliding window size, and the embedding delay τ , which dictates the interval between the data points. Its advantages are simplicity, speed, robustness, and invariance concerning nonlinear monotonic transformations [36].
Formally, we define OP transformation as follows. Let a time series X(t) = {x 1 , x 2 , . . ., x n } of size n and let also an embedding dimension D ∈ N and an embedding delay τ ∈ N. In each time instant t = {1, . . ., n − (D − 1)τ }, we have a sliding window w t ⊆ x of size D, such as i.e., we sample the time series at evenly spaced intervals, separated by intervals of size τ , in such a way that we obtain each element within the sliding window from the time series in the time t, . . ., t + (D − 1)τ . For each instant t, we establish an ordinal relationship between the data points in the sliding window that consists of the necessary permutation to sort them in ascending order based on their values. Therefore, for each sliding window w t at given instant t, the ordinal patterns consist of the permutation π = {r 0 , r 1 , . . ., Since OP does not handle same values in sequence, to obtain unique results, we define that, if a time series has elements such that x t−r i = x t−r i−1 , then we consider r i < r i−1 . Hence, by extracting the aforementioned ordinal relations, the time series is converted to a set of ordinal patterns, Π = {π 1 , . . ., π m }, where m = n − (D − 1)τ and each π m represents a pattern of the possible permutation set of D! [31]. For instance, if D = 3, our possible permutation set is {123, 132, 213, 231, 312, 321}, of size D! = 3! = 6. Thus, if τ = 1 and n = 100, we have m = 98, meaning that our transformed time series have 98 patterns and each π m represents a pattern in the possible permutation set, extracted from the ordinal relation presented in the sliding window w t .
The choice of D depends on the time-series size and must satisfy the condition n D!. Hence, the higher the D value is, the greater the time-series length must be to have reliable statistics [37]. For instance, [38] show that increasing the D and decreasing the time-series size lead to substantial deviations from the expected value in permutation entropy. Therefore, it is recommended that 3 ≤ D ≤ 7 [36], which are adopted in this work.
For all D! possible permutation π of D, we can compute the relative frequency by the times a particular sequence appeared in the time series, divided by the total number of sequences, obtaining the histogram of the probability distribution P ≡ {p(π )}, which is defined by: where | s π |∈ {0, . . ., m} is the number of pattern observed of type π . From this new representation, it is possible to extract features, such as IT quantifiers, which we can use to characterize the time-series dynamics [37]. Furthermore, we can also explore it in other analyses, such as graph likelihood [39]. Figure 1 illustrates the process described above of extracting OP probability distribution from time series. (i) First, we have the original time series. (ii) We calculate sliding windows with D and τ values. We can see in the highlighted sliding window how D and τ behave. Simplistically, D is relative to how many data points we consider to calculate the ordinal pattern: When D = 3, we use three points, D = 4, we use four points, and so on. τ refers to the spacing between two consecutive points in the sliding window; when τ = 1, we use immediate neighboring points, if τ = 2, we use points that are two points apart, and so forth. (iii) After computing the ordinal patterns to all the time series, we extract the histogram of relative frequency. (iv) We can use the probability distribution or the frequency distribution of this histogram, as we can see in the last step of the process.
In this work, we extract two quantifiers, the permutation entropy and statistical complexity, as discussed in the following.

Permutation entropy
The permutation entropy is a measure of uncertainty associated with the process described by p π and is defined by: where 0 ≤ H [ p π ] ≤ log D!. This measure is equivalent to the Shannon entropy [31]. Low values of H [ p π ] represent a sequence of increasing or decreasing values in the permutation distribution, indicating that the original time series is deterministic. In contrast, high values indicate a completely random system [36].  [40]. We can define the normalized Shannon entropy, for the case of permutation entropy, as: where 0 ≤ H S [ p π ] ≤ 1 [37]. With this, we adjust the values of different measures of entropies to a standard scale. It makes them comparable, and, in our case, we can evaluate whether this measure can contribute to differentiating the transportation modes.

Statistical complexity
Statistical complexity is a measure of the pattern or structure (i.e., regularity) present in systems. In other words, it captures the relationship between the regular components of a system, discounting for randomness [41,42]. It is based on the Jensen-Shannon divergence (JS) between the associated probability distribution of ordinal patterns p π and the uniform distribution p u (the trivial case for the minimum knowledge of the process). We adopted the following statistical complexity definition: where H S is the normalized Shannon entropy, as defined in Eq. (5), and the disequilibrium Q JS [ p π , p u ] is given by: where S is the Shannon entropy measure. Q 0 is a normalization constant defined by: which is equal to the inverse of the maximum value of J S[ p π , p u ]. Moreover, 0 ≤ Q JS ≤ 1 [31,37,43].

Ordinal patterns transition network
Given a sequence of OP Π ≡ {π i }, the OPTN represents the relation between consecutive patterns and is defined as a weighted directed graph A directed edge connects two OPs in the graph if such patterns appear sequentially in the original time series, representing a transition between the patterns. The weights w : E → R of the edges represent the probability of the existence of a specific transition in Π and are given by: where | Π π i ,π j |∈ {0, . . ., m − 1} is the number of transitions between the permutations π i e π j and The transition graph, being constructed from the OP set, inherits some properties from it. The most notable are: - . However, as [36] recommends D to be at most 7, the sorting will take no more than 7 elements, so the complexity of such strategy is more dependent on time-series size n [35]. -Scalability: The approaches that use a visibility graph [28], for instance, transform each time-series sample into a vertex within the graph-an impracticable approach to high-dimensional time series due to the space required for storage. Otherwise, the number of vertices in OPTN does not depend on the size of the series. Instead, it is given by the embedding dimension D, which is limited by D!. -Robustness: OPs are robust to the presence of noise and invariants with respect to nonlinear monotonic transformations [31,37].

Probability of self-transition
The self-transitions of the transition graph are the edges from a vertex to itself, also known as a loop. Its presence in a graph represents the occurrence of the same OP consecutively.
In [44], the authors analyzed the entropy computed through the weights of the OPTN after the removal of the self-transition edges. In [35], the authors showed that these self-transitions are directly related to the temporal correlation of the original time series and are a valuable indication of the hidden dynamics. Therefore, we should not discard them.
Hence, we define the probability of self-transition as the probability of a sequence of matching patterns within the OP transformation set. We calculate it as: We base our weight normalization of the graph on [44], where the authors normalize the weights such that all weights sum 1. However, in our case, we accept the presence of self-transition [35].

Transportation mode identification
Our goal in this work is to identify transportation modes using IT quantifiers. Hence, to achieve it, we follow the framework shown in Fig. 3: From each GPS trajectory, we extract three motion features (distance, speed, and acceleration). We transform each feature (including latitude and longitude) into ordinal patterns, from which we extract the IT quantifiers used to feed a traditional ML classifier. We will better explain these steps throughout this section.

Dataset description
In this work, we use the GeoLife 1 data, collected by [1]. This dataset presents GPS trajectories of 182 users over five years (from April 2007 to August 2012), containing latitude, longitude, and altitude information. Among these users, 73 has transportation mode information, which we used to evaluate our proposal. Table 2 describes the transportation mode used in this work, presenting the total distance and duration as well as their average and standard deviation. We have four kinds of transportation: walking, bike, bus, and personal car (car/taxi). Note that we considered only transportation modes with a duration higher than 500 hours  since we understand that the smaller the time series, the more difficult it is to extract relevant information, which leads to the generation of low-quality models.
Here, we define trajectory as an uninterrupted sequence of GPS points (latitude and longitude) that belong to the same transportation mode. We consider that every user is at the same altitude, thus discarding this measure. Since our focus in this work is to investigate how IT quantifiers are helpful in TMI, we suppose that the trajectories are already segmented. It makes our evaluation similar to other classification problems, where the observations contain only one class, and it is already known. Therefore, we separate the raw trajectory by user, day, and transportation mode. Some studies do not make this assumption. For instance, [17] use the walking information to segment the trajectories in different sizes, and [27] segment the trajectories in intervals of 200 points. The segmentation is essential in practical scenarios, which is out of our scope. However, one can easily extend our technique to these scenarios. Moreover, the data may contain errors generated during sampling and labeling. The errors concerning the labeling are due to this task being dependent on human effort; hence, people may mislabel some trajectories. This inaccuracy cannot be removed from the data since it is impossible to determine it. Consequently, we consider that each provided label is the actual transportation mode used. On the other hand, sampling errors can appear as erroneous measurements, which we can identify and remove to avoid corrupt and/or incomplete information that can affect the identification of the transportation modes. Hence, we discard longitude and latitude that are out of range. Some studies, based on the supplied labels, remove points in which the values exceed a threshold [18,23,27]; however, we do not remove these points to evaluate the robustness of our proposal. In addition, we discard trajectories with less than 10 points to avoid creating low-quality trajectories. Table 3 presents some statistics about the obtained trajectories: how many trajectories each transportation contains (with the proportion in the dataset); the average (with the standard deviation), maximum, and minimum points of the trajectories by transportation; and the average speed, with the standard deviation, in kilometers per hour. As in [17], we join cars and taxis as driving since they have fewer trajectories than the other transportation modes. Particularly, car encompasses 860 trajectories (9.8 % of the dataset), an average of 627.88 points (± 866.97), and with the maximum and minimum trajectory containing 8394 and 12 points, respectively. Taxi consists of 545 trajectories (6.3 %), with an average of 470.92 (± 758.70), and maximum and minimum trajectories size of 10841 and 11 points, respectively. Also, the speeds for car and taxi are 32.50 (±13.87) and 30.41 (±12.30), respectively.
We note that the dataset is imbalanced, with walking containing about half of the trajectories. However, the average size of the trajectories is smaller, showing that people tend to commute longer when using other transportation.
Moreover, we see in Tables 2 and 3 that speed and distance are extremely useful for classification. For instance, using only speed, we can discriminate between walking and car. However, we cannot describe some urban scenarios by these features, such as traffic jams, in which the car and walking may achieve the same speed. Hence, more features can help the identification of transportation modes, such as OP transformation, since we do not know in which context the transportation modes were generating the studied trajectories.

Motion features and feature extraction
This step comprehends the transformation of segmented trajectories to features extracted from OP and OPTN transformation. We usually find trajectory data as a collection of latitude, longitude, and altitude points. These features, however, are challenging to interpret. Their transformation in another feature may improve the transportation mode identification. Therefore, we transform the trajectories (latitude and longitude information) into three motion features that can describe the movement of transportation modes: distance, speed, and acceleration.
Regarding the geographical distance between two succeeding GPS points, we use the geodesic distance between two trajectory points, a generalization of a straight distance to a curved surface. In other words, the geodesic distance is the shortest path between two points on the Earth, using the model of an ellipsoid of revolution. We use World Geodetic System (WGS) 84, the standard in cartography and satellite navigation (including GPS) among the several ellipsoid models. We use it in this paper, as explained in [45].
Speed calculates how fast an entity moves from one point to another. We define it as where dist is the distance between points p 1 and p 2 , as described above, and Δt = t 2 − t 1 , which is the difference time between the two points ( p 1 with time t 1 and p 2 with time t 2 ). The SI units for its magnitude are meters per second (m s −1 ), so we calculate using these metrics. Moreover, acceleration determines the rate of change of the speed of an entity regarding time. We define it as The SI units for its magnitude are m s −1 .
Hence, from each motion feature, we extract three features: permutation entropy and statistical complexity from the OP representation, and probability of selftransition from OPTN. We extracted these three features from latitude and longitude as well, totaling 15 features. We describe the extraction step using OP transformation in Sect. 2.
Besides these steps, more could be performed, such as extracting more features. However, we do not intend to exhaust such possibilities since our goal is to analyze how OP transformation can help the transportation mode field by substituting more specialized features, such as jerk and heading change rate. This kind of analysis can be more generalizable, allowing its use in IoT devices by reducing latency and computational consumption in classification, which can leverage a wide range of approaches such as personal assistance and healthcare applications.

Classification and model evaluation
This step receives the extracted features and classifies them using traditional ML methods. To evaluate if our proposal can generalize and validate our results, we use cross-validation with tenfold. It is important to note that this cross-validation is performed in the extracted features immediately before classification. After going through the transformations described in this work, such features can be interpreted as independent and identically distributed, allowing this validation method.
Moreover, since our data are imbalanced, we stratify each class to force each fold to have the same class distribution, i.e., we preserve the percentage of sample for each class in each fold. It ensures that neither of the classes is over-represented, leading to increased unrealistically results.
Since the idea of this work is to highlight the particularity of each data transformation, little effort was devoted to the adjustment of the classification algorithms. It may be possible to obtain better results of the evaluation metrics by adjusting the classifier's parameters. However, our objective is to present good results of such metrics and know if our proposal is suitable for the characterization and classification of trajectories. We classified using simple algorithms, which are: k-nearest neighbors (k-NN), with k = 2, support vector machines (SVM), with linear (SVM-L) and radial (SVM-R) kernels, decision tree (DT), RF (300 trees), and gradient boosting decision tree (XGBoost) (300 trees as well).
To evaluate the performance of our model, we used the following measures: -Accuracy (acc). It measures how well a classifier correctly identifies or excludes a condition. In other words, this measure estimates how close the prediction of a classifier is to the actual labels (the transportation modes in this case). We define acc as acc = correct predictions total of observations = TP + TN TP + TN + FN + FP , (13) where TP (true positive) represents the cases in which the predicted class of observation is indeed its class; FP (false positive) refers to the cases where the predicted classes of observation are mistaken; the model predicts that it belongs to a class, but it does not (also known as type I error); in the TN (true negative) cases, the model predicts that observation does not belong to a class, and it does not belong; moreover, finally, FN (false negative) describes the cases in which the model predicts that observation does not belong to a class, but it belongs (also called type II error) [46]. -F1-measure. It is the weighted harmonic mean between precision (pre) and sensitivity (sen), defined as where precision represents the random variation of a model, i.e., express the proportion of points in data which the model says that they are relevant and they are, as where L is the number of labels, and sensitivity explains how effectively a classifier identifies the positive prediction, that is, the ability of our model to identify which transportation models pertain to a class, as In multi-class problems, such as identifying many transportation modes, the evaluation measures can be calculated separately for each class, as in a one-versusall scheme called micro-averaged measures. Also, we can compute such measures as an unweighted average mean of all classes, called macro-averaged measures. While macro-averaging treats all classes equally, micro-averaging favors the larger. To choose which one to use depends on the goal [47]. In this work, we consider macro-averaging measures since our dataset is imbalanced, as discussed before.

Algorithm complexity analysis
The cost to extract each motion feature is O(n), which is the feature extraction step's total cost. Algorithm 1 shows the pseudo-code to the OP transformation. The lines from 3 to 10 go through all the trajectory size, doing the following operations: In line 4, we select the indices of points to transform into OP; that is, we select every D points, spaced by τ value. This costs O(1). Line 5 indeed obtains such points values from the trajectory data, based on the indices values, with the cost O(1) as well. The ARGSORT function returns the indices that would sort an array (i.e., the ordinal relation); hence, in line 6 we obtain the ordinal pattern π that represents the current sliding window. It uses a simple algorithm to sort, such as Merge sort, having a cost of O (D ln D). Hence, the complexity cost for the OP transformation is O (n D ln D). This algorithm stops when it reaches the end of the trajectory; precisely, the last sliding window (as shown in line 3), which is determined by n − (D − 1)τ , where n is the size of the trajectory. The complexity of the classification step is related to the classification algorithm used. For instance, an RF classifier has cost O(N 2 × f × t), where N is the total of trajectories, f is the total of features, and t is the total of trees.

Implementation
All the implementation procedure in this work was made in Python (version 3.7.3), using the Anaconda distribution (version 4.7.11). 2 The machine in which they were executed had the following configuration: Ubuntu 18.04.3 OS, 20 × Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz, and 125 GB RAM. The implementation of our framework can be found online at https:// github.com/icps/tmc_ordinal_patterns.

Results and discussion
In this section, we present the results for transportation mode identification using features extracted from OP and OPTN transformations, as described in Sect. 3. First, we evaluate the influence of the OP parameters, the embedding dimension D, and the embedding delay τ , respectively. After, we examine the classification results. Figure 4 presents the obtained accuracy using the OP features extracted from latitude, longitude, distance, speed, and acceleration, with different values of D, from 3 to 6. Although [36] also recommended D = 7, this value can be unfeasible in this scenario since, for long trajectories, it can take a long time to transform data. Therefore, we do not use this value.

Influence of the embedding dimension D
We want to evaluate the strength of the OP features, individually and together, to characterize the transportation features and contribute to their identification. Hence, Fig. 4a-c displays the accuracy results for each feature classified independently. Figure 4d, e exhibits the following combination of features: The former shows the results for the features extracted from OP transformation only (permutation entropy and statistical complexity) and the latter presents the results of the OP combined with the OPTN transformation (permutation entropy, statistical complexity, and probability of self-transition). All figures have the same values in the y-axis, making it easier to compare the results.
When classifying each feature in isolation, we see that H [ p π ] achieves the best accuracy results, about 75 % when D = 5. We also note that this feature is more robust than p st and C JS [ p π ] to D value variation. That is, H [ p π ] achieves similar results when we vary the D values from 4 to 6.
Moreover, C JS [ p π ] reaches about 73 % when D = 5 and p st achieves the lowest result among these three features, about 68 % when D = 4. It may indicate that the sequence of the ordinal patterns in trajectories is not essential to identify the transportation modes using this basic set of motion features (latitude, longitude, distance, speed, and acceleration), being sufficient to calculate their distribution. It means that it is more important to identify the patterns than when they occurred.
Also, for D = 6, the three features present a slight decrease in accuracy, indicating that the relationship in the trajectories exists in smaller sliding windows. In other words, the patterns that identify the transportation modes in the used features are smaller than 5 data points. It is vital to conduct further study to understand the importance of the OPs sequence information; we do not dive in such direction in this work.
Regarding the classifier, we can see that RF, SVM-R, and XGBoost perform better in the three cases, with similar results. It indicates that the space of these trajectories is not linear, needing a classifier that can explore nonlinear spaces to achieve better results.
Furthermore, combining the OP features, C JS [ p π ] and H [ p π ], give us slight better accuracy results than the H [ p π ] alone, about 76 %, a gain of about 1 % of accuracy. Hence, we can say that, in this system, C JS [ p π ] and H [ p π ] provide similar information; therefore, the use of C JS [ p π ] does not enhance much the results.
Finally, joining the three features for classification give us an improvement of about 3 % in accuracy when compared to the OP features set (C JS [ p π ] and H [ p π ]), reaching about 78 % when D = 4. Therefore, we see that the aggregation of OP and OPTN representation brings the best results. In other words, ordinal patterns distribution and their transition are valuable information to identify transportation mode.
Additionally, we can see that D = 4 and D = 5 present similar results for the three features-in this case, we prefer to use the smallest value since the cost of the algorithm will be lower. Again, RF performs better, with about 3 % of accuracy gain against the second better classifier, XGBoost.

Influence of the embedding delay τ
Now, we will evaluate the τ influence in the transportation mode recognition. We will set the D value as four, shown to be the best result in the previous experiment. The maximum τ value depends on the time-series size n and the dimension D, being limited by: Hence, the greater the τ value, the greater the number of time-series samples needed. For example, for D = 4 and τ = 1, the time series has to be, at least, greater than 3; D = 4 and τ = 3, n > 9; D = 4 and τ = 5, n > 15; D = 4 and τ = 10, n > 30; D = 4 and τ = 15, n > 45, and so on. Note that greater D values require larger trajectories to perform the transformations, which we cannot obtain in some cases. Therefore, the preference for smaller D values. It also has the advantage of decreasing the algorithm complexity. In Table 4, we present our data set size n when τ increases. When τ = 15, for instance, more than 1000 trajectories are discarded, when compared to τ = 1; i.e., we lose information when increasing τ .
In Fig. 5, we see the values obtained when using D = 4 and τ = {1, 5, 10, 15}. The presentation scheme is similar to the above experiment: Fig. 5ac shows the features classified alone and Fig. 5d, e presents the OP set and a set containing the three explored features, respectively. The y-axis has the same value in all figures.
We see that, for the features classified alone, the accuracy value obtained by the OP features, C JS [ p π ] and H [ p π ], declines as the τ value increases. From τ = 1 to τ = 15, there is a loss of about 15 % of accuracy in both cases. Differently, in p st , as the τ value increases, the results fall a little but soon return to the previous value obtained when τ = 1. It suggests that the ordinal patterns transition preserves the relationship between the trajectory points captured in the sliding window when changing the value of τ (i.e., p st is robust to τ variation), which does not happen in OP distribution.
As before, when combining the OP features, we may gain more information and enhance the results. However, as τ increases, the accuracy deteriorates as well. Joining the three features brings the best of both transformations, improving the results and bringing some robustness to τ variation. Anew, the best classifier is RF, with XGboost in second.

Classification results
In our last investigation, we analyze in more detail the classification results of the transportation modes. We selected the RF classifier since it consistently presented the best results in the previous exploration. Additionally, we adopted the following set of motion features as the baseline: distance, speed, and acceleration. We extracted four statistics measures from each feature: minimum, maximum, mean, and standard deviation. Hence, we have a total of 12 features. We chose the most classical features to compose our baseline, which is not state of the art, since one of our goals with this work is to show that OP transformation can substitute more specialized features. Also, adding a significant number of features may negatively impact the results by leading to the curse of dimensionality. Our proposal consists of the following motion features: latitude, longitude, distance, speed, and acceleration. From each feature, we extracted the three IT quantifiers studied in this work, p st , H [ p π ], and C JS [ p π ], since it provided the best accuracy value, as we saw earlier. These features are classified together with the standard features mentioned above (i.e., from each motion feature, we extract classical statistic features) since we want to investigate how our characterization can contribute to a standard model without exhausting the space of features. Hence, we have 35 features. Table 5 depicts the performance metrics for our model and the baseline, along with the confidence interval (at a 95 % of confidence level). To evaluate the behavior of our model in scenarios with fewer data, we varied the number of available trajectories. In the first four scenarios, each transportation mode contains the same number of trajectories. For instance, the first one with 50 trajectories contains 10 of each transportation mode. In the fifth scenario, each transportation mode contains 1000 trajectories, except for driving, contain-ing 1340 observations (both car and taxi contain less than 1000 features each). The last scenario is all the data available in the dataset.
We note that in scenarios with fewer data, our proposal presents the best results in all metrics. For instance, with only 50 trajectories, our model achieves an accuracy of 90%, whereas the standard features reach 74%, a gain of 16% of accuracy. We also see that the other metrics also present good results. As we add more trajectories in the classification, the baseline results improve. However, our model consistently presents better metrics results, with an accuracy gain up to 6%, when we have 250 trajectories. Using the whole dataset, we show a gain of about 3% in all metrics when using our model. It suggests that our features can contribute to the standard ones when identifying transportation modes.
After that, we will adopt the values D = 4 and τ = 1 and the RF classifier to analyze the classification results of different subsets of transportation modes. We used the entire dataset with our model (standard, OP, and OPTN features). Table 6 shows these results with a confidence interval of 95 %. We see that it is more challenging to distinguish between transportation that, intuitively, travels at a similar pace, such as walking and bike, and driving (car/taxi) and bus. For more distinct transports, such as walk and driving, we achieved better results. Car/taxi and bus also present good results for all metrics. It is likely to happen due to the frequent stops of buses, whereas cars and taxis travel more freely.
Moreover, the subset with smaller accuracy is bike and bus. There is not an easy explanation for this misclassification. One can suppose that an extraordinary fact forced the car or taxi to have similar features to walk or vice versa. It would be necessary to analyze the events influencing the mobility captured by the GPS data to understand this situation.   Figure 6 shows the boxplot for F1, precision, and sensitivity when classifying all transportation modes (bike, bus, driving, and walk). We can observe from these plots that all metrics do not present high variability, and outliers are rare. The highest variability was presented for the precision when classifying buses, with a standard deviation of 2.61. This behavior indicates that our classifier tends to be consistent.
In Fig. 7, we see the confusion matrix to the complete set of the investigated transportation modes. We see that, in general, our model can distinguish satisfactorily between the investigated transportation modes. We note that the model misclassifies walk as bus and bike; it makes more sense to confuse bike trajectories as walk since, as said before, they travel at a similar pace.
However, bus trajectories as walk need more investigation to understand. Moreover, driving and bus is reasonable since they are road-based and may have similar features.
Comparing our approach with other works in the literature is an arduous task for several reasons. For instance, they extract a wide range of features, both manually and automatically. It leads to many issues, such as not availability, higher complexity, and higher computational cost. Another essential aspect is segmentation. Most papers segment the trajectories similar to our work (considering that it is previously segmented as a traditional classification problem). Other works, such as [17], for instance, used walking information to slice the trajectories. Therefore, the number of seg- ments extracted from trajectory data is different, even using the same dataset. It is not possible to say which one is the correct total of segments. In other words, we cannot affirm if one segment is indeed about one trip or not since the dataset is manually labeled, which may lead to errors such as underreporting, as said before. Though the dataset contains the ground truth information (which transportation mode a point belongs to), at least it is guaranteed that the segments contain information about only one transportation mode. Moreover, the literature does not classify a unique set of transportation modes even using the same dataset.
For instance, using the GeoLife dataset, we see an infinitude of sets in different papers [17,24,44]. It is just one more issue that prevents the comparison of the works in this field. Therefore, we can conclude that it is imperative to develop solutions that can unify the methodology of transportation mode identification to provide more comparable works, which can help the progress of this area.

Conclusion
In this work, we used the OP transformations to classify transportation modes recorded as GPS trajectories. We transformed the GPS trajectory data into OP, and afterward, we transformed such patterns into the transition network and the probability distribution of the pattern frequency. From the latter, we extracted two wellknown IT quantifiers, namely permutation entropy and statistical complexity. We extracted the probability of self-transition from the former, which is directly related to the temporal correlation of the original time series. Our investigation showed that using both transformations enhances the classification results; hence, we can affirm that they satisfactorily characterize the trajectories.
Besides that, the probability of self-transition presented less dependence from the embedding dimension D and embedding delay τ , the needed parameters of ordinal patterns transformation.
Furthermore, we analyzed our proposal in scenarios where it is hard to collect data. Hence, less data are available to the model. Our results demonstrated that the features extracted from OP and OPTN could benefit this context since fewer data and features are vital to avoid the curse of dimensionality, for instance. This approach makes it possible to achieve good classification results using less computational power by extracting fewer features and data.
To the best of our knowledge, this is the first work that uses IT quantifiers derived from OPTN to transportation mode identification, showing that it is a feasible approach to this kind of problem. Additionally, new research can extend our method with more features or external information and may enhance the results.
Note that, although our proposal is validated here to the transportation mode identification, we believe that the findings presented here may be applied in practical situations as well as in technological applications related to time-series classification problems. In this scenario, the characteristics of OP transformation can be advantageous, such as its simplicity, robustness, and speed.
Although presenting many advantages, as previously presented, OP has some drawbacks. For instance, it does not consider the difference in amplitude that data may have. Consequently, data with different magnitudes are the same OP symbol. It may lead to errors since many time series can have important information in this feature. In addition, OP also does not see differences in sequences of the same values since it was designed to handle only inequalities (e.g., sequences with different values), which may introduce bias that leads to errors as well. There are techniques in the literature that try to overcome these issues, but this investigation is out of our scope. However, future researches that intend to explore these paths are welcomed and can positively complement our study.

Conflict of interest
The authors declare that they have no conflict of interest.
Code availability The implementation of our framework can be found online at: https://github.com/icps/tmc_ordinal_patterns.