PreCLN: Pretrained-based contrastive learning network for vehicle trajectory prediction

Trajectory prediction of vehicles is of great importance to various smart city applications ranging from transportation scheduling, vehicle navigation, to location-based advertisements. Existing methods all focus on modeling spatiotemporal relations with explicit contextual semantics from labeled trajectory data, and rarely consider the effective use of large amounts of available unlabeled trajectory data with the assistance of contrastive learning and pre-training techniques. To this end, we develop a novel Pretrained-based Contrastive Learning Network (PreCLN) for vehicle trajectory prediction. Specifically, we propose a dual-view trajectory contrastive learning framework to achieve self-supervised pre-training. A Transformer-based trajectory encoder is designed to effectively capture the long-term spatiotemporal dependencies in trajectories to embed input trajectories into fixed-length representation vectors. Moreover, three auxiliary pre-training tasks, i.e., trajectory imputation, trajectory destination prediction, and trajectory-user linking, are used to assist the training of PreCLN with the dual-view trajectory contrastive learning framework. After pre-training, the result trajectory encoder is used to generate trajectory representations for future trajectory prediction. Extensive experiments on two real-world large-scale trajectory datasets demonstrate the significant superiority of PreCLN against state-of-the-art trajectory prediction baselines in terms of all evaluation metrics.


Introduction
With the rapid growth of location-based services and the wide spread of GPS-equipped devices, the availability of large-scale trajectory data has been increasing in recent years, including taxi trajectories, human check-ins, traffic camera surveillance data, and more [1][2][3]. Therefore, mining spatiotemporal trajectory data has been extensively studied, Yanwei Yu yuyanwei@ouc.edu.cn Extended author information available on the last page of the article.
World Wide Web (2023) 26:1853-1875 such as predicting future trajectories [4][5][6], estimating arrival times [7][8][9], traffic prediction [10][11][12], route planning [13][14][15][16][17][18], and location-aware queries [19][20][21][22][23]. Among these researches, trajectory prediction has attracted a great amount of attention, which is of crucial importance for varieties of smart city applications, such as transportation scheduling, location-based recommendations, and autonomous driving. For example, trajectory prediction can assist traffic managers in predicting possible impending congestion and travel delays, and executing corresponding traffic control and scheduling strategies. If the driving path of the vehicle can be accurately predicted, the intelligent transportation system can provide the driver with personalized assistance or recommendations, such as dynamic re-routing, personalized risk assessment and prompts, speed suggestions, etc. Additionally, based on trajectory prediction, customized location-based advertisements can also be developed for those most likely to pass through certain commercial stores. Nonetheless, due to the complexity of the road network and the dynamics of urban traffic, as well as the uncertainty of the driving conditions, effective vehicle trajectory prediction faces many challenges.
Early methods towards this goal, have made significant efforts on trajectory prediction and next location prediction based on Markov model [24][25][26], Recurrent Neural Network (RNN) model [27][28][29][30], and graph-based model [30][31][32]. Based on the great success of RNN model in sequence modeling, a handful of RNN model-based location prediction or recommendation studies have been developed. LSTPM [28] is proposed to use a geo-dilated RNN to exploit the geographical relations among non-consecutive locations to conquer the limitation of RNNs in short-term user preference modeling. Li et al. [33] design a multi-layer LSTM encoder-decoder model that leverages temporal attention mechanism to enhance model sequence learning ability. STKG [30] is developed to incorporate knowledge graph embedding and Graph Convolution Network (GCN) into RNN-based models for better capturing the sequential transition patterns. In recent years, self-attention models have been extensively applied to temporal location modeling [34,35]. The recent state-ofthe-art works [35][36][37] feed time interval, user, and location into self-attention based models to explicitly learn spatiotemporal dependencies in trajectories, and stack extra modules [28,37,38] to integrate contextual information.
Despite the effectiveness of existing trajectory prediction methods [5, 25, 30, 35-37, 39, 40], these works have three key limitations. First, RNN-based models can hardly capture the long-term dependencies among the long trajectory sequences as RNNs have a certain degree of forgetfulness in processing long-sequence data [41]. Second, most works do not adequately exploit the potential of pre-training techniques for trajectory prediction on the available large-scale trajectory data. Recent spatiotemporal data mining works [37,39] prove that pre-trained models effectively improve model performance. Third, most works do not consider strengthening the trajectory prediction ability with the assistance of a large number of auxiliary tasks, for example, enhancing the model's ability to capture long-term dependencies through destination prediction, and enhancing the model's ability to learn user preferences through trajectory-user linking.
To tackle the aforementioned challenges, we develop a novel Pretrained-based Contrastive Learning Network for vehicle trajectory prediction, named PreCLN. To be more specific, we design a dual-view trajectory contrastive learning framework, which is to achieve self-supervised pre-training. Two trajectory representations of an input trajectory learned from two different views (i.e., hierarchical map gridding and road network mapping), respectively, are regarded as the query and the positive sample; Trajectory representations learned from other trajectories in the same batch are treated as negative samples with respect to the query. To generate more trajectory samples from the original trajectory data, we employ three different trajectory data augmentation strategies for pre-training. To effectively capture the long-term spatiotemporal correlations in trajectories, we develop a Transformer-based trajectory encoder to embed the input trajectory sequences into fixed-length representations. Moreover, we leverage three auxiliary pre-training tasks, i.e., trajectory imputation, trajectory destination prediction, and trajectory-user linking, to assist the training of the trajectory encoder. After the pre-training, the result trajectory encoder is used to generate trajectory representations for vehicle trajectory prediction. We conduct extensive experiments on two real-world large-scale trajectory datasets, and experimental results demonstrate that our PreCLN model can obtain significant performance improvement compared with state-of-the-art trajectory prediction techniques.
We summarize the contributions of this paper as follows: • We propose PreCLN model, a novel pre-trained spatiotemporal contrastive learning network, to make fully effective use of context-aware geographical information and road network for vehicle trajectory prediction. • We develop a dual-view trajectory contrastive learning framework that effectively captures long-term spatiotemporal dependencies in trajectories through a Transformerbased trajectory encoder with three auxiliary pre-training tasks. • Extensive experiments on two real-world large-scale GPS trajectory datasets verify the superiority of our proposed model in vehicle trajectory prediction when competing with state-of-the-art baselines.

Related work
In this section, we mainly discuss related works, including next location prediction and trajectory prediction.

Next location prediction
RNN models are widely used in modeling sequential forecasting in most existing next location prediction works [30,31,42,43]. Liu et al. [42] propose STRNN model that inputs temporal and spatial intervals between every two consecutive visits as additional auxiliary information into RNN model to improve prediction performance. Yao et al. [43] use a recurrent model to jointly learn user temporal preference and semantic contexts for next location prediction. Feng et al. [44] develop DeepMove model, which learns both long-term periodicities and short-term sequential regularities from correlated trajectories by combining an attention layer and a recurrent layer. Zhu et al. [45] propose Time-LSTM model that adds time gates into the LSTM network to capture the time factor to improve the spatiotemporal effect. To capture the spatiotemporal dependencies, Li et al. [46] further add spatiotemporal gates to the LSTM structure to capture the spatio-temporal relationships between successive check-ins. Huang et al. [29] propose ATST-LSTM model that leverages an attention mechanism to learn weights for each check-in in an LSTM network. Sun et al. [28] develop LSTPM model consisting of a nonlocal network for long-term preference modeling and a geo-dilated RNN for short-term preference learning. Rao et al. [30] propose Graph-Flashback that incorporates the learned POI transition graph into RNN-based models for better capturing the sequential transition patterns. Recently, Lian et al. [36] propose a self-attention based location recommendation model GeoSAN that considers point-topoint interaction within the trajectory. To aggregate all relevant visits from user trajectory, Luo et al. [35] further propose a Spatio-Temporal Attention Network (STAN), which allows a point-to-point interaction between non-adjacent locations and non-consecutive check-ins with explicit spatio-temporal effect. Nonetheless, previous location prediction approaches mainly consider capturing correlations between locations from observed mobility data, but neglect to use pre-training techniques to learn non-trivial spatiotemporal dependencies between locations from large-scale trajectory data to improve model performance.

Trajectory prediction
In the early days, people generally used traditional mathematical models or machine learning methods to address the trajectory prediction problem, including Markov models, hidden Markov models, and Bayesian models. Asahara et al. [24] propose a hybrid Markov chain model for pedestrian trajectory prediction, which considers the relationship between the individual and the whole. Gambs et al. [26] use an extended Markov chain model to personalize the prediction of the subsequent location. Zhao et al. [47] use a Bayesian n-gram model to learn the spatiotemporal dependencies of location sequences. Mo et al. [25] develop a hidden Markov-based model to solve the problem of individual mobility prediction. However, these traditional methods cannot effectively simulate the complex dependencies of spatiotemporal trajectory data. Deep learning models have been proven to better model spatiotemporal characteristics of trajectory sequences [10,11,48]. Among them, RNN-based models are representative and have achieved satisfactory performance on a variety of prediction tasks [49]. Park et al. [40] propose an LSTM-based encoder-decoder model to predict the next location of the vehicle trajectory on the map grid. Li et al. [33] develop a multi-layer LSTM encoder-decoder model in which the temporal attention mechanism is used to enhance the sequence learning ability for human mobility prediction. Capobianco et al. [50] leverage the attention mechanism to enhance the recurrent network model, which is applied to vessel trajectory prediction. CNN-based models are also used for mobility sequence prediction and trajectory prediction [51,52]. Nonetheless, these RNN models (e.g., LSTM) can hardly capture long-term dependencies in trajectory data [41], and can not run in parallel.
Challenged by the flaws of RNN-based models, self-attention models thrive to replace them due to the advantages of better performance, fewer parameters, and parallel operation. Self-attention models are first applied to sequential recommendations [53,54] and then widely used for location sequence prediction [35,36,55]. Lin et al. [37] propose CTLE which is a pre-trained model and applies a Transformer encoder to calculate contextual embeddings for next location prediction. In the follow-up work [39], they further propose a TALE pre-training method based on the CBOW framework, which is able to incorporate temporal information into the learned embedding vectors of locations. Fu and Lee [56] propose an RNN-based encoder-decoder trajectory representation learning framework, and then use three tasks (trajectory similarity measure, travel time prediction, and destination prediction) to leverage the large-scale trajectory datasets to learn trajectory embeddings. Shao et al. [5] use attention fusion dynamic GCN to obtain trajectory representation for trajectory prediction, and use two auxiliary tasks (i.e., arrival time estimation and ranking of similarity weights) to optimize the model. However, these models do not adequately exploit the potential of pre-training techniques and contrastive learning for trajectory prediction on large-scale trajectory data.

Problem definition
In this section, we first introduce the basic concepts used in this work, and then formally define the studied problem.
where V is the collection of vertices, each representing an intersection, and E is the collection of edges between vertices, i.e., the collection of road segments. Each edge e i,j ∈ E denotes the road segment from vertex v i to vertex v j . Definition 2 (Vehicle Trajectory) A trajectory of a vehicle u i is defined as a sequence of location points in chronological order T r i = (t 1 , p 1 ), (t 2 , p 2 ), . . . , (t n , p n ) , where n is the number of location points in T r i , and p i represents the geographic coordinates of the i-th location point, i.e., latitude and longitude.
We use T R i and T R to denote the set of all trajectories of vehicle u i and the set of all trajectories of all vehicles, respectively. Notice that each vehicle has a unique identification, denoted as u i , hereinafter also referred to as the user.
Problem (Vehicle Trajectory Prediction) Given all trajectories of all vehicles T R and the road network G, our goal is to learn a modelf to predict future vehicle trajectory of the next τ time steps for any given vehicle: The main notations used in this work are summarized in Table 1.

Methodology
In this section, we present the details of our proposed PreCLN, which consists of four major components: (1) a multimodal embedding module that learns the dense embedding vectors (2) a dual-view contrastive learning network that uses Transformer-based encoders to generate dual-view representation vectors of input trajectory for both auxiliary pre-training tasks and main task; (3) a multiauxiliary task pre-training module that utilizes pre-training techniques with three auxiliary tasks to enhance the representation learning ability of the model to improve the performance of main task; (4) a trajectory prediction layer that uses an attention-based predictor to estimate future location sequences using learned trajectory representations. The overall architecture of the proposed PreCLN is shown in Figure 1.

Multi-modal embedding module
This module contains two sub-modules: trajectory data augmentation and user and time aware location embedding layer.

Trajectory data augmentation
In contrastive learning models, data augmentation is an essential step, and positive samples are obtained through data augmentation. Nevertheless, in our model, the purpose of data augmentation is to obtain a larger amount of more diverse pre-training trajectory samples. Therefore, trajectory data augmentation not only needs to preserve the travel semantics of trajectories, but also needs to consider the differences between different samples.
Based on the above analysis, we explore the following three different trajectory augmentation strategies: • Multi-hop Augmentation. Multi-hop enhancement is to sample a multi-hop subtrajectory from the original trajectory according to the given number of hops. For example, when the hop count is set to 2, a 2-hop sub-trajectory is sampled, i.e., p 4 ), . . . , (t 2 n/2 , p 2 n/2 ) are sampled for original trajectory T r i = (t 1 , p 1 ), (t 2 , p 2 ), . . . , (t n , p n ) .
• Missing Augmentation. Missing Augmentation is to use different mask matrices to hide some location points in trajectories to form the missing trajectories. When the missing ratio is set to 30%, we first get a mask matrix where each element has a 70% probability of taking a 1. Then, we obtain augmented trajectories according to the mask matrix, i.e., the trajectory points corresponding to elements with a value of 1 are retained. • Segmentation Augmentation. Segmentation augmentation is to truncate the long trajectory according to a certain proportion to obtain multiple sub-trajectory segments. Give a trajectory T r i = (t 1 , p 1 ), (t 2 , p 2 ), . . . , (t n , p n ) , we can get sub-trajectory segments T r 1 . . , (t n , p n ) through segmentation augmentation.

User and time aware location embedding layer
This layer is used to encode user, time, and location and fuse them into dense embedding vectors.

User and Time Embedding
Many previous studies have shown that user and temporal information of trajectories have an important impact on trajectory sequence prediction tasks [35,36]. User information reflects users' spatiotemporal preferences and is beneficial for personalized modeling. For example, some users have to commute between home and the workplace every day. Meanwhile, the time dimension contains the specific day, week, hour, and minute information of each trajectory location, reflecting the periodicity of trajectory data. Therefore, we consider the user information in our model and learn the embedded user representation through: where u i is the one-hot encoding of the i-th user, and W u and b u are learnable parameters. Similarly, we first encode time information into time vector and t s j are the week, day, hour, minute, and second respectively, and then obtain the embedded time representation through: where W t and b t are learnable parameters.
Location Embedding Location information is of great significance for trajectory representation learning, and effective location embedding is beneficial to obtain a superior trajectory representation as it captures spatial correlations effectively. The exact locations of vehicle trajectories are usually described by GPS coordinates, i.e., latitude and longitude. Though they are continuous, they are difficult to directly be applied to deep learning models since GPS coordinates have poor discreteness. On the other hand, vehicle trajectory data is constrained by the road network, and considering the road network information can better characterize the representation of the trajectory [57].
To this end, we present two location encoding methods that embed the exact coordinates of location into the unique id: • Location encoding based on hierarchical map gridding: following [36], we first use a hierarchical map gridding method to map latitude and longitude into a grid. Specifically, each geographic area can be divided into four sub-areas according to a cross, and the four sub-areas are respectively encoded with 0, 1, 2, and 3 according to a fixed rule. Each sub-area can be recursively divided into four sub-areas and encoded again. Hence, any GPS coordinate can be mapped to a grid with a quadkey code. For example, if the gridding level is set to 3, the length of the quadkey code for each grid is 3, and the i-th digit indicates the grid code of the i-th lever where the GPS coordinate is located. • Location encoding based on road network mapping: we use the Fast Map Matching (FMM) 1 algorithm [58] to map vehicle trajectories to the road network, and use road network node sequence to represent each trajectory. Given a trajectory T r i = p 1 , p 2 , . . . , p n , we can get a node sequence T r R i = v 1 , v 2 , . . . , v m using FMM. To consider spatial correlation, we further encode road network nodes through: where |e i,j | represents the length of road segment e i,j , K is the number of nodes connected to node v i , and v i and v j are the one-hot vectors of nodes v i and v j , respectively.

Embedding Fusion
To incorporate user and time information, we further use a fusion layer to obtain user and time aware location embeddings: where p j can be the location code based on hierarchical map gridding or road network mapping, and W u f , W t f , W

Dual-view contrastive learning network
In this section, we propose a dual-view trajectory contrastive learning framework that leverages trajectory representations learned from two different views for contrastive learning across auxiliary tasks.

Dual-view trajectory contrastive learning framework
Inspired by [59], we design a dual-view trajectory contrastive learning framework. As shown in Figure 2, the framework aims to achieve self-supervised pre-training. For an input trajectory, two trajectory representations are learned from two different views (i.e., hierarchical map gridding and road network mapping), respectively, and then fed into the same task to expect the same result (i.e., maximize the similarity of outputs O g i and O r i ). That is, one view representation is regarded as the query, and the other view is regarded as the

Transformer-based trajectory encoder
In this section, we develop a Transformer-based trajectory encoder to capture the longterm dependencies in trajectory sequences. The Transformer-based trajectory encoder stacks multiple self-attention blocks, each consisting of a self-attention layer and a point-wise Feed-Forward Network (FFN). In particular, the self-attention layer takes the embedding matrix X of the location sequence as input and feeds them into an attention module after converting it through three weight matrices W Q , W K , W V : S = self att (X + P) = att ((X + P)W Q , (X + P)W K , (X + P)W V ) (5) where P is the positional embedding matrix. The attention module, i.e., att (·, ·, ·), is the scaled dot-product attention: The FFN is used to non-linearize the output of the self-attention layer and transform dimensions, which is a two-layer network, and the formula is defined as follows: Residual connection and layer normalization are applied in the FFN and self-attention layers to stabilize and speed up the training process when stacking multiple self-attention blocks. Notice that we use H g and H r to denote the trajectory representation matrix obtained from hierarchical gridding location encoding and road network mapping location encoding, respectively.

Multi-auxiliary task pre-training
In this section, we present three auxiliary pre-training tasks employed in our model, i.e., trajectory imputation, trajectory destination prediction, and trajectory-user linking.

Trajectory imputation task
The trajectory imputation task mainly predicts random hidden location points through trajectory context semantics, which has a strong correlation with the main task of trajectory prediction, significantly increasing the training sample space. Specifically, we first obtain a random mask matrix M for input trajectories according to a certain proportion, and then predict the missing location points through the learned trajectory representations.
The hidden operation using the mask matrix is: where is Hadamard product. T R is the input of the task and T R is the prediction target.
In this task, we employ a two-layer fully connected layer to map the learned trajectory representations to the target trajectory representations: where H g and H r are learned trajectory representations from two different views, and H g imp and H r imp are the two corresponding imputed trajectory representations.

Trajectory destination prediction task
To enhance the ability of long-term trajectory sequence prediction, we introduce a trajectory destination prediction pre-training task. The historical trajectory data is used as the model input, and the last location point of each trajectory is used as the destination. The grid code of the destination location can be obtained through a two-layer FFN: where Z g des and Z r des are the two destination grid codes learned by the model.

Trajectory-user linking task
The trajectory-user linking task enables the model to identify the correlation between trajectories and their users, which facilitates our model to capture user trajectory preferences in a personalized manner. All trajectory data is used as model input, and the users of the trajectory are used as training targets. We obtain the probability that each trajectory belongs to the user through a classifier: where U g tul and U r tul are the probability matrices of users.

Trajectory prediction layer
To consider the user and temporal information of the target trajectory when predicting future trajectory, we design an attention-based predictor: , att (·, ·, ·) is defined as (6), and W P is the learnable weight matrix.
Finally we use a linear layer to map Z to the future location sequence: where τ is the number of future time steps, and W l and b l are the learnable parameters. Notice that we can realize multiple transformations from the trajectory representation to future location sequence by stacking multiple att (·, ·, ·) layers to obtain a robust trajectory predictor.

Loss function
Trajectory prediction loss The final output from the prediction decoder is the locationlevel trajectory prediction matrixŶ (n+1):(n+τ ) = [ŷ n+1 ,ŷ n+2 , . . . ,ŷ n+τ ]. We introduce the mean absolute error (MAE) to calculate the error between the predicted trajectory sequence and the real trajectory sequence as the loss function: where denotes the number of trajectories in training set.

Contrastive learning loss
As described in Section 4.2.1, we treat the two different view representations of each trajectory as the query and the positive sample, and the representations of other trajectories in the same batch as negative samples. Then these trajectory representations are fed into the same task. To maximize the similarity of the task outputs of the query and the positive sample and the difference of the task outputs of the query and the negative samples, we employ a normalized temperature-scaled cross-entropy loss: where O represents the task output, and sim(·, ·) is the similarity of two output results, B denotes the set of all samples in the same batch, and T denotes the temperature parameter. The contrastive learning loss function can be calculated by accumulating all losses across all batches: Auxiliary task loss Furthermore, we also need auxiliary task loss function to complete the auxiliary pre-training to guide the learning of the model. LetÂ denote the auxiliary task prediction result and A denote the ground-truth result. The auxiliary task loss function is formulated as: Eventually, we integrate L tp , L cl and L aux into a joint learning framework through hyperparameters α and β: where λ represents the hyperparameter for regularization, and denotes the model parameters.

Datasets
We use two publicly available real-world large-scale vehicle trajectory datasets to evaluate our model: Porto Taxi Trajectory Data and T-Drive Trajectory Data.
• Porto Taxi Trajectory Data [60]. This dataset is from the ECML-PKDD competition, containing over 1.7 million complete trajectories collected from 442 taxis running in the city of Porto from 1st July 2013 to 30th June 2014. The sampling frequency is once every 15 seconds. • T-Drive Trajectory Data [61][62][63]. This dataset is a sample of T-Drive trajectory dataset that contains one-week trajectories of 10,357 taxis from February 2 to February 8, 2008. The total number of points in this dataset is about 17 million and the total distance of the trajectories reaches 9 million kilometers.
The basic statistics of the two datasets are summarized in Table 2.

Baseline methods
We compare our method with the following state-of-the-art baselines: • ST-LSTM [4]: This is a spatial-temporal LSTM model for next location prediction.
• DeepMove [44]: This is an attention-based RNN model, which captures the temporal transition through RNN, and capture the multi-level periodic information through attention model. • STAN [35]: This is an attention-based model, which explicitly uses relative spatiotemporal information between POIs within the user trajectory. • CTLE [37]: This is a contextual location embedding method that is built upon a bidirectional Transformer framework. • TALE [39]: This is a time-aware location embedding method that incorporates temporal information through designing a novel temporal tree structure for hierarchical softmax calculation. • Graph-Flashback [30]: This method uses GCN on the learned POI transition graph to learn the representation of POIs, and then feeds them into RNN-based models for location prediction.

Evaluation metrics
In experiments, we use Mean Absolute Error (MAE) and Root of Mean Square Error (RMSE) to evaluate the effectiveness of models, which are defined as: where denotes the number of trajectories in the test set, and dist (·, ·) represents the distance calculation function between two latitude and longitude location points.

Experiment settings
In experiments, we randomly select 70%, 20%, and 10% of the data in each dataset as training, testing, and validation sets, respectively.
For baselines, we use the source code released by their authors, adopt the parameter settings recommended in their papers, and fine-tune them to be optimal. For our model, we adopt Adam with default parameter setting to optimize our objective functions. The depth of self-attention is set to 3 on both datasets. We set the grid size to 80 meters(i.e., the level of gridding is 20), the learning rate to 0.001, and the batch size to 64. The embedding dimensions of user and time both are set to 12, and the representation dimension of trajectories (i.e., H) is set to 512. The blocking ratio for mask matrix is set to 30%. The multi-hop enhancement is set to two hops, the sampling ratio of down-sampling enhancement is set to 30%, and the segmentation enhancement adopts random truncation. The training epoch is set to 100 and the dropout rate is set to 0.2, however, an early stopping mechanism is used to avoid overfitting problems. All models are trained on a Linux server with a 2.60GHz Intel(R) Xeon(R) Gold 6126 CPU and 2 × 16GB NVIDIA Tesla V100 GPU.

Overall performance
The comparison of PreCLN with all baselines on two real-world datasets is shown in Table 3, where the best is shown in bold, and the second best is underlined.
From the evaluation results, we can observe that our PreCLN framework achieves the best prediction results as compared to state-of-the-art baselines. In particular, the relative performance improvement of our PreCLN over the best-performed baseline Graph-Flashback is 6.23% and 8.58% in terms of MAE and RMSE on T-Drive dataset. Although Graph-Flashback incorporates knowledge graph embedding and GCN to learn location representations and transition patterns between locations, it only relies on RNN model to learn temporal dependencies. Our PreCLN not only uses Transformer-based trajectory encoder to efficiently learn the mid-and long-term spatiotemporal dependencies in trajectories, but also utilizes contrastive learning to enhance the representation learning ability of the model. In addition, PreCLN significantly outperforms the pre-trained baseline TALE by average 14.91% and 52.36% improvements in terms of MAE and RMSE on two datasets, respectively. The main reason behind this is that our model combines dual-view contrastive learning and three auxiliary pre-training tasks to effectively model and capture spatiotemporal dependencies in trajectories that contribute to trajectory prediction, while TALE only uses pre-training techniques to enhance the ability to learn location representations that incorporate contextual semantics such as user and time. Additionally, the prediction performance superiority can be observed in our PreCLN compared with all competitive methods, which validates the effectiveness of our pre-trained model with the integration of dual-view contrastive learn framework and auxiliary pre-training tasks.

Ablation study
To verify each component of PreCLN, we conduct the ablation study on model components. We carefully design five variations: • TE -This variation removes time encoder module, and directly use the one-hot encoding to represent time. The performance of all variations on Porto and T-drive datasets are shown in Figure 3. As we can see, all variants significantly perform worse than PreCLN, demonstrating the effectiveness of all components of our model.
Variants TE, UE and TUE perform worse than PreCLN and TUE further performs worse than TE and UE, which indicate that the user and time aware location embedding layer has a positive impact. User embedding takes user information into account and, therefore, incorporates personalized preferences. Time embedding encodes the time information, taking into account the temporal periodicities. RNM and HMP focus on the structural settings of the model, from the comparison between RNM and HMP, it can be known trajectory representation learning based on hierarchical map gridding performs better than trajectory representation learning based on road network mapping. The reason is that the hierarchical map gridding view is a fine-grained trajectory representation learning that contains more accurate spatiotemporal information, while the road network mapping view is a coarse-grained trajectory representation learning that only considers high-level road route information. Nonetheless, both single-view variants (i.e., HMP and RNM) are significantly worse than our PreCLN, further validating the rationality and superiority of our design of dual-view contrastive representation learning framework.
To verify the effect of each auxiliary task used in our PreCLN, we further conduct the ablation study on auxiliary tasks. We compare our model with eight carefully designed variations. Despite the changed part(s), all variations have the same framework structure and parameter settings. The performance of all variations on two trajectory datasets are shown in Figure 4 • IMP -This variation only considers trajectory imputation as auxiliary task during pretraining. • DP -This variation only considers trajectory destination prediction as auxiliary task during pre-training. • TUL -This variation only considers trajectory-user linking as auxiliary task during pre-training.  As we can see in Figure 4, all variants significantly perform worse than PreCLN, which fully validates the effectiveness of all auxiliary pre-training tasks used in our model in assisting trajectory prediction.
The variations with two tasks are generally better than the variations with one task (i.e., (TUL+IMP)/(TUL+DP)/(IMP+DP) vs. TUL/IMP/DP), which verifies the effectiveness of the pairwise combination of the three pre-training tasks. More specifically, from the comparisons among the three single-task variations, it can be seen that both IMP and DP are significantly better than TUL task on both metrics on two datasets, which reflects that trajectory imputation task and destination prediction task can better assist trajectory prediction compared to trajectory-user linking task. That is, for trajectory prediction, the enhancement of short-and long-term spatiotemporal dependence learning is more important than the enhancement of user preference learning. This may be because we have effectively fused user information through the multimodal embedding layer in trajectory representation learning. This is also verified by IMP+DP being significantly better than TUL+IMP and TUL+DP variations. PreCLN model using three auxiliary tasks further improves trajectory prediction performance, indicating that the three pre-training tasks can promote each other and strengthen the model's ability to learn effective trajectory representations from different aspects.

Parameter study
We finally investigate the sensitivity of our PreCLN with respect to the important parameters, including grid size and coefficients α and β. We report MAE and RMSE on two datasets with different parameter settings in Figures 5, 6 and 7.
As shown in Figure 5, at first, the prediction error of PreCLN decreases as the grid size increases, and then begins to increase when the grid size is larger than 80 meters. This is mainly because the grid size setting too large may lose the spatial information of trajectory locations, since many GPS points share a grid code. If the grid size is set too small, a   large number of grids will be generated, which increases the grid code length and reduces the spatial correlation between adjacent GPS points, which in turn leads to poor prediction performance.
We further study the coefficients α and β which balance weights between main task and auxiliary pre-training tasks. As we can observe, the prediction error first drops and then rises as α and β increase on both datasets. Moreover, our model achieves relatively low prediction errors when α and β fall into a certain range. However, the performance of PreCLN is worse when α and β are closer to 0 or 2. This illustrates that trajectory prediction performance can be significantly improved by considering contrastive learning loss and/or auxiliary task loss, but too much focus on contrastive pre-training will also hurt trajectory prediction performance of the model. Additionally, as can be seen from the results, compared with auxiliary task loss (i.e., β), considering contrastive learning loss (i.e., α) can reduce the model prediction error faster in both MAE and RMSE, reflecting that the contrastive learning can integrate the high-level travel semantics of trajectories with the fine-grained gridding, which effectively the learning ability of trajectory representation.

Effect of Augmentation Strategy
We now evaluate the effect of the employed three trajectory data augmentation strategies. We report the experimental results in Table 4.
Based on the results on two datasets in Table 4, we have two observations: (1) All trajectory data augmentation strategies help improve model prediction performance. (2) The downsampling/missing augmentation performs better than the other two augmentation methods. The reason may be that the downsampling/missing augmentation method hides some location points by random sampling, so that more diverse sub-trajectory samples can be obtained, resulting in better results.

Effect of Trajectory Encoder Selection
We also evaluate the selection of different views of trajectory encoder. The experimental results on two datasets are shown in Table 5. We can conclude that the trajectory prediction performance of the trajectory encoder based on hierarchical map gridding is significantly better than that of the trajectory encoder based on road network mapping. This may be because the location encoding based on hierarchical map gridding uses more fine-grained spatial information, while the location encoding based on road network mapping only preserves the high-level transition patterns of trajectories. In addition, the goal of the trajectory prediction task is to predict fine-grained location points, and it is more reasonable to choose the trajectory encoder based on fine-grained location encoding view, i.e., hierarchical map gridding.

Conclusion
In this work, we propose a novel pretrained-based contrastive learning Network PreCLN for vehicle trajectory prediction. It effectively captures the complex spatio-temporal dependencies in vehicle trajectories for trajectory prediction through the designed dual-view trajectory contrastive learning framework with assistance of three auxiliary pre-training tasks.
The experimental results on two real-life large-scale vehicle trajectory datasets demonstrate the effectiveness and superiority of our proposed model for trajectory prediction.