Durable queries over non-synchronized temporal data

Temporal data are ubiquitous nowadays and efficient management of temporal data is of key importance. A temporal data typically describes the evolution of an object over time. One of the most useful queries over temporal data are the durable top-k queries. Given a time window, a durable top-k query finds the objects that are frequently among the best. Existing solutions to durable top-k queries assume that all temporal data are sampled at the same time points (i.e., at any time, there is a corresponding observed value for every temporal data). However, in many practical applications, temporal data are collected from multiple data sources with different sampling rates. In this light, we investigate the efficient processing of durable top-k queries over temporal data with different sampling rates. We propose an efficient sweep line algorithm to process durable top-k queries over non-synchronized temporal data. We conduct extensive experiments on two real datasets to test the performance of our proposed method. The results show that our methods outperforms the baseline solutions by a large margin.


Introduction
Temporal data are ubiquitous nowadays. They can be found in many areas such health [1,2], energy [3,4], traffic [5,6], environment [7,8], etc. In fact, with the fast development and wide deployment of data collection devices, many practical data are temporal in nature. Generally speaking, a temporal data describes the evolution of some data object over time [9]. In many practical applications, a data object has a value at any time during its evolution. For example, a stock may have a closing price every day; a power generator may run at a varying load depending on the need of power consumption; a traffic sensor by the road may record a traffic flow every hour. In such applications, users often show particular interests to the top data objects (i.e., the data objects with the best scores) within a time period.
In fields such as data management and information retrieval, there have been extensive studies on querying temporal data. The quereis can be roughly classified into two categories: point-wise queries and period-wise queries. Point-wise queries measure a data object at each time point. For example, Lee et al. [10] studied the consistent top-k (CTop-k) queries over temporal data, aiming at finding all data objects whose scores are always among the k highest ones at each time point. In contrast, period-wise queries focus on some aggregated measurement (e.g., average score, total score, etc.) of a data object over a time period. For example, Jestes et al. studied aggregate top-k queries on temporal data [11], aiming at finding the k data objects with the highest aggregation scores.
Durable top-k (DTop-k) queries are a relatively novel type of queries over temproal data. Similar to consistent top-k queries, a durable top-k query has particular interests in data objects with the best k scores at each time point. Nonetheless, unlike the case with CTop-k queries, for a data object to be a DTop-k result, it is unnecessary that its score remains a top-k at every time point. Instead, within a given time period, any data object that becomes one of the top-k for sufficiently many time points is interesting enough to a DTop-k query. In this sense, DTop-k queries can be viewed as an extension of CTop-k queries.
U et al. [12] first proposed and studied durable top-k queries on document archives. A document in an archive may have different versions over time. For example, a Wikipedia document may be editted by different users from all the world. Specifically, U et al. wanted to ideantify the documents matching some given keywords for sufficiently many time points. Wang et al. [13] studied DTop-k queries over times series data. They argue that, comparing to archived documents (which are piecewise constant), time series are more dynamic than archived documents, thus it requires dedicated techniques for efficient processing of DTop-k queries.
A key assumption in [13] is that the time series are synchronized, i.e., each time series has an observation/score at every time point; hence, the top-k objects at every time point can be precomputed and organized in certain index structures. However, in many practical applications, massive temporal data are collected from multiple data sources, and different data sources may use different sampling rates [14,15]. For example, cars with different GPS devices or speedometers may generate speed data at different frequencies using different mechanisms. Therefore, it is very unlikely that such temporal data could be synchronized. Instead, practical temporal data are more likely to be non-synchronized [16].
In this paper, we investigate efficient processing of DTop-k queries over nonsynchronized temporal data. The main challenge from non-synchronicity is that the time is continuous instead of discrete. Existing index structures (e.g. [13]) for synchronized, thus discrete, data are hardly applicable. Therefore, we propose a novel sweep line algorithm (SLA) to efficiently deal with the non-synchronicity. The key insight of SLA is the property of intersections: When two temporal data intersect, they change their relative order, and vice versa. In light of this property, SLA answers DTop-k queries by tracking snapshot top-k objects. To sum up, we make the following contributions: 1. To the best of our knowledge, we are the first to investigate the problem of efficient DTop-k processing with non-synchronized temporal data. 2. We propose a novel sweep line algorithm (SLA) to efficiently answer DTop-k queries. 3. We conduct extensive experiments using two real datasets. The results show that SLA outperforms its competitors by a large margin.
The rest of the paper is organized as follows. Section 2 introduces some preliminary knowledge (such as concepts and existing methods). Section 3 presents the solutions to DTop-k queries, including a straightforward solution (Section 3.1) and our main proposal SLA (Section 3.2). Then, Section 4 shows the experimental results. After that, Section 5 summarizes research work closely related to ours. Finally, Section 6 concludes the paper.

Concepts and problem definition
Definition 1 (Temporal data) A temporal data describes the evolution of a data object over time. Formally, the j -th data object in a temporal database D can be represented as a sequence where v j is the score of object o j at time t j ( = 1, 2, · · · , n j ) and T j is the recorded length of the temporal sequence of object o j .
Note that Definition 1 offers a general representation that naturally allows asynchronicity in D. Indeed, given two data objects in the temporal database o i , o j ∈ D, it is allowed that the time points t i 1 , t i 2 , · · · , t i T i and t j 1 , t j 2 , · · · , t j T j are not aligned (and even T i = T j ).

Definition 2 (Snapshot Top-k Object) Given a data object
A clear implication in Definition 2 is that, given T discrete samples in time, we also view the temporal data o = (t 1 , v 1 ) , (t 2 , v 2 ) , · · · , (t T , v T ) as a concatenation of T − 1 line segments. Thus o(t), the score of object o at time t, is a linear interpolation on the line segment within t , t +1 , with endpoint cases of o (t ) = v and o (t +1 ) = v +1 . This view is closely related to the piecewise linear representation of temporal data [17,18], which may lose accuracy because actual temporal data are usually smoother. Nonetheless, in this work, we do not replace or remove original data points to obtain any approximation of the temporal data. We leave the accuracy issue to the data generation mechanism (which is beyond the scope of this work) and assume that any data o ∈ D, when viewed as a concatenation of line segments, is accurate enough for any potential application. Hence, in this work, we alternatively represent a temporal data o = (t 1 , v 1 ) , (t 2 , v 2 ) , · · · , (t T , v T ) as o = s 1 , s 2 , · · · , s T −1 where s i is the line segment with endpoints (t i , v i ) and (t i+1 , v i+1 ).
Given a data object o ∈ D, define the durability of o ∈ D with respect to W and k as   Table 1 summarizes the notation frequently used in the rest of the paper.

The essence of intersections
Previous studies have discovered the relationship between intersections of line segments and snapshot top-k objects (see, e.g., [13]). We summarize the main result as follows.  We omit the proof of the intersection lemma as it is quite intuitive that, for a non-top-k object o to become a top-k, it must at least go up to beat the k-th best object o k . In fact, the converse of Lemma 1 is also true: for any intersection (v, t) of two line segments, there must exist a particular k such that the snapshot top-k set Top k t (D) changes at time t. Therefore, intersections of line segments play an important role in DTop-k query processing.

A Straightforword method
In light of the intersection lemma (Lemma 1), in order to process a DTop-k query q = W, k, γ over non-synchronized temporal data, it suffices to compute the snapshot top-k sets at every intersection within the interval W . With all the snapshot top-k sets computed, it is relatively trivial to compute dur(o) for each candidate data object o.
Algorithm 1 summarizes the above idea. Algorithm 1 first finds the set of all intersections within W (Lines 1-2), P = (p 1 , p 2 , · · · , p M }. Note that the size of P (i.e., the number M) might be quadratic to the total number of line segments within W . Algorithm 1 then finds the snapshot top-k set S i = Top k p i .t (D) at every intersection (Lines 3-4). After obtaining all the snapshot top-k sets, the rest of the algorithm becomes trivial. It uses a counter c to maintain the state of each object o. The counting procedure contains two cases. It remains to be clarified how to obtain the intersections (Line 1) and the snapshot top-k sets (Line 4).

Finding the intersections
Given a set of line segments, finding all the intersections are a fundamental task in computer geometry. It can be efficiently solved via a sweep line algorithm [19]. Specifically, let D = o 1 , o 2 , · · · , o N be the temporal database and S =

Finding the snapshot Top-k
Efficient solutions to snapshot top-k queries have been studied in literature. Li et al. [20] proposed a SEB-tree index to support snapshot top-k queries (which are termed "top-k(t) queries"). SEB-tree is a randomized index structure with deterministic correctness guarantees. Specifically, SEB-tree is based on p-samples of S. A p-sample S p ⊆ S is constructed by selecting each line segment s ∈ S with probability p. To construct the SEB-tree index, Li et al. first construct a sequence of p i -samples S p i ⊆ S for p i = 2 −i (i = 1, 2, · · · ), and then build a B-tree based on each S p i . A snapshot top-k query can then be answered using the SEB-tree index in O (log |S| + k) time. where C is the total number of candidates (i.e., the number of objects that ever is a top-k at any time in W ).

A sweep line method
Through the analysis in Section 3.1.3, we see that a major drawback of the straightforward method is that it issues many snapshot top-k queries, which is quite time-consuming. We argue that such a computational cost can be largely saved, as an intersection x = (v, t) in fact carries much useful information.
In light of the intersection lemma (Lemma 1), an alternative solution to processing a DTop-k query q = W, k, γ over non-synchronized temporal data is to track the changes of the snapshot top-k set within the indicated time interval W = t begin , t end . Specifically, we may first find the snapshot top-k set Top k t (D) for t = t begin and then scan all the intersections in temporal order to see whether and how Top k t (D) changes as t goes to t end . Algorithm 2 summarizes the above idea, following the paradigm of sweep line algorithms [19]. It remains to be clarified how to maintain the snapshot top-k set during the line sweeping process, i.e., how to efficiently retrieve the k-th best object (Line 15).

Maintaining the snapshot Top-k
Considering fact that the parameter k is query-related, keeping track of the k-th best object is to maintain the entire ordering of the temporal database D. Since we focus on top-k objects where k |D|, in practice we may maintain the top-k max set, where k max |D| is the maximum possible query parameter.
Let x i = (t i , v i ) (i = 1, 2, · · · , M) be all the intersections in D in temporal order. It is easy to see that the snapshot top-k set remains unchanged within every interval t i , t i+1 , since there is no intersection between x i and x i+1 . Therefore, we may precompute the snapshot top-k max sets between each interval t i , t i+1 , organizing the results in an index structure (e.g., a hash table). In this way, Line 15 of Algorithm 2 can be execute in constant time. Figure 2 illustrates the above idea.
From Figure 2, it is also clear that the time efficiency is at the cost of index size. It requires O (M · k max ) space. Although k max |D| by assumption, the total number of intersections M is usually at the magnitude of O |D| 2 . Thus the hash table index of all snapshot top-k max takes quadratic space.
To strike a balance between the space cost and time efficiency, we may use a tree-based data structure similar to the status data structure DS status (Section 3.1.1) to facilitate the line sweeping. Specifically, we use a max-heap as an auxiliary data structure. The heap is initialized with the snapshot top-k ranking 1, o 1 , 2, o 2 , · · · , k, o k . Without loss of generality, here we assume that there is no tie in the ranking. Nonetheless, ties can be trivially handled in such a data structure. Using the auxiliary data structure, though it takes constant time to retrieve the k-th best object, it takes additional O (log k) time to maintain the max-heap property. Thus, the overall time for processing an intersection is O (log k). Precomputing the snapshot top-k max sets depending on whether a quadratic index is used. Comparing to the analysis in Section 3.1.3, we see that Algorithm 2 is much more efficient in processing DTop-k queries.

Early termination
Given a DTop-k query q = (k, W, γ ), it is possible for Algorithm 2 to terminate before actually sweeping the entire query interval W . Let t be the current position of the sweep line and r t = W .t end − t . For any object o with a counter c begin , c end

Experiments
In this section, we evaluated our proposed methods. We simulate durable top-k queries over two real datasets and compare our proposed method with straightforward solutions. We also study the impact of query parameters.

Datasets
We use Stock Market Data (SMD) 1 and Gowalla 2 as our datasets for the experiments. SMD records the daily stock market prices for all NASDAQ listed companies. Specifically, there are 1,574 companies. Each company is a temporal data from the day it was listed on NASDAQ until Aug 29, 2022. The earliest company was listed on Jan 2, 1970, whereas the latest one was just listed on Aug 24, 2022. So, the SMD dataset has a timespan over 50 years and the temporal data are of various lengths in SMD. Each temporal data has 5,509.45 values in average, with the maximum and minimum lengths being 13,283 and 4. The oldest and youngest companies The distribution of temporal data lengths in SMD is shown in Figure 3a.
Gowalla was a popular location-based social network service from 2007 to 2012. The dataset, released by Cho et al. [21] in 2011, contains 6,264,203 check-in records from 196,591 users to 1,280,956 places. The Gowalla dataset contains data from Feb 2, 2009 (Day 0) to Oct 23, 2010 (Day 626), covering a total number of 627 days. Figure 3b shows the daily number of check-in records in the dataset. Since there are in total ∼ 10 6 places but only ∼ 10 4 daily check-ins, we can infer that a number of places have no recorded visit every day. The original SMD and Gowalla datasets are synchronized on a daily basis. We randomly disturb the datasets a bit to generate non-synchronized temporal datasets. Specifically, given a temporal object o = (t 1 , v 1 ), (t 2 , v 2 ), · · · , (t T , v T ) , we add a random time shift Δt i to time t i (i = 1, 2, · · · , T ). The time shift Δt i is chosen in the range of ±1440 minutes uniformly at random. In this way, we turn the original synchronized data into a non-synchronized one without affecting much semantics of the data.

Efficiency of SLA
In this section, we compare SLA with some baseline methods. Specifically, we consider the following methods in this experiment.
-Baseline (Section 3.1). The straightforward method that issues a snapshot top-k query at every intersection. -SLA-Q: The SLA algorithm with quadratic index (with all the top-k objects precomputed between any two intersections). -SLA-H: The SLA algorithm with a heap structure to faciliate the line sweeping process.
We vary k in the range 5, 10, 15, 20, and 25. For each k = 5, 10, 15, 20, 25, we randomly generate 1,000 DTop-k queries with a fixed γ = 0.75. The length of the query interval W is also fixed to a week (i.e., 7 days, or 10,080 minutes), but the position of the interval W may vary from the first day to the last. For each k, each of the 1,000 query is answered by each method 10 times to get an average time cost. Then, the performance of a method is measured by the average of its performance on 1,000 queries. Figure 4 show the results. As can be seen, the time costs of all methods increase as k gets larger. Baseline increases faster than SLA does. This is because when k increases, it costs more from snapshot top-k computation. In addition, it can be seen that SLA outperforms Baseline by orders of magnitude because SLA avoids snapshot top-k queries. Moreover, SLA-Q slightly outperforms SLA-H due to the quadratic index of precomputed top-k rankings. Nonetheless, the improvement of SLA-Q over SLA-H is not very significant. This implies a wise choice should be made to balance the space and time costs in practice. Comparing the results in Figure 4a and b, we see that it generally costs more to process DTop-k queries in Gowalla. This is because Gowalla has much more temporal data to be processed at most of the timestamps.

Sensitivity to the length of query interval
In this section, we investigate the impact of |W |, the length of the query interval, on the performance of SLA. We fix k = 5 and vary the length |W | in the range 7, 14, 21, and 28 days. This range covers the time length from one week to approximately one month. Again, we randomly generate 1,000 DTop-k queries for each choice of |W |. Each query is answered by each algorithm 10 times. The average time costs are recorded and shown in Figure 5.
From Figure 5, it is clear that, as the query interval W gets larger, the time cost of Baseline increases much faster than SLA does. This is expected because the number of intersections may be quadratic to the number of line segments in W . In addition, we see that SLA-Q and SLA-H are not as sensitive to |W | as Baseline is. Indeed, comparing to the initial snapshot top-k computation, the line sweeping process over W is relatively cheap (either reading a precomputed top-k list or updating a size-k heap). In addition, the difference between SLA-Q and SLA-H is very small, which is also a reference when deciding the implementation details of SLA in practice.

Types and representations of temporal data
There are various types of temporal data [9]. A (spatio-)temporal point data is essentially a position in space and time. For example, a Twitter data (i.e., tweet) can be viewed as an instance of point data, which is associated with a timestamp, a geo-tag, and some content [22]. A trajectory [23][24][25][26] is a traced path of some object moving in space over time. A time series data is a series of points organized in temporal order [13,[27][28][29]. A stream data is data that is continuously generated and collected over time [30,31].
In this paper, we focus on time series data. Especially, we target a general problem setting that the time series are non-synchronized. This setting is closely related to piecewise linear approximations or segmentation of time series [18,32], which is typically used to reduce storage cost of very long time series.

Queries over temporal data
Li et al. [20] proposed an index structure for snapshot top-k queries, finding the top-k temporal objects at a given timestamp. The index worked with piecewise linear time series. As stated in Section 3, their solution could be a building block of our solution to DTop-k query processing. Jestes et al. [11] studied aggregate top-k queries on temporal data, where the aggregations were such as average or sum over a given time interval. Jiang and Pei [33] studied the interval skyline queries, which found skyline objects (i.e., time series that are not dominated by others) within a given time interval. The focus of [11] and [33] are different from ours and their solutions cannot be used for DTop-k queries. Lee et al. [10] studied consistent top-k queries, which was the special case of DTop-k with a fixed durability threshold γ = 1.0. As pointed by Wang et al. [13], although their solution can be extended for DTopk queries, the time efficiency of the extended solution was a major issue in practice. Wang et al. [13] first investigated DTop-k queries over time series data and proposed top-k event scanning (TES) algorithms. Gao et al. [34] reduced the problem of DTop-k query processing into 3D halfspace reporting problems and proposed more efficient solutions. However, both [13] and [34] focused on synchronized time series. Thus, there solutions are not directly applicable to our problem.
There are durable top-k over temporal data other than time series. For example, U et al. [12] studied DTop-k queries in over archived documents. Gao et al. [35] studied DTopk queries over instant-stamped temporal records. Chen et al. [36][37][38] studied the problems of top-k term publish/subscribe/search over geo-textual streams. The research problems of these work are clearly different from ours, and thus their solutions are not directly applicable to DTop-k query processing.

Conclusions
In this paper, we study the problem of DTop-k queries over non-synchronized temporal data. The major challenge brought by non-synchronicity is that the time space becomes continuous and thus solutions based on discrete time space are no longer efficient. We propose an efficient sweep line algorithm SLA to process DTop-k queries over non-synchronized temporal data. The key insight of SLA is the property of intersections: When two temporal data intersect, they change their relative order, and vice versa. Using this property, SLA answers DTop-k queries by tracking snapshot top-k objects. We conduct extensive experiments on two real datasets to test the performance of our proposed method. The results show that our methods outperforms the baseline solutions by a large margin.