Top-k Approximate Selection for TypicalityQuery Results over Spatio-textual Data

. Abstract Spatial keyword query is a classical query processing mode for spatio-textual data, which aims to provide users the spatio-textual objects with the highest spatial proximity and textual similarity to the given query. However, the top-k result objects obtained by using the spatial keyword query mode are often similar to each other while users hope that the system can pick top-k typicality results from the candidate query results in order to make users understand the representative features of the candidate result set. To deal with the problem of typicality analysis and typical object selection of spatio-textual data query results, a typi-cality evaluation and top-k approximate selection approach is proposed. First, the approach calculates the synthetic distances on dimensions of geographic location, textual semantics, and numeric attributes between all spatio-textual objects. And then, a hybrid index structure that can simultaneously support the location, text, and numeric multi-dimension matching is presented in order to expeditiously obtain the candidate query results. According to the synthetic distances between spatio-textual objects, a Gaussian kernel probability density estimation-based method for measuring the typicality of query results is proposed. To facilitate the query result analysis and top-k typical object selection, the Tournament strategy-based and Local neighborhood-based top-k typical object approximate selection algorithms are presented, respectively. The experimental results demonstrated that the text semantic relevancy measuring method for spatio-textual objects is accurate and reasonable, and the Local neighborhood-based top-k typicality


Introduction
With the rapid development of GPS and the universal application of intelligent devices, a large number of spatio-textual objects containing location information and descriptive information (such as POIs, Check-ins, etc.) emerged on the Web, and gradually formed large-scale spatio-textual data.Spatial keyword query is becoming a research hotspot in the fields of Location-Based Services (LBS), spatio-temporal data mining, POI recommendation and spatial database query [1][2][3][4].For example, the LBS platforms such as Meituan, Ctrip, Twitter, and Google Maps all need the support of spatial keyword query technology to obtain spatio-textual objects that are close to the query location, semantically related to the query keywords and interesting to users.Spatiotextual object (hereinafter referred to as spatial object) mainly includes two parts: location information and descriptive information.Location information represents the geographical position of the spatial object (the location information of two-dimensional spatial object is represented by longitude and latitude), and descriptive information is the textual and numeric description of spatial object characteristics (for example, hotel name, facilities, surrounding environment, user comments, prices, user ratings, etc.).Given a set of spatio-textual objects, the spatial keyword query q is in the form of q (loc, keywords, k, α), where q.loc represents the query position, q.keywords is the set of query keywords, k is the number of returned result objects, and α ∈ [0, 1] is a weight coefficient.The commonly used similarity measuring method between a spatial object o and query q is as follows, Score(o, q) = α • Sim Loc (o.loc, q.loc) + (1 − α) • Sim Doc (o.doc, q.keywords) (1) where, o.loc and o.doc represent the location information and descriptive information of spatial objects, respectively.Sim Loc and Sim Doc represent the normalized spatial proximity and textual similarity between o and q, respectively.α is used to adjust the proportion of spatial proximity and textual similarity in the overall similarity.
Since there is a large number of spatial objects contained on the Web, there would be too many matching candidate query results for a non-selective spatial keyword query.For example, searching for "Thai Food" near the location of "706 Mission St, San Francisco" on Yelp website will return about 500 pieces of Thai restaurants and related restaurants (for example, Vietnamese Food, Hunan Food, etc.), and these result objects will be presented to users in a comprehensive order according to their location similarity and textual similarity to the query.However, the restaurants in the top-k result sets will be similar to each other and may not be representative among the whole candidate query results.In real application, users want to know the top-k most representative restaurants in terms of location, dishes, and user ratings of the whole query result set.Additionally, the user may only specify a query region and aim to search the most representative top-k restaurants in the region.In this case, if the system can automatically find top-k typical results for users to choose according to the distribution of spatial objects in the target set in terms of geographical location, textual semantics, user ratings and other information, it will greatly reduce the burden of users' selection and strengthen users' understanding of candidate query result sets.
To solve the above problems, this paper proposes a typicality measuring method for spatial objects and top-k typical object approximate selection algorithm.This method first calculates the synthetic distance of all spatial objects in the dimensions of geographical location, textual semantics and numeric attributes, and then builds a hybrid index structure supporting location-text-numeric matching to quickly obtain candidate query results.Then, the top-k typical objects are obtained from the candidate query results by using the probability density estimation method based on the Gaussian kernel function.To facilitate the query result analysis and top-k typical object selection, the Tournament Strategy-based and Local Neighborhood-based topk typical object approximate selection algorithms are presented, respectively.When the size of candidate query results is large, the time complexity of selecting precise top-k typical objects is high.Therefore, this paper presents approximation algorithms based on the Tournament strategy and local neighborhood respectively, which are used to expeditiously provide the top-k typical objects.

Related work 2.1 Spatial keyword query processing mode
There are generally six spatial keyword query processing modes for spatio-textual data.Boolean range query [5], Boolean kNN query [6], top-k range query [7], top-k kNN query [8][9][10], top-k GNN query [3,11], and Reverse kNN query [12].Boolean range query returns spatial objects within the range of query location and containing all query keywords.Boolean kNN query returns the top-k spatial objects containing all query keywords and closest to the query location.The top-k range query returns the top-k spatial objects that are within the range of query location and most relevant to the query keywords.Top-k kNN query returns the top-k spatial objects with the highest comprehensive similarity with query location and query keywords.The top-k GNN query returns k groups of spatial objects covering all query keywords and close to the query location, and each group of spatial objects is treated as a query result unit.Reverse kNN query returns a set of objects, each of which contains the query location in its k-nearest neighbor set.The first two query modes take advantages of strict query matching, which requires the text information associated with spatial objects to contain all query keywords.The third and fourth query modes focus on ranking, which ranks the results according to the spatial proximity and textual similarity between spatial objects and the query.The latter two query modes extend the traditional spatial keyword query, which is used to obtain nearest neighbor group objects and reverse-nearest neighbor spatial object sets.Table 1 shows the characteristics of different spatial keyword query processing modes.

Query type
Spatial index structure Ranking strategy 1. Boolean range query R*-tree, R-tree, Quad-tree No ranking 2. Boolean kNN query Quad-tree, R-tree Spatial proximity first 3. Top-k range query S2I, R-tree Textual similarity first 4. Top-k kNN query Quad-tree, IR-tree Spatial proximity and textual similarity 5. Top-k GNN query Grid, Quad-tree, IR-tree In-group density, spatial proximity and textual similarity 6. Reverse kNN query Quad-tree, IR-tree Spatial proximity and textual similarity The Top-k typicality query processing mode proposed in this paper aims to obtain the top-k objects with the highest typicality from the candidate query result set, so that they can represent the overall characteristics of the whole candidate query result set, which is a new spatial keyword query processing and result analysis mode.

Index structure of spatio-textual data
The index structure of spatial data mainly includes R-tree based indexes (such as Rtree [13], R*-tree [14], S2I [15], etc.), Grid-based indexes (such as TS [16], IG-tree [17], etc.) and Quad-tree based indexes (such as I3 [18], ILQ [19], etc.).The index structure of text data mainly includes Inverted file [20], Signature file [21], and Bitmap index [22].The index of spatio-textual data is to combine spatial index with text index to build a spatio-textual hybrid index structure.The main hybrid index structures and their characteristics are shown in Table 2.
As can be seen from Table 2, IR-tree is the most representative hybrid index structure.The basic idea is to use R-tree to match the query region and leverage an inverted index file to match the query keywords.In real applications, however, the descriptive information of spatio-textual objects (such as hotel name, facility, price, user rating, etc.) contains both text information and numeric information.However, the existing spatio-textual hybrid index structure can only deal with spatial information and text information, and users usually query the numeric attributes by specifying preference weights (such as preferring hotels with low prices and high ratings, etc.) instead of specifying explicit numeric intervals.The existing spatio-textual hybrid index structure cannot effectively handle such queries.Therefore, based on IR-tree, this paper constructs a hybrid index structure that can support both spatial proximity, text similarity and numeric preference processing, so as to expeditiously obtain the candidate query results.

Top-k query result selection
For the top-k query result selection of spatial keyword query, the factors considered according to the query result ranking function can be divided into three categories, spatial proximity, textual similarity, and overall score of spatial proximity and textual similarity.The top-k selection of spatial keyword query results is mainly based on two strategies.The one is to locate candidate query results by using spatio-textual index structures such as IR-Tree, and then rank the candidate query results according to the scoring function combining spatial proximity and textual similarity [15,23,24].The other is based on Threshold algorithm (TA), which can obtain the top-k results that the comprehensive similarity is higher than the given threshold by dynamically adjusting the threshold [25,26].Disadvantages: High storage cost, high computational cost for the update operation I3 [18],ILQ [19] Quad-tree Inverted file Advantages: High efficiency for region searching

Disadvantages:
Unbalance tree structure and high storage cost TS [16],IG-tree [17] Grid Inverted file Advantages: High efficiency for region searching Disadvantages: High storage and update costs However, the top-k result objects returned by existing methods are usually similar to each other and not representative, so it is more necessary to choose top-k typicality results.The concept of typicality analysis originated from the cognitive science [31], and has been gradually applied to numeric data analysis [32], image processing and recognition [33], recommendation system [34] and other fields in recent years.The research in the above fields has shown that selecting typicality samples is a very effective way for users to understand and analyze the large dataset.In this paper, the idea of typicality analysis in cognitive science is applied to the selection of top-k typicality query results for spatio-textual data.The purpose is to select representative query results from a large number of candidate query results, so as to enhance users' understanding of the main features of the candidate query result set.However, unlike image data and numeric data, spatial objects contain location, text and numeric data, and the typicality evaluation of an object depends on the distribution of objects around it.Therefore, the difficulties in typicality evaluation and typicality selection of query results lie in the comprehensive similarity evaluation among spatial objects in location, text and numeric dimensions, and the precision and execution efficiency of top-k approximate selection algorithm.
In recent years, diverse query closely related to typicality query has attracted much attention in the fields of advertisement search [35], spatio-temporal news push [36], RDF data query [37], spatio-textual data query [38], road network query [39], etc.The above work mainly adopts the methods of text semantic similarity measuring, densitybased clustering, maximum coverage model, etc. Typicality query and diversity query are different in that the former considers the typicality of the selected results in the candidate query result set and the differences between them, while the latter pays attention to maximizing the differences between the selected results.

Textual similarity measuring
Textual similarity measuring methods can be roughly divided into four categories.The first category is based on term statistics, such as Bag of Words (BOW) [40], Vector Space Model (VSM) [41] and so on.This kind of method measures the similarity between terms according to their co-occurrence frequency in the document set, but cannot effectively measure the semantic relationships between terms.The second type is Knowledge Base (KB) [42] method, which introduces external dictionaries (such as HowNet [43], WordNet [44], Probase [45], Wikipedia [46], etc.), and uses the hierarchical path relationship between concepts or entries in dictionaries to determine the similarity between entries.However, this kind of method cannot calculate the similarity between entries that are not covered in Knowledge Base.The third category is the method based on topic models (such as LDA, LSI) [47].The basic idea is to calculate the semantic correlation between terms according to the distribution of documenttopic-word.However, most of the comment texts of spatial objects are short texts.As pointed out in [48], short texts do not contain sufficient statistical information.Meanwhile, the number of times a word is assigned to a topic and the number of words occurred in a topic are difficult to converge.Moreover, some users' comment syntax is random.Therefore, the topic model is not suitable for semantic similarity measuring between user comments.The fourth is based on Word Embedding model, which maps entries in documents to a fixed-length low-dimensional dense vector space according to the size of the vocabulary.Word vectors can maintain semantic relationships, grammatical information and contextual information.At present, the mainstream word embedding models are Word2Vec [49,50](such as Skip-gram and CBOW), SENNA [51], GloVe [52] and FastText [53].The word embedding method is the most popular method to measure the semantic similarity of texts.In this paper, word embedding and Convolutional Neural Network (CNN) model are used to extract the deep semantic features of comment text to calculate the semantic similarity between spatial objects.
3 Problem definition and solutions

Problem definition
Spatial objects mainly include spatial information (such as latitude and longitude), text information (such as feature description and user comments) and numeric information (such as user rating, price), so the features describing spatial objects mainly include location, text and numeric attributes.The definition of "typicality" is given as follows.Definition 1 (Typicality).Given a set of spatial objects D = {o 1 , o 2 , . . ., o n } described by m attributes (A 1 , A 2 , . . ., A m ), the spatial objects in D can be regarded as independent and identically distributed sample subsets in the sample space Z composed of Cartesian products of these attribute domain (Dom (A 1 ), Dom (A 2 ) . .., Dom (A m ) ), and each spatial object (i.e., sample point) is an m-dimensional vector in Z.The typicality degree of a spatial object o i in D of Z can be expressed as where L (o i | Z) denotes the probability of the occurrence of o i (i.e. the probability of the occurrence of o i in Z ).Since the data distribution of Z is unknown in real applications, an approximate estimate for Density-based estimation is the main method to measure the typicality of samples.The research in psychology and cognitive science further shows that [31], if the probability of object o appearing in set D is greater, o would be more typical.The probability of o appearing in D is affected by the distribution of objects around it.The denser the objects distributed around o, the higher the probability density of o and the more representative/typical it is.
Typicality analysis is closely related to cluster analysis, but there are essential differences.Typicality analysis selects representative/typical objects according to the distribution of objects in a given set (such as spatial distribution, semantic feature distribution, and user rating distribution).While, clustering analysis divides the objects in a given set into several categories according to the distance between them, and the distance between objects in the same category is as small as possible.In some research work, Mean or Centroid points in each cluster are regarded as the representative of the class.But in some cases, typical points are not Mean or Centroid points.Figure 1 shows a set of spatial objects and Typical point A, Mean point B and Centroid point C in this set.It can be seen that point A is more representative than B and C, because the points distributed around A are denser, so A is a typical point in the set.According to the above analysis, this paper takes advantages of the probability density estimation method based on kernel function to measure the typicality degree of spatial objects in a given set.This paper aims to construct a hybrid index structure that supports spatio-textnumeric multi-dimensional matching, and then use the hybrid index structure to quickly obtain candidate query results according to the spatial keyword query.Lastly, the typicality degree analysis and top-k approximate selection of candidate query results are conducted.

Solution
Figure 2 shows the solution of top-k approximate selection for typicality query results over spatio-textual data.The solution includes offline pre-processing and online processing steps.
(i) Offline pre-processing.Firstly, we compute the normalized distances between all spatial objects on location, text and numeric attributes, respectively, and store these distances in files for calculating the typicality degree of query results during the online processing stage.There are two methods for measuring the text semantic similarity.The one is based on keyword coupling relationship and the other is based on Word Embedding and Convolution Neural Network.These two methods are used to deal with the description text and user comment associated with spatial objects, respectively.
(ii) Online processing.For a given spatial keyword query, it is first normalized by decomposing query keywords and unifying letters.And then, the candidate query results are obtained from the spatial object set by using the spatio-text-numeric hybrid index structure.For the candidate query result set, we take advantage of the Gaussian kernel function-based probability density estimation method and the integrated distance between spatial objects in each attribute (including location, text and numeric value) calculated in the offline stage to compute the typicality degrees for spatial objects.Lastly, the top-k typical results are obtained from the candidate query result set by using approximation algorithm.4 Integrated similarity measuring for spatial objects Spatial objects mainly include three kinds of attributes, location, text and numeric.It is necessary to quantify the similarity between two objects on each attribute by using the probability density function to measure the typicality degree of spatial objects, and then calculate the integrated similarity on these attributes.The similarities of spatial objects in location and numeric attributes can be simply calculated by Euclidean distance.The text information associated with spatial objects mainly includes description text and comment text.Description text usually corresponds to attributes such as "hotel name: hilton", "facilities: gym, washing room, wifi".Comments are mostly short texts which are natural language and sentences issued by users.To measure the text semantic similarity between spatial objects, we should leverage different text similarity measuring methods to deal with description text and comment text.In this paper, the coupling relationship between keywords is used to measure the semantic similarity between description text of spatial objects, while the combination of word embedding and convolution neural network is used to measure the semantic similarity between comment texts associated with spatial objects.

Semantic similarity measuring for description text based on keyword coupling relationship
The description text of spatial objects usually corresponds to different attributes (such as name, category, facility service).The text attributes and their corresponding keywords can be expressed in the form of <key: value> pairs.For example, Table 3 shows the locations and descriptions of four spatial objects.Spatial object o 1 contains the set of key-value pairs {<name: full season>, <facility service: pool>, <facility service: wifi>, <facility service: gym>}.For spatial objects o 2 and o 3 , the similarity between them would be 0 if we use a traditional textual similarity calculation method (such as vector space model) because there is no intersection between the sets of key-value pairs contained in them.It can be found that, however, the values "swimming pool", "parking" and "gym" correspond to the attribute of "facility service" are appeared together in o 4 , and the values "swimming pool" and "gym" are appeared together in o 1 and o 4 twice, which shows that there is an inter-correlation between "swimming pool" and "gym", and further identifies that the texts between o 2 and o 3 have a coupling correlation.Based on this, this paper will take advantages of the coupling relationship between keywords to measure the text semantic similarity between spatial objects.
In this paper, the description text of each spatial object is regarded as a document, and the description texts associated with all spatial objects consist of a document collection (denoted by T ).We first use text processing tools to extract all the different keywords in the document collection and normalize them (such as unifying upper and lower case letters, word tenses, word segmentation, and removing meaningless words), and then each document can be represented by a group of keywords in the form of <attribute: keyword>.On this basis, we can take advantage of the keyword coupling relationship to measure the semantic similarity between description texts of spatial objects.

Keyword coupling relationship measuring
In a document collection, the relationships between all different keywords can be represented by a graph structure.Figure 3 shows a keyword coupling relationship between three different keywords {A, B, C}, where each vertex represents a keyword and the edge represents an intra-coupling relationship between two keywords.As can be seen from Figure 3, the coupling relationship between vertices can be divided into the intra-coupling relationship and the inter-coupling relationship.If two vertices are directly connected by edges, they have intra-coupling relationship (such as A and B).While if two vertices are indirectly connected they have inter-coupling relationship (such as A and C are interconnected by B).The weight on the edge represents the normalized intra-coupling relationship between keywords.For a given keyword, its output and input degrees may be different.For example, the output degree of B to A is 0.5 while the input degree of A to B is 1.The linear combination of intra-coupling and inter-coupling relationships between two keywords consists of their coupling relationship.

Fig. 3 Keyword coupling relationship (i) Keyword intra-coupling relationship measuring
Given a pair of keywords (t i , t j ), their intra-coupling relationship can be calculated as, Since t i and t j may also co-occur with other keywords, it is necessary to normalize the intra-coupling relationship between t i and t j .The calculation method is as follows, where, n represents the number of all different keywords in T , the normalized intra-coupling relationship of (t i , t j ) refers to the proportion of the intra-coupling relationship between t i and t j in the sum of the intra-coupling relationships between t i and all other keywords.For any pair of keywords (t i , t j ), it has δ Intra (t i , t j ) ≥ 0 and n j=1,j̸ =i δ Intra (t i , t j ) = 1 (ii) Keyword inter-coupling relationship measuring If there is no direct relationship between t i and t j , and suppose there is at least one keyword in the document set T making δ Intra (t i , t c ) > 0 and δ Intra (t j , t c ) > 0, then (t i , t j ) has an inter-coupling relationship, and their inter-coupling relationship generated by their common keyword t c is calculated as follows, where, δ Intra (t i , t c ) and δ Intra (t j , t c ) represent the intra-coupling relationship between t i and t c , t j and t c , respectively.
Since there are usually several common keywords between t i and t j and each common keyword has different importance in the document set, we use IDF weight method to measure the weight of each keyword in the document set and normalize the weights of all other keywords with the maximum IDF.After this, we let C represent the common keyword set of t i and t j , that is, C = {t c | (δ Intra (t i , t c ) > 0 and δ Intra (t j , t c ) > 0)}.Then, the inter-coupling relationship of t i and t j through all the common keywords in C can be calculated by the following formula, (iii) Coupling relationship between keywords Given a weight coefficient θ ∈ [0, 1], the coupling relationship between keywords t i and t j can be obtained by linear combining their intra-and inter-coupling relationship, which is calculated as follows,

Semantic similarity measuring based on keyword coupling relationship
Based on the keyword coupling relationship, the semantic similarity measuring method between description texts associated with spatial objects is divided into the following three steps.(i) Document vectorization Let the order Γ is a sequence containing all the different keywords in the description text set associated with all spatial objects, m is the number of keywords in the order, that is, m = |Γ |.Γ [i ] denotes the i-th keyword in the order, where i = {1, ..., m}.Based on the order of keywords in Γ , if there is a keyword corresponding to Γ [i ] in the document d of a spatial object, then the vector (ii) Semantic matrix construction According to the coupling relationship between keywords, the correlation between all keywords in Γ is constructed into a matrix S with the size of m * m, where each element S(i, j) represents the coupling relationship between keywords t i and t j .
(iii) Semantic similarity measuring between documents According to the matrix constructed in step (ii), the semantic similarity between the two documents can be defined as follows, where, sim T indicates the similarity between the vectors corresponding to documents d i1 and d i2 .The matrix S preserves the coupling relationship between keywords contained in the two documents.We keep the latest 50 pieces of comment texts for each spatial object and put them together to form an integrated comment text.Each spatial object corresponds to an integrated comment text and all the integrated comment texts corresponding to spatial objects form a comment text set.Skip-gram model is used to train different keywords in the comment text set and each keyword is transformed into a d-dimensional dense vector. (

ii) Matrix representation of comment text
The comment text corresponding to a spatial object usually contains multiple keywords and each keyword in which can be represented by a vector.Thus, each comment text can be spliced into a matrix by keyword vectors.Suppose the vector dimension of each keyword is d and the maximum number of keywords contained in the comment text is n, then each comment can be represented by an n * d matrix.It should be pointed out that, for the comment with the number of keywords less than n, the zero vector filling method would be adopted to make the number of word vectors reach n.
(iii) Feature extraction for comment text CNN can control the fitting ability of the model by using different sizes of convolution kernel, pooling and output feature vectors.As shown in Figure 4, the input of the neural network is a n * d matrix corresponding to the given comment text, the filters of (3, d), (4, d) and (5, d) sizes are respectively used for convolution operation.The activation function is then used for nonlinear processing.Next, the obtained results are processed by maximum pooling, dropout and flattening, and lastly the feature vector representation of the comment text can be output.In Figure 4, the convolution layer is used to extract features.Since each vector represents a keyword and each keyword cannot be trained separately, the convolution kernel can only slide in the vertical direction, that is, the width of the convolution kernel is equal to the dimension d of the word vector and the height h of the convolution kernel indicates the number of keyword vectors contained in the convolution window.The parameter h can take different values.By combining the feature vectors extracted from different convolution windows, it can better reflect the effective semantic features of the document.Assuming that the eigenvector generated by the keyword x i:(i+h−1) is a i , then we can get a i = f (w × a i + b), where w is the weight matrix of the convolution kernel, b is bias vector, and f () is the ReLU activation function.With the above method, after the convolution operation, the feature vector set {a 1 , a 2 , . . ., a n−h+1 } would be generated by the keyword set x 1:h , x 2:h , . . ., x (n−h+1):n of the document.
The pooling layer is connected to the convolution layer.In this model, the maximum pooling is selected to process the text.The maximum value of the feature vector generated by each convolution kernel is obtained by using the maximum pooling operation.Then, the maximum value is spliced to obtain the feature vector of the convolution structure.The purpose of maximum pooling is to extract the most representative features generated by different convolution kernels.The dropout layer forbids some neurons to propagate forward by randomly deleting neurons in order to avoid over-fitting problems in the training process.The function of Flatten layer is to splice and smooth the eigenvalues extracted from the convolution layer and pooling layer, and make them become a vector.The vector output from the Flatten layer is the abstract eigenvector feature vector representation of comment text corresponding to spatial object.
(iv) Semantic similarity measuring for comment texts The feature vector representation of comment text corresponding to spatial objects can be obtained through step (iii).Given a pair of comment texts corresponding to spatial objects, the semantic similarity between them can be obtained by calculating the Cosine similarity of their feature vectors.

Integrated distance measuring for spatial objects
Given a set D = {o 1 , o 2 , . . ., o n ) containing n spatial objects, according to the above similarity measuring method, the similarities between any pair of objects in D in dimensions of geographical location, description text, comment text, and numeric attributes are normalized values (i.e., the value is in the interval of [0, 1]), so the distance can be obtained by subtracting the normalized similarity from 1 .Consequently, the integrated distance between objects o i and o j can be calculated by the following formula, where, m represents the number of attributes of spatial objects (including geographical location, description text, comment text and numeric attributes), d o represents the normalized distance between o i and o j on attribute l, and w l represents the weight of the distance corresponding to attribute l in the integrated distance.In this paper, all attributes are regarded as equally important, that is,

Hybrid Index Structure and Query Matching
To quickly obtain the matching candidate query results for a given spatial keyword query, we propose a spatio-text-numeric hybrid index structure for pruning the search space, which can speed up the query matching efficiency.

Hybrid index structure
The hybrid index structure combining spatio-text-numeric matching is called NIR-tree and the structure of which is shown in Figure 5.Each node of the NIR-tree records spatial information, text information summary (all different keyword sets extracted from the node text information), numeric attribute tuple information, and pointers of all objects in the subtree with the node as its root.The information of each node in the NIR-tree is divided into three parts, the first two parts are two pointers respectively to the inverted file (InvFile) and the numeric attributes file (NumFile) of the node, and the third part is the collection of Entries in the node.Each non-leaf node and leaf node may contain multiple entries, but the number of entries contained in each node is fixed and cannot be less than 1/2 of the pre-defined number of entries.Skyline is a collection of tuples that are not dominated by any other tuples in the dataset.Let R be a relation containing a set of attributes A = {A 1 , A 2 , . . ., A m } and n tuples.For a pair of tuples < p, q > of R, the tuple q dominates p if q is superior to p in at least one attribute and not inferior to p in all other attributes.While, if tuples p and q do not dominate each other, then both p and q should be in Skyline.For example, when looking for hotels, users usually consider factors such as price, hotel grade and number of parking positions, so hotels with lower prices, higher grades and more parking positions are obviously good choices.The numeric attribute information of each spatial object can be regarded as a numeric attribute tuple.If the tuple p describing the numeric attribute information of hotels contained in the Skyline, there would be no other tuple in Skyline with lower prices, higher grades and more parking positions than p.Thus, Skyline is helpful for obtaining highquality query results.The NIR-tree index structure leverages the Skyline to process the numeric attribute information associated with spatial objects.Algorithm 1 presents a Skyline computation algorithm for numeric attribute information associated with spatial objects.
For a given numeric attribute tuple list list num, Algorithm 1 firstly ranks list num in ascending order according to the value of the first numeric attribute.And then, for each tuple in list num, if its value on each attribute is no greater than other tuples in list num, it will be deleted from list num until such tuples do not exist.Lastly, the Skyline collection for list num can be obtained.By using Algorithm 1, the Skyline collection of all spatial objects can be obtained.During the query processing, we can consider the candidate query result objects who also fall in the Skyline collection as the preferred results.For a non-leaf node, each item of which is also composed of a quadruple in the form of < pN, Rect, N.pid, N.aid >. where, pN is the address of child node N in this node, Rect refers to MBR that can contain all child nodes under this node, N.pid is the text information identifier points to the InvFile which contains the text keywords of all child nodes under this node, N.aid is the numeric attribute tuple information identifier points to the NumFile which contains the Skyline collection of numeric attribute tuples of all child nodes under this node.
By using the above spatio-text-numeric hybrid index structure building method, we can generate the NIR-Tree (shown in Figure 6) for the spatial objects in Table 4.

Query Matching Algorithm Based on NIR-Tree
For a given spatial keyword query, the process of obtaining candidate query results by using NIR-Tree is as follows: (i) Starting from the root node of NIR-Tree, the algorithm first checks whether each branch node matches the spatial constraint of the query in turn.If the node satisfies the spatial constraint, then it checks whether the InvFile of this node contains the query keywords.
(ii) For the matching branch nodes, the objects in the Skyline collection are taken as query results (the objects in the Skyline collection have better values in numeric attributes than that not in the Skyline collection).
The query matching algorithm based on NIR-Tree is shown in Algorithm 2.

Fig. 6 An example of NIR-Tree
Algorithm 2 works as follows.Firstly, it adds the root node to the maximum heap.And then, for each object N in heap, if N is a spatial object (that means root is a leaf node), N is added to the result, and S 1 , S 2 and integrated Score are calculated.Otherwise, N is a non-leaf node.For each entry E in N , the algorithm checks whether it contains query keywords and satisfies the spatial constraints of the query.If not, the branch would be not iterated.Otherwise, the values of S 1 , S 2 and Score would be calculated respectively after obtaining the Skyline collection for the node (line 15), and then E would be added to the heap and iterated until the heap is empty.Lastly, the set of matching results is returned.
Algorithm 2 Query matching algorithm based on NIR-Tree search (q, keywords, weights, α, β) Input: the set of spatial objects D, spatial keyword query q(loc, keywords ′ , weights, α, β), index file NIR-Tree Output: query result 1: result ← φ, maximum heap ← φ, heapEntry ← φ 2: heap.add(root)/*root is the root node of AIR-Tree */ 3: while heap ̸ = φ and result.size()< k do 4: if N is an object then 6: result.add(N ) 10: else for entry E in N do 12: heapEntry.add(E) 13: if q.keywords ′ ==heapEntry.getId().getKeyword()and q.loc in E.Rect then In this paper, all objects in the spatial object set are regarded as a set of points in a high-dimensional space, so that the traditional probability density estimation method can be properly transformed to measure the typicality of a spatial object.The probability density estimation method based on the Gaussian kernel function is adopted, which can effectively measure the probability density when the data distribution is unknown [54].Given a set of spatial objects D = {o 1 , o 2 , . . ., o n }, the typicality of an object o ∈ D can be defined by the probability density function f (o) as follows: where, d (o, o i ) 2 represents the integrated distance between objects o and o i (calculated by Equation 8), is a Gaussian kernel function (here, h = 1.06s * n −1/5 , and s represents the standard deviation of the integrated distances between all spatial objects in D ), and n represents the number of spatial objects in D. Figure 7 shows the spatial object (represented by the red star) with the highest typicality in the geographical location dimension.From the figure, it can be seen that the points distributed around the typical point are very dense, which means that the probability density of this point is the highest and most representative within the given set.
Fig. 7 The visualization of the most typical object in the set of spatial objects For the candidate query result set, the method for selecting the exact top-k typical objects is to calculate the typicality of each object, and then select the first k objects with the highest typicality as the final result.The implementation process of the above method is shown in Algorithm 3.
The time complexity of Algorithm 3 is O n(n−1)

2
≈ O n 2 , where n represents the number of objects in the candidate query result set.However, in the case of a large dataset, the query response time of Algorithm 3 is unacceptable, so it is necessary to investigate an approximate selection algorithm to make it find the best approximate query results as soon as possible.

Top-k approximate selection algorithm of typicality results
Because of the high time complexity of the exact algorithm for finding top-k typical objects, this paper proposes two approximate selection algorithms, the Tournament strategy-based and the Local neighborhood-based approximate selection algorithms, in order to reduce the time complexity of the algorithm and meanwhile ensure the query accuracy.for j = i + 1 to n do 5: result ← find k objects with the highest typicality from R according to T (o, R) 9: return result

Tournament strategy-based approximate selection
The basic idea of Tournament strategy-based approximate selection is to divide the set of spatial objects into several small sets and pick the local most typical object from each set, and then iteratively select global typical objects from local typical objects.Given a set of spatial objects D = {o 1 , o 2 , . . ., o n }, its implementation procedure is as follows: (i) It randomly divides the set D' (D'= D at initial) into several small groups, and each group contains u objects (u is a small positive integer).For each group, the typicality of each spatial object is calculated within the group by using Equation 9, and then the object with the highest typicality can be selected (called group typical object).The typical objects from all groups will form a new set, while other objects are removed from D'.
(ii) For the new set, the algorithm repeats step (i) until D' contains only one object and it would be added as a candidate typical object to the candidate typical object collection C. Figure 8 shows the processing procedure of steps (i) and (ii), which is referred as a candidate typical object selection procedure.It should be noted that the candidate typical object may not be the final global typical object.
(iii) The candidate typical object selection procedure is repeated for v times, and the candidate typical object set C would contain v objects.For each object of the set C, the algorithm computes its typicality in the range of D, and finally only one object with the highest typicality (called approximate typical object) is retained, which would be added to the approximate typical object set S as one of the final results and removed from D. The above process steps (i)∼(iii) is referred as a round selection process of approximate typical objects.
(iv) By repeating steps (i)∼(iii) for k rounds, we can obtain an approximate typical object set S which contains k approximate typical objects close to exact typical objects.

Local neighborhood-based approximate selection
According to the top-k approximate selection method of representative tuples proposed in [55] against the relational data, this paper presents a local neighborhood-based top-k approximate selection method of typical objects.The basic idea is as follows.
We assume that each object in set D is a point in m-dimensional space.For two objects a and c in D, according to Equation 9, the contribution of point c to the , where n represents the number of objects contained in D. The contribution of c to a decays exponentially with the increase of the distance between c and a.Therefore, the contribution of c to the probability density of a is smaller if the distance between c and a is further.For three objects a, b, and c in D, the triangular inequality holds.Therefore, the d(a, c) ≈ d(b, c) holds if d(a, c) >> d(a, b).Furthermore, the objects that are far away from a and b have almost the same and very small contribution to the probability density of a and b.Thus, given a spatial object set D and one of its subsets S ⊆ D (the integrated distances between objects in S are relatively close), the global probability densities (i.e., typicality) among D of objects in S can be approximately calculated by using the local neighborhood of S. Definition 2 (Local Neighborhood).Given an object set D, one of its subsets S ⊆ D, and a neighborhood threshold σ, the local neighborhood of S in D can be calculated by the following formula, The set N is composed of the objects in D whose distance to at least one object in S does not exceed the threshold σ.
Based on the local neighborhood, the local typicality of an object in S in the local neighborhood of S can be calculated.For an object o in S, its local typicality can be calculated by the following formula, According to the above ideas, the local typicality of a spatial object can be used to approximate its global typicality.The error range between the local typicality and the global typicality of a spatial object calculated on the basis of the local neighborhood will be discussed below.Theorem 1.Given a set D and its subset S ⊆ D, a neighborhood threshold σ, we assume that õ = arg max oi∈S {T Local (o i , S, D, σ)} is the object with the maximum local typicality in S and o = arg max oj ∈C {T (o j , D)}is the object with the maximum global typicality in S, then we have Furthermore, for any object p ∈ S, we have The proof of the theorem is given below.
Proof.For any object p ∈ S, where S ⊆ D, according to formula 9, we have On this basis, Equation 15 is applied to o and õ, respectively, then we can get the following equation, Using Inequality 17 again, there would be So far, theorem 1 is proved.
Given a neighborhood threshold σ, for any object p ∈ D, we first compute the σ-local neighborhood of {p}, i.e.N ({p}, D, σ) and the local typicality T Local (p, {p}, D, σ).And then, we use the local typicality of p to approximate its global typicality degree T (p).On this basis, top-k objects with the highest local typicality are selected as the approximated top-k global typical objects.According to theorem 1 , for the spatial object set D, the error between the total typicality of the top-k approximate result objects obtained by the local neighborhood algorithm and the total typicality of the exact top-k result objects obeys Inequality 12.
It should be pointed out that it is very time-consuming to calculate its σ-local neighborhood for each object in D. To improve the processing efficiency, we take advantages of VP-tree to accelerate the σ-local neighborhood search of a given object.VP-Tree is a binary spatial partition tree.Its basic idea is to select a certain spatial object as the vantage-point, then calculate the distance from other objects to the vantage-point, and divide the spatial object set according to the given distance threshold [56].Firstly, following the idea from [56], vantage-points are selected as candidates by randomly sampling a group of objects from D. The remaining objects are then used to evaluate these candidate points.Eventually, the object that can construct a highly balanced VP-tree is selected as the vantage-point.Next, according to the vantage-point, the objects whose distances from the vantage-point are not higher than the given threshold in D are divided into the left subtree while the objects whose distances from the vantage-point are higher than the given threshold are divided into the right subtree.The left and right subtrees are recursively divided until the node contains only one object.The time complexity for building VP-Tree is O(|D|log|D|).Figure 9 shows an example of building a VP-Tree.Suppose we pick point < 5, 6 > as the vantage-point, and then calculate the Euclidean distances from the other four objects to the vantage-point, respectively.After this, the objects whose distances from the vantage-point are less than the given threshold (assuming it is 4) are divided into left subtrees (which contain the points < 4, 3 > and < 2, 4 >) while the others are divided into right subtrees (which contain the points < 1, 1 > and < 3, 2 >).
Given an object in the spatial object set D, the process for searching the σneighborhood of the object by using VP-Tree is as follows.It first compares the integrated distances between each node in VP-Tree and the given object from top to bottom.If a non-leaf node in VP-Tree is in the σ-neighborhood of the object, then all the subsequent objects of the node are in the neighborhood of the object.In the worst case, if the neighborhood threshold is set to the maximum integrated distance between all objects in D, the neighborhood of the given object will contain all objects in D, then the algorithm will degenerate to calculate the global typicality of objects by using the exact selection method and thus the time complexity of the algorithm would be O n 2 .In the experiment part, we will discuss the impact of neighborhood threshold on the error rate of the algorithm in detail.

Experimental settings
The experiment runs on Windows 10 operating system, Intel i5.2.30-GHz CPU and 8GB RAM computer.All algorithms were implemented in Python.Yelp and Foursquare datasets are leveraged for evaluating the performance of our methods.For the Yelp dataset, 53,516 spatial objects with longitude between -115.0 and -110.9 and latitude between 32.3 and 35.6 are extracted as the test dataset, and each spatial object in which contains data such as location information, numeric information, description text and user comments.Among them, the description text includes name, city, categories, postal code, facilities and other attributes.The numeric attribute information is composed of 5 random numbers between 0 and 1.The comment text of each spatial object is composed of the latest 50 pieces of comment texts and the integrated comment text does not exceed 200 keywords.For the Foursquare dataset, 100,285 spatial objects with longitude between -175.2 and 178.6 and latitude between 36.5 and 40.6 are extracted as the test dataset, and each spatial object contains data such as location information, numeric information and description text.Among them, the description text includes 5 attributes which are name, address, city, state and categories.The numeric information includes 3 attributes which are check-in number, checked user number and ratings.After the above processing, the integrated distance measuring between spatial objects will be conducted in the dimensions of geographical location, description text, user comments, and numeric attributes.The statistics of the experiment datasets are shown in Table 5.The average number of keywords contained in the description text associated with each spatial object 7 6 The upper bound of the number of keywords in the comment text of each spatial object 200 -

Effect of textual similarity measuring method for spatial objects
This experiment aims to test the effect of Word2Vec-CNN-based comment text similarity measuring method.The parameter settings of Word2Vec-CNN are as follows: every keyword in the comment is trained into a 128-dimensional word vector, and the comment text of each spatial object is set to the same length.In the CNN model, it takes 64 convolution kernels with height [3,4,5] for convolution operation, the sliding step is 1, the activation function is ReLU, and Dropout is set to 0.2.We adopted a user survey strategy to conduct the experiment.In the Yelp dataset, we randomly selected 31 spatial objects with 31 pieces of comment texts from the dataset and one piece of comment text from the comment texts is picked as the test text.The semantic similarities between the test text and the other 30 comment texts are calculated by TF-IDF, LDA (Latent Dirichlet Allocation), and Word2Vec-CNN respectively, and the top 10 pieces of comment texts with the highest similarity with the test text are selected.After this, we invited three classes of undergraduates (93 students in total) to participate in the survey.They were asked to mark the top 10 comment texts most similar to the test text from 30 comment texts.To ensure the mark quality, we asked students to measure the similarities between the test text and comment texts mainly considering the following two aspects: i) whether there are duplicate keywords between the test text and comment texts, ii) whether the keywords in the test text and comment texts are semantically related.On this basis, we evaluate the overlap between the results obtained by the three text semantic similarity calculation methods and the user-marked results.Figure 10 shows the average overlap rate between the top 10 comment texts marked by the students and the results returned by the three text semantic similarity measuring methods.
Experimental results showed that the average overlap rate between user-marked results and TF-IDF, LDA and Word2 Vec-CNN algorithms are 69.9%,76.3% and 87.6%, respectively.Therefore, the Word2Vec-CNN method proposed in this paper has high user satisfaction and the similarity calculation results are more in accordance with human measuring criteria.
To further test the correlation between the comment text semantic similarity measuring method and human cognition, 300 pairs of comment texts are randomly picked from comment texts associated with spatial objects, and the similarities of these 300 pairs of comment texts are marked by students.On this basis, the correlation coefficients between the user-marked similarity values and the similarity values obtained by TF-IDF, LDA and Word2Vec-CNN are calculated, respectively.The calculation method of the correlation coefficients is as follows, Correlation(x, y) = Cov(x, y) where, Cov(x, y) = E((x − E(x))(y − E(y))) represents the covariance of two random variables (i.e., the similarities marked by the user and the similarities calculated by the algorithm), E(x) and E(y) represent expectations, D(x) and D(y) represent variance, respectively.The larger the value of Correlation the higher the relationship between two kinds of similarities.Table 6 gives the correlation coefficients between the similarities of comment texts marked by users and the similarities of comment texts calculated by each algorithm, taking {50, 100, 150, 200, 250, 300} pairs of comment texts, respectively.It can be seen from Table 6 that the correlation coefficient between the similarities of comment texts calculated by Word2Vec-CNN and the similarities marked by users is the highest, which further demonstrates the rationality of the comment text similarity measuring method.

Experiments on the effect of typicality measuring of spatial objects and top-k approximate selection algorithm
(1) Comparison on typicality of centroid, mean and typical points The aim of this experiment is to test the typicality difference among the centroid point, mean point and typical point on the same dataset.In this experiment, we respectively select the datasets containing {500, 1000, 5000, 10000, 15000, 20000, 25000, 30000} objects from Yelp and Foursquare datasets.For these datasets, we first find the centroid point and mean point from each dataset, and then the typicality degree of the centroid point and mean point can be calculated by using Equation 9. We also compute the typicality degrees of all spatial objects and then find the most typical point from each dataset.Tables 7 and  8 show the comparison on the typicality degree of centroid point, mean point and typical point on different sizes of Yelp and Foursquare datasets, respectively.It can be seen from Tables 7 and 8 that the typical point, mean point and centroid point are not the same objects on each dataset of Yelp and Foursquare, and the It can be seen from Figure 12 that on Yelp and Foursquare datasets, the error rate of the algorithm decreases slightly with the increase of value u, which demonstrates that the effect of the algorithm on parameter u is not sensitive.While with the increase of u value, the execution time of the algorithm would increase obviously.According to this, we will set the number u of objects in each group in the Tournament strategy-based algorithm to 10 (i.e.u = 10) in the follow-up experiment.
2) The impact of neighborhood threshold σ on error rate in local neighborhoodbased algorithm The aim of this experiment is to test the error rate of the algorithm and the average number of objects contained in the neighborhood of the result object when the neighborhood threshold σ in the local neighborhood algorithm takes different values.We first randomly selected three groups of 10000 objects from Yelp and Foursquare datasets as the test datasets.The number of k is set to 10 .The average error rate is computed on the three groups of test datasets.Tables 9 and 10 show the error rates of each algorithm varied with different neighborhood thresholds σ on Yelp and Foursquare datasets, respectively, and the average number of objects contained in the neighborhood of top-10 result objects.It can be seen from Tables 9 and 10, for Yelp and Foursquare data sets, with the increase of local neighborhood threshold σ, the average number of objects contained in the local neighborhood of top-10 result objects gradually increases, and the error rate also changes from a sharp decline at the beginning to a stable and low level.By further observing the relationship between the average number of objects contained in the local neighborhood and the error rate, it can be found that the error rate decreases the most when the number of objects contained in the neighborhood increases from 3 to 25 on Yelp dataset (respectively from 4 to 23 on Foursquare dataset), which demonstrates that the closer the integrated distance with the given object, the greater the impacts on its typicality degree, which is consistent with the idea of Equation 9. When the neighborhood thresholds are set to be 0.5h and 3.2h on Yelp and Foursquare datasets respectively, the error rates of the algorithm reached the lowest (2.76% and 0.82%, respectively), and the average number of objects in the corresponding neighborhood is 71 and 118 , respectively.However, as the neighborhood threshold continues to increase, the error rate of the algorithm would fluctuate slightly but remain at a very low level.The reason is that the increase of neighborhood threshold would lead to an increase in the number of objects in the neighborhood, and some objects that are not close to the given object may also join in, thus bringing some noise to the typicality degree calculation of the given object.In the real application, therefore, the larger the neighborhood threshold would not indicate the better performance while the larger the neighborhood threshold would lead to the higher execution cost of the algorithm.It should be further pointed out that due to the different data distributions of different datasets, the values of h are usually different.Thus, it is necessary to test the neighborhood threshold in a large numeric interval at first and then choose an appropriate interval to further carefully test so that the optimal neighborhood threshold can be determined.In the following experiments, the neighborhood thresholds on Yelp and Foursquare datasets will be set to 0.5h and 3.2h, respectively.
3) The impact of the result number of k on the error rate of each algorithm The aim of this experiment is to test the error rate of the algorithm based on the Tournament strategy-based and local neighborhood when the number of results k is {5, 10, 15, 20, 25, 30, 35, 40, 45, 50}.We first randomly selected three groups of 10000 objects from Yelp and Foursquare datasets as the test datasets.Other parameters of each algorithm are the same as above.The average error rate of each algorithm is computed on three groups of test datasets.Figure 13 shows the error rates of two algorithms varied with different values of k on Yelp and Foursquare datasets.As can be seen from Figure 13, the error rate of the local neighborhood-based algorithm on Yelp and Foursquare datasets is obviously lower than that of the Tournament strategy-based algorithm.The error rate of the Tournament strategy-based algorithm on two datasets is relatively low when k = {5, 10}.Also, the error rate would fluctuate with the increase of value k but the change is not substantial, which indicates there is no clear correlation with the change of value k.This is because the essence of the algorithm is to repeatedly execute k rounds on the dataset to obtain top-k results, and each round selects the object with the highest typicality degree from the dataset by taking advantage of the Tournament strategy, so it is insensitive to the change of parameter k.The error rate of the local neighborhood-based algorithm on two datasets increases gradually with the increase of value k.The reason is that the more typical objects are selected, the more approximate typical objects are selected, so that the probability of overlapping the typical object set obtained by the approximate algorithm and the typical object set obtained by the exact algorithm will gradually decrease and the error rate will gradually increase.
4) The impact of dataset size change on the error rate of each algorithm The aim of this experiment is to test the error rate of each algorithm when the size of the dataset changes.In the experiment, we randomly selected the datasets containing {5000, 10000, 15000, 20000, 25000, 30000} objects from Yelp and Foursquare, respectively.The parameters of each algorithm are the same as above.On this basis, we compute the top-10 typical objects by using the two approximation algorithms and their corresponding error rates (the experimental results are shown in Figure 14).As can be seen from Figure 14, the error rate of the Tournament strategy-based algorithm gradually increases with the increase of dataset size.The reason is that under the case that the number of objects contained in each group is fixed, with the increase of dataset size, the gap between the number of objects contained in each group and the number of objects in the dataset becomes larger, and thus the deviation between the typical objects selected in the group and the global typical objects may also increase, which makes the error rate gradually increases.On the contrary, the error rate of the local neighborhood-based algorithm decreases with the increase of the dataset size.The reason is that with the increase of the dataset size, the data distribution becomes denser, and the number of objects in each neighborhood will increase under the same neighborhood threshold, so that the typicality degree of objects calculated in the local neighborhood is closer to that in the global neighborhood, thus the error rate decreases with the increase of the dataset size.

Experiments on testing the performance of top-k approximate selection algorithm
(1) The impact of the neighborhood threshold σ on performance The aim of this experiment is to test the correlation between the neighborhood threshold and the execution efficiency of the local neighborhood-based algorithm.We first randomly selected three groups of 10000 objects from Yelp and Foursquare datasets as the test datasets.The average execution time is computed on the three groups of datasets under the same parameters.Figure 15 shows the execution time of selecting top-10 objects for the different neighborhood thresholds σ on Yelp and Foursquare datasets.As can be seen from Figure 15, with the increase of neighborhood threshold σ, the execution time of the local neighborhood-based algorithm gradually increases.This is because the number of objects in each object neighborhood would increase with the increase of neighborhood threshold, which leads to an increase in the time for calculating the typicality degree of each object, and consequently the execution time for selecting top-10 typical objects from the whole dataset would increase.
(2) The impact of the result number k on the execution efficiency of each algorithm The aim of this experiment is to test the execution time of the Tournament strategy-based algorithm and the local neighborhood-based algorithm varied with the change of the result number k.We first randomly select three groups of 10000 objects from Yelp and Foursquare datasets as the test datasets.The values of the parameters of the two algorithms are the same as those mentioned above.The execution time is the average of the execution time on three groups of test datasets.Table 11 shows the execution time of each algorithm for different values of k = {5, 10, 15, 20, 25} (the former part is the average execution time on the Yelp and the latter is that on the Foursquare for each column in Table 11 ).
As can be seen from Table 11 , the execution time of the local neighborhoodbased algorithm is much less than that of the Tournament strategy-based algorithm.This is because when the neighborhood threshold is set to 0.5h and 3.2h on Yelp and Foursquare datasets, respectively, the number of objects contained in each object neighborhood is small, so the time to calculate the typicality degree of each object is faster.For the Tournament strategy-based algorithm, its execution time increases obviously with the increase of value k.The reason is that the algorithm needs to add a round of Tournament calculation on the whole dataset when the increase of value k.In contrast, the execution time of the local neighborhood-based algorithm almost no changes with the increase of value k.This is because no matter what the value of k is, the algorithm has to calculate the typicality degree of all objects in the dataset in the local neighborhood and then select top-k result objects.(3) The impact of dataset size on the efficiency of each algorithm The aim of this experiment is to test the execution time of the Tournament strategy-based algorithm and the local neighborhood-based algorithm varied with the change of dataset size.In the experiment, we first randomly selected the datasets containing {5000, 10000, 15000, 20000, 25000, 30000} objects from Yelp and Foursquare as the test datasets.The number of results is set to k=10.The values of other parameters of the two algorithms are the same as those mentioned above.Table 12 shows the execution times of the two algorithms varied with the different sizes of Yelp and Foursquare datasets (the former is on Yelp and the latter is on Foursquare for each column). of the size of dataset, the execution time of the Tournament strategy-based algorithm increases linearly.This is because the increase of dataset would lead to the increase of the calculation time of each round of tournament of the algorithm.The execution time of the local neighborhood-based algorithm also increases linearly with the increase of dataset.This is because the increase of dataset would lead to the increase of the number of objects contained in each object neighborhood in the local neighborhood algorithm, and the number of objects that need to calculate the typicality degree is also proportional to the size of dataset.

Conclusion
To deal with the problem of typicality analysis for query results over spatio-textual data, this paper proposed an approach of typicality evaluation and top-k approximate selection for spatial keyword query results.To expeditiously obtain the candidate query results, a hybrid index structure that can support spatio-text-numeric multidimensional matching is designed.To measure the typicality degree of query result objects, an integrated distance measuring method of spatial objects in the dimensions of geographical location, description text, comment text and numeric attributes is proposed.Based on the integrated distance of spatial objects, a typicality degree measuring method of spatial objects based on probability density estimation is proposed.To speed up the retrieval efficiency and precision of typical objects, the top-k approximate selection algorithms based on Tournament strategy and local neighborhood are designed respectively.Experimental results demonstrated that the proposed similarity measuring method for calculating the semantic similarity of comment text has high accuracy, and the top-k approximate selection algorithm based on the local neighborhood has a low error rate and fast execution efficiency.It should be noted that although the local neighborhood-based algorithm is superior to the Tournament strategy-based algorithm in result error rate and execution efficiency, the Tournament strategy-based algorithm can remain stable even when the data distribution is unknown or imbalance, while the error rate of the local neighborhood-based algorithm is greatly affected by the change of neighborhood threshold.The typicality measuring method for query results proposed in this paper can be applied to typicality analysis of different application scenarios by combining with clustering and other methods.For example, the target set can be further divided into several subsets by clustering or query methods, and then the typical objects in different subsets can be obtained.The typical objects in multiple subsets constitute a diverse typical object set, so that the query results can represent the main characteristics of different objects in the dataset.Therefore, the research of diverse typicality analysis and approximate selection algorithm will be investigated in the future.

Fig. 1
Fig. 1 Differences between center point, mean point and typical point

Fig. 2
Fig. 2 Solution of top-k approximate selection for typicality query results over spatiotextual data

Fig. 4
Fig. 4 The Word2Vec-CNN-based model for feature extraction of comment text associated with spatial objects

Algorithm 1 1 :: let i=0 4 :
Skyline set computation algorithm for numeric attribute tuples in spatial objects Input: Numeric attribute tuple list list num Output: Skyline collection of list num The tuples in list num are sorted by the value of the first numeric attribute in ascending order 2: for each tuple in list num do 3while i < tuple.size()do 5: compare tuple[i] with tuple1[i] of other tuples in list num 6: if tuple[i] <= tuple1[i] then 7: list num.remove (tuple1)/ * Remove the tuple1 from list num */ list num.remove(tuple) 11: return Skyline of list num

Algorithm 3
Algorithm for selecting the exact top-k typical results Input: Candidate query result set R = {o 1 , o 2 , . . ., o n }, positive integer k Output: k objects with the highest typicality in R 1: for all objects o ∈ R do 2: let T (o, R) = 0 3: for i = 1 to n do 4:

Fig. 8
Fig. 8 One-time Tournament strategy-based candidate typical object approximate selection process

Fig. 9
Fig. 9 An example of VP-Tree

Fig. 10
Fig. 10 Comparison of the average overlap rate between the user-marked results and the results returned by TF-IDF, LDA and Word2Vec-CNN methods

Fig. 11
Fig. 11 Error rate of Tournament strategy-based algorithm varied with the values of v on Yelp and Foursquare datasets

Fig. 13
Fig. 13 Error rate of two algorithms varied with different values of k on Yelp and Foursquare datasets

Fig. 14
Fig. 14 Error rate of two algorithms varied with different sizes of Yelp and Foursquare datasets

Fig. 15
Fig. 15 Execution time of local neighborhood-based algorithm varied with different neighborhood thresholds σ on Yelp and Foursquare datasets

Table 2
Hybrid index structures of spatio-textual data and their characteristics

Table 3
Location and description information of spatial objects

Table 4
Example of spatial objects

Table 5
Statistics for experiment datasets

Table 6
Correlation coefficients between similarities calculated by different similarity measuring methods and user-marked results

Table 7
Comparison on typicality of centroid point, mean point and typical point on different sizes of Yelp datasets

Table 8
Comparison on typicality of centroid point, mean point and typical point on different sizes of Foursquare datasets

Table 9
Error rate and average number of objects in the local neighborhood varied with different neighborhood threshold σ on Yelp

Table 10
Error rate and average number of objects in the local neighborhood varied with different neighborhood threshold σ on Foursquare

Table 11
Execution time of two algorithms for different values of k on Yelp and Foursquare datasets

Table 12
Execution time of two algorithms on different sizes of datasets