How to Personalize and Whether to Personalize? Candidate Documents Decide

Personalized search plays an important role in satisfying users’ information needs owing to its ability to build user proﬁles based on users’ search histories. Most of the existing personalized methods built dynamic user proﬁles by emphasizing query-related historical behaviors rather than treating each historical behavior equally. Sometimes, the ambiguity and short nature of the query make it diﬃ-cult to understand the potential query intent exactly, and the query-centric user proﬁles built in these cases will be biased and inaccurate. In this work, we pro-pose to leverage candidate documents, which contain richer information than the short query text, to help understand the query intent more accurately and improve the quality of user proﬁles afterwards. Speciﬁcally, we intend to better understand the query intent through candidate documents, so that more relevant user behaviors from history can be selected to build more accurate user proﬁles. Moreover, by analyzing the diﬀerences between candidate documents, we can better control the degree of personalization on the ranking of results. This controlled personalization approach is also expected to further improve the stability of personalized search as blind personalization may harm the ranking results. We conduct extensive experiments on two datasets and the results show that our model signiﬁcantly outperforms competitive baselines, which conﬁrms the beneﬁt of utilizing candidate documents for personalized web search.


Introduction
Search engines play an important role in the process of obtaining information in our daily life.However, for the same query, existing search engines usually return the same result to different users, which hardly meets the needs of different people.For example, for the query "Apple", a computer enthusiast tends to seek information related to "Apple computer" while a farmer might prefer "Apple fruit".Personalized search has been proposed to cope with this problem by re-ranking candidate documents based on the user interests.Traditional personalized search studies [1][2][3][4][5][6] mainly focused on extracting human-designed personalized features from users' query logs to predict users' intents.In recent years, with the rapid development of deep learning, many neural models [7][8][9][10][11] have been proposed to build user profiles using neural networks to improve the personalization quality.
Existing personalized neural models mainly built user profiles by exploiting personalized signals, especially the query-related search behaviors, from users' query logs.For example, HRNN [12] used the attention mechanism to highlight query-related historical behaviors to build dynamic user profiles.Along this line, we notice that the quality of such a "query-centric" user profile is highly dependent on the representation of the current query.However, due to the ambiguity and short nature of the query [13,14], it is often difficult to create accurate query representation to reflect the potential intents of users, leading to an inaccurate and biased user profile.For example (see Figure 1(A)), a programmer is interested in the new programming language Go and just issues a single-term query "go".Since the word "go" is quite general and its topic is very broad, it is hard to guarantee its representation can cover all related intents.In the case that programming-related subtopic is ignored in the representation, we have no way to find relevant user histories and build correct profiles.
To alleviate the above problem, we attempt to enhance the representation of the query with its retrieved candidate documents, thereby improving the quality of the user profile.Candidate documents under a query are often regarded as a summary of the intent corresponding to the query [15][16][17].Compared with a short query text, they can provide richer information for better understanding the potential query intents for personalization.Let us analyze this benefit from two aspects: how to personalize and whether to personalize.For the first aspect, similar to some query classification studies [18,19], using candidate documents can provide richer information on potential query intents to enhance the query representation, so as to extract more comprehensive and useful behaviors from user history logs (see Figure 1(B)).For the "go" example we mentioned above, the information in the candidate documents indicates that "go" potentially has multiple intents: go programming language, board game go, and so on.Therefore, we can use this information to enrich the representation of the query "go".With the enriched query representation, more useful historical behaviors (especially programming-related ones) can be identified to build a more accurate user profile.For the second aspect, the semantic difference between candidate documents indicates whether the personalization should be considered in document ranking.Intuitively, if candidate documents are semantically similar to one another or cover related topics (which often occurs when the query intent is very clear), then there is no need to personalize the results.For example, the candidate documents of the query "go programming tutorial" may include "Go Programming Language Tutorials", "How to Learn Go Programming", and "Getting Started with Go programming".Due to the small semantic differences between these documents, personalizing the ranking of them is unnecessary.Blind personalization may even have side effects [20,21] for such queries.This indicates that there should be a stronger correlation between the level of personalization and the semantic differences between documents.Consequently, we intend to adjust the degree of personalization by measuring the semantic difference across candidate documents.
Although candidate documents can provide rich query intent information, depending blindly on them all to improve query representation may introduce additional issues with topic distribution bias.For example, when multiple topics exist in candidate documents, their corresponding documents may vary greatly in number.As a result, the query representation neglects the user's real interest in minor subtopics and is dominated by major subtopics with more documents.To address this problem, we propose to select a diverse set from candidate documents to improve the diversity of the query representation, so that it can cover as many potential subtopics as possible.
Specifically, we propose a Documents Enhanced Personalized Search framework (DEPS), which mines the semantic information of candidate documents to improve the personalization from two aspects.For the first aspect, we use a diverse document set selected from all the candidate documents to enhance the representation of current query based on Transformer, so as to make the query representation cover the potential user's intents.Then, we use the enhanced query representation to highlight relevant historical behaviors based on attention mechanism to build the user profile.For the second aspect, we design a difference-aware self-attention mechanism that helps to measure the semantic difference between the candidate documents, and use the difference to control the weight of personalization in the final ranking score.
We conduct extensive experiments with two widely used datasets for personalization, namely the AOL dataset and a query log dataset from a commercial search engine.Experimental results show that our model outperforms all existing baseline models, and confirm the benefit of utilizing candidate documents for either refining user profiles or controlling the degree of personalization.
The main contributions of this paper are summarized as follows: (1) We propose incorporating candidate documents to improve personalized search through the following two questions: how to personalize and whether to personalize.
(2) We utilize a diverse document set to enhance the query representation, so as to build a more accurate user profile.
(3) We propose dynamically adjusting the degree of personalization according to the semantic difference between the candidate documents.
The rest of the paper is organized as follows.We first summarize the previous studies that are related to our paper in Section 2 and introduce the proposed method in Section 3.Then, experimental settings are described in Section 4, and the results are analyzed in Section 5. Finally, the paper is concluded in Section 6.

Personalized Search
Due to the ability of personalized search to meet the personalized information need of different users, various personalization-related studies have been conducted.Some traditional models used click-based features to calculate the relevance score of the candidate document.Dou et al. [20] proposed the P-Click model to predict user's intent by counting the number of clicks under the same query.Teevan et al. [3] also extracted these click-based features from query logs to forecast users' future navigational behaviors.Apart from that, some studies [22,23] tried to apply topic-based features extracted from the documents to model the users' interests.Some other studies used feature engineering to improve the quality of personalized search.They extracted click-based features, query entropy, and other features from the current query and user's query log.Then, the learning to rank algorithm LambdaMART [24] is used to combine these features to train the ranking model.
In recent years, deep learning has been widely applied in information retrieval due to its powerful representation learning capability.For personalized search, it is usually applied to predict users' interests [8,[10][11][12][25][26][27].Song et al. [2] used an adaptive ranking model to build dynamic user profiles.Li et al. [28] used the semantic features of in-session contexts to improve the ranking results.In addition, many studies apply various network structures in personalized search.Ge et al. [12] used hierarchical recurrent neural networks with the attention mechanism to model the user interests.Ma et al. [25] proposed a fine-grained time information enhanced model based on LSTM to model a more accurate user profile.Zhou et al. [10] used the context of history to learn a better semantic representation of the current query.Deng et al. [29] applied a dual-feedback network that incorporated users' positive/negative behavior to better understand the user's intent.These methods extensively used the attention mechanism to filter user's historical behaviors based on the current query to build "query-centric" user profiles.Since the user history contains rich information, filtering historical behaviors through current query has been proven to be an effective method.In this work, in addition to the current query, we further consider using its corresponding candidate documents for more accurate historical behaviors selection.We believe that this method can help build more accurate and stable user profiles.

Pseudo-Relevance Feedback
Pseudo-Relevance Feedback is a technique used in the field of information retrieval to improve the results of a search query.The basic idea behind Pseudo-Relevance Feedback is to use the top-ranked documents to update the query language model, so as to improve the ranking results.It has been applied in many retrieval models [30][31][32].For example, Zhai and Lafferty [31] enhanced the original query model by extracting topic information from feedback documents.Ai et al. [32] proposed to use the top retrieved documents to learn a deep listwise context model that helps to re-rank the ranked list.In this paper, we also use the candidate documents as a kind of Pseudo-Relevance Feedback to enhance the query representation.Different from many existing approaches that use all the candidate documents as feedback, we select a diverse set from the candidate documents to avoid the problem of topic distribution bias.

Modeling Interaction of Documents for Ranking
Recently, modeling interaction of the candidate documents has been proved to be effective for ranking in IR.Some studies [33,34] revealed that the inter-relationship between candidate documents helps model the query-document relevance.Ai et al. [32,35] take multiple documents as the input of the scoring function and predict their ranking scores together.Besides, some researchers [36,37] tried to capture the crossdocument comparative information based on the self-attention mechanism to re-rank the documents.Qin et al. [38] proposed a supervised diversification framework that uses self-attention to model the interactions between all candidate documents globally.The success of these studies shows that the difference information between candidate documents helps improve the ranking quality.In this paper, we design a differenceaware self-attention mechanism to better capture the semantic differences between candidate documents for personalized search.

Proposed Method: DEPS
Personalized search mainly improves ranking results by modeling user profiles based on users' search logs.As we stated in Section 1, the shortness and ambiguity of the query make its representation fail to reflect the potential intents of users, leading to the deviation of user profile.Besides, the personalization incompatible with document semantic differences may degrade the ranking quality.In this paper, we propose to leverage the semantic information hidden in the candidate documents to address these two issues.Specifically, we use a diverse document set selected from the candidate documents to enhance the query representation.With the query representation enhanced, a more accurate user profile can be built based on the attention mechanism.Furthermore, we devise an attention-based method to measure the semantic difference between the documents and adjust the degree of personalization in the final ranking.

Symbol Explanation q i
User query at time step i d + i,j The j-th clicked document under q i q Current user query D Candidate document set of q d i The i-th candidate document of q D div Diverse document set selected from D w p i Personalized score weight of d i q e Enhanced representation of q U e Enhanced user profile To begin with, we formulate the problem with notations (listed in Table 1).The search history H records the user's historical behaviors, including query requests and corresponding click actions.We represent the user's search history as a sequence , where t is the current time and d + i,j refers to the j-th clicked document under query q i .Given the current query q, we use D = {d 1 , d 2 , . . ., d N } to represent the corresponding candidate documents retrieved by the search engine.Our task is to score each candidate document based on current query and user's search history.Different from previous studies in which the candidate documents are only used for similarity matching, we attempt to mine the semantic information and their relationships hidden in the candidate document list to improve the ranking results.In this paper, in addition to the current query and user's search history, we also use the candidate documents as additional data to calculate the final personalized score.The final score of the i-th candidate document can be computed as: where Pscore(d i |q, H, D) represents the personalized score of the i-th document and Ascore(d i |q) is the adhoc score between the query and the i-th document.The function φ(•) is a multilayer perceptron (MLP) using tanh(•) as the activate function.The structure of our model is shown in Figure 2. Next, we will introduce each part of our model in details.

How to Personalize: Documents-based User Intent Understanding
As we discussed in Section 1, some queries lack semantic information due to their shortness and ambiguity, which hinders us from understanding the potential query intents.To cope with this problem, we try to select a diverse document set from candidate documents to enhance the potential topic coverage of the query representation.Fig. 2 The architecture of DEPS.Given the current candidate documents, a diverse document set is selected from them to enhance the topic coverage of the current query based on Transformer, so that a better user profile can be built with the enhanced query.Then, a difference-aware selfattention is designed to help measure the semantic difference between candidate documents and calculate a personalized score weight for each document.Finally, the ranking score is calculated with the assistance of several other features.
Then we use the enhanced query representation to filter historical behaviors to better understand user's intent based on attention mechanism.Specifically, we divide the whole process into four parts: (1) query/document encoding (2) diverse documents selection (3) documents enhanced query representation (4) user profile building.We will introduce the details of each part in the following.

Query/Document Encoding
To get the embedding of the current query and corresponding candidate documents, we initialize a global word embedding matrix M ∈ R |V |×m , where |V | is the vocabulary size and m represents the dimension of word vector.we use this matrix to convert each word in the query and documents into vectors.
For the current time t, we use to represent the embeddings of the current query and the corresponding candidate documents.Then we intend to learn their context-aware representations with Transformer [39] based on the entire text, which are defined as: where Trm sum (•) means the sum of outputs of Transformer.The obtained contextaware representations of candidate documents are denoted as D = {d 1 , . . ., d N }.

Diverse Documents Selection
As we stated in Section 1, enhancing the query representation with all candidate documents could make it ignore the minor subtopics in which user's real intent may lie.For example, for the query "apple", its top result list contains significantly more results about the subtopic "Apple company" than the subtopic "apple fruit".If each result contributes equally to enhancing the query representation, the resulted query representation will be overwhelmed by "Apple company" because of the extreme imbalance in the result number, and it is hard to accurately capture the topic about "apple fruit".When a fruit farmer issues the query "apple", the biased query representation will likely fail to identify relevant information about "apple fruit" in the user's search history, and it will yield an inaccurate user profile.Therefore, it is necessary to reduce the redundancy of candidate documents to balance the number of documents corresponding to different subtopics, so that the enhanced query representation can cover each subtopic more accurately.This will somewhat benefit different users that have diverse information needs.In this section, we attempt to select diverse documents from the candidate documents based on the Maximal Marginal Relevance (MMR) algorithm [40].
MMR is a greedy algorithm that strives to reduce the document redundancy while maintaining the query-document relevance in the search result diversification task.Its algorithm for selecting a document can be formulated as follows: where D is the candidate document set; D div is the diverse document set already selected from D; D\D div represents the set difference, i.e. the documents that have not yet been selected from D; Sim(•, •) is the similarity metric used in matching and is implemented as the cosine similarity in this work; λ is used to adjust the querydocument relevance and document diversity of the selected document set.To ensure the diversity of the selected document set, we set a document similarity threshold θ.When the minimum similarity between selected documents D div and unselected documents D\D div exceeds θ, we stop the document selection.We tune the parameter λ and document similarity threshold θ by grid search and finally set them as 0.3 and 0.75 respectively in this paper.The overall process of our diverse documents selection is summarized as Algorithm 1.
After the selection, we have obtained the diverse document set D div which will be used to enhance the query representation in the next stage.

Documents Enhanced Query Representation
As we have discussed in Section 1, the ambiguity and short nature of the query hinder it from accurately representing users' potential intents.In this part, we intend to use the diverse document set to enhance the query representation based on Transformer, so Algorithm 1 Diverse documents selection based on MMR Input: candidate document set D; the document similarity threshold θ.Output: diverse document set D div .
1: D div ← {d 1 } // initialize D div with d 1 2: while D\D div do 3: if max dj ∈D div Sim (d s , d j ) > θ then D div ← D div ∪ {d s } 8: end while 9: return D div that the query representation can more accurately reflect the potential intents of users.We put the query representation q and the diverse document set D div = {d div 1 , ..., d div n } as the input of the Transformer.
where Trm f (•) means merely taking the output in the first position; q e represents the enhanced query representation and will be used to build the user profile in the following section.

User Profile Building
Now that we have obtained the enhanced query representation, we attempt to use it to filter the historical search behaviors to build accurate user profile.Ge et al. [12] revealed that different search behaviors could contribute differently to building user profiles and their weights should be determined by their relevance to the current query.In this part, we design a user profile building module based on the query-aware attention mechanism.The details are as follows: Formally, for each query in the search history, we concatenate the word embeddings of the query and corresponding clicked documents, with "[SEP]" token as the separator.Then we feed them into Transformer and sum the outputs together to get the representation of the i-th historical behavior, which is denoted as h i .
where C refers to the number of clicked documents under q i .Then we calculate weights {w 1 , . . ., w t−1 } for each historical behaviors vector in {h 1 , . . ., h t−1 }: Then the enhanced user profile U e can be computed by a weighted linear combination of {h 1 , . . ., h t−1 }: Compared with the enhanced query representation q e , we believe that the original query representation q still contains some useful information.Thus, we also use q to build an original user profile U in the same way as U e , and the two user profiles will contribute to computing the final personalized score in Section 3.3.

Whether to Personalize: Personalized Weight Modeling
As we stated in Section 1, the degree of personalization should be adaptive to the semantic difference between the candidate documents.In this section, for each candidate document, we intend to measure its semantic difference from other candidate documents, thereby adjusting the effect of personalization on its ranking.Specifically, we design a difference-aware self-attention mechanism (denoted as DifAttn) that takes the representations of each document as the input and aggregates the semantic representations from other documents based on euclidean distance (see Figure3).Then we compute a personalized weight based on the semantic difference between each document and corresponding aggregated semantics.The details of the implementation are as follows:

Difference-aware Self-attention Mechanism
The traditional self-attention mechanism mainly implements the transmission of similar information in the sequence through the dot product.In this part, to capture the documents with greater semantic difference from each document, we replace the dot product function with a euclidean distance based function f (•, •). where denote the query, key and value matrices of the attention mechanism.Following by previous studies [36,39], we use the multi-head self-attention (denoted as MS) in order to learn multiple aspects of different documents.The MS will first project the inputs into h subspaces with the dimension Ê = E/h and employ the DifAttn(•) for each head.Then the final output  is obtained by concatenating each head.
where D agg = {d agg 1 , . . ., d agg N } are the aggregated semantic representation.D = {d 1 , . . ., d N } are the representations of all the candidate documents obtained in Section 3.1.1and the projection matrices of each head W Q i , W K i , W V i , and W O are parameters learned during the training.In order to remove each document's attention to itself when computing semantic differences, we mask out (setting to -∞) its attention value in the input of the softmax.

Personalized Score Weight
As we discussed in Section 1, the more significant the semantic difference between documents, the greater the degree of personalization should be.In this part, we propose computing a weight for each document to adjust its final personalized score.Formally, for the i-th candidate document, its personalized weight w p i is calculated by the euclidean distance between d i and d agg i .Then we use sigmoid(•) to make w p i between 0 and 1.

Re-ranking
The final ranking score of each candidate document consists of two parts.For the personalized relevance, we compute the similarity between the document representation d i and user profiles obtained in Section 3.1 and multiply them by the personalized weight w p i : where U e and U represent the user profiles built by the enhanced query and the original query respectively in Section 3.1.4.For the ad-hoc relevance, we divide it into two parts: (1) we consider the interaction-based and representation-based similarity between the query and document matching of the original query q E and document d E i , and their context-aware representations q and d i .
(2) we follow [10] and extract some additional features f q,di for each candidate document, including clicks features, topic features and some neural matching features.These features are also fed into MLP to compute a relevance score: where s I (•) and s R (•) are implemented as KNRM model and cosine similarity respectively, and φ(•) is implemented as multilayer perceptron.
We adopt pairwise learning to rank algorithm LambdaRank [41] to train our model.We construct a training pair with a positive sample (the clicked document) and a negative sample (the unclicked document).Given a positive sample d i and a negative sample d j , the probability that d i is more relevant than d j is computed as where score(•) calculates the final score of the document.The final loss function is defined as the weighted cross entropy between the ground truth P ij and predicted probability P ij : where the weight λ ij is the change value when swapping the position of d i and d j .
4 Experimental Setup

Dataset
We use AOL search log [42] and the dataset from a commercial search engine (abbreviated as Commercial dataset in the following) to conduct our experiments.Table 2 shows the detailed statistics of both datasets.AOL dataset is a publicly available dataset that includes three months (from 1st March 2006 to 31st May 2006) of user query and click data.Since the dataset only contains the documents that the user clicked on, we select the candidate documents from the top documents recalled by BM25 algorithm [43].Following [44], we split the query log into sessions whose boundaries are decided by the similarity between two consecutive queries.Each piece of data contains an anonymous user ID, a query text,  [45,46].We only use the document title to calculate the relevance between the query and the document and remove users who did not have historical or training data.
Commercial dataset contains search logs from January to February 2013 without applying personalization technology.Each piece of data contains a user ID, a query text, the time when the query is issued, the urls of top-20 retrieved documents, the click label of each url and corresponding dwell time.The dataset differs from the AOL dataset in a few ways.Firstly, the candidate documents are directly retrieved by search engine, making the original ranking quality much higher than BM25.Secondly, we crawl the content of the document according to its URL to represent the document, which makes the document representation more accurate than just using the document title.Lastly, this dataset contains the click dwell time, so we regard the click whose dwell time is more than 30s or the last click as a satisfied click.We regard 30 minutes of inactivity as the boundary, based on which we segment the search log into different sessions.

Baselines
For AOL dataset, the original rankings are generated based on classical BM25 algorithm.For Commercial dataset, the original ranking results are directly returned by a commercial search engine.In additional to the original rankings, we also compare our model with several adhoc search baselines and personalized search baselines.The details of these baselines are listed as follows: KNRM [47].KNRM is a kernel-based neural ranking model.It builds a word-level similarity matrix between query and document, and uses a kernel pooling technique to extract multi-level soft match signals from it.Then a learning-to-rank algorithm is used to map these features into the final ranking score.
Conv-KNRM [48].Conv-KNRM is proposed based on the KNRM model.It first utilizes convolutional neural networks to model n-gram soft matches for adhoc search.
The kernel pooling and learning-to-rank algorithm are applied to calculate the final ranking score.
BERT [49].This model matches query and document based on the pre-trained BERT model.It takes the concatenated query-document sequence as the input and regards the features of "[CLS]" token as the matching signals.
P-Click [20].P-Click re-ranks the candidate documents based on the number of times user clicked on the same query in the search history, which is inspired by user's re-finding behaviors during the search process.
HRNN [12].It employs a hierarchical recurrent neural network to model the sequential information underlying user history and dynamically generates the user profile based on a query-aware attention mechanism.Then, it re-rank the candidate documents based on their relevance with the short-term and long-term user profiles.
PSGAN [7].PSGAN is a personalized framework for overcoming the problem of noisy training data based on generative adversarial network.It can generate queries which better match users' search intents and select better document pairs for modeling user interests.We use the trained discriminator for the re-ranking task.
RPMN [11].This study proposes to construct memory networks (MN) to identify complex re-finding behavior.It can build a fine-grained user model dynamically based on current information and use the model to re-rank the documents.
HTPS [10].This model applies a hierarchical Transformer to encoder the search history and disambiguate the user's query in multiple stages.Besides, a personalized language model is designed to predict the user intent accurately.
PEPS [8].The PEPS model uses historical data to train a personalized word embedding for each user.It proposes to improve the performance of personalized search based on better data representation instead of the user profile.

Evaluation Metrics
For the AOL dataset and Commercial dataset, we regard the clicked documents and satisfied documents as relevant documents and label the others as irrelevant.We apply three common evaluation metrics to evaluate the models: mean average precise (MAP), mean reciprocal rank (MRR), and precision@1 (P@1).Considering the fact that users' click behavior may be influenced by the original order and some relevant documents may be ignored due to their low rankings, we use a more credible metric P-improve [7] as the fourth evaluation metric to measure the ranking results in a more objective manner.We calculate P-improve as the ratio of increased correct pairs compared with the original ranking results.More detailed explanation can be referred to in [7].Since the candidate documents in the AOL dataset are recalled by BM25 and are not presented to users, we only use this metric on the Commercial dataset whose candidate documents are directly retrieved by search engine.

Implementation Details
We implement our model with Pytorch and carry out a series of experiments to determine the parameters of the model.The word embedding is set as 100.As for the Transformer, the hidden size is 512 and the number of heads in the multi-head attention mechanism is set as 8.The whole model is optimized by Adam, with batch size as 32 and learning rate as 9e-5.

Overall Performance Comparison
The overall results of different models on the two datasets are shown in Table 3.It can be observed that: (1) The comparison of our model and baselines.Our DEPS model outperforms all the baseline models on the two datasets.Compared with the state-of-the-art model PEPS and RPMN, DEPS shows significant improvements in terms of all the evaluation metrics with paired t-test at p < 0.05 level.Concretely, compared to the best baseline model PEPS on the AOL dataset, we have increased the MAP by 5.6%, and (2) The comparison of different datasets.Compared with AOL dataset, Commercial dataset has a much higher origin ranking quality, which makes the ad-hoc search baselines perform worse than the original ranking.On the AOL dataset, with rich interactive matching signals between the query and the document, the model HTPS and PEPS outperform RPMN significantly.While on the Commercial dataset, RPMN performs better.This proves that the AOL dataset mainly evaluates the methods of modeling user interests and query-document matching, while the Commercial dataset focuses on the model's capability of capturing personalized signals.Our model outperforms PEPS and RPMN on both datasets, which further confirms the robustness of DEPS model.
In summary, the experimental results indicate that the candidate documents can further improve the personalization by enhancing query representation and adjust the personalized scores.For a more detailed analysis of our model, we conduct a series of supplementary experiments: ablation studies and experiments on different query sets.

Ablation Experiments
Our DEPS model includes several main components: the enhanced query representation q e , the personalized weight w p and the diverse documents selection module.To analyze the role of each part, we conduct several ablation experiments on two datasets.The details of the ablation models are as follows: DEPS w/o.EQR.We abandon the enhanced query representation q e and only use the original query representation q to build the user profile and compute the personalized score.
DEPS w/o.PW.We discard the personalized weight and calculate the personalized scores only by matching the user profiles to the documents.DEPS w/o.DDS.We remove the diverse documents selection module and use all the candidate documents to enhance the query.
The experimental results are shown in Table 4. Removing any component of our model will damage the results on different datasets.Specifically, abandoning the enhanced query representation q e (EQR) causes the most decline in each metric, which confirms that the candidate documents can effectively enhance query representation.Besides, without the personalized weight w p (PW), the MAP, MRR, P@1 metrics drop 0.8%, 0.7%, 0.9% on the AOL dataset, which proves the effectiveness of controlling the degree of personalization based on the semantic difference between documents.Furthermore, considering the time costs of diverse documents selection module (DDS), we attempt to remove it and test the performance of our model.The results show that the gap between our model and SOTA baseline is still large even if the model loses 0.4% in MAP on the AOL dataset.

Performance on Different Query Sets
To further explore how our model improves the ranking results, we divide the test data into different subsets based on two different scenarios and compare the improvement of metric MAP on different models on AOL dataset.The details are as follows.
Ambiguous and non-ambiguous queries.In this part, we investigate the model performance on ambiguous and unambiguous queries respectively.For an ambiguous query (such as "Apple"), it usually has multiple subtopics and different users may have different intents.As we discussed in Section 1, ambiguous queries have more potential for personalization and applying personalization to non-ambiguous queries may hurt the search quality.The query ambiguity is measured by click entropy.As we mentioned in Section 1, applying personalization to queries with low click entropy may hurt the search quality.Thus, we compute the click entropy of all queries and divide the whole query set into unambiguous query set (click entropy < 1) and ambiguous query set (click entropy≥1).
We choose PSGAN, HTPS, PEPS as the baselines and the experimental results are shown in Figure 4.The delta MAP represents the improvement relative to the original ranking.We can see that all the models improve the MAP on both query sets, which shows that proper personalization is effective for both kinds of queries.Besides, our model outperforms the best baseline PEPS, especially on the ambiguous query set.This indicates that candidate documents can enrich the potential subtopics of the current query and the quality of query-centric user profiles can also be improved.
Repeated and non-repeated queries.We also categorize the query set into repeated and non-repeated queries.For the repeated queries, a more accurate user profile can be built based on the click behaviors on the same query in the past.But for the non-repeated queries, there is no identical historical search behavior to refer to, which has greater difficulty in predicting user intent.The experimental results are shown in Figure 5.
The results indicate that all models have better performance on the repeated queries.This demonstrates that most personalized models can improve personalization by capturing the user's re-finding behaviors.Besides, our DEPS outperforms all the models on both query sets and the improvements on the non-repeated queries are more obvious.This phenomenon means that our model can better improve the ranking results by mining the semantic information hidden in the candidate list when facing new queries.
Queries of different lengths.According to our statistics, about 45.5% of issued queries contain only one or two words.The shorter the query is, the less intent information the query representation contains.As we stated in Section 1, candidate documents provide sufficient information which reveals the potential query intents.In order to further demonstrate the effects of our model, it is worthwhile to test our model on queries of different lengths.We choose PEPS as our baseline model and the comparison result is shown in Figure 6.We observe that our model performs better on short queries, especially those with a length of 1 or 2. This because short queries often lack semantic information, so by enhancing their semantics with candidate documents, they can get more personalized improvements.Another observation is that our model DEPS outperform the baseline model PEPS on almost all lengths of queries, which further demonstrates the effectiveness of our model.

Conclusion
In this work, based on the candidate documents, we designed a personalized search framework that explores two questions worth considering in the field of personalization: how to personalize and whether to personalize.For the first question, we proposed using candidate documents to broaden the topic coverage of the current query, hence more accurate user profile can be built based on the enhanced query.For the second question, we designed a difference-aware self-attention mechanism to capture the semantic difference between candidate documents and calculate a personalized weight to adjust the final personalized score for each document.Our experiments confirmed the effectiveness of our framework for personalized search.To further improve our framework, we can replace our user profile building module with a more complex structure.For example, we could apply large language model (LLM) to better analyze users' search behaviors and build more comprehensive user models.Besides, how to utilize the candidate documents in the same session to enhance the query representation is also worth further studying.

Fig. 3
Fig. 3 An example of using DifAttn for semantic aggregation.

Fig. 5
Fig.5Experimental results on repeated query set and non-repeated query set.

Table 1
Notations in our approaches.

Table 2
Statistics of two datasets.thetime when the query is issued, a candidate document and a click tag.Since the personalized search relies on the search history, we divide the whole dataset into two parts: historical data and experimental data.Specifically, the first five weeks data corresponds to historical data which contributes to the personalized search.And the last eight weeks data is considered as experimental data which is further divided into training data and test data at a ratio of 5:1.For each query, we sample 5 candidate documents for training and 50 candidate documents for testing following

Table 3
The performance comparison of all models on AOL dataset and Commercial dataset.Note that since the candidate documents of AOL dataset are not presented to users, it is not suitable to use P-improving metric on the AOL dataset.The percentage represents improvements based on the SOTA baseline." †" indicates the model outperforms all baselines significantly with paired t-test at p < 0.05 level.The best results are in bold.Ours) .8322† +1.0% .8423† +1.0% .7394† +1.2% .2802+5.5%

Table 4
The experimental results of ablation models on AOL dataset and Commercial dataset.0% on the Commercial dataset.The reason for the improvement reduction is that the original quality of Commercial dataset is much higher than the AOL dataset and it is difficult to improve the results obviously.Therefore, the P-improve on which our model DEPS increases by 5.5% is more creditable.The significant performance improvement on the two datasets proves that making use of the semantic information of candidate documents is effective for improving search quality.