Tell me What you Like: Introducing Natural Language Preference Elicitation Strategies in a Virtual Assistant for the Movie Domain

Preference elicitation is a crucial step for every recommendation algorithm. Traditional interaction strategies for eliciting users’ interests and needs range from button-based interfaces , where users have to select what they like among a set of ﬁxed alternatives, to more recent conversational interfaces , where users have to reply to some questions formulated by the algorithm about their preferences. However, none of these strategies either mimics the dynamics of real-world open-ended interactions, or allows users to express their needs with an adequate level of expressiveness and control. In this paper, we present a strategy that allow users to express their preferences and needs through natural language statements . In particular, our natural language preference elicitation pipeline allows users to express preferences on objective movie features ( e.g., actors, directors, etc.) that are extracted from a structured knowledge base, as well as on subjective features that are collected by mining user-written movie reviews.


Introduction
The rise of the Virtual Assistants (VAs) [1] is one of the most interesting trends in the area of software development.This tendency, which is also confirmed by the huge investments made by big companies in that direction 1 , has led to the spread of systems such as Google Assistant, Siri and Amazon Alexa.
Even if these systems proved to be very effective in fulfilling a broad range of informative needs, ranging from booking flights to playing music, there is still room for improvement.As an example, the development of personalization or recommendation strategies for VAs is at a very embryonic stage.As stated in [2], even commercial VAs do not encode a sufficient knowledge of the user (i.e., what she likes, what she needs) and provide very basic suggestions, that is to say, they are based on heuristics (i.e., the popularity of an item) or on previous user's behaviors.As a consequence, a significant research effort is currently put in the process of discovering the potential of VAs as a platform for providing personalized recommendations [3].
In this setting, one of the main issues VAs have to face is represented by preference elicitation.Generally speaking, preference elicitation [4] refers to the problem of acquiring and modeling user preferences as accurately as possible.As it already acknowledged for recommender systems [5], preference elicitation is a fundamental requisite to get good recommendations, and the same statement holds for VAs as well.
However, preference elicitation methods that are currently implemented in these systems are not particularly satisfying, since they tend to trade-off user effort with the richness of information needed by the algorithms to provide satisfying recommendations, and this often leads to a weak representation of the interests and, in turn, to bad recommendations [6].
Traditional interaction strategies to elicit user's interests in chatbot and VAs rely on button-based interfaces.In this case, users have to provide explicit ratings on a small subset of items picked from the catalogue, or have to indicate what they like among a set of fixed alternatives.Even if this preference elicitation strategy is widespread in recommendation algorithms, it often provides a sub-optimal representation of interests and needs [7].Indeed, user requirements are often time and context-sensitive, and the analysis of past interactions is not a good source to consider to provide accurate recommendations.For example, if a user who rarely spends time with kids needs a recommendation for a family-friendly movie, information coming from her past watched movies is not useful.Moreover, user requirements may often depend on specific features of the items rather than on the items in their entirety.As an example, a user may like a specific genre or a specific director, so basing the user profile only on previously liked items, results in a weak representation of the preferences.
Accordingly, research moved towards conversational elicitation of user preferences and needs.This form of preference elicitation is currently exploited in Conversational Recommender Systems (CoRS) [8], that feature a dialoguebased interaction to gather users' preferences.However, the strategies that are typically implemented in CoRSs rely on system-driven dialogues, where the algorithm itself decides the most relevant questions to be asked.This leads to long multi-turn dialogues where the user takes on a merely reactive role, thus making the interaction less natural [9].Indeed, these strategies neither mimic the dynamics of real-world open-ended interactions nor allow the users to express their needs with an adequate level of expressiveness, since users are bound by the questions decided by the algorithm.
A different vision of the preference elicitation process is proposed by Bogers et al., who introduced in [7,10] the concept of narrative-driven recommendations (NDR).In particular, NDRs better mimic the dynamics of people-to-people interactions since the preference elicitation process relies on a rich and articulated natural language description of the needs of the user and of the qualities of the desired item.Differently from what typically happens in CoRSs, here the user has complete control over the preference elicitation process: as an example, a user may tell the system: 'I like The Matrix and movies with plot twists', and a suitable recommendation is identified accordingly.In this way, the user is provided with the maximum expressiveness to articulate her preferences and requirements, and the whole process follows the dynamics of real interactions (e.g., just think about the discussions about movie recommendations happening on online forums, such as Reddit2 ).
Another relevant aspect to be investigated concerns how users express their preferences, i.e., what lexicon they typically use.As shown in literature [11], users tend to express two types of preferences: items that they like or dislike, and properties that the item should or should not have.In turn, properties can also be objective or subjective.Objective features concern non-controversial characteristics of an item (i.e., the actors of a movie), while subjective features are based on opinions about the item (e.g., movie with a "great plot") and can be obtained by analyzing users' reviews [12].
Based on these intuitions, in this paper we fit into this research line and we present a method to introduce natural language preference elicitation in a VA for the movie domain.To this end, we designed a VA that understands narrative descriptions of users' preferences and needs as in [10], and returns a recommendation accordingly.Narrative descriptions can contain liked items (e.g., "I like the Matrix"), objective and subjective desired properties (e.g., "I like romantic movies" or "I like Keanu Reeves"), thus providing the users with the maximum expressiveness.
Even if the idea of using natural language to elicit users' preferences is not completely new [11,13,14], to the best of our knowledge the adoption and the evaluation of this strategy in a VA is under-investigated.Moreover, little evidence is provided regarding the impact of natural language preference elicitation on the accuracy of the recommendations as well as on user's perception of the suggestions.To this end, we carried out a user study evaluating two variants of our preference elicitation strategy: one that only allows users to talk about objective properties and one that also include subjective properties.Our results showed that people were more familiar in expressing their preferences by using objective properties.However, when information about subjective preferences is also acquired, their combination with objective features generally leads to better recommendations.To sum up, our contributions can be sketched as follows: • We designed a workflow to gather subjective and objective features describing the items in a catalogue, and to integrate them into a VA for the movie domain; • We introduced a methodology for natural language preference elicitation that allows users to express their preferences on items they like, as well as on subjective and objective properties; • We performed a user experiment to evaluate the effect of different types of properties on the quality of the interaction and on the accuracy of the recommendations; The next sections are organized as follows: Section 2 describes the related work, Section 3 describes in detail the preference elicitation strategies and the general workflow.Section 4 describes the setup and results of the user experiments.Finally, Section 5 contains the conclusions and outlines future work.

Related Work
In this paper, we propose a strategy that exploits natural language to elicit user preferences.Accordingly, in this section we present related work about strategies for natural language preference elicitation in VAs and recommender systems (RS), and we emphasize the distinctive traits of our methodology.

Preference Elicitation in Recommender Systems
Techniques for preference elicitation have been largely investigated in RS research.Indeed, in order to avoid the cold start problem [15], a RS must acquire a sufficient amount of data to build a profile of the user and to provide accurate recommendations [16].Early work in the area relied on coarse-grained preferences based on item categories [17] and generated trivial recommendations based on categories liked by the users.Next, research moved towards explicit preference elicitation strategies.Popular approaches tackle this problem by asking users to rate a subset of relevant items or by relying on item comparison, where people specify their preferences either between pairs of items [18].
Instead of gathering information about liked items in their entirety, some approaches implement the preference elicitation process through the acquisition of desired properties.In this case, as proposed by Thompson et al. in [19], the task is tackled as a constraint-satisfaction problem.Recently, Lei et al. [20] have presented a CoRS that features a preference elicitation process exploiting property-based questions.At each point of the conversation, reinforcement learning is used to decide which questions to ask, and when to generate recommendations.A similar strategy is also presented by Sun and Zhang in [21].
However, beyond their effectiveness, all these traditional methods share the common issue of making the elicitation process particularly long, boring, and conceptually far from the dynamics of real-world interactions.
In order to to reduce the training time and maximize the information gathered from each answer at the same time, some strategies to automatically select the items to be rated have been proposed.These attempts rely on information theory or entropy [22], or exploit Active Learning for automatic selection of the items to be rated [16].Even if these strategies can effectively speed up the preference elicitation process, they all have the issue of binding the user to a mere reactive role, since the control of the elicitation process is maintained by the algorithm.
Differently from all these works, the distinctive trait of our methodology lies in the fact that: (1) we designed a natural language interaction strategy to elicit user preferences, which is more natural w.r.t. the traditional acquisition of explicit ratings; (2) we do not select or suggest the items to be rated.Conversely, we put the user in control of the whole elicitation process, and we allow her to freely express interests, preferences and informative needs.

Natural Language Preference Elicitation
The idea of mimicking people-to-people interaction to elicit users' preferences in VAs and CoRSs is not completely new.As an example, in [13], the authors investigate how NLP techniques can be used to acquire user preferences in the form of natural language statements.By following this trend, Rafailidis and Manolopoulous [3] recently propose the idea of using VAs as a means to deliver personalized recommendations.All these intuitions support the idea of introducing methodologies for eliciting preferences and needs in a VA by using natural language, which is the core of this research.
One of the first studies investigating how natural language is used in preference elicitation scenarios is due to Nunes et al. [14].As for the use of natural language to elicit user preferences in VAs and RSs, Kang et al. presented in [11] the results of a study analyzing the use of natural language in a movie RS.Specifically, the aim of the study was to understand what types of queries are usually written by users, and to categorize the queries based on their goals.In particular, the authors identified objective, subjective, and navigation goals.Objective goals are invoked when users search for movies based on noncontroversial attributes, such as the director or the release year.On the other side, subjective goals are based on opinionated attributes, such as a quality of a movie.Finally, navigation goals are invoked when users search for a movie by name.
It should be emphasized that our work inherits some of the concepts presented in by Kang et al., since we also split descriptive features of the items into objective and subjective features, and we developed two different pipelines to extract them from structured and unstructured content.However, it is important to point out that we propose our work as a continuation of the research presented in [11].Indeed, while Kang et al. just present an analysis of how people use language to express their needs, in our work we apply this concept in the field since we: (1) designed a strategy to allow users to express their interests on both objective and subjective features in a preference elicitation process; (2) we implement our strategy in a VA for the movie domain.Moreover, we also investigated how the different groups of features affect the overall quality of the recommendations.
Another relevant piece of work in the area of natural language preference elicitation strategies is presented in [7], where the concept of narrative-driven recommendations is introduced.As pointed out by the authors, the user profile in NDRs is seen as a narrative description of users' current needs.This narrative description can contain item and properties that can help to identify relevant recommendations.It should be pointed out that our research completely inherits the vision of the preference elicitation process as a narrative description of preferences and needs.
The effectiveness of NDRs is evaluated in [10,23], where two experiments in the books and movies domain are carried out.In both cases, the accuracy is assessed through an in-vitro experiment, whose ground truth is based on content extracted from Reddit discussions (i.e., the movies suggested by Reddit users as forum replies).Both experiments confirm that the information coming from narratives improves the quality of the recommendations.Even if our work inherits some of the ideas evaluated in [10] and [23], some distinctive traits of our research are worth to be emphasized: 1. First, both [10] and [23] focus on the recommendation phase, that is to say, they used information extracted from narratives to enrich content representation and feed recommendation algorithms.Beyond this intuition, which is common to our work, the novelty of our approach lies in exploitation of narratives for the preference elicitation phase as well, since we introduce a strategy to automatically recognize and extract user preferences from narrative descriptions of user needs.This is done by designing Natural Language Understanding (NLU) algorithms that are integrated in a VA.To sum up, all the aspects concerning preference elicitation and natural language interaction, left out from both [10] and [23], are crucial in this work.2. Next, differently from the work above mentioned, we carried out an in-vivo experiment involving real users.In our opinion, our experimental setting might provide more reliable results w.r.t. an in-vitro setting.Indeed, previous experiments exploit a ground-truth based on the automatic analysis of Reddit posts.This represents a lower bound for the accuracy of the recommendations, since many other relevant items which are not mentioned among Reddit answers might exist, but in-vitro evaluations ignore them.Conversely, our methodology is based on feedback provided by real users, thus it may provide a more precise and accurate picture of the overall effectiveness of the model; 3. Finally, we also used standard questionnaires to also provide a quantitative assessment of the quality of the interaction and of the recommendations.
All the details will be provided in the following Sections.

Description of the Methodology
Figure 1 describes the general workflow carried out by our strategy for natural language preference elicitation.Roughly speaking, the workflow can be split into two phases: a knowledge extraction phase, whose goal is to extract descriptive features from structured and unstructured content and to store them into a knowledge base, and a knowledge exploitation phase, where the features stored in the KB are made available to the VA, that interacts with the user in natural language and uses these information for preference elicitation and recommendation.

Knowledge Extraction
The knowledge extraction phase is a mandatory step to implement our strategy for natural language preference elicitation.Indeed, before allowing the VA to correctly recognize preferences and informative needs of the users, some knowledge concerning which preferences and which informative needs the users can express has to be made available to the VA.In other terms, we need to: (1) collect descriptive features that characterize the items in the catalogue; (2) store the features into a KB; (3) make this knowledge available to the VA.As we aim to provide users with maximum expressiveness, we extracted both objective and subjective properties of the items.To this end, the knowledge extraction process is further split into two pipelines: the first focuses on structured knowledge sources, and aims to gather objective features from knowledge graphs such as such as Wikidata [24] and DBpedia [25], while the second runs opinion mining techniques [26] on unstructured knowledge sources (e.g., users' reviews) to extract subjective features.

Collecting Objective Features from a Knowledge Base
Objective features describe non-controversial characteristics of the items, such as the director of a movie or the year a book was released.These features allow users to express preferences concerning the genres (e.g., I like horror movies) or the actors (e.g., I like Leonardo DiCaprio) they like, and can be easily gathered from freely available knowledge bases and knowledge graphs.
In order to collect objective features, we designed a strategy for knowledge graph mapping (see Figure 1), which is structured as follows: given a set of Formally, for each i ∈ I, we define a mapping function map(i) that takes as input some textual metadata describing i and returns the URI of the resource.In our setting, we used the title of each movie, but more complex procedures based on different metadata can be implemented.In other terms, the goal of the mapping function is to match logical entities that can be recommended, i.e. items in the catalogue, with the corresponding physical entities in a knowledge graph.Such a mapping function is implemented as a SPARQL query4 that returns as output the corresponding URI of the resource.In order to manage approximate matching, we also exploited the Levensthein distance 5 .In this case, we mapped the item with the resource having the minimum distance from the original query.Some statistics about the mapping procedure will be provided next, together with the description of the experimental protocol.
It is worth to emphasize that the mapping step is necessary to get an entry point to the knowledge graph.Once the mapping is completed, it is possible to gather some features describing the items in the catalogue.In our case, for each movie, we gathered information about actors, directors, genre, screenwriters and producers.These properties are domain-dependant, and were manually selected based on domain knowledge and some heuristics.For different domains, other properties should be chosen.An example of the output of this step is reported in the left box of the portion of knowledge base presented in Figure 1.

Collecting Subjective Features from Users' Reviews
On the other side, subjective properties are more related to the perception of the item itself, since they refer to characteristics that involve a degree of judgement or uncertainty [11].As an example, users can use subjective properties in preferences such as "I like movies with interesting characters" or "I love movies with a great plot".Moreover, subjective properties can also be used to refer to the emotional state the desired item can invoke (e.g."I like movies with a funny ending").As stated in [23], differently from objective features, which refer to precise and unambiguous characteristics (e.g., movies starred by Brad Pitt), subjective features are very useful when users have vague ideas about the characteristics of the desired item, and they prefer to express preferences related to their personal perceptions or their emotions.
In order to collect these properties, we also designed an opinion mining pipeline to extract subjective features from users' reviews.The choice of analyzing users' reviews to obtain subjective features is particularly straightforward, since reviews give a clear picture of what people like and what people think about the items that they consumed in the past [27], and this can be used to catch characteristics related to the perception of the items or to the emotions they can evoke.Our strategy relies on an unsupervised aspect extraction algorithm inspired by previous work in the area [28], that identifies uni-grams and bi-grams that frequently appear in the reviews of an item.In a nutshell, the subjective features of an item are the uni-grams and bi-grams that are frequently mentioned with a positive sentiment in the reviews of the item.Our methodology can be further split into two phases: selection and ranking.
Selection Phase.As a first step, we collect a set of reviews discussing a particular item, and we split each review into sentences.Given each sentence, we use sentiment analysis [29] to determine the sentiment conveyed by that particular sentence, and filter out all sentences expressing a negative or neutral sentiment score, thus maintaining only positive sentences.This choice is very common in literature [30], and is based on the intuition of just highlighting positive aspects of the items to be recommended.Next, once the set of sentences has been filtered, we tokenize and lemmatize the sentence, and we retrieve the POS tag.After this step, all lemmas mentioned in all the positive reviews of the item are obtained.
Next, the available lemmas are analyzed in order to obtain candidate unigrams and bi-grams.As for uni-grams, we filter nouns (except for proper names) and adjectives (except for comparatives and superlatives).As for bigrams, we identify noun-noun and adjective-noun pairs.In this case, our choices are based on previous research [31], showing that relevant aspects describing the items are usually represented using nouns.As an example, starting from this positive sentence from a review of "The Matrix": "The movie is intriguing, full of philosophical and religious allegories, and the action is probably some of the best cinema has ever seen with the slow motion gun fights and masterful stunts" we obtain the following set of candidate uni-grams and bigrams: movie, allegory, action, cinema, motion, gun, fight, stunt, intriguing, full, philosophical, religious, slow, masterful, religious allegory, slow motion, motion gun, gun fight, masterful stunt.
Ranking Phase.The Extraction phase is repeated over all the sentences of all the reviews that discuss a particular item.Accordingly, when the set of reviews is particularly large, the amount of candidate uni-grams and bi-grams can be very large as well, thus it is necessary to design a strategy to rank all of them in order to select the most relevant subjective features.
In order to remove some noise, we start the ranking phase by removing common and domain-specific stop words.Next, we calculate the TF-IDF score for each candidate uni-gram and bi-gram, by considering all reviews for an item as an individual document.Generally speaking, the TF-IDF score calculates how frequently a particular uni-gram or bi-gram is mentioned with a positive sentiment in the reviews of the item.Next, we extract the top-100 uni-grams and bi-grams based on their TF-IDF score, and we further filter out those appearing in less than 0.25% of the items.All these values have been empirically set after a thorough analysis based on several runs of the algorithm.Automatic tuning of such parameters is left as future work.Based on the same example provided in the Extraction phase, once stop words are removed and TF-IDF ranking is applied, we get the following subjective features: action, stunt, intriguing, philosophical, religious, slow motion, gun fight.As shown in the right box of the KB depicted in Figure 1, these features represent the final output of the pipeline and are encoded in our knowledge base as subjective features.As future work, we will consider the implementation and evaluation of more sophisticated aspect extraction strategies based on deep learning techniques.To sum up, we can state that the combination of subjective and objective features provides the items with a richer representation of their descriptive features, which can be exploited for both preference elicitation and recommendation.

Knowledge Exploitation
Once the knowledge extraction pipeline has completed its execution, the knowledge base is populated with both objective and subjective features describing the items in the catalogue.Next, the knowledge exploitation pipeline is carried out.The components involved in this pipeline are designed to make the VA able to: (1) correctly catch the mentions to objective and subjective features throughout the preference elicitation process; (2) make the features available to the recommendation algorithm; (3) start the recommendation process when enough preferences are collected.The specific workflow carried out by the knowledge exploitation pipeline is presented in Figure 2. It is a specialization of the components already presented in Figure 1 that provides more details on how NLU and recommendation modules interact and exchange information with the knowledge base.
As shown in the figure, the process is started by the user who interacts with the VA through a chatbot by writing her messages in natural language.As we explained throughout the paper, the first distinctive trait of the methodology lies in the fact that the preference elicitation is completely guided by the user, while the VA is takes on a simple reactive role.This choice is supported by the findings of recent work [32], which showed that people are more satisfied with services they feel they have more control over.
Next, all the messages written by the user are received by the VA, whose goal is to correctly catch the preferences expressed by the users.As an example, if a user tells the VA: "I like The Matrix and I love sci-fi movies", the VA shall recognize the items (i.e., The Matrix) as well as the descriptive properties, (i.e., sci-fi) mentioned in the text.When enough evidence is collected, the VA starts the recommendation process based on the preferences of the user and generates a suggestion.Finally, the user can express a feedback, and the recommendation process is repeated until a satisfactory recommendation is Fig. 3 Interaction between a user and the VA generated.Obviously, both NLU as well as recommendation modules play a key role in this pipeline.In the following Section, we provide more details about the algorithms we exploited in work.Throughout this Section we refer to the dialogue presented in Figure 3 as a running example.

Natural Language Understanding Modules
In order to understand the semantics conveyed by users' messages and to correctly manage the interaction with the user, it is mandatory that the VA is provided with some NLU capabilities.To this end, we integrated into our system three fundamental components: an Intent Recognizer (IR), a Named Entity Recognizer (NER) and a Sentiment Analyzer.
Intent Recognizer.The goal of the IR is to correctly understand the goals or the actions the users have in mind when they interact with the VA.Operationally, the IR takes as input a message written by the user and classifies it against a fixed set of categories called intents.In our implementation, the IR classifies each message against three different intents i.e. 'provide preferences', 'ask for recommendation' and 'feedback on recommendation'.These intents are directly inherited from similar approaches for CoRS already presented in literature [8].
As an example, by referring to Figure 3, the first three messages written by the user (highlighted in yellow ) can be classified with the 'provide preferences' intent, while the intent of the fourth and fifth messages (highlighted in blu and red ) are 'ask for recommendation' and 'feedback on recommendation', respectively.
The IR currently implemented in our VA is based on DialogFlow 6 and relies on machine learning.Hence, for each intent, the system needs to be trained by feeding the IR with an adequate set of input examples that cover the different utterances that can be used to express preferences or to ask for recommendations.In our case, we fed the Intent Recognizer with approximately 50 examples for each intent.Unfortunately, due to space reasons, it is not possible to provide further details regarding the intent recognition process.For further reading, we suggest to refer to a recent survey [8], since most of the underlying concepts are covered by that work.
Named Entity Recognizer.Every time the IR classifies a message of the user as a message to 'provide preferences', our strategy for natural language preference elicitation is activated.Indeed, in this phase it is very important to analyze every single message to identify all the elements (i.e., movies, properties) that are mentioned, and to populate the profile of the user accordingly.In this setting, the NER module plays a fundamental role.
Given a message written by the user, the goal of the NER is to identify all the mentions to the entities contained in the text.In our case, the term entities stands for items, subjective and objective properties.As an example, by referring again to Figure 3, two subjective properties are mentioned in the first message and two objective properties (i.e., actors) are mentioned in the second message.Finally, the third message contains two movies liked by the user.
In order to recognize all the mention to these terms, also the NER needs to be trained.In particular, our NER is based on a state-of-the-art implementation that first exploits Conditional Random Fields to go through the text and identify potential entities.Then, fuzzy string matching is used to map candidate entities to the elements encoded the knowledge base (i.e., items, objective and subjective properties).If a matching is obtained, the entity mentioned in the text is recognized, extracted and stored in the profile of the user as a preference.
Sentiment Analyzer.Finally, it is also important to emphasize that a VA should be able to correctly understand the sentiment conveyed by the messages written by the users.As an example, as shown in the second sentence highlighted in grey in Figure 3, the user could mention two entities in the same message, one with positive sentiment and one with negative sentiment.In order to correctly manage these situations, we also integrated a Sentiment Analyzer in our VA, whose goal is to process each sentence written by user and to associate the correct sentiment to the fragment of text.By referring again to the previous example, after running the SA over the user's input, only the evidence about the preference of the user for Keanu Reeves is collected and stored in the profile, while the mention to Tom Cruise is used to identify items the user doesn't like.
To sum up, the combination of IR, NER and SA allow users to interact with the VA and carry out an open-ended natural language preference elicitation process.All the information which is extracted throughout this process is then stored in the profile and exploited by the recommendation algorithm.

Recommendation Services
When a sufficient number of preferences is collected, users can ask the system to generate a recommendation (green sentence in Figure 3).In this case, all the information stored in the profile of the user is passed to a recommendation algorithm, which in turn provides the user with a suitable suggestion.Next, users can express a feedback on the recommendation (yellow sentence in Figure 3) and whenever the suggestion is not liked, a new recommendation cycle starts.
As for the recommendation algorithm, in this work we compared two different implementations: a graph-based algorithm based on PageRank with Priors [33], and a content-based algorithm exploiting Doc2Vec [34].In the following, we introduce both the recommendation methods as well as the underlying data models.
Graph-based Recommendations.Our first strategy to provide users with recommendations is based on a graph-based algorithm exploiting PageRank with Priors (PPR) [33].The choice of PPR is motivated in the light of recent research in the area [35][36][37], which confirmed that PPR provides results in line with the most effective recommendation strategies.Moreover, one of the strengths of this algorithm lies in the fact that it can easily manage information gathered from knowledge bases and knowledge graphs.
For each configuration, we encoded items, subjective properties and objective properties as nodes.Next, we connected each item to all the objective and subjective properties that describe it.Given such a data model, we run PPR over the graph and we return as recommendation the item node having the highest PageRank score.Differently from the original PageRank, the distinctive trait of PPR is the adoption of non-uniform personalization vector assigning different weights to different nodes to get a bias towards some nodes.In our case, we biased the random walk in order to give more importance to the items and to the properties (that is to say, to the nodes) previously liked by the user.More details about the specific configuration of the algorithm are provided next.
Content-based Recommendations.Beside graph-based recommendations, we also included in our VA a content-based recommendation strategy exploiting Doc2Vec [34].In this case, we used Doc2Vec embeddings to build a vector-space representation of both items and users, which is used to identify suitable recommendations.
Given an item, we first query the KB to collect its subjective and objective features, then we use Doc2Vec to learn an embedding of the item based on these features.Similarly, the vector-space representation of the user is obtained by collecting the features describing her interests that were previously elicited by the VA, and by passing this information to Doc2Vec that learn the embedding representing the user as the centroid vector of the embeddings representing the movies and the properties she previously liked.Given such a representation of user and items, the cosine similarity between the profile of the user and all the items in the catalogue (except those the user has already rated) is calculated, and the item with the highest similarity is returned as recommendation.
Clearly, the behavior of both PPR and content-based recommendations is strongly influenced by the underlying data model (that is to say, by the features used to describe users and items).As the data model of the RS changes (i.e., only objective features, objective and subjective features, etc.), the resulting representation of users and items changes as well, and this leads to different recommendations.One of the goals of our experiment will be to also assess, beyond the effectiveness of our strategy for natural language preference elicitation, the effectiveness of the different groups of features on the perceived quality of the recommended items.
To conclude, it is also important to emphasize that the choice of the recommendation model which is adopted by the VA is not crucial here, since we mainly aim to introduce a strategy to elicit user preferences through natural language.However, we decided to implement two different algorithms to assess whether the effectiveness of our preference elicitation strategy is consistent across different recommendation algorithms or not.Moreover, it should be pointed out that the above described pipeline can work with every recommendation algorithm, as long as objective and subjective features are encoded into the recommendation model.The analysis of further algorithms is left as future work.

User experiment
In our user study (N = 249), we deployed our VA and we asked users to interact with the system by providing their preferences in natural language and by evaluating the recommendations they received.Through our user study, we collected data concerning how people use natural language to express preferences and needs, how accurate are the resulting recommendations, and how users perceive the interaction strategy we propose.

Research Questions
Through our study, we aim to answer to the following research questions: RQ1 -Interaction: How does our natural language preference elicitation strategy impact the way users interact with a VA?In particular, we are interested in both quantitative (e.g., amount of preferences expressed) and qualitative analyses (e.g., the lexicon mainly used by the users).Moreover, we want to assess to what extent users feel confident with our preference elicitation strategy.
RQ2 -Recommendation: How does natural language preference elicitation strategy impact the perceived quality of recommendations?In this case, we observe users' feedback to discover how accurate the recommendations are and how the users perceive the suggestions in the different experimental conditions.
RQ3 -Data Representation: How do different variants of the strategy impact both the quality of the recommendations and on the effectiveness of the preference elicitation process?We are interested in understanding whether the injection of subjective features increases the accuracy of the recommendations and how it impacts on the preference elicitation process.

Experimental Setup
In order to answer to these questions, we carried out a user study which involved 249 users (72.7% men, 79.5% aged 21-30, 68.2% average or advanced computer users, 58.2% already used a RS, 49.4% high interest in movies, 44.0% medium interest in movies.22.9% regularly or moderately used a VA), recruited by following the common availability sampling strategy.Our experiment followed a between-subjects protocol, i.e. each user was randomly assigned to one of the experimental conditions and interacted only with a specific configuration of the VA.Of course, the user is not aware of the specific configuration she is interacting with.Experimental conditions were designed in order to evaluate both the informative power of the different groups of features as well as how the features impact on the quality of the recommendations.In particular, we defined four different experimental conditions: It should be pointed out that we discarded the configuration based only on subjective features.This choice was made after some preliminary tests with the VA, from which emerged that users did not feel at ease when they were constrained to expressing their preferences by just using subjective features, because they were more complicated to formulate.Accordingly, we only considered them in combination with objective features.
Before the experiment, participants were informed about the goal of the experiment, and gave their consent to the treatment of data for the study.The experiment was carried out as follows: 1.Each participant is randomly assigned to an experimental condition; 2. Participants undergo a training phase to learn how to interact with the system and how to express preferences; 3.Each participant interacts with the VA and expresses her preferences.At least 5 preferences are needed; 4. When the minimum amount of preference is collected, the user can ask for a recommendation; 5.The system returns a recommendation, and the participant must express a binary feedback; 6.In case of positive feedback, the experiment ends.Otherwise participants must provide new preferences, then they can ask for a new recommendation.The process ends in any case after three negative responses; 7.At the end of the experiment, each user answers a post-usage questionnaire.

Implementation Details
Knowledge Extraction.Our catalogue of items was built by gathering all the elements belonging to the 'Movie' category from Wikidata.Next, objective properties were extracted from Wikidata again by following the mapping procedure presented in Section 3.1.1.For each movie, we extracted the actors, directors, genre, screenwriters and producers.As previously stated, these features were chosen based on domain knowledge and some heuristics.
Next, subjective properties were extracted by processing a subset of the popular Amazon Movie Reviews Data7 by exploiting the opinion mining pipeline described in Section 3.1.2.To implement both Sentiment Analysis as well as the Natural Language Processing tasks, we used the CoreNLP library 8 .
Natural Language Understanding.The VA is implemented as a Telegram bot.Intent Recognition relies on Google Dialogflow9 , while Named Entity Recognition is implemented using a custom-trained NER model from Stanford CoreNLP10 .Finally, Sentiment Analyzer relies on the CoreNLP Sentiment Tagger.
Recommendation Algorithms.As for PPR, our recommendation algorithm relies on Java implementation available in the Jung framework 11 .In particular, PPR was run by adopting the default distribution of the weights i.e., 80% of the total weight is evenly distributed among items liked by user and 20% is evenly distributed among the remaining nodes.As for contentbased recommendations, we exploited the Python implementation of Doc2Vec available in the Gensim package 12 .Dimension of the vectors was set to 300.No further parameter was tuned.
Finally, in Table 1, we depict some statistics about the knowledge base.we employed in our experiment.As shown in the table, objective properties are almost seven times larger than subjective properties.However, the distribution of the latter is significantly less sparse, since the number of subjective properties that are extracted for movie is much higher w.r.t. the number of objective properties (96.03 vs. 18.23, on average).This means that each subjective property describes and is connected to a higher number of movies, on average.In the following, we will analyze how these characteristics impact on the preference elicitation and on the accuracy of the recommendations.

Metrics and Questionnaire
During the testing sessions, we collected the following metrics: • Preference Count (PR): average number of preferences expressed by each user throughout the preference elicitation process.The value is further split into objective and subjective preferences; • Query Density (QD): adapted from Glass et al. [38], it measures the mean number of new concepts (items, subjective/objective properties) introduced in each user message.It is computed using the following formula: where N u (i) is the number of distinct concepts introduced by the user u in the dialogue i, N m (i) is the number of messages of the user in the dialogue i, and N d is the number of dialogues.• Conversation Length (CL): average number of interaction turns before the user finds a satisfactory item, or the experiment is concluded; • Hit Rate@K (HR@K): average number of hits (i.e.satisfactory recommendations) that are recorded within the top-K recommendations made by the system.
For each experimental condition, the final score is calculated by averaging the scores obtained by the single users.Statistical significance between experimental conditions was assessed by using T-Test.
At the end of the experiment, we also asked the users to fill in a postusage questionnaire based on the standard ResQue model [39], which aims to assess the effectiveness of the interaction and the quality of the recommendations.Questions are organized into nine constructs: Ease of use, Control and Transparency, Interaction Adequacy, Recommendation Accuracy, Novelty, Intention of use, Perceived usefulness, Confidence and Trust, Overall Satisfaction.Answers are provided on a 5-point Likert Scale, with 1 meaning Strongly Disagree, and 5 meaning Strongly Agree.A detail of the questionnaire, along with the constructs each question refers to, is provided in Table 4 and Table 5.

Results
In this section, we report the results collected during the experiment.In order to answer to RQ1, we first analyze the how many preferences were gathered, and what characteristics they have.
Throughout the user study, VA recognized mentions to 1,621 entities (6.51 per user, on average).By splitting the results based on the type of preferences expressed, the first interesting outcome emerged.Indeed, 76.95% of the entities belongs to objective properties (5.04 per user, on average), while only 23.05% refer to a subjective property (1.81 per user, on average).This finding is probably due to the fact that users are more familiar in expressing preferences in form objective properties (e.g., "I like movies by Quentin Tarantino"), rather than indicating complex and more articulated subjective characteristics of the items.
In order to deepen the analysis, in Table 2 we also report the top-10 properties the users mentioned in their messages during the preference elicitation phase.As shown in the left column, users tend to very frequently indicate the actors they prefer and the movies they like.Contrary to what expected, the use of genres is not particularly common.On the other side, subjective features are used to refer to more specific characteristics, such as movies having a good soundtrack or a good photography.Moreover, as expected, subjective features were also used to indicate emotions the requested movie shall evoke, such as romantic or violent.Finally, very fine-grained characteristics (e.g., plot twist, witty humor) are also commonly elicited by our VA.Overall, we can state that this analysis confirmed the expectation that guided the design of our natural language elicitation strategy.Indeed, the combination of objective and subjective features allow users to express preferences on a broader range of heterogeneous characteristics of the movies.In the next section we will also analyze what impact such characteristics have on the overall accuracy of the recommendations.
Next, we calculate the average scores for QD and CL in the different experimental conditions, in order to assess whether the use of subjective and objective properties also impacted on the length of the conversations and on the amount of information recognized by our VA at each turn.Results are presented in Table 3.
As shown in the Table, the amount of concepts recognized during each turn of the conversation does not significantly change on varying of the specific configurations.However, an interesting outcome can be noted for PPR, since the the PPR+All configuration obtained a lower QD than PPR+Obj.In other terms, when users can indicate subjective properties in their messages, a slightly lower number of concepts is recognized, on average.This result follows  the previously findings, since it confirms that users experience more difficulty in using the correct lexicon to indicate their preferences on subjective properties.In the light of this result, it is not surprising that just a small portion of subjective properties (only 23.05% of the total preferences, as previously reported) was collected by our VA.As for the CL, both the algorithms confirmed that the use of subjective properties leads to a longer conversation.This element further confirms that the use of subjective properties makes it more difficult to express preferences, since more turns are required on average to correctly elicit user preferences and to obtain satisfying recommendations.
In order to complete the analysis, we also report the results obtained by the subset of the questionnaire evaluating how users perceive our strategy for natural language preference elicitation.Scores are reported in Table 4.It should be pointed out that the only two configurations, that is to say, objective and objective+subjective, are reported in the table.In this case, we did not split the results based on the underlying recommendation algorithm since the RS does not have any role in the preference elicitation process.
As shown in the table, the use of subjective features led to a decrease of the interface adequacy.Even if the gaps are tiny and not statistically significant, the adoption of a richer lexicon did not improve the quality of the interaction.Conversely, the users experienced more difficulties in telling the system what they like.However, a tiny increase was noted in terms of ease of use and control, thus it is likely that users appreciate the opportunity of expressing their preferences through a richer and more articulated lexicon, even if they are not familiar with it.
To sum up, this part of the study confirmed that the combination of subjective and objective features provides users with more opportunities to express their preferences.However, the fact that users are not particularly familiar with the lexicon to use is of hindrance for a complete exploitation of subjective features.It is likely that users need to be better informed (e.g., during the training phase, or by means of example messages) about which lexicon they can use before being more confident.This aspect will be further investigated in future work.
Next, in order to answer RQ2 and RQ3, we analyze the average HitRate obtained by the different experimental conditions.First of all, can state that our natural language preference elicitation strategy allows users to get good recommendations regardless of the specific configuration.Indeed, as shown in Figure 4, more than 80% of the users obtain a good recommendation after two turns of conversation, and the value further increases between 96 and 98% at turn three.Accordingly, we can state that the NLU components successfully caught the mentions to the entities the users indicated in their messages.This leads to a precise representation of the interests of the users, which in turn allows to generate accurate recommendations.
Next, by splitting results based on the different experimental conditions, more controversial findings emerged.As for Doc2Vec, results showed that the injection of subjective features led to more accurate recommendations, especially for HitRate@1.This means that, when subjective features are encoded into the profile and are exploited by the recommendation algorithm, a larger amount of users gets a recommendation even at the first turn.The same outcome is confirmed for HitRate@2 and HitRate@3, even if the gaps are smaller.This analysis shows that the presence of subjective features allow our contentbased algorithm to generate more precise recommendations, and this happens regardless the difficulties that the users experienced in the preference elicitation phase.
Unfortunately, this finding is not confirmed by the configuration based on PPR.Indeed, in this case, the adoption of subjective features led to an overall decrease of the HitRate.However, it should be pointed out that the results obtained by PPR are always lower than those obtained by D2V.Accordingly, it is likely that poor performances of PPR mainly depend on the underlying data model rather than on the informative power of the features.This result is not particularly surprising, since other work already showed [40] that PPR often needs automatic feature selection of the properties to obtain the best results.In future work, we will run again the experiment by also carrying out some filtering of the properties, in order to see whether a different data model leads to an higher accuracy.At the moment, we can just state that subjective features do not increase the quality of the recommendations when PPR is used.Finally, in Table 5 we report the results we collected for the remaining questions of the post-usage questionnaire.These questions are more focused on the quality of the recommendations, thus we also split the results based on the underlying algorithm.Generally speaking, the results confirmed the findings of the quantitative analysis based on HitRate.Indeed, the use of subjective features, which positively impact on the quality of the recommendations on D2V, also positively affected the users' perception of the suggestions.Results are particularly significant in terms of perceived usefulness and control of the recommendation process.On the other side, as for PPR, we obtained mixed results again, since the gap between the configurations was really small for most of the comparisons.As a side effect of our methodology, we noted a positive impact of subjective features on the novelty of the recommendations.Even if this aspect it is not directly investigated in this paper, we noted an increase in novelty for both PPR and D2V.This is probably due to the introduction of novel and unexpected connections that follows the injection of subjective features.Thanks to these new properties, more connections between the items are created, thus it is more likely that the user receive as recommendation something new and unexpected.
To conclude, we can state that the introduction of subjective features has a positive impact on the overall quality of the recommendations, especially for content-based techniques.It should be emphasized that, despite the reluctance and the difficulties in the expression of preferences in terms of subjective features, when the users also adopt such a more sophisticated lexicon, better recommendations are generally returned.

Discussion
Based on the aforementioned results, we can now answer the Research Questions defined in Section 4.1.
RQ1 -Interaction: How does our natural language preference elicitation strategy impact the way users interact with a VA? Users generally appreciated Table 5 Results collected in our post-usage questionnaires, comparing PPR to D2V algorithm and configurations based on objective features to those based on subjective and objective features (all).For each question, the best-performing configuration for each algorithm is reported in bold, while the overall best configuration is also underlined.Gaps below 0.05 are not highlighted.
our strategy for natural language preference elicitation.The adoption of a more natural mechanism to gather preferences and needs, inspired by narrativedriven recommendations, allows users to express their interests in a way that better mimics human-to-human interactions.Indeed, the answers to the postusage questionnaire were generally high for all the constructs, so we can state that no particular issue emerged in terms of user interaction.As for the specific behavior of the users, we noted that they tend to more frequently use objective preferences w.r.t subjective ones.This is probably due to the fact that users are not familiar with the lexicon of subjective properties that can be indicated, thus they need some sort of training or suggestions in order to exploit these features more effectively.This finding is also confirmed by the quantitative results we obtained for QD and CL, which confirmed that the use of subjective properties led to a slightly lower precision in the recognition and an increase in the average length of the conversations before a good recommendation is generated.RQ2 -Recommendation: How does natural language preference elicitation strategy impact the perceived quality of recommendations?The adoption of a strategy for natural language preference elicitation allowed almost all users to receive a good recommendation within three turns of conversation.These results are even higher when a content-based approach is used.This is a encouraging quantitative result, which confirms that our NLU modules are able to collect all the information which is needed to generate a good recommendation.Of course, more work is necessary to further improve the recognition quality as well as the methods to inject objective and subjective features into recommendation models.This is also confirmed by the answers to the post-usage questionnaire, which showed that the majority of users was satisfied by both the interaction with the VA as well as by the overall quality of the recommendations.
RQ3 -Data Representation: How do different variants of the strategy impact both the quality of the recommendations and on the effectiveness of the preference elicitation process?This is probably the most controversial and interesting part of our study, since the analysis of the outcomes concerning the adoption of subjective features has led to mixed findings.On one side, we noted that users experience several difficulties in expressing preferences and needs by using subjective features.Indeed, just 23% of the preferences we collected in our study falls into this category.As previously stated, it is likely that some sort of training is needed to help users in their first interactions with the VA, in order to allow them to discover which lexicon they can use to express their interests.However, when some preferences expressed in terms of subjective features are collected, recommendation algorithms are able to make the best out of our strategy.In particular, this statement holds true for content-based strategies, whose HitRate obtained an increase of 10% when subjective features were injected in the model.However, this behavior was not confirmed by PPR, thus a more thorough analysis shall be conducted to understand out to improve the quality of the suggestions with different recommendation paradigms.
To conclude, we can state that the introduction of subjective features, which is particularly new for VA and conversational agents, provided us with interesting and novel insights which are useful to open the way to further research in the area.

Conclusion
In this paper, we introduce a strategy to elicit natural language user preferences in a VA for the movie domain.Our approach is based on a knowledge extraction pipeline, where both objective and subjective features are extracted from structured and unstructured knowledge sources, which is followed by a knowledge exploitation pipeline, where the information previously extracted are exploited by NLU modules and recommendation algorithm.
The proposed pipeline is then evaluated in a large user study (N = 249), whose results showed that the proposed approach allow users to easily express their preference and to receive accurate recommendations.As for the use of the features, results showed that people tend to mainly express their preferences in terms of objective features, and discard subjective features.However, when the users are able to use subjective features to tell the system what they like, better recommendations are usually generated.
As future work, we propose to expand the experimental study by increasing the sample size, which should result in more significant results.Moreover, we aim to further increase the external validity by evaluating our approach in a different domain.Finally, the implementation of the workflow described in Section 3 can be improved, by implementing more sophisticated techniques for opinion mining, recommendation and NLU.

Fig. 1
Fig. 1 Workflow of our strategy for natural language preference elicitation

Fig. 2
Fig. 2 High-level Workflow of the Main Components of our VA Objective Features: Users can express their preferences for movies and objective properties.Recommendations are based on PPR; • PPR + All Features: Users can express their preferences for movies, objective properties and subjective properties.Recommendations are based on PPR.• D2V + Objective Features: Users can express their preferences for movies and objective properties.Recommendations are based on our content-based algorithm exploiting Doc2Vec; • D2V + All Features: Users can express their preferences for movies, objective properties, subjective properties.Recommendations are based on our content-based algorithm exploiting Doc2Vec.

Fig. 4
Fig. 4 Results in terms of HitRate.Different algorithms are reported with different colors.Dotted lines indicate configurations based on objective and subjective features, while solid lines indicate based on objective features alone.

Table 1
Knowledge base statistics

Table 2
Detail of the top-10 most frequent objective and subjective properties recognized by the VA

Table 3
Results obtained for Query Density (QD) and Conversation Length (CL).Higher results are reported in bold.

Table 4
Average answers collected in the post-usage questionnaire.Best-performing configuration is reported in bold.