Research on Implicit Intent Recognition Method Based on Prompt Learning

As one of the core modules of the dialogue system, intent recognition plays an important role in human-computer interaction. Most of the existing intent recognition research is limited to simple, direct, and explicit intent recognitions. However, the natural human-computer interactions are ﬂexible and diverse, and the expressions are often the euphemistic implicit intentions. Therefore, the implicit intent recognition brings new research challenges in this ﬁeld. This paper pioneers a Chinese Implicit Intent Dataset CIID, which covers 7 common intents from diﬀerent ﬁelds, and the data is the text containing the user’s implicit intent. Based on this corpus, it is the ﬁrst time prompt learning is employed for implicit intent recognition and by constructing a suitable prompt template, the model can get ”relevant hints” to dig out the true intention of the user. Finally, this paper evaluates a range of classiﬁcation models on CIID dataset. Experimental results show that the recognition rate of the proposed model is 97.6%, and achieves the state-of-the-art recognition accuracy. Furthermore, since it is diﬃcult to collect the user’s implicit intention data, this paper also explores the performance of these classiﬁcation models on the CIID dataset with few-shot settings, and the experimental results show when the training data is reduced to 4.7%, the recognition rate of the proposed model can still keep 92.4%, which is signiﬁcantly higher than other baseline models, the results further prove this proposed method is advanced and robust.


Introduction
Human's intention can be divided into explicit intention and implicit intention according to the way of expression [5]. Explicit intent means that the user clearly states his needs by speech or text. Implicit intent refers to the fact that the user does not express his requirements directly, and it is necessary to infer the user's true intention by analyzing their utterances [16,23]. For example, the intent for booking hotel, the explicit intent expression is "book a hotel near the train station for one night", and the implicit intent expression may be "I missed the train and have to find a place to live nearby". It can be seen from these examples that the expressions of implicit intent is more in line with people's daily speaking habits than the mechanical and monotonous expressions of explicit intent. However, due to the difficulty of implicit intent recognition and the lack of implicit intent corpus, most of the existing intent recognition methods focus on simple and clear utterances with explicit intents. With the increasing application of artificial intelligence in people's daily life, it is urgently required that the computer is able to infer human implicit intents besides it can understand simple and direct service instructions. By improving the intelligence level of human-computer interaction, the intelligent system provides people with more considerate services. Therefore, this paper focuses on the implicit intention recognition in the field of human-computer dialogue.
Human's intentions are usually subjective and the intentions expressed by the same sentence vary from person to person, therefore, it is very difficult to realize the generalized intention recognition. Existing intent recognition manually divides the dataset into several limited intent categories, and then converts the intent recognition into a text classification issue. The most commonly datasets for explicit intent recognition are the air travel dataset ATIS [11] and SNIPS [6] collected from the Snips personal voice assistant. In addition, there are intent recognition dataset of the banking BANKING77 [3], the Chinese intent recognition dataset CAIS [20], etc. Compared with explicit intent recognition, researches on implicit intent recognition are few and there is no public dataset. As a result of that, this paper proposes a Chinese Implicit Intent Dataset CIID. Social media such as Weibo and Facebook provide a platform for users to express their thoughts, wishes and demands [21]. With the popularity of these platforms, more and more users post their status information on these platforms, many of which express their intentions explicitly or implicitly. Therefore, inspired by the literature [15], this paper collects data from Sina Weibo which is the biggest Chinese social media platform, and then conducts filtering, cleaning and annotating these data. The implicit intent dataset is constructed with seven categories, that is, ordering food, checking weather, shopping, traveling, booking movie ticket, setting reminder and booking hotel.
Implicit intent data usually contains deep-level semantic information, it is difficult for general classification models to capture this information so as to accurately identify the true intent of users. Therefore, this paper proposes an implicit intent recognition method based on prompt learning to mine users' implicit intents. Prompt refers to processing the input text and refactoring the task according to a specific template in order to more fully utilize the pretrained language model for feature extraction. On the one hand, this method can give the pre-trained model a prompt and make it recall the learned knowledge in the pre-training stage so as to maximize the potential of the pre-trained model. On the other hand, by adding a suitable natural language template, the method provides "intent hints" to help the model infer the user's true intent. The proposed method is verified on the implicit intent dataset CIID and the experimental results show that the method achieves the state-of-the-art recognition results.
The main contributions of this paper are listed as follows: (1) A cross-domain Chinese implicit intent dataset CIID is constructed, which contains 7 common service intents and a total of 5,042 annotated utterances. Few previous works involve the implicit intent recognition, however, it is of great significance for a more intelligent and natural human-computer interaction. Therefore, the proposed dataset CIID can promote the development of implicit intent recognition.
(2) An implicit intent recognition method based on prompt learning is proposed. This method enables the model to fully mine the implicit semantics in the utterance by constructing a prompt template. It is the first attempt of prompt learning applied to implicit intent recognition. The method is also evaluated in few-shot settings. The experimental results demonstrate the proposed method is advanced.
(3) Comprehensive experiments are carried out on a range of benchmark classifiers. Compared with the general pre-trained classification model BERT, the recognition accuracy of our proposed model is improved by 1.6%, and the F1 score is improved by 1.7%, achieving the state-of-the-art experimental results.
The rest of this paper is organized as follows: Section 2 discusses the related work of explicit intent recognition and implicit intent recognition respectively; Section 3 introduces the proposed implicit intent recognition dataset CIID and its construction method; Section 4 describes the proposed implicit intent recognition method in details; Section 5 gives experimental results and analysis; Section 6 is conclusions and future work.

Related works
Intention recognition as the basis of human-computer interaction has been widely studied. Early intention recognition relies on rules written by the domain experts. Ramanand et al. [24] proposed a method based on rules and graphs to obtain the intention templates for consumption intentions, which achieved a good classification result. Their method is easy to understand, but requires different experts to manually formulate feature rules for different fields, and the recognition results heavily rely on the expert rules, so it has certain limitations. The researchers then turned their focus to statistical learning methods. Peng et al. [22] proposed a naive Bayesian approach for intent recognition. The support vector machine model proposed by Haffner et al. [10] has a great performance in the field of classification and is widely used in the task of intent recognition. However, the semantic features based on statistical learning methods are usually determined manually and empirically, and do not consider the contextual information of conversational utterances, so the deep semantics cannot be accurately identified.
The current mainstream of intent recognition methods are based on deep learning. Kim et al. [14] first tried to use CNN for text classification tasks, proposed a textCNN model, and achieved an ideal results. Ravuri et al. [25] proposed to use RNN and LSTM respectively to achieve intent recognition, and verified it on the ATIS dataset. The results show that the recognition performance of LSTM is superior to RNN. Balodis et al. [1] proposed an intent detection system based on FastText word embeddings and neural network classifiers, and achieved the state-of-the-art results on three datasets. With the pre-trained language models such as BERT [7] are proposed, the performance of NLP tasks has been greatly improved. Chen et al. [4] proposed a BERTbased model to joint intent classification and slot filling task. Compared to the attention-based recurrent neural network models, the model achieves significant improvements in intent classification accuracy, slot filling F1-score, and sentence-level semantic framing accuracy on multiple public benchmarks.
Most of the existing works focus on explicit intent research, on the contrast, few researches are involved for implicit intent recognition. Fu et al. [9] made a preliminary exploration in this field and proposed a method for automatic recognition of implicit consumption intention in social media. They regarded implicit consumption intention recognition as a multi-label classification problem which comprehensively used user's attention behavior, intention attention behavior, intention forwarding behavior and other features. Li et al. [15] used an attention-based encoder-decoder model for implicit intent detection and first constructed a parallel corpus of implicit-explicit intent for training the model, and then recognize the converted explicit intent. Jia et al. [12] used transfer learning to integrate the data of Jing Dong Q&A platform and a small amount of labeled Weibo data to construct the training set, and proposed a bidirectional long short-term memory(BiLSTM) neural network based on attention mechanism to identify users' implicit consumption intentions.
The current research on intent recognition is dominated by data-driven deep learning methods, thus, the lack of implicit intent datasets is one of the main reasons which restrict the development of this field. Most of currently public datasets are for explicit intent recognition, here list part of the datasets and their attribute information in Table 1. The earliest proposed ATIS [11] is a commonly used intent recognition dataset and the data comes from the corpus in the field of air traveling which contains 5,371 utterances and 21 intents. The Banking-77 [3] is also a single-domain intent recognition dataset, derived from online queries of the bank customers, including 13,083 utterances and 77 fine-grained intents. Both SNIPS [6] and CAIS [20] datasets are cross-domain intent recognition datasets, containing 13,784 and 10,013 utterances respectively. Above datasets only involve the explicit intent information, which is monotonous and straightforward. However, human's daily communication is flexible, complex, and often contains many overtones. Therefore, improving robots or computers the ability of understand implicit intention is an essential and critical step for implementation human-like intelligence. Therefore, this paper proposes a Chinese implicit intent dataset CIID, which contains 7 common user intents across different domains, with a total of 5042 utterances.
For the proposed CIID dataset, we adopt an advanced prompt-based method to identify the implicit intents in the utterances. We firstly construct prompt templates for the input which can convert the original classification task into a masked language modeling task, then use the pre-trained language model to predict the masked token in the template, and finally map it to the true label. For implicit intent recognition tasks, appropriate prompts can provide semantic hints to the model, assisting the model to infer the user's true intent. For example, an utterance " I missed the train and have to find a place to live nearby", after constructing the prompt the sentence could be " I missed the train and have to find a place to live nearby, so I want to ", which can help the model to recognize the user's implicit intent -book a hotel.

CIID dataset
In order to enable robots to understand the deep semantics of the users and develop the research of intent recognition, this paper proposes a Chinese Implicit Intention Dataset CIID. To the best of our knowledge, no such dataset has been proposed in previous studies. The overview of this dataset and the construction process will be briefly described.

Overview
The CIID dataset contains a total of 5042 utterances, covering 7 service intents that commonly used in human-computer interaction, including ordering food, checking weather, shopping, traveling, booking movie ticket, setting reminder and booking hotel. The distribution of each category is shown in Fig.1. Among them, shopping intent accounts for the highest proportion, the reason is because users are more inclined to share their shopping intentions or product-related information in social media, on the contrary, hotel reservations may involve the user privacy, so fewer related intention utterances are collected.   Table 2. Implicit intent text does not contain any explicit intent instruction while expresses the corresponding intent. For example, for the corpus in Table 2, "I'm so hungry, but I don't want to cook." Although there is no expression related to ordering foods, actually, it is the user's true intention.

Dataset construction
The construction of the CIID dataset includes two stages. In the first stage, a small-scale seed dataset is manually constructed. In the second stage, implicit intention utterance templates are abstracted from the obtained seed dataset, Ordering Food It has been raining for several days, when will the sun come out?
Checking Weather I want to wear a skirt tomorrow, will it be cold?
Checking Weather It's almost New Year's Eve, it's time to start stocking up.
Shopping I need a new cup with a straw! ! Shopping The world is so big and I want to see it.
Traveling I've been waiting for this movie for a long time, finally released! Booking Movie Ticket My poor memory. . . I forgot to sign in for a few days.
Setting Reminder I have missed the assignment submission time for the third time.
Setting Reminder I have to stay outside for two nights for the postgraduate entrance examination next week.

Booking Hotel
Is there any places to live around Disneyland?
Booking Hotel and social media data is crawled based on these templates to expand the dataset. The specific construction process of CIID is shown in Fig.2.

Construct the seed dataset
In order to promote the intelligence in human-computer interaction, the dataset selects 7 common user services of the intelligent assistants as intent labels. We first recruit a number of volunteers and let them express their service requirements in a variety of ways, but these utterances cannot contain the keywords directly related to the intention. For example, for checking weather intent, participants could not directly ask "what is the weather?" and even could not say the keyword "weather". Finally, we collected more than 300 utterances from all the participants, and obtained a small-scale seed dataset after further deduplication and annotation operations.

Dataset augmentation
Due to the limited scale of manually collected dataset, this paper fully utilizes the rich data resources from the social media to expand the seed dataset. Compared with other data resources such as movie scripts and personal voice assistants, Weibo not only contains richer user demand status, but also is easier to obtain. Therefore, this paper conducts corpus from the popular social media platform-Sina Weibo 1 to obtain more posts containing implicit intentions.
Specifically, we abstract a number of implicit intention utterance templates which can represent the utterances based on the seed dataset, in order to ensure the richness of the data, there are more than 20 implicit intention utterance templates of each category. Then, Weibo posts in the past three years is crawled by the scrapy crawler according to the generated templates, and a total of about 20,000 posts are obtained, we filter out the posts that really contain implicit intents to expand the dataset. Fig.3 shows an example of the data augmentation process, " I forgot to clock in again this morning " is an utterance in the seed dataset, and the intent category is setting reminder. First, we analyze the keywords in the sentence that determine the intention, and then construct the implicit intention utterance template " forgot to again " according to the obtained template. Finally, we crawl the Weibo posts which have the similar semantics to the seed dataset. Since the crawled posts are very noisy and contain lots of redundant information such as external links and expressions, we further filter out the posts which are semantically inconsistent with the seed dataset. For example, a searched post is " I forgot what I dreamt about this morning again! ", although it conforms to the pattern in the template in Chinese, but its semantics are not the same as the seed data. After filtering, we continue to clean the data to remove other redundant information such as expression signals and other nformation that may involve user privacy such as someone's name. Finally, a total of 5042 implicit intent annotated utterances are obtained, as shown in Fig.1

Prompt learning
Prompt learning is the new paradigm of natural language processing, which can prompt the pre-trained model to recall the learned knowledge in the pretraining stage by constructing a suitable prompt, so as to unify the downstream tasks with the pre-training tasks [18]. The literature [26] is an earlier work on prompt learning and proposes a PET method which is a semi-supervised training process. PET reformulates the input example into a cloze style sentence to help the language model understand the given task. Jiang et al. [13] used a paraphrase-based method to automatically generate high-quality and diverse prompt templates, while using an ensemble approach to combine answers from different prompt templates. The literature [19] improved the prompt template, and proposed to use some pseudo-prompts, such as the [unused] token in the vocabulary, to replace the manually-defined prompts, so that the prompt tokens can automatically learn and update during training stage. This paper extends prompt learning to the field of implicit intent recognition for the first time. We infer that a suitable prompt can not only give hints to the pre-trained model, but also provide semantic hints to the entire recognition model, so as to help the model understand the implicit intent better.

Implicit intent recognition method
This paper proposes an implicit intent recognition method based on prompt learning to understand the implicit intents better. The architecture is shown in Fig.4. Firstly, we construct different prompt templates prompt 1 . . . prompt n for the input, and these templates contain the masked tokens that need to be predicted. At the same time, the mapping between the predicted word and the label is established. In the following step, a pre-trained Chinese BERT is adopted to predict words at the masked positions and maps them to the actual labels. Finally, the prompt i (1 ≤ i ≤ n) with the best performance is selected for the intent category prediction.

Construction of prompt and verbalizer
In order to explore the influence of giving a certain explicit prompt to the model for identifying the implicit intent, this paper uses artificially created prompts and verbalizer to build the model, which is the most natural and direct way. Specifically, for the input text X, the prompt is defined as function P , the input after constructing the prompt is: Where X ′ is a sentence that contains X and the masked tokens. We denote the class label as L = (l 1 , l 2,..., l k ),and denote the predicted vocabulary as V = (v 1 , v 2,..., v m ), the verbalizer represents the mapping of L → V . Since the selection of prompt templates is very important to the performance of the whole model, this paper designs different explicit intent prompt templates for the input based on human knowledge and experience. Table 3 shows the four prompts constructed in this paper and a verbalizer example corresponding to prompt P 4 . For different prompt templates, the predicted words in the verbalizer will be changed.

Prediction and answer mapping
By constructing the prompt, we obtain the reconstructed text X ′ containing the masked tokens, the intent classification task is transformed from assigning a meaningless label for the input to selecting the most suitable word for the  BERT-base-chinese masked positions. Therefore, BERT is employed to predict the masked tokens in X ′ . Since the Chinese BERT predicts character-by-character, this paper predicts the two masked tokens m 1 , m 2 respectively and obtain the predicted score distribution s 1 , s 2 .
MLM (w | m) represents the unnormalized score assigned to the character w at the mask position m, v 1 is the set of the first characters in all predicted words, and v 2 is the set of the second characters.
Then, the scores of all predicted words in verbalizer are calculated by the obtained character score distributions. Here, the word score is calculated by simply multiplying scores of the two characters. After calculating the score distribution s of the predicted words, this paper normalizes it and obtains the probability distribution p of all predicted words. Then, the probabilities of the predicted words belonging to the same category are averaged, and the obtained probability distributionp represents the corresponding label distribution. The prediction process is shown in Fig.5.
The probability distributionp of the class labels is obtained, and the maximum probability is selected as the final predicted classl of the model. l = argmax(p) (7)

Training loss
In the training phase, the probability distribution p (v | X ′ ) of all predicted words is predicted by the pre-trained model, and then the mean probability of predicted words belonging to the same category is calculated as the corresponding label probability distributionp(l | X). The loss function is defined as: where k is the number of labels and y represents the ground truth. x ( It's so stuffy outside. Is it going to rain? The topic of this sentence is about ____ ) Fig. 5 The process of prediction.Firstly, the pre-trained language model predicts the character scores in the masked positions;Then multiply the character scores to get the word score;Finally, the word probability is calculated for category prediction

Experiments
To validate the constructed CIID dataset and proposed implicit intent recognition method based on prompt learning in this paper, comprehensive experiments are carried out on a range of classification models. In this section, the experimental setup, experimental results and analysis are presented.

Baselines
(1) Non-pretrained models. This paper uses non-pretrained models such as BiLSTM [17], FastText [1], TextCNN [14] as baseline models. First, the input sentences are encoded by pre-trained word vectors, then features are extracted through the neural network layers, and finally intent classification is performed through a softmax function.
(2) Pre-trained model. The pre-trained model BERT [7] is adopted as the baseline model. After fine-tuning the BERT model on the proposed dataset CIID, the semantic representation vector [cls] is extracted, and then the intent classification is performed after the linear transformation.

Experimental settings
The hardware platform is Intel® Core™ i9-10,900 with 2.8 GHz, NVIDIA GeForce RTX 3090, and Ubuntu 18.04.3 OS. The proposed implicit intent recognition model is built based on the OpenPrompt [8] toolkit which is provided by the Natural Language Processing Laboratory of Tsinghua University, and the rest of models are implemented based on the Pytorch framework. This paper divides the training set, validation set and test set into 4242, 300, 500 and the detailed statistics are shown in Table 4. Among all the models evaluated on the CIID dataset, the learning rate of the non-pretrained model, pre-trained model and our proposed model is set to 1e-3, 5e-5, 3e-5, and the epoches are 100, 10, 10 respectively. To avoid overfitting, all of the models adopt an early stopping strategy. The evaluation indicators are F1 score and Accuracy.  Table 5 shows the experimental results with different prompt templates for classification on the proposed CIID dataset. It can be seen that different prompts affect the recognition accuracy of the model. But no matter which prompt template is used (P 1 , P 2 , P 3 or P 4 ), the recognition rate of the corresponding model is higher than that of other five classification models in Table  6. Among these prompts, the model P4 obtains the best performance, so this study uses the results of P 4 to compare with other models below.  Table 6 shows the experimental results of all the evaluated models on the CIID dataset, where Precision, Recall and F1 refer to the macro average. As can be seen from Table 6, the F1 score and accuracy of our proposed model are 0.975 and 0.976 respectively, and achieve the state-of-the-art results. From the results we can conclude that: (1) BERT model has certain advantages over other non-pretrained models such as BiLSTM [17], FastText [1] and TextCNN [14] because it owns the general language representation obtained by pre-training based on the large-scale corpus. (2) Compared with the BERT [7] model, our prompt-based model can more fully utilize the knowledge learned in the pre-training stage, furthermore, the prompts also play an important role in understanding text semantics.  6 analyzes the F1 scores of the baseline models BiLSTM+Att [27], BERT [7] and the proposed prompt-based model on each category. The BiL-STM+Att model adds the attention layer on the basis of the BiLSTM model which can assign weights for the words to enhance the performance and interpretability of the model. However, the BiLSTM+Att model has the lowest F1 score in each category in Fig.6, which shows that only relying on optimizing the neural network cannot fully capture the implicit semantics in the sentence. The BERT model is pre-trained on a large-scale corpus and acquires the general knowledge, which is helpful for understanding implicit semantics. Therefore, F1 scores of all categories are higher than BiLSTM+Att model. Compared with above models, our proposed prompt-based model achieves the highest F1 score in almost all categories, which shows our method can not only understand the implicit intent correctly but also be applicable to different fields.

Ablation experiments for few-shot settings
The lack of data sources has always been an important problem for intention recognition, especially for implicit intention recognition. Because the acquisition and annotation of the implicit intent data is very difficult, it urgently requires the model shows good recognition performance even when the training set is small. Therefore, this paper explores the performance of the proposed model under different-scale training sets. The experimental results are shown in Table 7.
The experimental results in Table 7 show that when the training set is reduced to about 47% of the original, the accuracy of our model is 96.4%, and when the training set is reduced to 4.7% of the original, the accuracy of our model can still reach 92.4%. The results prove that our proposed  To make the conclusions more convincing, this paper conducts comparative experiments on different models to explore the influence of the training scale, the results are shown in Fig.7. When the training samples are reduced to 1000, the recognition accuracy of all evaluated models decreases slightly. However, when the training samples are reduced to 500, the recognition accuracy of the proposed model is 94%, while that of other models have dropped significantly. With the continuous reduction of the training samples, the recognition accuracy of the traditional neural network models drop sharply. Compared with the neural network model, BERT [16] has a smaller accuracy decrease owing to its prior knowledge. The prompt-based model can obtain stable recognition accuracy even if the training set is extremely small, which proves that the proposed model has the highest recognition accuracy and also has a strong few-shot learning ability.

Conclusions and future works
This paper elaborates the research on implicit intention recognition. Firstly, a new implicit intention dataset CIID is built, which contains 7 common and representative intents in human-computer interaction, with a total of 5042 utterances. Secondly, a prompt-based approach is proposed to identify the implicit intents in the CIID dataset, and is compared with a series of benchmark models. The experimental results show that the proposed model achieves the best performance currently since constructing a suitable prompt can enhance the inferring ability to users' true intents. Finally, we evaluate the models in few-shot settings, and it turns out that the prompt-based model performs well even when the training samples are insufficient.
In the future, greater variety of intents will be built according to the daily requirements in human-computer interaction, and covering more aspects of our lives as many as possible. In addition, since human intentions are often subjective and vary from person to person, we will further explore the fusion of multimodal information and combine the user portraits to achieve a better recognition performance.

• Human and Animal Ethics
Not Applicable • Availability of supporting data The data that support the findings of this study are available from the corresponding author by request.