A response generator with response-aware encoder for generating specific and relevant responses

The dialogue data usually consist of the pairs of a query and its response, but no previous response generators have exploited the responses explicitly in their training while a response provides significant information about the meaning of a query. Therefore, this paper proposes a sequence-to-sequence response generator with a response-aware encoder. The proposed generator exploits golden responses by reflecting them into query representation. For this purpose, the response-aware encoder adds a relevancy scorer layer to the transformer encoder that calculates the relevancy of query tokens to a response. However, golden responses are available only during training of the response generator and unavailable at inference time. As a solution to this problem, the joint learning of a teacher and a student relevancy scorer is adopted. That is, at the training time, both the teacher and the student relevancy scorers are optimized but the decoder generates a response using only the relevancy of the teacher scorer. However, at the inference time, the decoder uses that of the student scorer. Since the student scorer is trained to minimize the difference from the teacher scorer, it can be used to compute the relevancy of a prospective response. The proposed model is the first attempt to use a golden response directly for generating a query representation, whereas previous studies used the responses for its implicit and indirect reflection. As a result, it achieved higher dialogue evaluation score than the current state-of-the-art model for Reddit, Persona-Chat, and DailyDialog data sets.


Introduction
A number of studies have shown that the generation of a response unrelated to a query is one of the most critical problems in response generation (Griol and Molina 2016;Serban et al. 2017;Huang et al. 2020)  . The generation of ungrammatical responses is now decreasing due to recent pre-trained language models, and the illogical responses are nearly inevitable in the end-to-end architecture (Wang et al. 2020). As a result, the task of response generation focuses mainly on reducing overgeneral and irrelevant responses (Ling et al. 2021;Sun et al. 2021).
One reason for general responses is that the traditional loss functions such as the negative log-likelihood tend to assign a high probability to safe responses. To prevent a response generator from preferring general responses to specific ones, Li et al. modeled the bidirectional influence between a query and its response by proposing the maximum mutual information as a loss function , and Wu et al. added a max-marginal ranking regularization term to the crossentropy loss (Wu et al. 2018). However, since they focused only on avoiding safe responses, their methods are ineffective in reducing other types of unrelated responses.
Some recent studies have found that resolving a lack of background knowledge helps a response generator evade irrelevant responses . Thus, the studies tried to inject some external knowledge into a response generator. Ghazvininejad et al. adopted unstructured texts (Ghazvininejad et al. 2018) and Wu et al. chose a structured graph as external knowledge for supplying background knowledge for dialogues . The search for the knowledge adequate for a query is a must in this approach, but it is not easy to search it when the query is short as in many dialogue-related tasks.
Both irrelevant and over-general responses are actually originated from unsuccessful capturing of the core meaning of a query (Song and Park 2018;Khattak et al. 2021). Existing response generators usually capture the core meaning only with a query and do not consider a user response even though user responses are available for their training. It is also known that a human utterer designs her utterance to induce a certain response from a hearer and the response by the hearer is made by considering the core meaning of the utterance (Grice 1969). Therefore, both a query and its prospective response are required for encoding the query in order to generate an appropriate response to the query. However, at least to our knowledge, there is no study to model prospective responses explicitly for query representation.
This paper proposes a novel response generator that generates a relevant and specific response to a query. The proposed response generator reflects the response of a query to query representation. For this, the encoder of the proposed generator imposes the importance weight to query tokens according to the relevancy of the tokens to a response. The key problem here is that the response is not available at the inference time. As a solution to this problem, the joint learning of teacher and student networks (Lian et al. 2019) is adopted where the input for one network (teacher) is a pair of a query and a response and that for the other network (student) is just a query. The two networks are trained to minimize the difference in their posterior distributions. Thus, the response generator uses the teacher network at the training time and the student network at the inference time to obtain a query representation. As a result, a query is encoded as a vector attended by its real response at the training time and by a prospective response at the inference time. The effectiveness of this query representation is shown by applying the response-aware encoder to a pre-trained sequence-to-sequence model such as MASS (Song et al. 2019) and BART (Lewis et al. 2019).
The main contributions of this work are: • This paper is the first attempt to use a golden response in generating a query representation to improve the performance of the response generator. Since the golden response can be used only in the training step, the teacherstudent framework was adopted. In addition, an attention mechanism is used to reflect the golden response in query representation.
• This paper adopts a golden response to understand the intention of the utterer. Since the utterer induces a certain response and designs the utterance, using a golden response helps understand the intention of the utterer. • The proposed response-aware encoder creates a query representation that reflects a golden response through a relevancy scorer. The decoder of the proposed response generator generates a response similar to a golden response through the response-aware query representation.
The rest of this paper is organized as follows. Section 2 introduces the previous studies on response generation, teacher-student framework, and implicit reflection of response. Section 3 describes why a golden response is needed for query representation. Section 4 presents the proposed response-aware encoder and how to train the encoder, and Sect. 5 introduces two methods of response generation using the response-aware encoder. Section 6 gives the experimental results, and finally Sect. 6 draws conclusions and future work.

Related work
This section presents the previous studies. It begins with how to avoid generating inappropriate responses. One promising way to generate a relevant response to a query is to use a golden response. Thus, in the following subsection, a teacher-student framework that allows a model to use a golden response is introduced. Afterward, the previous studies using the teacher-student framework to exploit the golden response implicitly are shown.

Inappropriate response avoidance
A number of efforts have been made to minimize inappropriate responses since a sequence-to-sequence model was introduced to response generation. However, Wu et al. found out that the standard loss for training a response generator prefers high-frequent tokens to low-frequent ones (Wu et al. 2018). As a result, new losses for compensating for lowfrequent tokens have been proposed. For instance, Jiang et al. proposed the frequency-aware cross-entropy loss (Jiang et al. 2019). However, they focused only on avoiding general responses, while inappropriate responses are categorized into four types of ungrammatical, irrelevant, illogical, and overgeneral responses .
One reason for irrelevant responses is a lack of background knowledge of a response generator. The representative approach to this problem is to provide some external information to a response generator Liu et al. 2019). For instance, Ghazvininejad et al. used an unstructured text as external knowledge for a fully datadriven neural dialogue system (Ghazvininejad et al. 2018;Paranjape et al. 2022), while Zhou et al. and Wu et al. adopted a knowledge graph to provide common knowledge to a dialogue response generator Bai et al. 2021;Wang et al. 2021). Another way to provide background knowledge is to use a dialogue corpus which contains a specific type of information. The method using persona information Chan et al. 2021;Cho et al. 2022) and the method using empathetic information (Rashkin et al. 2019;Ando et al. 2021;Wei et al. 2021;Li et al. 2021) are representative.

Teacher-student joint-learning framework
Hinton et al. proposed a teacher-student framework for the first time to reduce the resource consumption of deep neural networks (Hinton et al. 2014). After that, a number of studies have been proposed to deliver knowledge of a teacher network to a student network (Menon et al. 2021), but the framework was used for a different goal in the response generation. Since a student network is trained to mimic its teacher network under this framework, Lian et al. have used the framework to allow a model to exploit the information available only when training the model (Lian et al. 2019). Following this study, Wu et al. used a teacher network to guide a student network for knowledge selection using both a query and a response , and Bai et al. did it using a dialogue goal and dialogue history (Bai et al. 2021).
ConKADI  adopts the teacher-student framework to utilize a golden response for commonsense selection related to the response and shows the state-of-theart performance for Reddit data set. However, it employs the golden response for retrieving response-aware commonsense knowledge, not for response generation. MRBD Feng et al. (2021) is another dialogue-related model that adopts the teacher-student framework. It uses the framework to solve various subtasks of dialogue generation. Even if it showed the state-of-the-art performance for Persona-chat and Dai-lyDialog data sets, it does not utilize a response at all in dialogue generation.

Implicit reflection of responses
The fact that most end-to-end response generators are trained with the pairs of a query and its response implies that they are taught implicitly to reflect a response somehow or other. MASS Song et al. (2019), BART Lewis et al. (2019), and T5 Raffel et al. (2020) learn the relation between a query and a response at the cross attention layer of the decoder, and DialoGPT ) and UBAR  learn it at the self-attention layer. However, the effect of using a response is limited due to their implicit reflection.
Nevertheless, there is no study yet to exploit the responses explicitly to the best of our knowledge.

Factors affecting query representation
The representation of a query is affected by two kinds of factors. One is relative importance among the words within a query. A query contains, in general, a few important words to represent its core meaning. Thus, many previous studies tried to find out the important words by an attention mechanism (Bahdanau et al. 2015;Shan et al. 2018). However, a query by itself is insufficient to understand the core meaning of a query fully. According to the work of Grice et al., a query is designed by considering its prospective response (Grice 1969). Thus, the other factor that affects the query representation is a prospective response of the query. Table 1 enumerates three example pairs of a query and its response extracted from Reddit data set to show the insufficiency of a query for capturing its core meaning in a single-turn dialogue. Q and R in this table indicate a query and its response, respectively. The important words of a query are marked in boldface. In the first example, it can be figured out easily that an utterer for Q expects a respondent to answer with 'orange' or 'gold'. Thus, 'orange' and 'gold' should be regarded as important words. Note that the actual response in R is made with 'orange'. In addition, the word 'choice' in R is strongly associated with 'pick' in Q. Therefore, 'pick' also should be an important word.
Other examples show a similar phenomenon. For instance, R contains the expressions of 'gave it a try' and 'still waiting'. Thus, the expressions of 'gave it a shot' and 'haven't received' in Q should be considered as important ones. That is, the words in a query that allow inducing a response are important ones. Therefore, such words should be focused on when generating a query representation.

Response-aware encoder
The proposed response generator is basically a sequenceto-sequence model of which whole structure is explained in detail in the following section. Its main distinction from other sequence-to-sequence models is its response-aware encoder that reflects both a query and a response into query representation. Thus, the explanation of the proposed generator begins with how the response-aware encoder is structured and how it is trained.
Let D = {(q i , r i )} denote a single-turn dialogue data set which consists of the pairs of a query q i and a response r i . A query is a token sequence denoted as q i = q i1 , q i2 , · · · , q in , where n is a query length. A response is also a token sequence denoted as r i = r i1 , r i2 , · · · , r im , where m is a response length. A sequence of embedding vectors for a query q i , q i = q i1 , q i2 , · · · , q in , is obtained by leveraging a transformer encoder, since a transformer encoder has shown remarkable performances in many dialogue tasks (Oluwatobi and Mueller 2020). That is, the vector q i is computed by where Encoder(·) is a transformer encoder. The transformer encoder solves the relative importance, but does not reflect a prospective response at all since Encoder(·) does not consider any response of a query. As a result, q i does not represent the core meaning of a query completely. To solve this problem, the tokens in a query that corresponds with a response should be more emphasized.
Let r i = r i1 , r i2 , · · · , r im be a vector sequence for r i , where each r i j is encoded by the transformer encoder in Eq.
(1). The response r i can be represented as a single vector r e i by applying the average pooling to r i . That is, Then, following the work of Cai et al. Cai et al. (2019), the relevancy of a query tokens q i j to r e i is obtained by where φ is the ELU activation function and W t is a learnable parameter. The relevancy of q i to r i denoted as R t i is computed by applying the softmax to e t i j 's. That is, Since R t i j is a scalar value which represents an importance weight of q i j to r i , the final response-aware representation of q i is obtained by multiplying every R t i j to its corresponding q i j . That is, the final query representationq i iŝ whereq t i j = R t i j · q i j . Note that r i 's are available only at the training time. There is no response to a query at the inference time. Therefore, it is impossible to calculate the relevancy R t at the inference time. As a solution to this problem, inspired by the work of Wu et al. , a teacher-student framework is adopted for distilling the knowledge of scoring R t to a student relevancy scorer. Let the module for scoring R t be a teacher relevancy scorer. Then, a student relevancy scorer also measures the relevancy of each query token under the circumstance in which only queries are available without their responses. For each query token vector q i j , the student scorer determines the relevancy of q i j by where W s is a learnable parameter. The relevancy score R s i = R s i1 , R s i2 , · · · , R s in is then obtained by applying the softmax function to e s i j 's. That is, Whereas R s i is the relevancy of q i to r i , it is trained only with the queries which leads to a large discrepancy from R t i . To make the discrepancy minimized, the student relevancy scorer is trained with the Kullback-Leibler loss The smaller L kd is, the more similar R s i becomes to R t i . As a result, if the student relevancy scorer is trained to minimize the loss in Eq. (7), then it also gets able to compute the similarity of q i to r i only with q i . Then, the final query representationq i by the student relevancy scorer iŝ whereq s i j = R s i j · q i j . The structure of the proposed response-aware encoder is shown in Fig. 1. The encoder consists of two modules: a transformer encoder layer and relevancy scorers. The transformer encoder layer transforms a query q i and its response r i into vector representations of q i and r i . After that, the relevancy Fig. 1 The structure of the proposed response-aware encoder of q i to r i is applied to q i , where the relevancy is computed by two scorers. The teacher scorer computes the relevancy R t i with Eqs. (2) and (3) using both q i and r i , while the student scorer computes the relevancy R s i with Eqs. (5) and (6) using only q i . Then, the final query representation for q i is obtained by both scorers. The teacher scorer represents q i aŝ q t i in Eq. (4) and the student scorer expresses it asq s i in Eq. (8). However, note that both the teacher scorer and the student scorer are trained during the training time, but only the student scorer is used to represent q i at the inference time. There could be two kinds of models according to how to utilize R t i and R s i . One model is to make an encoder use them as explained in the previous section. Figure 2a and b corresponds to this model. In this model, the transformer decoder receives a query representation by the response-aware encoder at the encoder-decoder attention layer as a context vector. Since a response r i for every query q i is available at the training time,q t i from the teacher relevancy scorer is used as an input for the decoder (see Fig. 2a). However, at the inference time, there is no response for q i . Thus,q s i from the student relevancy scorer is used at the inference time as shown in Fig. 2b.

Response generation by response-aware encoder
The other model to utilize R t i and R s i is to use them at the decoder, and Fig. 2c and d show this model. This model regards R t i and R s i as an additional attention for the encoder output that comes from a response. Let Q, K , and V are a query, a key and a value matrix for the multi-head attention respectively and r be a relevancy score vector. Then, the scaled-dot product attention at the encoder-decoder attention layer is defined as where d k is the embedding dimension and • is the Hadamardproduct between a matrix and a vector. The vector r in this equation is R t i at the training time and R s i at the inference time.
The label smoothed cross-entropy loss L ce (Szegedy et al. 2016) is used for training the response generator, where the cross-entropy loss is where I is the number of data, q i is the distribution of a ground-truth label and p i is that of a predicted output. the label is selected according to its ground-truth distribution q(k).. Thus, the loss for training the whole response generator is a combination of L kd in Eqs (7) and L ce . L kd aims at forcing the student relevancy scorer to distill the knowledge of the teacher relevancy scorer and L ce compels the response generator to generate an answer similar to a golden response. That is, the whole loss for the proposed response generator is where α is a hyper-parameter.

Experiments
This section presents the experimental setups including data sets and implementation details, and the empirical evaluations of the proposed model.

Data sets
Three benchmark data sets are used to verify the performance of the proposed response generator. One is the Reddit data set , another is the Persona-Chat data set , and the other is the DailyDialog data set . All data sets are composed of opendomain dialogues. Each dialogue of the Reddit data set is a single-turn query-response pair, but the Persona-Chat and the DailyDialog consist of multi-turn dialogues. The queries and responses shorter than four words or longer than 30 words were excluded from the Reddit data following the study of . After this pre-processing, the Reddit data set contains 1,352,961 training pairs, 40,000 validation pairs, and 40,000 test pairs. The multi-turn dialogues of the Persona-chat data set and DailyDialog data set are converted to single-turn ones as done in the study of Feng et al. Feng et al. (2021). Table 2 summarizes the statistics on the data sets used for the experiments below.

Implementation details
The hyper-parameters for the transformer encoder layer and the decoder are equivalent to those of MASS or BART. RA-MASS follows the settings for MASS-base-uncased 1 and RA-BART obeys those for bart.base. 2 In addition, the recommendations of both models are followed for all other issues related to their implementation. RA-MASS and RA-BART share many parameter values. For instance, the dimension of embedding vectors is 768, and that of the inner-layers of feed-forward networks is 3,072. The number of heads in multi-head attention is twelve and the number of transformer layers in the encoder and the decoder is six. The batch size of training and validation sets is 32 and the learning rate is 0.0001. The Adam optimizer (Kingma and Ba 2015) is adopted with β 1 = 0.9 and β 2 = 0.98. In addition, the label smoothing of ls = 0.1 is used with cross-entropy loss. RA-MASS uses an attention dropout of 0.1 and an activation dropout of 0.1, but RA-BART does not. For the implementation of RA-MASS and RA-BART, Python 3.7.3 and PyTorch 1.8.0 were used. MASS is implemented with Fairseq library 0.8.0 while BART is with Fairseq 0.10.2. Lastly, the environment for training and running the proposed model is a PC with one RTX 2080ti GPU, 128GB RAM, and Intel Xeon CPU.

Evaluation metrics
The performance of a response generator is measured in four aspects: embedding, overlap, diversity and informativeness. Embedding is measured with Emb avg and Emb ex , Overlap is with BLEU-1 and BLEU-2, Diversity is with Dist-1 and Dist-2, and Informative is assessed with Entropy. Emb avg (Liu et al. 2016) is the similarity between the average of the all token embedding vectors of a golden response and that of the generated response. On the other hand, Emb ex measures the similarity between embedding vectors using vector extrema. BLEU-1 and BLEU-2 are the ratio of unigram and bi-gram overlaps respectively , and Dist-1 and Dist-2 are the ratio of distinct uni-grams and bigrams in all generated responses . Entropy is the average word-level entropy (Mou et al. 2016) which indicates how informative a generated response is.

Baselines
The performance of the proposed model is compared with the following baselines: • AS2S: a standard Seq2Seq model of RNN with an attention mechanism (Bahdanau et al. 2015), • COPY: a Seq2Seq based model with the copy mechanism (Zhu et al. 2017), • ConKADI: the commonsense knowledge-aware response generator ) which shows the state-of-theart performance for the Reddit data set, • MRBD: a training framework with multi-view feature representation and bidirectional distillation ) which shows the state-of-the-art performance for the Persona-Chat data set and the DailyDialog data set,

Utilization of relevancy score
There are two ways of utilizing R s i 's. One is to use them at the encoder and the other is to use them as attention for the decoder. Table 3 compares the ways empirically for the Reddit data set. Encoder in this table implies the utilization of R s i 's at the encoder while Decoder indicates their utilization at the decoder. For both pre-trained models of MASS and BART, Encoder outperforms Decoder for all metrics. Particularly, Encoder achieves much higher BLEU-1 than its corresponding Decoder. To sum up, Table 3 proves that it is more efficient to apply the relevancy scores directly to the query representation than to use them indirectly through encoder-decoder attention. Therefore, R s i 's are used by the encoder at all experiments below. Figure 3 shows how BLEU-1 changes according to α in Equation (9) with RA-MASS trained with the Reddit data set. Note that α regulates RA-MASS to imitate the output of the teacher relevancy model. RA-MASS fails in predicting the prospective responses properly if it does not imitate the teacher relevancy model. Thus, it achieves low BLEU- The proposed method is superior to the baselines with p-value < 0.05 except MRBD of which code is not publicly available is because the effect of the cross-entropy loss is relatively ignored. Therefore, α = 0.4 is adopted for the following experiments. Table 4 reports the experimental results about the performance of the proposed response generator. The experimental results show that the proposed generator represented as RA-MASS and RA-BART outperforms the current state-ofthe-art models as well as the baselines with a confidence level of 95%. The best model for all data sets is the proposed RA-MASS. It shows the highest performance for all metrics except Dist-1 and Dist-2 for the Reddit data set. Higher embedding and BLEU imply that a model generates more relevant responses and higher entropy means that a model produces more specific responses. Thus, RA-MASS is concluded to generate the most relevant and specific responses. The model with the highest diversity for the Reddit data set is ConKADI. Since ConKADI exploits external knowledge to achieve high diversity, it could obtain higher diversity than RA-MASS. However, it shows lower performance for all other metrics. The current state-of-the-art model for the Persona-Chat data set and the DailyDialog data set is MRBD, but its performance is worse than that of RA-MASS for all metrics except Dist-1 and Dist-2 for the DailyDialog data set. The architectures of MASS and BART are identical, while their training methods are different from each other. Since BART is trained with more diverse denoising objectives than MASS, it achieves higher performances than MASS in many tasks. However, MASS shows higher performance than BART in this task. The lower performance of BART seems to be originated from the characteristics of dialogue data such as word omission and wide use of demonstrative pronouns. In addition, single-turn dialogues are usually short. As a result, some denoising objectives of BART act rather as an obstacle of performance degradation. This lower performance of BART yields the lower performance of RA-BART than RA-MASS. MASS: i don't think you understand how public works. BART: they aren't, they're just not the same thing. RA-MASS: you're right, but it's still a good way to sell stock. RA-BART: no, they're the only way to sell stock. Table 5 shows two examples of how various response generators respond to a query. The queries in this figure are selected from the Reddit data set. The graph in each example depicts the importance weights of query tokens by the teacher relevancy scorer (red), the student relevancy scorer (blue), self-attention of vanilla MASS (cyan), and self-attention of vanilla BART (brown). Both relevancy scorers assign similar weights to query tokens, which implies that the student scorer trained with Eq. (7) mimics its teacher scorer quite well. Therefore, it is safe to use only the student scorer at the inference time.

Case study
According to this figure, MASS and BART fail in understanding the core meaning of queries. In the first query, the teacher relevancy scorer regards "annoyed", "barbecue on", "father", and "grave" as the expressions for a core meaning.
However, MASS focuses on the self-attentive word "barbecue" but does not pay any attention to "grave". As a result, it generates an inappropriate response. BART generates a very general response due to its strange self-attention. On the other hand, RA-MASS and RA-BART generate a relevant and specific response to the query due to the student scorer. That is, they generate their responses using the expressions stressed by the student scorer. As a result, the response by RA-MASS contains the expressions of "pissed" and "bbq on my dad's grave", and that by RA-BART includes an expression of "bbq on their father's grave".
Similarly, in the second query, the graph indicates that the key expressions are "markets" and "way to sell stock". However, MASS focuses on "public" and generates an inappropriate response. BART does not capture the meaning of the query at all and also generates an inappropriate response. On the other hand, RA-MASS and RA-BART focus more on the expressions with high weights by the student scorer. As a result, they produce the expression of "way to sell stock" in their responses.

Error analysis
The errors by RA-MASS are analyzed for a better understanding of the limitations of the proposed response generator. For this, two hundred responses by MASS and RA-MASS are randomly selected from the test set of the Reddit data. Two annotators are recruited to assign one of three labels (perfect, good, and bad) independently to every generated response following the study of Wang et al. Wang et al. (2020). The agreement between the annotators measured with kappa is 0.68, which implies that the annotators made a substantial agreement. The bad responses are further categorized into ungrammatical, irrelevant, illogical, or over-general. Figure 4 compares the responses by MASS and RA-MASS. Overall, the ratio of bad responses by RA-MASS is reduced to 28% from 37%, the ratio of bad responses by MASS. As a result, the ratio of perfect responses is increased in RA-MASS, which proves that RA-MASS produces better responses. When investigating the bad responses, the ungrammatical errors take the least portion in both MASS and RA-MASS due to the generation power of the pre-trained sequence-to-sequence model. On the other hand, RA-MASS shows totally different proportions for irrelevant and overgeneral errors from MASS. The irrelevant responses of RA-MASS take only 4%, while MASS produces 8% irrelevant responses. Similarly, the ratio of over-general responses in RA-MASS is reduced from 15% of MASS to 12%. These results prove that the proposed generator is effective in generating relevant and specific responses.

Implementation pros and cons
In addition to the performance improvement, the proposed relevancy scorer has two advantages in implementing it. One advantage is that it is simple to implement and independent of its base pre-trained models of BART or MASS. As a result, it can be plug-and-played easily into a transformer encoder. In addition, it retains the advantages of its base pre-trained models. However, it has also a disadvantage that it takes a long time to train the proposed model. This is mainly because the proposed training process has two steps.

Conclusions
This paper has proposed a novel response generator that reflects a prospective response explicitly into query representation through a response-aware encoder. The contemplation of response allows an encoder to capture the core meaning of a query better. To do this, this paper proposed a relevancy scorer using an attention mechanism. The relevancy scorer receives the query representation and response representation generated by the transformer encoder as input. Then, the relevancy scorer calculates the relevancy score between the query and the response and outputs the final query representation by applying the relevancy score to the query representation. As a result, the decoder employing the final query representation could generate a more relevant and specific response.
The key problem of the proposed model is that there is no response available at the inference time. To tackle this problem, a teacher-student framework was employed for relevancy scorer. That is, a teacher scorer was trained with the pairs of a query and response while a student scorer was taught only with queries. The teacher scorer was trained to generate a response output similar to a golden response as ordinary response generators do. On the other hand, the student scorer was trained to imitate the output of the teacher scorer along with the method which the teacher scorer was trained with.
The experimental results on three benchmark data sets showed that it is effective to use a prospective response of a query in encoding the query. The proposed RA-MASS was evaluated with four metrics about relevancy and specificity. According to the experimental results, RA-MASS shows the best performance for all metrics except for diversity in the Reddit data set, and outperforms its baselines for all metrics in the Persona-Chat data set. That is, it was verified that RA-MASS generates the relevant and specific responses. It was also shown through the case study that the proposed response-aware encoder captures the important words in a query accurately and the student relevancy scorer imitates the teacher relevancy scorer well.
One advantage of the proposed model is that the responseaware encoder can be applied to any response generator of a sequence-to-sequence style. Because it attacks a fundamental problem of generating irrelevant and general responses, its application is not limited to a single-turn response generation, but is widened to any response-related tasks such as commonsense knowledge-based response generation and multiple response generation.
Future work of this paper are twofold. One is to overcome a limitation of the proposed relevancy scorer cause by its existence out of the transformer encoder. This limit allows a golden response to be reflected only once in query representation. Thus, a model will be studied that reflects a golden response at every layer of the transformer encoder. The other is to widen the applications of the response-awareness further. The idea of response-awareness is believed to be applied to any sequence-to-sequence tasks such as question answering, text summarization, and machine translation.

Author Contributions
All authors contributed to the study conception and design. Material preparation, data collection were performed by So-Eon Kim and data analysis was performed by Hyun-Je Song. The first draft of the manuscript was written by So-Eon Kim and reviewing and editing the manuscript were performed by Seong-Bae Park. All authors read and approved the final manuscript. Data availability and materials The datasets generated during and/or analysed during the current study are available in the ACL2020-ConKADI repository, https://github.com/pku-sixing/ACL2020-Con KADI.