Chinese Medical Paraphrase Generation: Based on Neural Machine Translation

： Background: As people prefer to obtain medical knowledge online, medical intelligence question-answer systems based on question matching have attracted more and more attention, especially in China. However, due to the lack of paraphrase corpus of medical question, the development of this field is limited. Objective: We propose a method for paraphrase generation which suitable for the Chinese medical field and use deep learning models instead of artificial evaluation for the first time. The method is designed to be able to automatically construct high quality Chinese medical paraphrase. Methods: Validation experiments were carried out on two Chinese paraphrase data (one is general data, the other is medical data). Neural machine translation is used to generated paraphrase, that is, translate a sentence into other languages, and then reverse-translate it back to the original language to get the corresponding paraphrase. BLUE, ROUGEs, are used as quantitative evaluation metrics. Three deep text matching models are used to evaluate the generated paraphrase, instead of manual. Precision, Recall, F1 and AUC are used as qualitative evaluation metrics. Results: 49908 and 4062 paraphrases were generated on the two datasets, and the generated efficiency was 97.03% and 98.38%, respectively. For the data in the two fields, the generated and original paraphrase pairs are very similar at the quantitative and qualitative evaluation metrics, especially the medical field. Take medical data as example, BLUE of generated and original paraphrase pairs are 0.556 and 0.626, respectively; the mean difference of AUC between the two groups was 0.015. Conclusions: We first propose a paraphrase generation method based on neural machine translation and use deep text matching model instead of manual evaluation to evaluate the generated paraphrase. By analyzing the evaluation metrics ， it can be concluded that ： the paraphrase generated method has reached or even exceeded the level of artificial construction at the semantic level, especially in medical field; the deep text matching model can replace manual evaluation and realize automated paraphrase generation. This is of great significance to the development of Chinese medical paraphrase generation.


Introduction
With the growth in the living standards, people are paying more and more attention to physical health, and hoping to obtain medical information conveniently online [1], especially during the covid-19 epidemic period. The famous online medical business websites in China include 'haoyisheng.com', '120ask.com'etc, while the well-known similar websites in other countries include 'DailyStrength', 'MD-Junction', etc. As times goes on, these websites' question-answer record accumulate and form big data, which is the products of the wide participation of the people and contains a large number of real cases and high potential medical value [2]. With the constant growth of medical data, we are all confronted with the problem of how to find answers to the questions we have [3]. Meanwhile, a large number of users-many of whom often ask similar, if not identical, questions-have placed a tremendous burden on the doctor-side and cause timely reply to be nearly impossible [4]. Thus, it is essential to develop techniques which can efficiently address the problem of medical question answering.
Question answering (QA) is an application of natural language processing (NLP) that tries to fulfill that need and has been receiving a lot of attention since the late 90s with evaluation campaigns such as TREC [5]. It is as pecialized type of information retrieval that returns precise short answers to queries posed as natural language questions [6][7][8]. The relevance and trustworthiness of the answers returned is of utmost importance in QA systems, and the latter especially for clinical domain [3]. Meanwhile, QA systems are often susceptible to the way questions are asked [9]. Thus, QA system based on question matching is getting more and more attention; namely, selecting automatically from some existing medical answer records the answer to the question that best matches a user's question. However, due to the lack of train data, the development of QA systems based on question matching has been greatly restricted. At present, there is a lack of large-scale similar question data, sspecially in the field of Chinese medicine. The problem can readily resolve by paraphrase generation task [10].
Paraphrases refer to texts with the same meaning but different expressions. For example, 'can "bailing capsule" be taken for a long time', 'can "bailing capsule" be taken for a long period' are paraphrases sentence pair. Paraphrase generation refers to a task in which given a sentence the system creates paraphrases of it [11]. Paraphrase generation is an important task in NLP, which can be a key technology in many applications, especially QA system. Traditional, paraphrase generation has been addressed by using four methods, including: rulebased methods [12], thesaurus-based methods [13,14], grammar-based methods [15] , statistical machine translation (SMT)-based methods [16,17]. Recent advances in deep learning, in particular neural network based on sequence-to-sequence (Seq2Seq) learning, has made remarkable success in various NLP tasks, including machine translation [18], paraphrase generation [19,20], etc. Zichao Li, et al [11], propose a new framework for the paraphrase generation, which consists of a generator and an evaluator, both of which are learned from data. Ankush Gupta, et al [10], proposed method is based on a combination of deep generative models (VAE) with Seq2Seq models (LSTM) to generate paraphrases, given an input sentence.
These studies focused on building a deep learning model in paraphrase generation task and achieved good results. However, deep learning model is supervised model, which means it's buliding need a lot of train materials (paraphrase data). In the field of Chinese medicine, there is a lack of paraphrase data which can be used for buliding.
In contrast to building new generation model, we propose to use mature neural machine translation (NMT) in paraphrase generation task. NMT is a Seq2Seq learning model for automated translation [18]. Compared with SMT, NMT has an overwhelming advantage, not only in the manual evaluation index, but also can reduce morphological errors, lexical errors and word order errors [21][22][23].
Meanwhile, the NMT has verified its performance in a real medical environment. Khoong, et al. [24] assessed the usefulness of machine translation in helping patients understand discharge instructions.
The another challenge in paraphrase generation lies in the definition of the evaluation measure [11]. Traditional, ROUGE, BLEU, etc. have been used as measure metrics, which could lose the calculating of semantic similarity. To quantify the aspects that are not addressed by automatic evaluation metrics, human evaluation becomes necessary. However, human evaluation will cost a lot of labor, and the results of evaluation could easily be subjectively affected. Hence, we propose to use the deep text matching models as an alternative to human evaluation. In recent years, with the development of NLP, a variety of matching models based on neural networks have been emerged and achieved good performance. The core of these models is similarity calculation, not only at the character level but also the sentence level [25][26][27].
In this study, we propose to use neural machine translation (NMT) in Chinese medical paraphrase generation task. It was verified on two Chinese paraphrase corpora, one is a general corpus, and the other is a medical corpus. BLEU and ROUGE metrics have been used in order to evaluate the results of approach. In addition, it is worth noting that we innovatively using deep text matching models instead of humans to evaluate the similarity.

Approach Description
The core of the approach is machine translation (MT). Literally understandable, machine translation is a technique that leverages computers to translate human languages automatically. MT models can be divided into two categories: statistical machine translation (SMT) and neural machine translation (NMT). NMT, which models direct mapping between source and target languages with deep neural networks has achieved a big breakthrough in translation performance and and even approached human-level translation quality, especially parity on Chinese-to-English translation [28,29]. At present, there are many translators based on neural machine translation, Google Translate (GT) [30]delivers roughly a 60% reduction in translation errors on several popular language pairs [18].
Based on the above, we aim to use NMT as generative model for Chinese medical paraphrase generation task. Specifically, it can be divided into two steps. The first step is using the Google Translate (GT) based on NMT to translate the Chinese original sentence into an English interlanguage sentence. In the second step, using the GT again to reversely translate the interlanguage sentence back to the Chinese form to obtain the paraphrase sentence. (Figure 1).

Data Sources
We evaluate our approach on two datasets, one of which (CCKS2018_Task) is a Non-medical field paraphrase dataset and the other (Chinese_covid) is a dataset in medical field.

CCKS2018_Task. It is the dataset of 'CCKS2018 WeBank Intelligent Customer Question
Matching Contest'. This Contest is a real scene sentence intention matching task organized by the Intelligent Computing Research Center of Harbin Institute of Technology (Shenzhen). The dataset consists of 100K lines of Chinese question paraphrase pairs. Each line of data is composed of source sentence, reference sentence and labels. The label represents the similarity value of each paraphrase pairs, and the value is 0 or 1 (0 means dissimilar, 1 means similar).
Chinese_Covid. It is a Chinese medical dataset in 2020. The dataset consists of over 10K lines of question paraphrase pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the questions in the pair are truly similarily or not. Wherever the binary value is 1, the question pair is similarity.
The above data can be divided into pairs of similar and dissimilar paraphrase sentence pairs according to the label, that is, the label is 1 (denoted as positive question pairs) and the label is 0 (denoted as negative question pairs).

Evaluation
Quantitative Evaluation Metrics. For quantitative evaluation, we use the well-known automatic evaluation metrics: BLEU [31], ROUGE [32]. Previous work has shown that these metrics can perform well for the paraphrase recognition task [33] and correlate well with human judgments in evaluating generated paraphrases [34]. Both of these scores lie between the range of 0 and 1 (or 0 and 100). Note that higher BLEU and ROUGE scores are better.
BLEU (Bilingual Evaluation Understudy) considers exact match between reference paraphrase(s) and generated paraphrase(s) using the concept of modified n-gram precision and brevity penalty. The score of it ranges from 0 to 1. The closer the score to 1, the higher the translation quality.
in the formula refers to the precision of N-gram. is the penalty factor. refers to the weight of the N-gram, which is generally set as a uniform weight, that is, = 1⁄ for any n.
: the length of the generated. : the length of the shortest reference.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) which is mainly based on the recall rate (Recall) including: ROUGE-N, ROUGE-L, etc.
= (1 + 2 ) + 2 (6) ROUGE-N mainly counts the Recall on N-grams. The denominator of the formula is the number of N-grams in the reference, and the numerator is the number of N-grams shared by reference and genereated. ROUGE-L uses the longest common subsequence of generated C and reference S when calculating, L is the longest common subsequence (LCS). in the formula is the recall rate, while is the accuracy, is ROUGE-L.

Qualitative Evaluation Metrics.
To quantify the aspects that are not addressed by automatic evaluation metrics, human evaluation becomes necessary. However, human evaluation will cost a lot of labor, and the results of evaluation could easily be subjectively affected. Hence, we propose to use the deep text matching models as an alternative. The main steps of this qualitative evaluation method can be summarized as follow ( Figure 2)

Deep text matching model
In this paper, we used deep text matching models (K-NRM [35]，MVLSTM [26] and Pyramid [27]) as an alternative to evaluate the similarity between source sentence and generated sentence.
The steps of calculating the semantic similarity of the three models are basically the same, which can be divided into three steps: word representation, feature extraction and multi-layer perception.
Word Representation. Computer can't directly process the sentence. Words need to be represented in a vector or matrix first. Usually, all words in the sentence are represented by a fixed length word vector respectively, which called word embedding, such as Word2Vec [36], Glove [37],etc.

Overview of generated sentences
For the method proposed in this article, we have verified it on the two datasets CCKS2018_Task  Table 2.
Chinese_Covid contains 1W question pairs, of which 4062 are positive sample data (sentence similarity = 1). After the same process, 3996 valid paraphrase question pairs, were finally constructed, with an effective ratio of 98.38%. Table 3.

Evaluation of the Performance
We conducted experiments on the aforementioned data set and reported the qualitative and quantitative results of our method.

Conclusions
In this paper, we first propose a method of Chinese medical paraphrase generation based on neural machine translation and use deep text matching model instead of manual evaluation to evaluate the generated paraphrase. Validation experiments were carried out on two Chinese paraphrase data. By analyzing the evaluation indicators such as BLUE, ROUGE and AUC, it can be concluded that：The paraphrase generated method has reached or even exceeded the level of artificial construction at the semantic level,especially in medical field; Deep text matching models can replace manual evaluation and realize automated paraphrase corpus construction. This is of great significance to the development of this field. This method can quickly and automatically construct a high-quality medical paraphrase corpus. It is helpful to promote the development of medical intelligent question answering system based on question matching.

Disscussion
In this study, we propose a method of medical paraphase generation，which based on NMT and verify its performance in Chinese test set. In view of the shortcomings of the current research on the generated methods of paraphrase, this study mainly makes the following contributions： Using mature NMT for paraphrase generation without training data. At present, most of the paraphrase generation models are end-to-end models based on deep learning. In the process of model building, a large amount of high-quality paraphrase corpus is needed for trainin. In the field of Chinese medicine, such data is lacking. In view of this situation, we propose for the first time to use NMT (GT in this study) as a substitute for paraphrase generation models to generate paraphrase. There are also some defects in this study, such as: only used GT as neural machine translator; inaccurate translation of medical terms. In future study, we will use other neural machine translators         Figure 1 Illustration of paraphrase generation based on neural machine translation (NMT). Neural machine translation: It is an Seq2Seq model following an encoder-decoder framework that usually includes two neural networks respectively The main steps of qualitative evaluation method Figure 3 The illustration of similarity calculation.

Figure 4
ROC curves of CCKS2018_Task and Chinese_Covid. ROC: receiver operating characteristic curve