R + : a novel testing method for question answering software via asking recursive questions

Question Answering (QA) is an attractive and challenging area in NLP community. With the development of QA technique, plenty of QA software has been applied in daily human life to provide convenient access of information retrieval. To investigate the performance of QA software, many benchmark datasets have been constructed to provide various test cases. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases are mandatory to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this work, we propose a novel testing method, qa a ske R + , with five new Metamorphic Relations for QA software. qa a ske R + does not refer to the annotated labels of test cases. Instead, based on the idea that a correct answer should imply a piece of reliable knowledge that always conforms with any other correct answer, qa a ske R + tests QA software by inspecting its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. Experimental results show that qa a ske R + can reveal quite a few violations that indicate actual answering issues on various mainstream QA software without using any pre-annotated labels.


Introduction
With the booming development of Natural Language Processing (NLP) techniques, machine has been able to process many tasks. Among them, Question Answering (QA) is one attractive but very challenging mission that requires machine to understand the human language and infer information from it as the human do (Nguyen et al. 2016;Zhang et al. 2020). As illustrated in Tables 2 and 3, respectively, given a question, QA software intelligently comprehends the relevant information from a huge knowledge base or one lengthy textual passage and returns the deduced answer. QA software has been widely used in daily human life to provide convenient access of information retrieval now. For instance, many intelligent devices are equipped with a virtual assistant, such as Siri from Apple and DuerOS from Baidu, which can provide the QA service.
Recently, we have seen many algorithms being proposed to improve the performance of QA software. At the same time, various benchmark datasets with different topics and task formats have been constructed to evaluate how well machine can answer textual questions by referring to the information stored in knowledge bases (Berant et al. 2013;Yih et al. 2016;Trivedi et al. 2017;Bao et al. 2016) or implied in textual materials (Rajpurkar et al. 2018;Clark et al. 2019;Kwiatkowski et al. 2019;Rajpurkar et al. 2016). Nevertheless, the testing methods for QA software are still primitive and thin. Specifically, current QA testing practices mainly adopt the reference-based paradigm. When performing a reference-based test, the researchers or engineers have to first manually annotate the labels (correct answers) for the test cases (assigned questions), which requires much human effort (Clark et al. 2019;He et al. 2018). Afterwards, QA software is tested by comparing its outputs with the annotated labels. As a result, these testing practices of QA software are mandatorily relying on the existing well-annotated datasets.
However, the reference-based test paradigm has some limitations due to its reliance on the pre-annotated labels. First, it cannot support the "just-intime test" for QA software, which requires an immediate issue detection on the returned answers to unlabeled questions. Such a kind of testing is actually inevitable and necessary in the daily usage. Let us consider the common usage scenario of QA software, where the user inputs one question to which she is looking for the answer. After getting an answer from QA software, she needs to make a quick decision on whether to trust this answer or not, without any pre-annotated labels. This process could be seen as one test execution followed by an immediate issue detection on which the decision is based, but the current reference-based test paradigm is obviously not designed to support such a process. And for the real-life usage, it is also very common to see the users directly trust the QA software to have passed this test and replied a reliable answer, because they have barely any clues on the correctness of these answers. However, since the reliability of QA software is not always guaranteed because of the complexity and intractability of the neural networks, it could be very risky to trust the outputs without any inspection. Secondly, the reference-based test can only be performed on the existing well-annotated benchmarks, which may confine the test sufficiency on This paper extends our preliminary work (Chen et al. 2021a) that is presented as a research paper at the ASE 2021 conference. Particularly, this paper enriches the existing preliminary work in the following aspects: • We extend our methodology and formulate a new method called qaaskeR + .
Besides recursively raising new questions strictly based on the source input and output, qaaskeR + further excavates some additional facts relevant to the source output to enrich the synthesized knowledge. This helps us obtain more diverse and complicated new questions as follow-up inputs, which may contribute to a better fault detection ability. Particularly, we design and implement two new Metamorphic Relations (MRs) in qaaskeR + based on this idea. • We briefly introduce the taxonomy of QA software, where two mainstream categories of QA software are reviewed. According to the systematic taxonomy, we add three new test objects to further comprehensively evaluate the effectiveness of our method on diverse QA software. Two of them belong to the new category that has not been considered in our preliminary work. • We perform comprehensive experiments to evaluate the effectiveness of our new method on the above test objects. We found that the new MRs have a higher violation detection rate than the existing MRs. And we also reveal many new issues and summarize the shortcomings for the added test objects. Promising results on all test objects confirm the usefulness of qaaskeR + on mainstream QA software. • We introduce a new research question to compare the issue detecting ability of our recursive MRs, which take the novel idea of recursively asking, to two representative non-recursive MRs that generate semantically equivalent questions as follow-up inputs via synonym replacement and back translation. Results demonstrate the superior fault detection effectiveness of our MRs. • We try one new repair strategy, which takes the value of recursively asking based on the knowledge inferred from the actual test output, to fix the test objects against the revealed answering issues. Results demonstrate that this method can also show a fairly good repair effect.
In summary, the contributions of this work, which are the super-set of the contributions in our preliminary work, are as follows: • We propose a method named qaaskeR + to test Question Answering software via Recursively Asking multiple questions relevant to the same or some further enriched knowledge. It gets rid of the dependency on the manually annotated ground truth labels of test cases and therefore enables both the flexible just-intime test during usage and the extensible test with massive real-life unlabeled data for QA software. • We design and implement five novel Metamorphic Relations in qaaskeR + . They generate the new question (follow-up input) on the basis of the knowledge synthesized from the existing question (source input) and the answer QA software returns for this question (source output). Two of them further enrich the knowledge with some correct information about the source output. The generated ques-tions are of distinct types (e.g., general questions and wh-questions) or asking for different objects in the knowledge. The implementation of these MRs has been released. • We carry out comprehensive experiments to evaluate the effectiveness of qaaskeR + . Results demonstrate that qaaskeR + can successfully reveal many valid violations on various QA software. Moreover, we found our MRs have better fault detection effectiveness than the MRs that generate follow-up questions via synonym replacing and back translation. And we have also identified a few actual answering issues according to the revealed violations and designed methods to fix the revealed issues based on the proposed MRs, which may inspire the developers to further improve the tested QA software. • We demonstrate and discuss the usage of qaaskeR + on the real-life QA software by taking an initial sip on the Google Search service.
The tool, replication package, and specific implementation details for this paper are available online at https:// github. com/ imjin shuo/ QAAsk eR-plus. The rest of this paper is structured as follows. Section 2 states the motivation of this work. Section 3 introduces QA software and Metamorphic Testing, which are the test object and the basis of our method, respectively. Section 4 elaborates the details of qaaskeR + , with five novel MRs proposed. Afterwards, Sect. 5 and Sect. 6 describe the settings and the results of the evaluation on qaaskeR + , respectively. Next, Sect. 7 discusses the real-life usage of qaaskeR + . Section 8 presents the threats to validity and Sect. 9 lists the related works. Finally, Sect. 10 draws a conclusion and lists our future work.

Motivation
As introduced above, QA software has been widely used in daily human life, thus there is an urgent demand to assure the quality of its returned answers and reveal its undisclosed defects. But currently, almost all the NLP models, including the core models in QA software, are mainly tested in the reference-based paradigm (Ribeiro et al. 2020;). As explained in Sect. 1, using this test paradigm, the testers must obtain a well-annotated benchmark dataset at first, which means that the manually annotated reference answers are mandatory during testing QA software. Once presented with the output answers to unannotated questions, current testing methods cannot automatically decide whether there is any problem in the output answers.
In fact, such a decision is inevitable, and it is of great necessity to support it. Let us first consider a real-life story based on the QA task shown in Table 1. Suppose that Tom is a junior entertainment editor. One day, Tom is asked to collect the information about all the recipients of the Academy Award for Best Actress over years. For convenience, Tom applies one QA software to retrieve the relevant information. For instance, to retrieve the recipient of the Academy Award for Best Actress in 1963, he inputs a question "Which actress won the Academy Award for Best Actress in 1963?" (Question 1). After few seconds, he receives the answer "Anna Magnani" (Answer 1). He collects all the results that he needs in this way. Without having the off-the-shelf ground truth labels to verify the obtained answers, Tom chooses to trust them and directly report them to his leader. Finally, he is strictly blamed because there are quite a few mistakes in his report, including wrongly taking Anna Magnani as the recipient of the 1963 Academy Award for Best Actress. Such an experience brings negative outcomes to Tom and greatly dampens his belief in the QA software. But if we consider Tom's question a test case, it is indeed not easy to automatically and quickly verify the output answer now, since no ground truth label is available. Therefore, to avoid such awkward experiences and even more serious issues in other critical application domains, a new testing method that does not require the label is desired to support the immediate issue detection on the returned answers to the unlabeled questions for QA software.
Besides, such a method is also necessary for the more comprehensive tests for QA software, even if there exist a few benchmark datasets. As mentioned in Sect. 1, the benchmarks are found to be imperfect. Specifically, many existing benchmark datasets are found to have the bias of topics and task formats and therefore may hinder the understanding of real-life performance (Gardner et al. 2020;Ribeiro et al. 2020). As a result, it is far from being sufficient to solely rely on the existing finite benchmarks to test QA software. Meanwhile, it can also be very expensive to construct new well-annotated benchmarks because it requires much human effort to annotate the correct answers for the new questions in them (Clark et al. 2019;He et al. 2018). In addition, the manual annotation could also introduce some errors (Northcutt et al. 2021), thus hurts the accuracy of the test results.
Therefore, in this work, we aim to propose a testing method to support these requirements. The proposed method should not depend on the annotated ground truth labels, thus it can provide a just-in-time test with efficient and effective issue detection and leverage massive unlabeled data to perform the extensible and abundant tests for QA software.

Question answering software
Question Answering (QA) has been a hot research topic for a long time. It is omnipresent in various domains in our daily life, such as virtual assistants (Nguyen et al. Question 3 In which years did the actress from Italy win the Academy Award for Best Actress? Answer 3 1955 and 1961 2016), E-commerce services (Gupta et al. 2019), and healthcare (Jin et al. 2019;Suster and Daelemans 2018a). According to the form of the information source for deducing answers, QA software can be firstly divided into two categories, namely the ones using information in the knowledge base (KBQA) and the other ones depending on information in the raw textual materials (TBQA) . And judging from whether each question is equipped with a specified knowledge base or textual material to deduce the answer, the two categories of QA software can be further respectively divided into the closed-world ones and the open-world ones 1 (Gupta et al. 2019). In this work, we apply our method to test several representative QA software in these categories. In the following, we will introduce the details of these categories of QA software, especially the ones we use as the test objects in this work.

QA Software based on information in knowledge base
With the development of the knowledge base techniques like knowledge graph, the information about various objects can be stored in the knowledge base in a structural form. In such structural information, one object is represented as an entity and its properties are indicated by this entity's relations to other entities. In this way, largescale information can be properly stored. At the same time, such structural information can be easily processed by machine. As a result, people are inspired to design the QA software based on the knowledge base (KBQA software) to automatically answer the textual questions raised by users according to the abundant information in knowledge base . Table 2 shows an example of a KBQA task. Due to the large capacity of knowledge base, communities have constructed several general knowledge bases, such as Freebase (Bollacker et al. 2008), DBPedia (Lehmann et al. 2015), and Wikidata (Tanon et al. 2016), to hold a great number of universal knowledge on all aspects. Existing KBQA studies mainly focus on providing QA services based on these general knowledge bases. Since the information in general knowledge bases is not specified to answer any specific question, KBQA Table 2 An example of a KBQA task The information "entity1".propertyName = {"entity2", "entity3"} suggests the value of a property named propertyName of entity "entity1" contains two entities, i.e., "entity2" and "entity3" The entities denoted as "x.xxxxxx" are the virtual compound entities synthesized by the knowledge base to represent some abstract objects is considered to be mainly performed in the open-world manner, which means that the KBQA software requires the question as the only input and would automatically retrieve relevant information in one unified general knowledge base to answer all questions (Chen et al. 2017;Han et al. 2020). Considering that open-world KBQA is the mainstream KBQA manner and there are many relevant methods and datasets available, we mainly focus on the open-world KBQA software in this work. Two main categories of methods to implement open-world KBQA software are commonly known as the semantic parsing-based ones and the information retrievalbased ones . The semantic parsing-based methods first parse a question into a query of logic form and then execute it over the knowledge database for finding the answers. When generating the search query, such methods require effectively locating the accurate search space. To tackle this problem, one state-of-the-art method in this way proposed by Lan and Jiang (2020) uses a staged approach to generate the necessary query graphs effectively. Another mainstream KBQA method is based on information retrieval. Given a question, they first retrieve a questionspecific sub-graph from the knowledge database based on the identified query intent of the given questions and then apply some ranking algorithms to select the most appropriate entities as the answer. A major challenge for the retrieval-based methods is the lack of supervision signals at intermediate steps. To address this challenge, He et al. (2021a) propose a method called NSM+h that adopts the teacher-student learning framework to reach the state-of-the-art. NSM+h contains a student network focusing on the KBQA task itself and a teacher network learning to provide supervision signals for improving the reasoning ability of the student network. In this work, we use these two state-of-the-art KBQA methods to build corresponding KBQA test objects as representatives.

QA Software based on information in textual materials
Though the structural information in knowledge base can be easily understood and processed by the machine, it has inherent limitations, such as incompleteness and fixed schemes. It is also costly to construct, maintain, and update the knowledge base. As a result, recent studies focus on another QA manner, namely TBQA, which directly extracts information from some non-structural textual materials, such as the textual passages in Wikipedia (Chen et al. 2017). Such QA software can effectively leverage the information in any textual materials that humans read, thus is much more flexible and extensible. Table 3 gives an example of a TBQA task. But compared to the structural information in knowledge base, the information in textual materials is fairly harder to comprehend and extract. The comprehension on textual materials is a challenging key problem in TBQA and has attracted a lot of works to improve its performance .
Since extracting information from textual materials is non-trivial, most of the existing TBQA studies focus on the relatively simpler closed-world manner. In this manner, a specific textual reference passage is attached with the given input question for deducing this question. TBQA software should answer this question based on the attached passage. Considering many relevant algorithms and datasets are available, we mainly target the closed-world TBQA software in this paper. In fact, the closedworld TBQA is still playing the core role in the open-world TBQA systems, which automatically retrieve relevant textual materials from web (Chen et al. 2017). Therefore, our method can also work on the open-world TBQA software. We briefly discuss such usages in Sect. 7.
Closed-world TBQA tasks can be solved with various methods. For example, we can simply further fine-tune some pre-trained language models, such as ROBERTa ) and T5 (Raffel et al. 2020), to obtain relatively good performance on several closed-world TBQA task formats like the span extraction and the boolean judgment. Earlier studies mainly solve the TBQA tasks of different formats using distinct models; while recent works move to building unified and effective methods for the general format-agnostic TBQA systems. Khashabi et al. (2020) pioneer to propose a method, UnifiedQA, which builds a single model to solve the closedworld TBQA tasks in distinct formats. UnifiedQA first trains a text-to-text model on some seed QA datasets of multiple task formats, during which the textual materials and the questions of different task formats are taken as input without using formatspecific prefixes. Users should then fine-tune this pre-trained model into specialized models for better performance on the specific QA tasks. The UnifiedQA model has shown pretty promising performance on par with or better than the format-agnostic models on many benchmark datasets. Afterwards, Tafjord and Clark (2021) propose an improved version of UnifiedQA, named Macaw, to realize the versatile and zero-shot closed-world TBQA. Zero-shot means that the Macaw model can deliver fairly good performance on various datasets without being further fine-tuned on any target corpus. Macaw also first trains the model on several seed QA corpus. But it applies more training tasks, such as teaching the model to raise questions for specified answers. Besides, Macaw further fine-tunes the model on several science questions to improve its zero-shot ability. Some officially trained Macaw models are publicly released as high-quality off-the-shelf TBQA software to the community. In this work, we adopt these two state-of-the-art methods to prepare representative TBQA test objects.

Metamorphic testing
To get rid of the dependency on the annotated labels in testing QA software, we design our testing method based on the idea of Metamorphic Testing. Metamorphic Testing (MT) is one proper candidate solution to bypass the labels of test cases, since it was proposed to reuse the passed test cases and alleviate the oracle problem during software testing (Chen et al. 1998(Chen et al. , 2018. MT does not require any inspection on the correctness of each individual output. Instead, it checks whether multiple outputs satisfy the specified relations, namely the Metamorphic Relations (MRs). One famous object of MT is sin function. Verifying the correctness of sin(x) given an arbitrary x is very expensive. In order words, we encounter the oracle problem when testing sin function. But checking the relation of sin(x) = −sin(−x) is straightforward. In this example, sin(x) = −sin(−x) is called the Metamorphic Relation (MR), which can be also rephrased as: if x (the source test input) is negated to −x (the follow-up test input), their outputs are also opposite to each other. MT has been used to test various software and systems, such as the supervised classifiers (Xie et al. 2011) and the unsupervised clusters (Xie et al. 2020). And Zhou et al. (2016) adopt MT to perform a system/service level validation for the search engines, where they mainly construct the follow-up inputs by considering the information in the source outputs as additional query restrictions. Recently, we have also seen MT being widely used for testing many deep learning applications, such as autonomous driving systems Tian et al. 2018;Zhou and Sun 2019;Wang and Su 2020) and language translation services (He et al. 2020;Gupta et al. 2020;He et al. 2021b;Yan et al. 2019;Sun et al. 2020Sun et al. , 2022.

A recursive metamorphic testing method for QA software
Let us revisit the motivating example in Table 1. Suppose that Jack and Merry are another two senior entertainment editors. They also ask the same question (Question 1) as Tom does and get the same answer (Answer 1) from QA software. They do not have the ground truth label to verify this output answer as well. But unlike Tom, by seeing this answer, Jack further asks the QA software one new question "In which years did Anna Magnani win the Academy Award for Best Actress?" (Question 2). For this question, the QA software replies "1955" (Answer 2). At the same time, by further considering a relevant fact that Anna Magnani is from Italy, Merry also asks a new question, "In which years did the actress from Italy win the Academy Award for Best Actress?" (Question 3). And she obtains the answer "1955 and 1961" (Answer 3). By comparing Answer 1 and Answer 2, Jack is then confused. And Merry also feels something wrong when comparing Answer 1 and Answer 3. This is because "1963" is not included in their second answer as they have expected. But the good thing is: even if Jack and Merry may not be clear about which answer is wrong, they have found some clues saying that the QA software is not reliable and at least one of the returned answers that they receive is incorrect. The operation of Jack and Merry in the above example just illustrates the basic idea of the method we propose in this paper. To break the dependency on the annotated labels in testing QA software, we propose a method named qaaskeR + . Its core idea is to test QA software via Recursively Asking multiple relevant questions and check the relation among the output answers for these questions. The input of qaaskeR + is the QA software under test (SUT) and a list of unlabeled questions, but no manually annotated ground truth labels for the questions are required. The output of qaaskeR + is a list of the revealed suspicious issues that the tested QA software has made. By leveraging Metamorphic Testing (MT), qaaskeR + tests the SUT via checking whether its multiple outputs violate the expected Metamorphic Relations (MRs) instead of comparing each individual output with the ground truth label. And the MRs in qaaskeR + are based on the idea that a correct answer should imply a piece of reliable knowledge that always conforms with any other correct answer.
More specifically, given an input question q (known as the "source input"), let us denote the answer that SUT replies for q as a (known as the "source output"). Then, a piece of knowledge k can be synthesized based on q and a. Taking Question 1 and Answer 1 in Table 1 as the example, by synthesizing Question 1 and Answer 1, qaaskeR + can obtain a piece of knowledge "Anna Magnani won the Academy Award for Best Actress in 1963". Obviously, if the answer a is correct, then the knowledge k is a true fact that should always hold. Thereby, on the basis of this synthesized knowledge k, qaaskeR + next recursively raises one new question q ′ (known as one follow-up input) relevant to k. Let us denote the answer that SUT replies for q ′ as a ′ (known as the corresponding follow-up output). As mentioned above, if a ′ is also correct, it should conform with the above knowledge k, regarding q ′ . Otherwise, a violation is revealed to indicate an answering error, because at least one of a and a ′ should be wrong.
Just like what Jack and Merry have respectively done in the above example, qaaskeR + raises the new question in two ways. Following Jack's operation, qaaskeR + can directly raise a new question based on k. In the above example, Question 2 "In which years did Anna Magnani win the Academy Award for Best Actress?" can just be considered as directly constructed based on k. And obviously, Answer 2 "1955" does not conform with k considering Question 2. One answering error is thereby revealed. Besides, qaaskeR + can also follow Merry's operation to further enrich k with a piece of correct information (also considered as one fact) about the source output before raising the new question. In the above example, the information "Anna Magnani is from Italy" is adopted to enrich k. And the enriched knowledge k + "An actress from Italy won the Academy Award for Best Actress in 1963" is thereby obtained. Then, Question 3 "In which years did the actress from Italy win the Academy Award for Best Actress?" can just be considered as constructed based on k + . Similarly, Answer 3 "1955 and 1961" does not conform with k + as well as k regarding Question 3, thus an answering error is revealed.
We can see that the above process does not require the label of q. Instead, our recursive MT method qaaskeR + would automatically formulate oracles to inspect the test outputs. As a result, such a method effectively breaks the dependency on the manually annotated labels during testing and therefore provides a possible solution to testing QA software on unlabeled data.
Based on the above idea of recursively asking, in qaaskeR + , we design five novel MRs by considering the consistency among the input question and output answer pairs related to the same or some further enriched knowledge, where the questions are of different types (i.e., the general questions, alternative questions, and wh-questions) or asking for different objects in the knowledge. Among the proposed MRs, three MRs directly generate a new question based on the synthesized knowledge and the other two MRs raise a new question based on a piece of further enriched knowledge. qaaskeR + realizes these MRs with three components, namely the synthesis of knowledge declaration from the given question and answer, the generation of the question from the given knowledge, and the violation measurement on the source and follow-up cases. In the following sections, we will elaborate the design of the MRs and the components in detail.
Note that the core idea of our method, i.e., to keep the knowledge consistent, is independent of the QA software under testing. Therefore, our method can be generalized to test various QA software. In this work, we have adopted it to test two major categories of QA software, i.e., the KBQA and TBQA software. There is no difference in terms of the methodology when applying the idea to testing such QA software, but only a few differences in the implementations of information extraction and question generation, which will be introduced in Sects. 4.3.4, 4.4.2, and 6.1. We consider these adaptive changes are also what readers will need most when applying this idea to testing new QA software or a new category of QA software.

Proposed metamorphic relations
As mentioned above, in this paper, we propose five MRs based on the idea of recursively asking (we call them "recursive MRs"). They test the QA software by checking its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. In this section, we introduce the overall idea of each MR. The specific implementation of these MRs would be introduced in detail in the following sections. To simplify the demonstration, we denote the source and follow-up input question as q TYPE and q ′ TYPE , respectively. The source and follow-up outputs of SUT are denoted as a TYPE and a ′ TYPE in the same way. The value of TYPE is one from { WH , GEN , ALT } as explained in Table 4.
We first introduce the three MRs that generate the new questions directly based on the synthesized knowledge.

MR1: Answering a new follow-up wh-question that is raised based on the knowledge regarding one existing source wh-question and the model's source output answer to it.
This MR is eligible for the test inputs with a wh-question on which SUT's output is not " < NoAnswer > ". As shown in Fig. 1a, given a wh-question q WH , we first obtain a WH from SUT. Then, a declarative sentence k (i.e., the knowledge) is synthesized from q WH and a WH with the declarative sentence synthesis (DSS) module. After that, we leverage the question sentence generation (QSG) module to generate the new wh-question and the corresponding target answer based on k. As there may be more than one wh-questions available for k, we randomly pick one from them as q ′ WH and its target answer is denoted as a t WH . Next, we run SUT with q ′ WH to obtain a ′ WH and perform the output checking. If the SUT is correct, both a WH and a ′ WH should be correct, and the knowledge k is a true fact. Since a t WH is deduced from k, we expect a t WH is part of a ′ WH . If a t WH does not exist in a ′ WH , at least one of a WH and a ′ WH is wrong.

MR2: Answering a new follow-up general question that is raised based on the knowledge regarding one existing source wh-question and the model's source output answer to it.
This MR is also eligible for the test cases whose question is one wh-question with a non-"< NoAnswer > " SUT output. Figure 1b shows the overall process of this MR. Similar to the operation in MR1, we first synthesize the declarative sentence k from q WH and a WH with DSS. Next, we use QSG to generate a new general question q ′ GEN and the corresponding expected target answer a t GEN , based on k. It is not difficult to find out that the a t GEN should be "Yes". We then run SUT with q ′ GEN to obtain a ′ GEN and perform the output checking. If the SUT is correct, both a WH and a ′ GEN should be correct, and k should be a true fact accordingly. As a result, a ′ GEN should be consistent with a t GEN , that is, "Yes" (or other sentences that express an affirmation). Otherwise, an issue is found because there must be at least one error in a WH and a ′ GEN .

MR3: Answering a new follow-up wh-question that is generated based on the knowledge regarding one existing general or alternative question and the model's source output answer to it.
This MR is eligible for the test inputs that have a general question or an alternative question. As shown in Fig. 1c, we first use DSS to transform the given q GEN (or q ALT ) into its declarative form k according to a GEN (or a ALT ). After that, as we operate in MR1, we obtain a new wh-question q ′ WH as well as its expected target answer a t WH based on k, and run SUT with q ′ WH to obtain a ′ WH . Next, we perform the output checking. If the SUT is correct, then both a GEN (or a ALT ) and a ′ WH should be correct, and the k is a true fact accordingly. As a consequence, a t WH should exist in a ′ WH . The absence of a t WH in a ′ WH would indicate that at least one in a GEN (or a ALT ) and a ′ WH is erroneous.
The other two MRs that generate a new question according to the further enriched knowledge are based on MR1 and MR2, respectively. They enrich the synthesized knowledge in MR1 and MR2. In this paper, we solely consider enriching the synthesized knowledge in MR1 and MR2, since it is fairly easier to extract abundant information about the output answer of the wh-questions. We leave the task of enriching the knowledge regarding more types of questions as our future work.

MR1+: Answering a new follow-up wh-question that is generated based on the knowledge regarding one existing source wh-question and some extra information about the model's source output answer to it.
Just like MR1, MR1+ is also eligible for the test inputs with a wh-question on which SUT's output is not < NoAnswer > . And as shown in Fig. 1d, the overall process of MR1+ is similar to that of MR1. The difference is that we leverage DSS to synthesize a further enriched knowledge in MR1+. More specifically, for a whquestion q WH and the SUT's output a WH , we first leverage DSS to retrieve a piece of correct information about a WH from the given knowledge base or textual materials. Next, we use DSS to synthesize an enriched knowledge k + based on q WH and the retrieved correct information about a WH . Afterwards, we adopt QSG to generate a new wh-question q ′ WH and its target answer a t WH based on this enriched knowledge k + . The remained steps are the same with those in MR1. We run SUT with q ′ WH to obtain a ′ WH and perform the output checking. If the SUT is correct, both a WH and a ′ WH should be correct, and the enriched knowledge k + is a true fact. Since a t WH is deduced from k + , a t WH should be part of a ′ WH . If a t WH does not exist in a ′ WH , at least one of a WH and a ′ WH is wrong.

MR2+: Answering a new follow-up general question that is raised based on the knowledge regarding one existing source wh-question and some extra information about the model's source output answer to it.
MR2+ is also eligible for the test cases whose question is a wh-question where SUT gives a non-< NoAnswer > output. As shown in Fig. 1e, similar to MR1+ and MR1, MR2+ differs from MR2 in the synthesis of knowledge. More specifically, for a wh-question q WH and the SUT's output a WH , we adopt the similar operation in MR1+ to collect a piece of correct information about a WH and synthesize an enriched knowledge k + based on q WH and the collected correct information about a WH . Next, we apply QSG to generate a new general question q ′ GEN against k + . The remained steps are the same with those in MR2. The expected target answer a t GEN for q ′ GEN is "Yes". We run SUT with q ′ GEN to obtain a ′ GEN and perform the output checking. If the SUT is correct, both a WH and a ′ GEN should be correct, and k + should be a true fact accordingly. As a result, a ′ GEN should be consistent with a t GEN , that is, "Yes" (or other sentences that express an affirmation). Otherwise, an issue is found because there must be at least one error in a WH and a ′ GEN .

Declarative sentence synthesis
In this section, we introduce the methods to synthesize the declarative sentence (the knowledge k or the enriched knowledge k + ) from a pair of question and SUT's corresponding output answer. The question could be one of three types, namely general questions, alternative questions, and wh-questions.

Declarative sentence synthesis based on general question and its answer
For a general question q GEN and SUT's answer a GEN , three steps are needed to synthesize the corresponding declarative sentence k. Figure 2a shows this process. Specifically, we first use spaCy toolkit 2 to analyze the token dependency and locate the auxiliary (aux) 3 in q GEN (step 1). When aux is a form of be or a modal auxiliary, we then move it to the location before the predicative verb (veRb(Root)) of the sentence (step 2-1). If aux is a form of do, we remove aux and transform veRb(Root) into the tense and number of aux with Pattern Library (Smedt and Daelemans 2012) (step 2-2). After that, the declarative sentence k corresponding to q GEN is prepared. Finally, if a GEN is not an affirmation (e.g., "No"), k is further negated (step 3). The final k is returned as the declarative sentence for q GEN and a GEN .

Declarative sentence synthesis based on alternative question and its answer
The synthesis of the declarative sentence from the given alternative question q ALT and SUT's answer a ALT also needs three steps. The whole process is shown in Fig. 2b. The first two steps are similar to the operations in Sect. 4.3.1. The difference is that the obtained k after step 2 still contains the alternatives (text in blue). So we adopt the Berkeley Neural Parser (Kitaev and Klein 2018) to parse the syntax tree of k and then use a ALT to replace the sub-tree rooted at the parent node WH : wh-words like "what", "how", "when", "who", etc. modal : modal words like "can", "must", 'would", etc. verb/verb : verb phrase and its adaption to the tense and number of auxiliary. of "or" (step 3). Finally, the obtained k is returned as the declarative sentence for q ALT and a ALT .

Declarative sentence synthesis based on Wh-question and its answer
To obtain a fairly reliable declarative sentence from the given wh-question q WH and SUT's answer a WH , we design numerous heuristic rules to process every distinct form of q WH . Due to the limited space in this paper, we only list four basic operations in Fig. 2c. The detailed rules (e.g., to adapt preposition) could be found in our online supplementary material. These operations are also performed based on the token dependency and the Part-of-Speech Tags analyzed with spaCy toolkit on q WH .

Declarative sentence synthesis based on Wh-question and extra information about its answer
Given a wh-question q WH and SUT's answer a WH , three steps are required to synthesize the declarative sentence of the enriched knowledge with respect to q WH and a piece of correct information about a WH . Figure 2d presents the overall process. The first step is to retrieve a piece of correct information about a WH . Consider that the knowledge bases or textual materials should store a great number of facts about a WH . For KBQA software, we seek its knowledge base for such information. We first locate a WH and randomly pick one of its properties as such information (step 1-1a). Next, we use some heuristic rules to generate a sentence to describe the picked information (step 1-1b). And for TBQA software, we directly pick a sentence including a WH from the textual material as such information (step 1-2). After obtaining the correct information about a WH , the second step for testing both KBQA and TBQA software is to transform q WH into a nominal clause (step 2). We also prepare quite a few heuristic rules to achieve this transformation. Finally, we substitute a WH in the sentence obtained in the first step with the nominal clause obtained in the second step (step 3). After that, the declarative sentence of the enriched knowledge is obtained. The detailed heuristic rules in the above steps can be found in our online supplementary material.

Follow-up question sentence generation
In this section, we introduce the methods to generate follow-up input question sentences from the declarative sentences synthesized by the above module. Two types of questions, namely general questions and wh-questions, could be generated.

General question sentence generation
The generation of general questions could be seen as the opposite process to Sect. 4.3.1. As shown in Fig. 3a, given a declarative sentence k, we first locate the predicative verb (veRb(Root)) in k (step 1). Then, we check if there is an auxiliary (aux) before veRb(Root). If it is, we move the aux to the beginning of the whole sentence and then the corresponding general question q ′ GEN is generated (step 2-1). Otherwise, we use Pattern Library (Smedt and Daelemans 2012) to recognize the tense and number of veRb(Root) and insert a do with suitable tense and number to the beginning of the sentence. The q ′ GEN is then obtained (step 2-2).

Wh-question sentence generation
Generating wh-questions based on the given knowledge is fairly complicated and challenging. It generally consists of two major steps, i.e., choosing proper target answers and producing the corresponding questions. Figure 3b gives an example of this process. To choose proper target answers from the given declarative sentence k, we first extract noun phrases and adjective phrases from k because they are usually used as the answers of QA software according to our observation on benchmark datasets (step 1). This process is performed based on the Part-of-Speech Tag of each token in k, which is labeled with spaCy toolkit. As a reminder, some unsuitable answers, such as phrases with demonstrative pronouns and a WH , are excluded from . And when we test the KBQA software that can only retrieve the entities from its knowledge base as the answers, we remove the candidate answers that cannot be found in With the potential target answers in , qaaskeR + then raises a reasonable question for each of them according to k (step 2). To handle various expression phenomena, we turn to a DL-based end-to-end Language Model, UniLM (Dong et al. 2019), instead of designing complicated heuristic rules. UniLM has shown fairly promising performance of question generation on SQuAD1 dataset (Rajpurkar et al. 2016). Specifically, we load the UniLM question generation model that is trained on SQuAD1 and released publicly. qaaskeR + then inputs k together with each answer ta j in to the trained model and obtains the corresponding new question set ℕℚ = {nq 1 , nq 2 , … , nq n }.
However, the quality of the questions raised by UniLM is not always guaranteed. For example, the questions generated for the potential target answer 2) and 6) in the example are unreasonable questions. These questions may lead to potential false positive issues since the MRs do not hold when input is invalid. To avoid such situation, we further design a method to sift the fairly clean and reliable questions from ℕℚ (step 3). Our basic idea is that for each nq j , if it is a reasonable question for ta j according to k, the declarative sentence k ′ j created with nq j and ta j should be very similar to k. Therefore, we adopt a widely-used sentence similarity metric, ROUGE (Lin 2004), to provide a similarity score 4 s R j between k and each k ′ j . If s R j is no greater than the pre-defined threshold R , ta j and nq j will be erased from and ℕℚ , respectively. According to the result of our preliminary experiments, we set R to be 0.7 in qaaskeR + . The potential target answer 2) and 6) and their questions in this example are hence filtered out.
Finally, the valid new questions remained in ℕℚ and their corresponding target answers in will be returned as the candidate new wh-questions for MR1, MR1+, and MR3 (step 4).

Violation measurement
In this section, we introduce the methods designed to measure whether SUT violates the MRs on the given test case. As mentioned in Sect. 4.2, we need to measure if a ′ WH contains a t WH (MR1, MR1+, and MR3) and if a ′ GEN expresses the affirmation (MR2 and MR2+). qaaskeR + achieves the measurement by considering the sentence semantic similarity. Specifically, we use the semantic overlap between a ′ WH and a t WH to indicate the existence of a t WH in a ′ WH , and the affirmation is measured with the semantic similarity between a ′ GEN and some affirmative expressions like "Yes".

Existence measurement
Whether a t WH exists in a ′ WH is measured via checking if there exist words in a ′

WH
sharing semantically similar embedding vectors with every word in a t WH . Considering the stop words often contain limited semantic information, we do not consider them in this process.
Let us consider an example whose a t WH is "the president of egypt" and a ′ WH is "egyptian president". Table 5 shows the analysis on this example. Specifically, we first discard the stop words "the" and "of". Then, for each word in a t WH (second and third rows), we calculate the cosine similarity between it and all the words in a ′ WH (second and third columns) as suggested by Řehůřek and Sojka (2010). After that, the maximum similarity for each word in a t WH is calculated as shown in the rightmost column. With this method, although "egypt" from a t WH is not in a ′ WH , a ′ WH is still considered to contain a t WH and not a violation, as it contains "egyptian" that shares a similar word embedding vector and expresses similar semantic meaning with "egypt". Finally, we average all the word-wise maximum similarity into an overall score s exi avg to indicate the existence of a t WH in a ′ WH . It will be next compared against a pre-defined threshold exi . If s exi avg is no greater than exi , a t WH is considered absent from a ′ WH and a violation will be reported. We set exi to be 0.6000 according to our preliminary experimental results. In this example, s exi avg is calculated as (1.0000+0.7443)/2=0.8722. The SUT is thus considered to pass the test on this test case.

Affirmation measurement
To check whether a ′ GEN expresses affirmation to the given general question, we propose to calculate the maximum semantic similarity between a ′ GEN and the word "Yes".
This process is similar to the measurement of existence. Specifically, as shown in Table 6, a ′ GEN in this example is "yep it is black". We first remove the stop words "it" and "is". After that, the word-wise cosine similarity is calculated as shown in the second and third columns. Then, we obtain the affirmative score s aff of a ′ GEN , namely the maximum similarity to "Yes". In this example, s aff is 0.6019. There is also a

Research questions
To evaluate qaaskeR + , we study five research questions: RQ1:The overall effectiveness of qaaskeR + . In this RQ, we aim to give a global picture on the effectiveness of qaaskeR + in revealing the issues of various QA software without using the annotated ground truth labels. And we also discuss the violation detection ability of our MRs and the performance of four test objects based on the test results.
RQ2: Effectiveness comparison with non-recursive Metamorphic Relations. In this paper, we design five novel recursive MRs that generate the follow-up input questions based on both the source input question and the source output answer. We are interested in whether these recursive MRs have better fault detection ability than those non-recursive MRs. So, in this RQ, we compare the effectiveness of our five recursive MRs with two representative non-recursive MRs that generate semantically equivalent questions as follow-up inputs.
RQ3: Validity of the revealed violations. Considering the imperfection in most of the NLP generation and measurement methods (Dong et al. 2019;Lin 2004), it is meaningful to understand the factuality of the revealed violations. Therefore, in this RQ, we perform a deeper inspection on the revealed violations to measure their validity.
RQ4: Analysis on the revealed true violations. To provide an intuitive and constructive impression on the issues revealed by our method, in this RQ, we dive into the analysis on the valid violations by locating the erroneous answers (that is, the source or the follow-up outputs), as well as summarizing the types of the answering issues according to their reasons.
RQ5: Helpfulness to fix the answering issues revealed by MRs. In many studies that propose MT-based testing methods for deep learning software, researchers generate new samples with the proposed MRs to retrain or fine-tune the models, so as to fix the defects revealed by MRs. We are also interested in whether our MRs are helpful to fix the revealed issues. Therefore, in this RQ, we study the performance of the models that are "fixed" by the proposed MRs in two manners.

Data preparation
To evaluate the effectiveness of qaaskeR + on KBQA and TBQA software, we first collect proper datasets as the source of test cases. For KBQA software, we choose two most widely-used benchmark datasets, namely WebQuestionSP and ComplexWebQ.
• WebQuestionSP is a classic KBQA benchmark dataset with wh-questions collected via the Google Suggest API. All its questions can be answered with entities retrieved from the Freebase knowledge base. Most of its questions can be answered by performing simple reasoning on Freebase. It includes 3k questionanswer pairs for training and 1.6k for testing. • ComplexWebQ is a more challenging version of WebQuestionsSP, whose questions are further complicated. Similarly, all these complicated questions are whquestions and can be answered based on Freebase. But they require more complex reasoning on Freebase to find out the answer. ComplexWebQ has 31.1k samples for training and 3.5k samples for testing.
For TBQA software, we leverage three typical benchmark datasets, namely SQuAD2, BoolQ, and NatQA. These datasets cover the mainstream types of TBQA questions and tasks.
• SQuAD2 is a span extraction dataset, where the answer of each question is a span of words from the textual material without demanding combination and rephrasing. And when the question is unanswerable, the output is expected to be " < NoAnswer > ". It contains 140k samples with wh-questions and 2k samples with general or alternative questions in total, which are divided into 130k training samples and 12k test samples. • BoolQ is a dataset totally composed of general questions obtained from Google Search queries and paired with passages from Wikipedia that are considered sufficient to deduce the answer. The answer is expected to be either "Yes" or "No" (or sentences with similar meanings (Khashabi et al. 2020)). It has 9.4k training samples and 3.3k test samples. • NatQA is one abstractive QA dataset, which means it requires the model to return answers that are not mere substrings of the textual material. We use the version provided by UnifiedQA where each question is appended with a referential textual material. It includes 98k wh-questions and 299 general and alternative questions, 5 which are then divided into 97k training samples and 11k test samples.
For each dataset, we first run the corresponding KBQA or TBQA software under test to obtain the source outputs of all the test cases in above datasets. After that, we apply each of the five MRs on its eligible source test samples, respectively. The testing (i.e., violation measurement) is then conducted on the eligible samples.

Test objects
As introduced in Sect. 3.1, in this work, we build our KBQA and TBQA test objects using four representative state-of-the-art methods, namely NSM+h (He et al. 2021a), Multi-hop Complex KBQA (Lan and Jiang 2020), UnifiedQA (Khashabi et al. 2020), and Macaw (Tafjord and Clark 2021). The first two are for KBQA test objects while the latter two are for TBQA test objects. In this section, we introduce how we prepare the test objects. Among the four state-of-the-art QA methods, NSM+h, Multi-hop Complex KBQA, and Macaw have already publicly released their models for the above benchmark datasets. Therefore, we directly load the corresponding official models as our test objects. 6 Meanwhile, UnifiedQA only releases its pre-trained base model and requires us to fine-tune this base model on our target dataset. Considering that Uni-fiedQA can solve multiple QA tasks with a unified model, we apply it to train one general model based on all three datasets as our test object. Specifically, we collect the training samples from SQuAD2, BoolQ, and NatQA to form a hybrid training set with 236,422 samples. And following Khashabi et al. (2020), we fine-tune the pre-trained T5-large-based UnifiedQA model on this hybrid training set and regard the optimal checkpoint as the final test object. During training process, the batch size is 3, the learning rate is 2e-5, the loss accumulates every 5 steps, the validation is conducted per 5000 steps, and the early stop tolerance is 10 times.

RQ1: The overall effectiveness of QAAskeR +
To evaluate the overall effectiveness of qaaskeR + in revealing the answering issues, we report the violation rates 7 (the ratio of the violated test cases in all eligible test cases) of four test objects at each MR and dataset. Tables 7 and 8 respectively show the violation rates of two KBQA software and the other two TBQA software. Based on our test results, we discuss the fault detection ability of different MRs, as well as summarize the characteristics of the performance for each test object.
Considering that two KBQA test objects are only able to answer the wh-questions, we solely test them with MR1 and MR1+ that generate new wh-questions based on existing wh-questions. From Table 7, we found that both MR1 and MR1+ have revealed quite a few violations on every dataset. This demonstrates the effectiveness of qaaskeR + to expose the answering issues on KBQA software. When 6 The official models can be found at their replication packages as follows: NSM+h: https:// github. com/ Richa rdHGL/ WSDM2 021_ NSM. Multi-hop Complex KBQA: https:// github. com/ lanyu nshi/ Multi-hopCo mplex KBQA. Macaw: https:// github. com/ allen ai/ macaw. 7 Our method solves the test oracle problem and thus in its practical application, the size of the source test suite for each MR, as well as the eligible ones, can be adjusted freely. In such a case, we report the violation rate to reflect the average violation detection ability of our MRs as their overall effectiveness. comparing the violations revealed by different MRs, we can see that MR1+, which is newly designed in this work to further extend and complicate the question, has revealed more violations than MR1. This indicates that generating more diverse and complicated questions as follow-up inputs tends to have a better fault detection ability. And we also compare the performance of two KBQA test objects and found that there is no essential difference between their performance. NSM+h triggers more violations on the relatively easier WebQuestionSP; while shows slightly better performance on the fairly more complicated ComplexWebQ.
The two general TBQA test objects can answer wh-questions, general questions, and alternative questions. Therefore, we apply all five MRs to test them. From Table 8, we can also see that all MRs have revealed many violations on all datasets. This demonstrates the effectiveness of qaaskeR + for the TBQA software. And the new MR1+ and MR2+, which introduce more diverse and complex follow-up inputs, are also found to reveal more violations than the original MR1-MR3. This again confirms the effectiveness and superiority of our idea to enrich and complicate the input question for detecting more errors. Furthermore, we found that MR2 and MR3, which involve the transformation of question types, have revealed relatively more violations than MR1 on both test objects. This is interesting because Uni-fiedQA and Macaw have shown promising ability to solve wh-questions and general questions on SQuAD2 and BoolQ, respectively, according to the results of reference-based tests (Khashabi et al. 2020;Tafjord and Clark 2021). But according to our results, they fail to return proper answers to the general question and wh-question related to very similar knowledge on these two datasets. From this point, we conjecture that UnifiedQA and Macaw might overfit the training samples, thus could only pass the test cases whose question is of the frequent types among the training samples from their corresponding datasets. This indicates the potential insufficient generalization of UnifiedQA and Macaw to figure out the questions of distinct types across datasets, which is vital in unifying the solutions for different TBQA task formats (Khashabi et al. 2020).
We also compare the performance of UnifiedQA and Macaw according to our test results. We see that UnifiedQA triggers fewer violations than Macaw on MR1, MR3, and MR1+, which require test objects to answer wh-questions. Meanwhile, Macaw shows better performance on MR2 and MR2+ that require test objects to answer general questions. Considering that the general questions are fairly rare in datasets, such results indicate that UnifiedQA, which requires being fine-tuned on the target dataset, can better process the test cases familiar in the dataset; while is weaker at the relatively rare ones. But Macaw, which provides zero-shot QA service on any target dataset, delivers fairly consistent but also limited performance across the test cases with distinct patterns. This suggests us to use Macaw to provide a preliminary QA service when the training data of expected patterns is limited; while fine-tune UnifiedQA on sufficient training data of expected patterns for better performance.

RQ2: Effectiveness comparison with Non-recursive metamorphic relations
To understand whether the idea of recursively asking can contribute to better fault detection performance, we would like to compare the effectiveness of our recursive MRs with several non-recursive MRs. However, there is no existing non-recursive MRs for the QA software addressing various kinds of questions. Inspired by our previous work of testing the machine reading comprehension software that only solves the general questions , we design two non-recursive MRs to generate semantically equivalent questions as follow-up inputs for the QA software addressing various kinds of questions. These two non-recursive MRs generate follow-up input questions via synonym replacement (SR) and back translation (BT). Specifically, we have: • MR SR : Following , given a source input question q, we replace all the adjectives in q with their corresponding synonyms to obtain the follow-up question q ′ . As a consequence, the semantics of q and q ′ are the same. Therefore, for q ′ , this MR expects the SUT to give an answer that is consistent with the answer of q. The synonyms are obtained according to the widely-used WordNet dictionary. • MR BT : Besides synonym replacement, Liu et al. (2021) apply back translation to obtain the semantically equivalent texts to test the dialogue systems. We borrow such idea to devise this MR. Given a source input question q, we first translate q into its Chinese expression and then translate it back into the English expression to obtain the follow-up question q ′ . The semantics of q and q ′ are the same as well, thus this MR also expects a consistent answer. The translation is automatically performed with Baidu translation service.
As a reminder, following  and Liu et al. (2021), we have not introduced other special designs into MR SR and MR BT to automatically check the quality of the new questions generated by them.
We apply MR SR and MR BT to test the four test objects on all datasets as well. Table 9 presents the test results when using these two non-recursive MRs. Compared to the test results of using our five recursive MRs in Tables 7 and 8, we found that MR SR and MR BT give much lower violation rates than our five recursive MRs in most cases. This demonstrates that our MRs have fairly superior fault detection effectiveness.
We conjecture that the superior fault detection effectiveness of our recursive MRs is due to the higher execution dissimilarity between the source and follow-up test cases. Chen et al. (2003) have revealed that the MRs leading to dissimilar execution manners tend to have a higher fault detection ability. When using qaaskeR + , given one question q, any of our five recursive MRs generates a new question q ′ R with different types or to ask for different objects. This makes the execution of QA software to deduce the answer for q ′ R tend to be different from that for q. As a consequence, if the execution of q is faulty, the execution of q ′ R tends to bypass the same fault to get a different answer and triggers the violation. And when the execution of q is correct, the execution of q ′ R then tries more execution manners to touch the fault. This is also more likely to trigger violations. By contrast, the two non-recursive MRs generate a new question q ′ N , which has a fairly similar query pattern with q. As a result, if the execution of q is faulty, the execution of q ′ N may also be trapped in a similar fault. An output with a similar error may thereby be obtained and the violation would not be triggered. Besides, when the execution of q is correct, the execution of q ′ N is not to touch the other faulty execution manner as well.

RQ3: Validity of the revealed violations
By having revealed quite a few violations on all the four test objects in RQ1 and outperformed the non-recursive MRs with large margins in RQ2, we are particularly interested in evaluating the validity of the revealed violations. We perform a manual inspection on the revealed violations. Specifically, if there is at least one incorrect answer in the source and follow-up outputs, we call the corresponding violation "valid". We consider this inspection meaningful, because: (1) Apart from reporting the violation rates, it is also necessary to give a deep understanding on the factuality of the revealed violations.
(2) Generation of wh-questions and measurement of semantic similarity remain challenging tasks and cannot be guaranteed to be 100% perfect and precise with current NLP techniques (Dong et al. 2019;Lin 2004). Therefore, it is necessary to check if the violations are due to the incorrect answers or the imperfection in the sentence generation and similarity measurement.
A co-author of this paper and another volunteer student participate in the manual inspection. Both of them are proficient in English. They are required to perform the inspection independently, without discussing it with each other. Since the implementation of our method and its actual operation for all KBQA software and all TBQA software are the same, respectively, we pick NSM+h and Uni-fiedQA as representatives and investigate the factuality of the revealed violations on them to bypass repetitive workload that should lead to similar conclusions. For either NSM+h or UnifiedQA, given an MR and a dataset, if more than 100 violations are revealed, the inspectors examine the validity of the randomly picked 100 violations. Otherwise, they will check the validity of all the violations.
After the inspection on the randomly picked violations, we perform Cohen's Kappa statistics (Cohen 1960). The agreement rate between two inspectors is quite perfect (0.87) and the inspectors discuss and settle the disagreement at last. Thus, we consider the validity rate of violations in this inspection can be referred as a fairly reasonable indicator to the overall validity of our experimental results. If this rate is fairly high, it means that the effectiveness reported in RQ1 and RQ2 is convincing, and most of the revealed violations are meaningful and should be seriously considered by the developers and the users of QA software.
We first analyze the inspection result for our MRs. The inspection results over the randomly picked violations revealed on NSM+h and UnifiedQA are presented  in the 2nd to 3rd columns of Table 10 and the 2nd to 6th columns of Table 11, respectively. For the representative of KBQA test objects, NSM+h, we found that over 80% of the inspected violation revealed by both MR1 and MR1+ are valid. Meanwhile, for the representative of TBQA test objects, UnifiedQA, we found that the inspected violations revealed by MR2 are all valid, over 80% of the inspected violations revealed by MR1 and MR3 are valid, and more than 70% of the inspected violations revealed by MR1+ and MR2+ are valid. These valid rates are considered to be acceptable in comparison with the precision in similar testing methods for the machine translation software (Gupta et al. 2020;He et al. 2020He et al. , 2021b. To conclude, the high validity rates indicate the effectiveness of qaaskeR + is meaningful and convincing. They also demonstrate the reliability of the design and implementation in our qaaskeR + regarding the quality of whquestion generation and semantics similarity measurement. Besides, we also further review and summarize the reasons for the detected invalid violations of different MRs. This helps us understand the limitations in our method implementation and sheds light on the future efforts to avoid the invalid violations. We first found the limited precision of the semantic equivalence measurement will cause all of our MRs to reveal some invalid violations. For instance, now it is not easy to identify the expected answer "yesun temur" as semantically contained in the output answer "yesün temür". And we found that the limited capability of the question generation model and unqualified question filter causes the MRs that generate wh-questions (i.e., MR1, MR1+, and MR3) to reveal some invalid violations. For example, the imperfect question "Seven episodes are going to be in what season?" for knowledge "Seven episodes are going to be in game of thrones season 7" (the target answer is in italic) will be generated but not filtered out. Besides, we found a few invalid violations revealed by MR1+ and MR2+ on UnifiedQA result from the imprecise knowledge that we synthesized. This is because it is indeed challenging to extract accurate and clear additional information about the source output from the textual materials to synthesize the extended knowledge. In the future, we can follow the advance of the question generation and semantic measurement techniques and use more powerful techniques to avoid the major of current invalid violations. And it should also be helpful to further refine the heuristic NLP rules that we implement for the previously mentioned operations.
And we also compare the valid rate of the inspected violations revealed by our MRs to that of the two non-recursive MRs. The corresponding inspection results on the inspected violations revealed by MR SR and MR BT on NSM+h and UnifiedQA are presented in the last two columns of Tables 10 and 11, respectively. We first notice that the valid rate of the inspected violations revealed by MR BT is much lower than that of our MRs. We found this is mainly because the translation service sometimes gives imperfect questions that are different from the original questions with respect to their semantics. Since we cannot expect the correct relation between the outputs for the original and these imperfect new questions, a few invalid violations would be revealed on these imperfect inputs. Meanwhile, we design specific methods to examine the quality of the generated questions when implementing our MRs. Therefore, our MRs generate fewer imperfect new questions and lead to fairly fewer invalid violations. As a reminder, according to the result of RQ2, MR BT is weaker than our MRs regarding the violation detection effectiveness as well. And as for MR SR , we found that the valid rate of the inspected violations revealed by it is slightly better than our MRs. We consider this is mainly because the synonym replacement is relatively easier and does not change the original question a lot, thus is less likely to generate many imperfect questions. But as we discussed in RQ2, this largely limits the violation detection effectiveness of MR SR as well.

RQ4: Analysis on the revealed true violations
In this RQ, we further study the valid true violations revealed by our method, which are identified in RQ3. Specifically, we first locate the erroneous answers (that is, the source or the follow-up answer) and next summarize the answering issues of two representative test objects.
We first identify the erroneous answer in the true violations. Tables 12 and 13 present the statistics results for NSM+h and UnifiedQA, respectively. We found that qaaskeR + can find errors on both source test cases and follow-up test cases for two test objects. More specifically, for NSM+h, we notice that more violations are blamed for the source output answer on ComplexWebQ dataset than on Web-QuestionSP dataset. This is reasonable because the questions in ComplexWebQ are more challenging than those in WebQuestionSP. And for UnifiedQA, we found that in the violations revealed by MR2 and MR3 on SQuAD2 and BoolQ, respectively, more issues are blamed for the follow-up answer than the source answer. Since the follow-up questions generated by MR2 and MR3 are of the fairly unfrequent formats in SQuAD2 and BoolQ, respectively, this further supports our conjecture about the limited generalization of UnifiedQA on question types across datasets, which has been discussed in Sect. 6.1.  And to give a systematic and intuitive understanding about the issues that qaaskeR + reveal on the test objects, we further identify and summarize the answering issues according to the revealed true violations. We first analyze the violations revealed on NSM+h. Since the output of the KBQA software is a set of several entities, there could only be three types of possible errors in returned answer, namely giving totally wrong entities, missing necessary entities, and returning extra entities. All these three types of errors have been identified in the revealed violations on NSM+h. The corresponding examples are listed in Table 14. As a reminder, these three types of errors may happen on one case simultaneously. Here we only present the examples on which only one type of error happens for the clear demonstration.
Giving totally wrong entities. We first found that NSM+h sometimes can return some totally wrong answers, which means that there is no intersection between the expected and returned answers. This indicates that NSM+h almost misunderstands the intent of the corresponding input questions at all. For instance, in Example N-1, the question asks for the team that Adrian Peterson plays for in his college years. However, NSM+h wrongly returns the team that he currently serves as a professional athlete.
Missing necessary entities. We also notice that NSM+h may miss some necessary entities and give an incomplete answer, where the output answer is the proper subset of the expected answer. For instance, in Example N-2, the information in the knowledge base clearly indicates that Margaret Hoover has been educated in two institutions. However, NSM+h only retrieves one of them as the final answer.
Returning extra entities. Besides, even having correctly returned all the necessary entities, NSM+h may further give some extra unnecessary entities. In Example N-3, the question only asks for the city of Acadia University. However, NSM+h seems not to accurately parse the border of this query. It returns not only the correct answer of city "Wolfville", but also the unnecessary entity "Nova Scotia", which is the province where Acadia University locates.
Compared to the KBQA tasks whose answer is made up of several entities, the answers for the TBQA tasks are more flexible since their answers are in the form of text without much restriction. As a result, UnifiedQA suffers from more types of issues. We identify and summarize five types of answering issues on UnifiedQA as follows. The corresponding examples are listed in Table 15.
< NoAnswer > for answerable questions. We first found that UnifiedQA cannot answer some answerable questions. In Example U-1, the textual material provides obvious evidence to deduce the answer, in which only the word "fund" in the question is replaced with its synonym "meet the cost". However, the Uni-fiedQA model fails to find the correct answer and merely outputs < NoAnswer >.
Format mismatch between the answer and the question. The second major issue is that some answers from UnifiedQA are not in the correct format that corresponds to the assigned questions. For example, the question in Example U-2-1 is a wh-question, which desires the concrete name of a film. However, SUT only returns "No". Meanwhile, the answer to the general question in Example U-2-2 should be either "Yes" or "No", but SUT wrongly gives an irrelevant verb as the answer.

Table 14
Examples of revealed answering issues on NSM+h Although successful in recognizing the type of the question, the UnifiedQA model sometimes gives answers with irrelevant content. Example U-3 presents an example of this situation. The model returns an answer "Sky", which is far from the correct answer "Some encrypted broadcasts".
Grammatical error. The UnifiedQA model also returns some answers with grammatical issues, which largely harms the quality of the answers. For instance, in Example U-4, an incomplete sentence is given as the answer.
Missing information in the answer. We also notice that the UnifiedQA model may miss some necessary information and give a partially correct answer. Referring to the model's answer in Example U-5, though it indicates that Shi Bingzhi is someone's father, the pronoun "his" is ambiguous. It is evidently not an accurate answer yet according to the textual material.

RQ5: Helpfulness to fix the answering issues revealed by MRs
In this RQ, we investigate the helpfulness of our method in fixing the issues revealed by MRs for the test objects. In existing studies on MT for deep learning software, there are two manners to fix the revealed issues with the proposed MRs, namely to retrain new models (Tian et al. 2018;Liu et al. 2021) and to fine-tune the existing models (He et al. 2020;Wang and Su 2020) using the samples generated with MRs, respectively. In our preliminary work, we have investigated the effectiveness of retraining new models with three preliminary MRs, i.e., MR1-MR3. In this work, we first report the effectiveness of retraining manner with all five MRs. Next, we further investigate the effectiveness of fine-tuning manner. In this RQ, we solely conduct the evaluation on NSM+h and UnifiedQA as representatives as well, because only they release friendly replication packages to perform retraining and fine-tuning.
Retraining new models from scratch on the training cases expanded by the proposed MRs is a widely-used manner to fix the revealed issues on deep learning models in many studies (Tian et al. 2018;Liu et al. 2021). Following the existing works that use this method, we expand the training samples with the five proposed MRs to retrain new NSM+h and UnifiedQA models. The performance difference between the original and new models will demonstrate the helpfulness of the MRs to fix the answering issues in a retraining manner. Retraining-based repair applies to the scenario where massive training samples are available and a high time cost is acceptable to rebuild a better-performing model from scratch. And it is especially preferable when only a few test cases are available to be extended for fine-tuning because fine-tuning may cause the models to overfit to the few new samples in this case; meanwhile, the model retrained on both the original training samples and the expanded new samples tends to retain the ability learned on the original training samples.
Specifically, for each training sample in a dataset, we adopt the MRs that are eligible on it to generate corresponding new training samples. Considering that the ground truth labels for training samples should have existed for performing supervised training, we decide to directly use the ground truth label of every training sample to synthesize the knowledge for generating new training samples. For every  Tables 16 and 17, respectively. We first found that the reduction in violation rates is substantial against all MRs for both test objects. For UnifiedQA, the improvement on MR2 and MR3, which involve the transformation of question types, is especially significant. These findings indicate that the proposed MRs can help to eliminate many answering issues revealed on the test objects by retraining them on the expanded training samples.
Fine-tuning the original models with the new samples generated from the existing test cases is a new manner to repair the revealed issues in deep learning models in recent studies (He et al. 2020;Wang and Su 2020). In this work, we also study if our MRs can help to fix the revealed issues using the new samples generated from the test cases. In particular, considering the labels of test cases are usually unavailable, we try directly generating new samples based on the knowledge synthesized from the actual source outputs, which is with exactly the same process of performing testing using qaaskeR + . The performance difference between the original and fine-tuned models would show the helpfulness of our MRs to fix the revealed issues via fine-tuning. Compared to retraining-based repair, fine-tuning is less resourceand time-intensive He et al. (2021b). When the cost of retraining cannot satisfy the tight update schedule or the original training samples are unavailable, developers can perform fine-tuning to flexibly patch the existing models. But they should use a proper learning setup to avoid the model overfitting. Specifically, given a dataset, we first divide its test cases into two subsets at the ratio of 1:1. We use the test cases in one subset (denoted as the fine-tuning subset) to generate the new samples for fine-tuning; while leave the test cases in the other subset (denoted as the testing subset) for testing. Next, for each sample in the fine-tuning subset, we leverage the MRs that are eligible on it to generate the new samples. Since there are usually no easily accessible ground truth labels for test cases, we directly generate new samples based on the knowledge synthesized with the actual outputs of models. This helps us efficiently leverage the unlabeled test cases to fix the answering issues. And we also collect the new samples together with all the original samples in the fine-tuning subset to form the final fine-tuning set. The final fine-tuning set of WebQuestionSP and ComplexWebQ for NSM+h include 1,735 and 3,059 samples, respectively, and the hybrid fine-tuning set for UnifiedQA contains 28,866 samples in total. Afterwards, we fine-tune each model with the corresponding fine-tuning set for 20 epoches and collect the optimal model during the fine-tuning process as the final test object. Finally, we test each fine-tuned model on the corresponding testing subset.
The test results for the fine-tuned NSM+h model and UnifiedQA model are presented in Tables 18 and 19, respectively. We surprisingly found that the violation rates of both test objects have reduced a lot as well. More specifically, the improvement at MR1, MR2, MR3, and MR2+ is generally comparable to that under the setup of retraining. The improvement at MR1+ is relatively less but still considerable. These suggest that fine-tuning the existing models with the samples generated based on the knowledge regarding the actual outputs can also obtain a fairly good repairing effect against the revealed answering issues, moreover, in a less resourceand time-intensive manner. To conclude, both retraining new models and fine-tuning existing models with the samples generated with our MRs can help to fix some revealed issues and reduce the violations. However, we could find that there still exist many violations on the retrained or fine-tuned models. This finding demonstrates that it is not that easy to repair all the issues revealed by qaaskeR + . Since qaaskeR + is a testing method per se, from this point we can also argue that our method is necessary for the correctness inspection of QA software output and the in-depth problem revealing of QA software. This again confirms the significance of qaaskeR + as a testing method.

Discussion on real-life usage
With the development of QA algorithms some industrial products have been able to provide preliminary QA services. For example, the Google Search service 8 can now return an exact answer or one paragraph with the answer span in bold when we input a wh-question as the query. Thus, we try qaaskeR + on the Google Search service to take an initial sip of its usefulness on the real-life QA applications. And as a search engine, the Google Search service does not require the referential textual material as input but retrieves necessary information from web by itself. Therefore, it can also be seen as a representative test object of the open-world TBQA software.
According to our observation, the Google Search service can mainly answer wh-questions now. Therefore, we only try MR1 and MR1+ on it. Besides, as the returned results vary in forms (e.g., sometimes an exact phrase and occasionally a paragraph with one span in bold), we only perform a small-scale trial by hand as the preliminary exploration. Specifically, for each MR, we first randomly choose 20 whquestions from an open-world TBQA benchmark, MKQA 9 (Longpre et al. 2020). These wh-questions are used as the source inputs and we manually collect and unify the answers returned from Google Search as the source outputs. After that, we run qaaskeR + to generate new questions and their target answers based on the source inputs and outputs. We next input the new questions as queries and obtain the search results, from which the follow-up outputs are manually extracted. At last, we perform the violation measurement on all test cases.
Finally, 5 out of 20 test cases regarding MR1 10 and 7 out of 20 test cases regarding MR1+ 11 trigger the violation. Let us present one of the violations revealed by each MR in detail. As shown in Fig. 4a, for MR1, we first query "When was the first railroad built in the United States?" and obtain the source answer "1830" from Google Search. After that, we query "In which country was the first railroad built in 1830?" whose target answer is "the United States". However, Google Search returns an irrelevant answer "The railroad was first developed in Great Britain...'', which triggers a violation. Actually, the answer to the source input should be "1827-02-28" according to the annotated label from MKQA. And as Fig. 4b shows, for MR1+, we first query "What was the first movie to have color?" and get the source answer "A Visit to the Seaside" from Google Search. With the introduction materials about this movie provided by Wikipedia, we learn that the director of "A Visit to the Seaside" is "George Albert Smith". Then, we recursively query "Who is the director of the first movie to have color?" on Google search but obtain an unexpected answer "Cecil B. DeMille", which triggers a violation. Actually, the answer to the source question should be "La Vie et la passion de Jésus Christ". These demonstrate that qaaskeR + finds true erroneous answers returned by Google Search. In a word, this trial demonstrates the potential of qaaskeR + to reveal real-life bugs for daily QA applications. Moreover, we consider it has also suggested that people can leverage our methodology of recursively asking to perform a "just-in-time test" during their usage of QA software, so as to have some clues about the correctness of the returned answer.

Threats to validity
The first threat to validity is about the representativeness of the test objects and the datasets. In this work, we apply our method to test four test objects and one reallife application. These four test objects have covered both mainstream categories of QA software and achieve the state-of-the-art performance. And Google Search is a popular and typical QA application practically used in human life. Thus, we consider they are suitable representative test objects and their test results can reflect the effectiveness of our qaaskeR + in general. Actually, qaaskeR + can be considered as a black-box testing method that only involves the input and output of QA software and an arbitrary fact about the output. Therefore, qaaskeR + should be generalizable to any other QA software of these mainstream QA manners, just using the way in which we test the representative objects. As for the datasets, the adopted benchmarks are all classic and have been widely used in the reference-based testing of QA software and cover the major types of QA tasks (He et al. 2021a;Khashabi et al. 2020). Since we evaluate qaaskeR + on all of them, we consider the evaluation should have fairly good generalization.
The second threat to validity is about the tools that we adopt to realize the proposed MRs. As illustrated in Sect. 6.3, the wh-question generation and semantic similarity measurement are not perfect yet because of the limited NLP techniques. To assure the validity of the revealed violations, we have designed various methods to avoid the false positive violations. We also inspected the factuality of the revealed violations. The result shows that over 70% of the inspected violations are valid. This is acceptable when compared to other MT-based test methods for DL software (Gupta et al. 2020;He et al. 2020He et al. , 2021b. And we will keep trying to improve this validity rate in our future work as well. The last threat to validity comes from the manual inspection and categorization of the revealed violations. To alleviate the bias introduced by the difference of subjective cognition, we delivered a tutorial to the inspectors before the inspection. We have also performed Cohen's Kappa statistics and found the agreement rate between two inspectors is quite perfect (0.87). And all the disagreements are settled after their discussion.

Related works
In this section, we discuss the related works in two aspects, i.e., the benchmark datasets proposed for testing QA software and the application of Metamorphic Testing for other Deep Learning software.

Benchmark datasets for QA software
To test QA systems as well as understand whether machine can intelligently deduce the question as the human do, many works proposed various benchmark datasets Yani and Krisnadhi 2021;Dzendzik et al. 2021). For KBQA, there are various benchmark datasets with different requirements. Some datasets solely contain questions which can be answered by deducing a simple relation in knowledge base (Chandar et al. 2016;Azmy et al. 2018) and some datasets include questions that request complex reasoning to answer (Bao et al. 2016;Trivedi et al. 2017). For TBQA, benchmark datasets are with diverse forms of task, including to fill in the blanks (Onishi et al. 2016;Suster and Daelemans 2018b), judge the correct options (Clark et al. 2019;Khashabi et al. 2018;Lai et al. 2017), extract the relevant spans (Rajpurkar et al. 2016(Rajpurkar et al. , 2018Yang et al. 2018), and return fluent text answers (Kwiatkowski et al. 2019;He et al. 2018). There are also datasets of the samples with adversarial inputs, such as typos (Eger et al. 2020) and irrelevant sentences (Jia and Liang 2017), to test the robustness of QA software. But as mentioned in Sect. 1, these datasets may mainly focus on some specific topics and task formats. As a result, solely testing with the reference-based paradigm on these datasets is not extensible and may be biased and insufficient.
Unlike these works, in this paper, we propose a method to test QA software without the demand of annotated labels via asking recursively. It breaks the reliance on the ground truth labels of test cases and hence enables both the flexible just-in-time test and the extensible test that can leverage the massive unlabeled data in real-life usage to test QA software.

Metamorphic testing for deep learning software
To alleviate the oracle problem during testing various Deep Learning (DL) software, quantities of works leverage MT and propose many novel MRs to test the DL models for different tasks.
The Autonomous Driving (AD) systems and the Neural Machine Translation (NMT) services are two typical DL software that attracts many MT-based testing methods. Tian et al. (2018) and Zhang et al. (2018) propose to test AD against the relation among the steering angles under distinct weather conditions. Zhou and Sun (2019) combine MT and fuzzing and take the LiDAR point-cloud data of AD into consideration during testing. Wang and Su (2020) leverage MT to test the object detection algorithms that are used to build a key component in AD systems. As for the NMT service, researchers propose to check its correctness with MT based on the structure invariance (He et al. 2020), pathological invariance (Gupta et al. 2020), referential transparency (He et al. 2021b), etc. In addition to being adopted to test NMT services, MT is also found to be helpful in assessing the quality of input data (Yan et al. 2019) and repairing the erroneous translations (Sun et al. 2020(Sun et al. , 2022 for NMT services.
In this paper, we also leverage MT to test a hot DL application, QA software. Specifically, we propose five novel MRs against the consistency among the input question and output answer pairs that are related to the same or some further enriched knowledge. We also implement three tools, i.e., the declaration synthesis, question generation, and similarity measurement, to realize the proposed MRs.
Besides, we propose to validate the machine reading comprehension (MRC) DL models with MT in our previous work . It aims to provide the MRC models with one systematic and extensible assessment of language understanding capabilities against required linguistic properties. In that work, the followup inputs were built on the basis of merely the source inputs and the transformation about several related linguistic properties like synonyms and negations.
Different from that, the qaaskeR + devised in this work is a recursive metamorphic testing method that constructs follow-up inputs by considering both the source input and the source output. This could involve some more abstract properties, such as the generalizability on question types. And we also evaluate qaaskeR + on not only the previously investigated TBQA boolean question task, but also the TBQA span extraction and free-form answering tasks. Furthermore, we investigate the effectiveness of qaaskeR + on another mainstream category of QA software, the KBQA software, and further explore the real-life usefulness of qaaskeR + on the Google Search service and its helpfulness to repair the revealed issues.

Conclusion and future work
Question Answering (QA) software has been widely used in our daily life. In this paper, we propose a novel recursive Metamorphic Testing method qaaskeR + with five novel recursive Metamorphic Relations. qaaskeR + tests QA software by checking its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. It cuts off the reliance on the preannotated labels of test cases, thus enables both the flexible just-in-time test during usage and the extensible test with massive unlabeled data for QA software, which cannot be supported by the current reference-based test paradigm. We evaluate the effectiveness of qaaskeR + by using it to test four representative state-of-the-art QA software that covers two mainstream types of QA software, as well as a popular reallife QA application, the Google Search service. Comprehensive results demonstrate that qaaskeR + can reveal quantities of valid violations that depict diverse answering issues for various kinds of mainstream QA software. Besides, we also found that our recursive MRs have a better fault detection effectiveness than two representative non-recursive MRs and can even help to fix the revealed issues.
We have planned plenty of future work directions. First, we would like to further evaluate the effectiveness of qaaskeR + on more applications and corpora. It would be interesting and significant to see the defects revealed on other QA software, especially the issues about some essential functionalities like generalization and the problems that may have been concealed by the insufficient reference-based tests. In addition, we will also try to design new MRs by considering more properties of QA software and keep improving the validity of the revealed violations. We are pretty interested to explore and strengthen qaaskeR + on repairing the revealed issues as well.