A microservice regression testing selection approach based on belief propagation

Regression testing is required to assure the quality of each iteration of microservice systems. Test case selection is one of main techniques to optimize regression testing. Existing techniques mainly involve artifacts acquisition, processing and maintenance, thus hard to apply in microservice regression testing since it is difficult to obtain and process required artifacts from multiple development teams, which is normal in cases of microservice systems. This paper proposes a novel approach, namely MRTS-BP, which takes API gateway logs instead of artifacts as inputs. By mining service dependencies from API gateway logs, MRTS-BP analyzes service change impacts based on a propagation calculation, and selects test cases affected by changes based on impact degree values. To evaluate the effectiveness of MRTS-BP, empirical studies based on four real deployed systems are presented. Retest-all strategy and a regression testing selection approach based on control flow graphs called RTS-CFG are compared with MRTS-BP. The results show that, MRTS-BP can significantly reduce both the number of test cases and overall time cost while maintaining the fault detection capability of selected test suite, and that MRTS-BP can save more time cost than RTS-CFG with the similar safety and precision.


Introduction
Microservice architecture is an effective architecture pattern popularly used in developing current cloud applications, for which services are built independently and integrated at run time by using container technique such as docker [1][2][3]. Microservice architecture well supports frequent business expansion and smooth upgrading by facilitating independent development and deployment of services [4]. Whenever service modification happens, regression testing is required to detect any potential faults introduced in modifications [5]. Common strategy employed in regression testing is to rerun previously used test cases (referred to as original test suite), namely retest-all strategy. The cost of retest-all strategy, however, might be not acceptable in the case of microservice system with a large amount of services deployed and rapid iterations. For example, WeChat, a social microservice system, has tens of thousands of services deployed and takes several months normally to conduct regression testing with retest-all strategy [6]. To reduce testing cost, many techniques such as test case prioritization, test suite minimization and testing selection are proposed and applied [7].
Regression testing selection (RTS) reduces testing cost by selecting a subset, selected test suite, from original test suite to intentionally cover modules which are changed or affected by other changed modules introduced in last iteration [5]. Many researches at first identify targeted modules by change impact analysis and then produce selected test suite [8][9][10][11]. Artifacts such as requirement specification, design model and code file are required usually to conduct change impact analysis effectively. For microservice testing, covering scenarios of service invocations is one of main test objectives [4,8,11], and thus a *Correspondence: Ji Wu maple_clz@163.com School of Computer Science and Engineering, Beijing University of Aeronautics and Astronautics, Beijing 100191, China service invocation chain is usually set as a test case, also called a test path. With such test paths, artifact based RTS extracts service dependencies from artifacts to analyze which invocation chains may be affected by service modifications such that corresponding test paths can be identified and selected. In addition, safety and precision are also considered while evaluating RTS approaches [5]. A RTS approach is safe if it will contain all test cases that can reveal faults in regression testing, and a RTS approach is precise if it will not contain any unnecessary test cases. To assure the safety, RTS approaches normally expand selected test suite, but that would decrease the precision. Usually, we want to improve precision (i.e. to reduce testing cost) while maintain the safety of test selection at an accepted level.
However, challenges might arise when artifact-based RTS approaches applied in microservice regression testing: (1) Artifacts acquisition. When the microservice system under testing is developed by multiple teams, it faces several issues while achieving artifacts. Extra communication is acquired for gathering which results into elevating total cost. Additionally, expense is increased through obedience with numerous security strategies [8]; (2) Artifacts processing. Diverse development approaches (e.g. different modeling methods and coding frameworks) applied [12][13][14] increase the difficulty to agree on the integrity, comprehensibility, and consistency of artifacts, which will seriously hinder the performance of artifacts processing; (3) Artifacts maintaining. With growths of system scale in uninterrupted environments, maintenance of artifact versions will cost extra. Especially when multiple versions of a service need to run together, the integration of artifacts might bring confuses. Hence, above three main challenges indicate that artifact-based RTS approaches are not suitable for microservice regression testing.
Furthermore, API gateway layer of microservice system logs every API invocation at runtime for quality of service investigation, including requester, responder, time, status code, etc. [4]. Given a large amount of API gateway logs collected, frequent collaboration patterns among services can be mined to indicate service dependencies. Based on service dependencies, change impact analysis can be conducted based on belief propagation [15], which can conquer above challenges, this motivates the approach presented in this paper, microservice regression testing selection technique based on belief propagation (MRTS-BP).
Initially, MRTS-BP generates service dependency matrix (SDM) by mining service dependencies at business level from API gateway layer logs. Secondly, a directed graph of dependencies among services is established from SDM, and is inputted to a change impact propagation algorithm to measure change impacts quantitatively. Thirdly, existent satisfaction, complete satisfaction and k-existent satisfaction are adopted for different cases to generate selected test suite. Moreover, to evaluate the effectiveness of MRTS-BP, we conduct empirical studies on four real deployed systems to measure reduction rate of testing cost, recall, precision and F-measure. Retest-all strategy and a typical artifact-based RTS called RTS-CFG [16] are compared with MRTS-BP. The results show that, MRTS-BP can significantly reduce both the number of test cases and overall time cost while maintaining fault detection capability of selected test suite, and that MRTS-BP can save more time cost than RTS-CFG with similar safety and precision.
Main contributions of this paper are: (a) data utilizes to select test cases are extracted from API gateway logs; (b) change impacts are quantitatively calculated through mathematical approach; (c) three selection strategies are proposed to meet practical testing scenarios.
The rest of this paper is structured as follows: Related work section presents related work on microservice testing, regression testing selection and belief propagation. Methodology section presents MRTS-BP in detail, Empirical study and Results and discussion sections present empirical analysis, and Conclusions and future work section discusses conclusions and future work.

Microservice testing
Microservice testing comprises unit testing, service testing, and end-to-end testing [4]. However, unit testing is utilized to identify culpabilities in functions or classes. Additionally, this process is sustained by two experimental tools such as xUnit [17] and mockito [18]. In order to bypass user interference and rapid assessment, the service-testing is preferred. End-to-end testing focuses on behaviors of entire system. Due to challenges by service autonomy, dynamic binding and access restrictions [8,11], service testing and end-to-end testing are very different from traditional software, thus leading to many researches and practices [4]. Whereas, test cases in this paper cover both services and end-to-end behaviors of the microservice system under testing.
Furthermore, both procedures at same interval involve multiple services and their invocations. For instance, consumer-driven based evaluation includes consumer, target, and stubbed-services [4]. Therefore, test cases of these procedures are abstracted as test paths [19][20][21][22], defined as follows: Definition 2.1test path Let <s i ,s j > represent an invocation between two services, a test path is a sequence of <s i ,s j > and each of elements in the sequence can be a single invocation or a sequence composed of multiple invocations. A test path can be formally defined as a recursive regular expression tp = <<s i ,s j > (,<s i ,s j >) * > | < (tp,) + > .
In our work, since logs of microservice systems collected present interfaces exposed by services, the granularity of "service" in Definition 2.1 is a service interface.

Regression testing selection
Formal definition of regression testing selection problem is follows as [5]: Definition 2.2Regression Testing Selection Problem, RTS issue.

Given:
The program, P, the modified version of P, P′ and a test suite, T. Problem: Find a subset of T, T s , with which to test P′.
Most of existing RTS approaches mainly concentrate on formal presentation of change impact scopes from P to P′ and searching which test cases cover these scopes [5]. Due to close relationship between RTS approaches and system architectures, with continuous development of architecture paradigms, various RTS approaches for different architecture patterns are proposed, which can be divided into two categories: independent program oriented RTS and web service oriented RTS.
Early application systems have relatively small set of functionalities, which are mainly in forms of independent programs. RTS researches concentrate on change impact analysis with code files, such as data flow analysis approach, graph traversal approach, firewall approach, etc. Data flow analysis approach extracts information detailing locations of definitions and uses, which is needed by an inter-procedural data flow tester to guide the selection and execution of test cases [23]. Though applied in regression testing for spreadsheet programs [24], such approach is difficult to conducted for codes that does not cover data flows. Graph traversal approach relies on graph models such as control dependency graph [25], program dependency graph [26], system dependency graph [26], control flow graph [27,28]. This approach usually includes two phases: analysis and selection. In analysis phase, different granularities of graph models are established from P to P′ to identify change impact scopes; in selection phase, relationships between such scopes and test cases are established, and then, test cases that relates to change scopes are selected. Integration scopes affected by changes are defined as "firewalls" in firewall approach, which is proposed for module integration testing [29,30]. Based on firewalls figured out from code files, test cases covering firewalls are selected.
With the wide deployment of web applications, a large number of RTS researches concentrate on the issue of web services regression testing selection. Since web service testing concentrates on service compositions [8,11], change impact analysis are conducted based on specifications and behavior models of web services. Treating web services testing as black box testing, web service description language (WSDL) based specifications for functions from an end-users point of view, are required as inputs to figure out change scopes to select test cases [31,32]. In analogy to graph traversal approach, a two-stage RTS based on control flow graphs is proposed for web service regression testing selection [16] (referred to as RTS-CFG in the following). Such approach includes initialization stage and key stage: in initialization stage, artifacts of system under testing are collected and control flow graphs are established to represent service invocation logic; in critical stage, test cases are selected according to dangerous edges of control flow graphs. Considering different granularities of services, various RTS approaches are proposed. For service endpoints, a RTS approach is presented based on path analysis [21]. This approach compares the invocation path changes between service endpoints before and after iteration, and then selects test cases covering such changes. For service interfaces, a RTS approach based on service interface contract analysis is proposed [20]. In this approach, conflicts caused by contract changes are figured out to select test cases covering such conflicts. For services as a whole, a RTS approach is presented based on business process modeling [22]. Such approach requires structured business logic specifications as inputs and follows the graph traversal method. For large scale systems, a RTS approach needs to inject additional codes into services to collect relationship data between JAVA codes and test cases [14], but its implementation depends on Google's infrastructure, which makes it not universal.
Compared with program oriented RTS, web service oriented RTS approach tends to select test cases based on change analysis of artifacts such as specifications and models. When such RTS approaches are applied in microservice regressing testing, artifacts collection, consistency checking, information extraction and modeling are necessary, but require substantial efforts to implement since microservice systems are usually developed by multiple teams and using different techniques. Whereas, MRTS-BP is proposed to replace artifacts processing with extracting service dependencies from API gateway logs based on frequent pattern mining [33].

Frequent pattern mining
General process of frequent pattern mining is: given a frequent threshold, when the frequency of an item set in transaction set exceeds the threshold, the item set is considered as a frequent item set, which can be used to generate association rules [33]. The frequency of an item set in the transaction set is called "support", which is computed as the ratio of the number of transactions containing the item set to the size of the transaction set. Frequent item sets with k elements are called k-frequent item sets. Non-empty subsets of a frequent item set must be frequent item sets. Therefore, the support of a frequent item set must be less than or equal to that of its non-empty subsets.
There are many types of frequent pattern mining algorithms, such as candidate set based algorithms, tree based algorithms and recursive suffix based algorithms [33], which are customized according to meet practical requirements of mining problems. In our approach, the mining problem is to extract service dependencies as a basis for change impact analysis.

Belief propagation
Belief propagation algorithm (BP) is a repetitive process for estimated interpretation based on graph structure. There are numerous applications of BP which include: forward propagation algorithm, the Viterbi algorithm, decoding algorithms of low density parity check (LDPC) and turbo codes. Such methodologies are utilized for different scenarios [15]. Generally, BP algorithm is follows as.
(1) Initialization: setting initial value of each node.
(2) Propagation: update all message values and node confidence values. (3) Determining whether node confidence values are convergent. Incase convergent, inference results obtained according to confidence values. Otherwise, it will jump back to step (2) and propagate iteratively.
In recent years, studies on BP algorithm comprise application and optimization. According to its application, scholars predominantly focus on communication coding and signal processing. In order to reduce complexity of sparse code multiple access, dynamic edge assortment procedure based on BP algorithm is introduced. Through iterative calculation, range boundaries of nodes are detected [34]. However, nonlinear equalization method utilized neural network where BP algorithm is applied to remove signal noises [35]. Additionally, for massive multiple-input multiple-output channel detection, BP is used purely based on deep neural network [36]. In optimization aspect, investigators primarily focus on implementation and convergence condition. Furthermore, LDPC along computational process assists in parallelization and merging memory access [37]. Beside this, convergence problem of BP algorithm and numerical polynomial-homotopy-continuation method revealed influence of structures. Therefore, parameters of graph models solved through fixed points [38].
Literature study reveals BP algorithm is not yet applied in microservice regression testing selection. Proposed work acquires to analyze impacts based on service dependencies from API gateway logs. Additionally, when service dependencies are transformed to a directed graph, impact analysis can be translated into impact propagation from some nodes to others, which can be addressed by BP-like methods.

Methodology
MRTS-BP resolves issue as given in Definition 2.2 for microservice systems. This problem is tackled through three steps: service dependency mining, change impact analysis and test case selection as displayed in Fig. 1. Furthermore, Inputs primarily comprise API gateway logs and original test path set. While, the output is selected test path set to be re-tested.

Service dependency mining
Microservice systems provide user accessible functions through service cooperating, leading to data exchanges and service invocations, called service dependencies [4]. API gateway logs record requests among services. One can see business flow and data flow triggered by users. Thus, our approach mines service dependencies from API gateway logs to generates service dependency matrix (SDM). This step mainly contains two activities: data preprocessing and service dependency matrix generation. The former establishes user request chains from logs and generates a transaction set, while the latter generate a SDM from the transaction set.

Data preprocessing
To facilitate log mining, raw data should be preprocessed to remove irrelevant items and to form into structured data [39]. API gateway logs may contain requester address, service name, service address assigned by load balancer, status code and so on, though its concrete structure varies from system to system. The first step in data preprocessing is to remove irrelevant items such as self-checking records from API gateway logs. Only required data fields such as requester address, service name and service address will be retained to form a structured data set.
Next, with cleaned data, our approach takes user requests as starting point to search for associated service invocations and then formulates into service invocation chains to represent a user session, called as user session extraction. Extraction process is implemented based on another key component of microservice systems called "service chain monitoring". Service chain monitoring mainly collects, analyzes and displays service invocations while microservice systems running, which supports for fault diagnosis and performance optimization, e.g. Open Zipkin of Twitter, CAT of Dianping.com, and Naver Pinpoint etc. Service invocations associated with the same user request share the same tracing ID. Therefore, a list of service invocations representing a user session can be formed by determining the consistency of tracing ID. Then, a transaction set for mining is generated from service invocation chains. Considering that service invocations can directly represent dependencies between services, our approach takes an invocation as an item, and takes a service invocation chain as a transaction. Formally, let S = {s j |0 < j ≤ n} represent the service set of a microservice system and n denote the number of services, then: According to Definitions 3.1 and 3.2, transaction set generation can be implemented by traversing the service invocation chains once, which is as follows: (1) initialize a transaction set as an empty set; (2) traverse each invocation of each service invocation chain, and denote each invocation as an item. After removing duplicate items, an item set is generated as a transaction and appended to the transaction set; (3) after traversing all service invocation chains, output the transaction set.

Service dependency matrix generation
Service dependencies are mainly derived from two type of sources: (1) Request flows Request flows are represented as invocation chains of services, which can be decomposed into invocations between services. If an invocation occurs frequently, it can be inferred that there may be a dependency between corresponding two services, which is defined as "request dependency" in our approach. Since an invocation is defined as an item in Definition 3.1, request dependencies are represented as 1-frequent item sets. (2) Data flows On the one hand, data flows may directly occur with a single invocation between services, which can be also considered as the category of request dependency. On the other hand, data flows may occur indirectly through multiple invocations, two basic cases of which are shown in Fig. 2. In Fig. 2, invocations between service 1 and service 3 do not exist. In the left case, service 2 is invocated through Req12 and Req32 respectively. When Req12 changes some persistent data in service 2 and Req32 needs to query such data, service 3 indirectly exchanges the data with service 1. A typical example is data subscription with data decoupling patterns proposed in [4], which includes customer management service (service 1), subscription management service (service 2) and report service (service 3). Customer management service sends new customer data incrementally to subscription management service, and report service queries subscription data from subscription management service, including customer data. In this example, report service does not directly interacts with customer management service, but the former relies on the latter indirectly through the customer data. When customer management service changes (for example, the structure of customer table is changed), such changes may also affect report presentation of report service, which is needed to verify in regression testing. Similarly, in the right case of Fig. 2, service 2 invocates service 1 and service 3 through Req21 and request Req23 respectively. When parameters of Req23 include some data returned by Req21, service 3 may indirectly rely on service 1 through such data. Service dependency generated by indi-rect data exchange is defined as "data dependency" in our approach. From the perspective of frequent patterns, data dependencies are expressed as 2-frequent item sets in the transaction set, and requestors or responders of two items are the same. Through the analysis above, it is concluded that the request flows may lead to request dependencies, while data flows may lead to both request dependencies and data dependencies. In order to measure the possibility of a service dependency quantitatively, confidence value is defined as follow: Definition 3.3Confidence of service dependencies. Given frequent threshold c, let F 1 , F 2 respectively represent the set of 1-frequent item sets and the set of 2-frequent item sets in transaction set D, count(I) represent the number of transactions containing item set I in D.
Then the confidence value of service s i depending on s j is given by equations in Formula 1: Formula 1 divides confidence degree of service dependency into four situations: the first equation computes the confidence of service dependency between two services in the same invocation, corresponding to request dependency, which is measured by the support of 1-frequent item set; the second and the third equations compute the confidence of service dependency between two services in different invocations that belong to a same data flows, corresponding to data dependency, which are measured by the product of the support of 2-frequent item sets and the confidence of association rules from corresponding tuples [33] (the second equation corresponds to data dependency in left case of Fig. 2 the third equation corresponds to right one). Except for above situations, it is considered that there is no dependency between services, and the confidence is defined as 0. For example, in Fig. 2, supposing that Req12 appears 500 times in logs and the log size is 2000, given c = 0.2, then <s 1 ,s 2 > ∈F 1 and conf(s 1 ,s 2 ) is 0.25 according to the first equation of Formula 1. Supposing that Req13 does not appear anywhere and Req12, Req32 appear together in the same transaction 400 times in logs, them {<s 1 ,s 2 >, <s 3  It is noted that the support of 2-frequent item sets is less than or equal to the support of 1-frequent item sets, and this method does not consider service dependencies whose confidence is less than c. Then, the definition of service dependency matrix is as follows: Definition 3.4Service Dependency Matrix, SDM Given a service set S = {s i |0 < i ≤ n}, SDM is a n-order square matrix, and its element a ij in row i and column j is defined as follows: Based on Definition 3.3 and Definition 3.4, an algorithm for generating SDM from D can be proposed as Algorithm 1. Firstly, the algorithm constructs 1-frequent item sets, and generates candidate sets by Cartesian product, and then removes infrequent item sets by c to obtain 2-frequent item sets (line 1 to 4). Secondly, SDM is initialized as an n-order zero square matrix (line 5). Based on the first equation of Formulas 1 and 2, the element of SDM corresponding to 1-frequent item is set as the support value (line 6 to 8). Thirdly, by traversing 2-frequent item sets, when tails of two invocations are the same, corresponding element of SDM is updated as the maximal value of its current value and the result computed with the second equation of Formula 1 (line 10 to 18); when heads of two invocations are the same, the third equation of Formula 1 is adopted (line 19 to 28).

Algorithm 1 Service dependence matrix generation algorithm
For example, a SDM of 9 services generated from logs is shown in Table 1. As Table 1 shows, elements on the diagonal of the SDM are all 0, and elements above 0 indicate dependencies where the confidence level exceeds given threshold. According to Algorithm 1, non-zero elements of SDM are related to F 1 , F 2 and c,  and therefore to the scale of logs and frequency threshold. Theoretically, the scale of logs is larger, more dependencies among services are covered, and there will be more SDM non-zero elements. Given logs, frequency threshold is smaller, more SDM non-zero elements will be generated. More non-zero elements indicate that SDM is more complete, which can affect the safety of test case selection, and it will be discussed in Empirical study section further.

Change impact analysis
Change impact analysis also including two activities: directed graph generation and impact propagation computing. The former builds a directed graph model from SDM, while the latter measures impacts by an impact propagation algorithm.

Directed graph generation
Directed graph is used to represent impact propagation network of services. Nodes of the graph represent services and directed edges represent propagation paths among services. Since each element of SDM represents the confidence of corresponding service dependency, the weight of each directed edge can be initialized. Therefore, based on Definition 3.4, a directed graph can be defined as follows: Definition 3.5Directed Graph, DG A directed graph for impact propagation is a tuple DG = (N,E), where node set N=S, edge set E = {e ij | a ji ∈SDM∧a ji >0}, e ij represents a directed edge from s i to s j , and w(e ij ) = a ji represents the weight of e ij .
Based on this definition, an algorithm of directed graphs generation from SDM is shown in Algorithm 2. Node set of DG (line 1) is established based on service set. And then, all elements of SDM are traversed (line 2 to 12) while directed edges are established between nodes with confidences greater than 0. The direction of a directed edge is opposite to the direction of corresponding service dependency (lines 5 to 8). For example, a DG corresponding to Table 1 is shown in Fig. 3. It can be seen that directed edges on DG are inverted with respect to non-zero elements in SDM.

Impact propagation calculation
Given a service set S and its modified version S′, it is easy to obtain a list of changed services with service registries [4], which is denoted as ∆S. Since services affect each other through service dependencies during changing, based on DG, change impact analysis can be translated into a quantitative assessment of impact propagation from some nodes in ∆S to others, which can be addressed by BP-like methods. The difference is that messages in belief propagation are used to calculate the probability of edge distribution of nodes, while messages in impact propagation are used to calculate the probability of nodes being affected by changes.
Referring to the framework of BP algorithm, an algorithm of node updating and message propagation is proposed, and its convergence of iterative calculation process is also analyzed.
Node updating Since there is no limit to the sum of change impacts on all service nodes, standard BP algorithm can not be applied. Our approach defines "impact degree" to measure change impacts of service nodes. Impact degree of nodes in ∆S is defined as 1, as the upper limit value, and impact degree of nodes not affected by changes is defined as 0, as the lower limit value. When a node acts as the end node of a directed edge, its impact degree may be updated with the message passing from the directed edge. The updated value of this node should be the maximum value of all messages sent to it and its current impact degree. The formal definition is as follows:

Definition 3.6Impact Degree
Given a directed graph DG of Definition 3.5, let N i represent the neighborhood of s i , m ji represent the message from s j to s i , t represent iteration rounds of message propagation, then impact degree p t (s i )∈[0,1] of s i is recursively calculated as follows: Apparently, impact degree of a node after message propagation is not less than that before message propagation, that is, p t + 1 (s i ) ≥ p t (s i ).
Message propagation In a directed graph, through directed edge e ij , impact degree of s i can be propagated to s j based on weight w(e ij ), so the message is defined as follows:

Definition 3.7Message
Given a directed graph DG of Definition 3.5, when e ij ∈DG.E, message m ij propagated from s i to s j is: Apparently, since w(e ij )<1, message value is always less than current impact degree of the sender node, that is, m ij <p t (s i ).
Convergence analysis Definitions 3.6 and 3.7 show iterative computing process of change impact propagation. When there are no loops in DG, that is, a directed acyclic graph, propagation rounds of each node do not exceed the number of edges contained in the longest path of DG, so calculation process must be convergent. When there are some loops in DG, updating rounds are uncertain. In this case, it can be divided into two situations to analyze, changed services in the loops and not in the loops, respectively discussed as follows: (i) When there is a node s∈∆S in a loop, then p 0 (s) = 1. Because p t + 1 (s) ≥ p t (s), and p t (s) ≤ 1, so p t (s) = p 0 (s) = 1, that is, impact degree of s will not be updated by message propagation. The directed edge with s as the end node does not work in computing process and can be considered as interrupted. At this time, the loop can be directly disconnected and transformed into a directed acyclic graph, so computing process converges, which can be shown schematically in Fig. 4. same time, impact degrees of nodes are assigned with Formula 4. Through numerical comparison, node s m with maximal impact degree can be obtained, which is denoted as p t (s m ). Since messages propagated in the loop satisfy m ij <p t (s i ), maximal value of messages in the loop is less than p t (s m ). Therefore, no matter how many rounds messages propagates in the loop, p t (s m ) does not changes. That is, the loop can be disconnected from the directed edge with s m as the end node, and the loop can be removed, which means calculation process converges. ➂ Computing all output messages. According to the calculation results of phase 2, output message of each node in the loop is calculated directly with Formula 5.
The schematic diagram of calculation process in case (ii) is shown in the Fig. 5.
To summarize, computing process of impact degrees determined by Definition 3.6 and Definition 3.7 is convergent, that is, impact degree of each node can converge to a stable value in finite iteration rounds. The results of impact propagation computing can be put in a dictionary structure called change impact table (CIT) to access conveniently in our approach. The pseudo codes for generating a CIT from a DG is shown in Algorithm 3.

Test case selection
This step selects a subset T s from test path set T based on a CIT. For any test path tp∈T, a service set S tp can be derived from the invocations which make up tp with service registries. By querying a CIT, we can obtain impact degree of any service s∈S tp , denoted as CIT(s).  From above strategies, existent satisfaction strategy is the most relaxed strategy, and the scale of selected test path set is largest. Especially when p is the minimum non-zero value in CIT(s), it means that as long as a test path contains services affected by changes, such test path is selected. The complete satisfaction strategy is the most strict strategy, and the scale of selected test path set is smallest. The k-existent satisfaction strategy is between the two, which can be used to adjust the scale of selected test path set as needed.

Fig. 4 A node of a loop is changed services
Corresponding test case selection algorithms can be proposed. The pseudo codes of existent satisfaction strategy are shown in Algorithm 4, in which original test path set is traversed once (line 2 to 9). By querying a CIT, whether the current test path is selected into the T S (lines 3 to 8) can be determined. The pseudo codes of the other two strategies are similar and will not be described further.

Empirical study
In order to evaluate MRTS-BP and analyze the influence of process parameters on testing selection, we implemented whole process based on Python 3.5, and collected testing data of four microservice systems for experimental analysis. Our empirical study investigates four research questions as follow: RQ1 Whether MRTS-BP is safe or not, and how the value of the frequent threshold c affects its safety.
An RTS technique is safe if it will contain all test cases revealing faults in regression testing [5]. Safety determines the availability of RTS techniques. In MRTS-BP, the frequent threshold c is directly related to the number of mined frequent patterns, and then (8) T s = tp|∃ k s ∈ S tp , CIT (s) ≥ p affects network structure of directed graph, which has a great influence on the results of change impact propagation. Therefore, the number of test cases selected by MRTS-BP is related to the value of c. It is necessary to analyze the relationship between c and the safety of MRTS-BP, and find out the range of c that can ensure the safety. Theoretically, MRTS-BP does not rely on artifacts such as specifications, design models and code files. It is completely decoupled from techniques for constructing of the systems under testing, that is, the scalability of MRTS-BP is obviously better than artifacts based RTS approaches. In order to make a more comprehensive comparison, an RTS approach based on control flow analysis (RTS-CFG) [16] is chosen to compare with MRTS-BP to reveal the practicability of the two in microservice regression testing. RQ4 How to choose selection strategies of MRTS-BP to optimize time cost.

RQ2
Existent satisfaction strategy, complete satisfaction strategy and k-existent satisfaction strategy are proposed in our approach to meet different testing requirements. To analyze the influence of selection strategies on the efficiency of MRTS-BP, experiments are needed to clarify how the selection strategies affect the number of test cases selected, the safety and the precision of MRTS-BP, which will be helpful to select appropriate strategies in practice.

Case introduction
The following four microservice systems are adopted in our empirical study: (1) m-Ticket: a multi-end ticket system based on SpringBlade (an open source microservice framework available at https:// github. com/ chill zhuang/ Sprin gBlade), provides ticket services in various fields such as transportation, accommodation, tourist attractions and movies, supporting service management, monitoring and tracing. (2) z-Shop: a mobile oriented mall system based on Zheng (an open source microservice framework available at https:// github. com/ shuzh eng/ zheng), provides one-stop management services for goods, stores, content promotion, orders, logistics, etc. (3) Need: a knowledge system based on Spring Cloud, provides data collection, auxiliary analysis, information extraction, knowledge graph construction, intelligent query and other knowledge graph management services. (4) JOA: an office automation system based on Spring Cloud, provides comprehensive information display, document circulation, process approval, plan management, organization personnel management, contract management, fund management and material management services for the organization with multiple departments and secret levels. Table 2 shows the numbers of logs, services, versions, test cases and faults of all cases above. The number of faults is collected from corresponding testing reports where all faults found in testing are reported. Based on these data, a posteriori method is adopted to setup experiments, that is, execution results of test suite are known before, and main activities are to select test cases and to perform statistical analysis. As shown in Table 2, m-Ticket, z-Shop and Need are relatively small systems, but JOA have much more services respectively. To compare with artifacts based approaches, we also collect design documents, logs and testing data. Since multiple teams developed JOA and they did not agree to grant the access permission, we failed to collect corresponding artifacts for JOA. Artifacts based RTS approaches can not be applied in JOA.

Evaluation metrics
According to common RTS evaluation metrics [11][12][13][14], considering problems in experiments, 3 metrics are proposed as follows: (1) Testing time cost saving rate (ET) is used to measure the extent to which RTS approaches reduce regression testing time cost. Let TO represent total execution time cost of original test suite, TR represent execution time cost of selected test suite based on RTS approaches, and TS represent the time cost of selection process, then ET is given by equation in (9): (2) Percentage reduction of the number of test cases (EN) is used to measure the ability of RTS techniques save testing cost only in terms of reducing the number of test cases [12,13]. Let NO represent the number of original test cases, NR represent the number of selected test suite, then ET is given by equation in (10): (3) Recall (R) indicates the percentage of selected test cases relative to all failed test cases [11][12][13][14], which is used to measure the safety of RTS techniques. Let NOF represent the number of test cases revealing faults in original test suite, NSF represent the number of test cases revealing faults in selected test suite based on RTS techniques, then R is given by equation in (11): (4) Precision (P) indicates the accuracy with which test cases were selected to be rerun [11][12][13][14]. Since NR represents the number of selected test suite, then P is given by equation in (12): (5) F-measure (F) is a combination of both P and R [11][12][13][14], which indicates the combination of safety and accuracy of RTS approaches, and F is given by equation in Formula 13. It can be seen that the lager F is, the better the combination of safety and accuracy is.

Experiments setup
According to research problems above, four experiments are setup as follows: • Experiment 1: determine the safety of MRTS-BP and its relationship with frequent threshold c For each case, from version v (v ≥ 2), following steps are carried out.
(1) Taking logs of version v-1 as input, global item set and transaction set are generated respectively. After frequencies of all items obtained, the minimum value and maximum value are taken as the lower bound and upper bound of frequent threshold c respectively. Then divide the range of c into ten equal parts, and take the lower bound of each equal part as a value of c, which are denoted as c i (i = 1,2…10). The ten values of c are used respectively to generate SDMs through transaction set mining. (2) Based on the ten SDMs generated in step 1, ten CITs are generated respectively. For each CIT, the minimum non-zero element in CIT is taken as selection threshold p, and test cases are selected with existent satisfaction strategy. (3) NOF is counted from testing report of corresponding version. For each test suite selected, NSF and recall R are also counted.
• Experiment 2: compare the ability to save testing cost of MRTS-BP and RTS-CFG For each case, from version v (v ≥ 2), following steps are carried out. (

Data and analysis
Data of Experiment 1 are shown in Table 3. Let R i (i = 1,2…10) represent the value of R corresponding to c i of step 1, it can be seen that R can reach 100% with frequent threshold c assigned an appropriate value in each version of each case. That is because existent satisfaction strategy means that test cases including any services affected by changes will be selected. So it indicates that MRTS-BP can ensure safety when  12:20 c is set appropriate with existent satisfaction strategy.
To show change trend of R with c intuitively, the line chart Fig. 6 is drawn from the last version of each case. From Fig. 6, when c changes from minimum value to maximum value, R will gradually decrease. This is because when c becomes larger, less frequent patterns are mined, that is, less possible service dependencies are obtained, which leads to more zero elements in SDM, and further leads to less directed edges of directed graph. Since less edges of directed graph has, less nodes will be considered in change propagation computing, which may lead to that less services are  appended to CIT. Then, less test paths will be selected with existent satisfaction strategy, which leads to the smaller value of R. Therefore, in order to ensure the safety of MRTS-BP, the value of c should be close to its lower bound value.
Data of Experiment 2 are shown in Table 4. From EN of MRTS-BP, value ranges from 40% to 57%, and mean value is 50%, that is, the number of test cases is apparently reduced by applying MRTS-BP. Similarly, EN of RTS-CFG ranges from 34% to 56% with mean value 48%.   Table 5. From values of R, MRTS-BP and RTS-CFG can ensure all test cases revealing faults are selected for each version of each case, that is, MRTS-BP and RTS-CFG are both safe. To intuitively compare P and F of the two techniques, line charts are drawn for each case in Fig. 8. From the line charts of case m-Ticket, z-Shop and Need, for P and F, it is can be seen that values and their change trends of MRTS-BP and RTS-CFG are almost the same. This is due to similar abilities of the two approaches to reduce the number of test cases and cover impact scopes of changes. Essentially, MRTS-BP and RTS-CFG both identify change impact scopes based on service dependencies. The difference lies in that, the former adopts impact propagation calculation while the latter adopts edge analysis based on control flow models. And it also indicates that BP-like algorithm worked in regression testing selection. On the other hand, since artifacts based RTS approaches are not adapted to the case which artifacts are difficult to obtain, such as JOA, the scalability of MRTS-BP is obviously better than that of RTS-CFG in practice. The Data of Experiment 4 are shown in Table 6, and corresponding bar charts are shown in Fig. 9. From values of EN with different selection strategies in each case, it can be seen that more strict selection strategy is, less test cases are selected, and more testing cost are saved. However, that less test cases are selected means more test cases affected by changes are ignored, which can make MRTS-BP be not safe, as shown by the values of R in each cases. That is, EN and R are a pair of trade-off with different selection strategies, and one should choose strategies according to the actual case. From Fig. 9, existent satisfaction strategy can always ensure the safety of MRTS-BP, but EN and F with such strategy are not the best, which means MRTS-BP can be applied in the case of high safety. Complete satisfaction strategy save more testing cost than others, but it is far less safe than others, which means such strategy can be applied in the case of tight schedule. The efficiency of k-existent satisfaction strategy is determined by the value of k. When k is larger, the number of selected test cases will be significantly reduced, but the safety will also be worse, or even unavailable. When k is smaller, MRTS-BP has better safety, but testing cost reduction rate becomes lower. It is worth noting that when k is 2, R reaches 100% in three cases, and EN and F are better compared with those of complete satisfaction strategy. This indicates that 2-existent satisfaction strategy can be applied and may bring more efficiency.

Threats to validity
As with most empirical studies, there are some risks to apply the conclusions directly of experiments above, and threats to validity mainly are manifested in two aspects: (1) Cases selection. Although our experiments consider factors such as size, complexity, and domain when selecting cases, there may be some limitations that do not cover all types of microservice systems. At the same time, methods of test cases generation, and whether faults records are comprehensive or not may also affect experimental results. It needs to be validated by more cases in different areas, different scales and different data distribution characteristics. (2) Comparison RTS techniques selection. The method for comparison, RTS-CFG, comes from related work, which also has validity risks that will be introduced into our experiments. At the same time, though RTS-CFG is a typical RTS technique relying on artifacts, it does not represent all artifacts based RTS techniques and our approach need to be validated against with more different artifacts based RTS techniques.

Conclusions and future work
This paper proposes a microservice regression testing selection approach MRTS-BP, describes the whole process in detail, and verifies its effectiveness through experiments. MRTS-BP conquer the challenges of artifacts based RTS approaches by processing API gateway logs instead of artifacts. For acquisition issue, API gateway logs are automatically and centrally recorded without business-specific data, which avoids additional communication costs and security risks. For processing issue, API gateway logs are structural and consistent, which leads to processing automatically with MRTS-BP. For maintaining issue, API gateway logs are identified clearly and accurately correspond to each version of services. Thus, MRTS-BP can be fully automated and is applicable to microservice systems regression testing in practice.
In future, two aspects which include granularity of MRTS-BP and service dependencies from API gateway logs must give insight overview to improve accordingly. Also, MRTS-BP in different fields and patterns like mesh service to collect more cases for empirical study.