A Reinforcement Learning-based Approach to Testing GUI of Moblie Applications

With the popularity of mobile devices, the software market of mobile applications has been booming in recent years. Android applications occupy a vast market share. However, the applications inevitably contain defects. Defects may aﬀect the user experience and even cause severe economic losses. This paper proposes ATAC and ATPPO, which apply reinforcement learning to Android GUI testing to mitigate the state explosion problem. The article designs a new reward function and a new state representation. It also constructs two GUI testing models(ATAC and ATPPO) based on A2C and PPO algorithms to save memory space and accelerate training speed. Empirical studies on twenty open-source applications from GitHub demonstrate that: (1) ATAC performs best in 16 of 20 apps in code coverage and defects more exceptions; (2) ATPPO can get higher code coverage in 15 of


Introduction
With the rapid development of mobile computing and wireless network technology and the widespread use of mobile devices in the past few years, the scale of mobile applications also has been overgrown.To meet the demands of users and conform to the trend of The Times, mobile applications (APPs) need to be developed and iterated constantly.Many applications may contain defects.Due to the different versions of devices and operating systems, mobile apps often need help with cross-platform and cross-version compatibility.If the APP fails, it may lead to poor user experience and even huge financial losses.For example, the China Merchants Securities and Huaxi Securities trading systems went down in 2022, causing economic losses for investors.In the same year, Chengdu's nucleic acid testing system collapsed, causing a waste of medical resources and citizens' time resources.Therefore, how to test these applications to ensure the correct operation of the APP has become the key to solving the compatibility and security at this stage.
There are many operating systems for mobile devices, such as Android, iOS, Harmony, Blackberry, Symbian, etc.In terms of mobile devices, Android devices occupy a massive share of the mobile market, and the number of Android apps is also increasing.Therefore, this paper studies the Graphical User Interface (GUI) test of Android apps.Android apps are mainly written in Java and stored in executable files in the form of dex files.The APP is distributed as an apk file.This example contains the dex file, code (if any), and other resources.An APP declares its main components in the Android-Manifest.xml.There are four main elements: activities, services, broadcast, receivers, and content providers.An activity is a component responsible for the GUI.An activity corresponds to a UI interface, which includes many UI elements (such as menus, buttons, and images).Developers can control the behavior of activities by implementing callbacks for each lifecycle (for example, create, pause, and destroy).The activity responds whenever the user responds to an interface action (such as a click), which is the primary goal of the Android testing tools.The service component can run for a long time in the background.Unlike activities, there is no user interface, so they are not usually the direct target of Android testing tools, although they may be tested indirectly through some activities.
The GUI may contain many widgets, and the GUI testing mainly performs functional testing on the application under test (AUT).GUI testing checks the behavior of the APP by interacting with the GUI (for example, clicking, long clicking, scrolling, and typing strings).If the behavior of an APP deviates from what is expected, the APP contains some defect.However, with the continuous development and iteration of APP, the composition of APP becomes more and more complex, and checking its function and behavior may require some clarification.Due to limited human resources and time pressures, Android APP GUI testing is expensive.New challenges arise when trying to replace human testing with machine tools.These challenges include the explosive growth of state combinations and the limitations of exploration.Automated testing of Android APPs has great research potential.
GUI testing has attracted massive attention from researchers.Some researchers propose random testing, which generates random events on the GUI.One of the most famous random test tools is Monkey [1], which Google provides for stability and stress testing.It generates random user events such as key presses and random inputs.However, random tests like Monkey may generate large amounts of invalid and redundant events.It is ineffective for them to explore more states and detect failures.Model-based strategies [2][3][4][5][6] build precise or abstract GUI models by static or dynamic methods to generate test cases.Nevertheless, the strategies are influenced by two problems.One of the problems is inherent state explosion.Another one is that the effectiveness of generated test cases depends on the built model's integrity and the representation of the application's state.
Reinforcement learning (RL) contains agent learning strategies to maximize returns or achieve specific goals by interacting with the environment.It is extensively used in Android GUI testing.Unlike supervised learning, which requires labeled datasets, RL can learn automatically by interacting with the environment.Though RL is applied to GUI testing, most are implemented using Q-learning.Q-learning uses a table to record expected values of actions that are called action values in a specific state.The Q table occupies a lot of memory.Some researchers are considering replacing tabular methods with Deep Neural Networks (DNN).The agent utilizes DNN to learn the actionvalue function automatically through past experiences.Driven by a reward function, the agent guides us to explore the AUT, unlike random testing, which explores the AUT without purpose.The optimal action to perform can be predicted by the agent even if the state has never been visited before.It can effectively solve the state explosion problem.RL allows us to test GUI effectively and efficiently.DeepGUIT [7] adopts Deep Q Network (DQN) to represent the value function, and ARES [8] applies TD3, DDPG, and SAC to fit value functions.However, these methods utilize a replay buffer that records the pairs of states and rewards.The replay buffer also occupies a lot of memory, even if less than Q-table.In reinforcement learning, exploration is one of the most challenging issues.In a mobile application, a function is triggered by a specific operation sequence.As the length of the operation sequence becomes more extended, exploration becomes more challenging.
In this paper, we propose ATAC(Automatic Testing based on A2C) and ATPPO(Automatic Testing based on PPO).They are novel approaches based on deep reinforcement learning to test Android GUI.ATAC applies the A2C algorithm to Android GUI testing to avoid using a relay buffer.A2C algorithm that introduces the idea of parallelism constructs multiple threads, and the thread interacts with the environment, respectively.Besides, ATAC does not need a replay buffer, and it can save memory space and consume fewer resources.The neural network it constructs is much smaller than DeepGUIT and ARES.It can work even if the agent does not equip with GPU.The PPO algorithm is an improvement of the A2C algorithm.It uses the importance sampling method to convert the off-policy into the on-policy strategy, which improves data utilization.ATPPO applies the PPO algorithm to Android GUI testing to avoid using a relay buffer.To reduce interruption of the execution sequence, we consider constructing a Finite-State Machine(FSM) during the test process.If the strategy falls into the local optimal state, according to FSM, it provides more advanced guidance for exploring reinforcement learning algorithms.We evaluate them in twenty apps in the environment of Android 10.Experimental results show that our approaches achieve higher coverage and detect more failures than the state-of-the-art test generation tool Monkey and RL-based ARES.
To summarize, this paper has the following major contributions: • We utilize executable GUI elements in XML files to represent states.
• We adopt a new reward function to avoid sparse rewards, and the reward function in this paper mainly depends on the change of state and whether errors can be detected.• We construct two GUI testing models(ATAC and ATPPO), respectively, based on the A2C and PPO algorithms.• We put forward the exploration strategy based on FSM, which provides high-level guidance for further improving test efficiency.• Empirical studies on 20 applications have found that our approaches have significantly different effects on the code coverage and the number of defects detected.
The remainder of this paper is structured as follows: Section 2 introduces relevant basic concepts and definitions.In Section 3, our proposed approach is described in detail.Section 4 shows the evaluation of our approach and discusses the results.Then, the threats to validity are shown in Section 5. Section 6 gives a general description and summary of related work.Finally, section 7 concludes our work.However, the premise of using model-based methods is that the current environment is known and well-described.Therefore, the approximation of the model-based method (model-free method) is usually used to deal with the problem.The existing model-free agents are presented in section 2.2.

Markov decision process
RL is a branch of Machine Learning which tries to maximize the reward it obtains in a complex and uncertain environment.It consists of the agent and the environment.RL trains the agent by constantly interacting with the environment.The agent's purpose is to receive as much reward from the environment as possible through trial and error.
The Markov decision process is one of the most fundamental theoretical models of reinforcement learning, and most problems can be regarded as or transformed into the Markov decision process.A Markov decision process is represented by a 4-tuple < S, A, P, R >. S: Set of all possible states A: Set of possible actions that the agent can perform in all states.P (S × A × S → [0, 1]): State transition function represents the probability of taking action to transfer to some state.R(S × A × S → R): Reward function represents the reward received from the environment after taking the action which changes the current state.
Fig. 1 shows the process of Markov decision.At a time step t, the agent observes the current state s t ∈ S, and then selects and performs the action from action space A. Then a reward is obtained, the environment moves to a new state, and then the agent continues to repeat the above process until the terminal state or timeout and restarts.The process above can be represented by a trajectory: s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 3 .... a t means the action performed and r t denotes the reward when performing a t in the state s t .The accumulated return Rt from the time step t with discount factor γ ∈ (0, 1] can be noted as: R t = ∞ k=0 γ k r t+k .The goal of the agent is to maximize the expected return from each state s t .The action-value function Q π (s, a) is used to estimate the expected return when we take an action in a certain state when following the current policy.The state-value function V π (s) is to evaluate the expected return in the current state s when following the π.The action value function or the value function can be eliminated by substitution, and the Bellman expectation equation can be obtained.Similarly, the relationship between the optimal state and the action value function can be defined: Then, the Bellman optimal equation can be derived according to the optimal state and the action value function: We use the symbol * to indicate the estimated value.r(s, a) indicates the reward when performing a in the state s.We can find an optimal policy through Bellman's expectation and optimal function in theory.However, the Bellman equation is difficult to obtain.Therefore, we need other ways to estimate the optimal value function, such as policy iteration.

Model-free agent
An optimal policy can be found through Bellman expectation and optimal function in theory.However, the Bellman equation is difficult to obtain.Therefore, we need other ways to estimate the optimal value function, such as policy iteration.
There are three types of model-free agents: Value-Based agent, Policy-Based agent and Actor-Critic agent and Actor-Critic agent combines the first two.Value-based model-free methods learn the value function so that a policy can derive from value function.Policy-based model-free methods directly parameterize the policy by π(a|s; θ) and update parameter θ by performing gradient ascent on E[R t ].Actor-Critic agent learns both policy and value function.
A standard policy-based method [9] updates θ by ∇ θ R t log π (a t |s t ; θ), it is an unbiased estimate of ∇ θ E[R t ].Afterwards, baseline b t (s t ) from the return has been used in order to reduce the variance of the estimate forward.Then the parameter of policy π is updated by ∇ θ (R t − b t (s t ))log π (a t |s t ; θ) .The estimated value function can be rewarded as b t (s t ) to further reduce the variance and R t − b t (s t ) can be considered as the advantage of performing the action at in state s t when we adopt the policy π.As mentioned above, R t is the estimate of the action value function Q π (s t , a t ) and finally the advantage A(s t , a t ) can be calculated by Q(s t , a t )−V (s t ).The method [10,11] introduced above can be viewed as an Actor-Critic architecture with policy π as an actor and the baseline V (s t ) as a critic.
Advantage Actor-Critic (A2C) [12] is a typical Actor-Critic method, which is a synchronous, deterministic variant of Asynchronous Advantage Actor-Critic (A3C).It uses multiple workers to avoid the use of a replay buffer.Fig. 2 shows the mechanism of A2C.
Most of work [13][14][15][16] based on RL adopts Q-learning as agent.Q-learning uses a table to record expected values of actions that are called action values in a specific state.Recently, some researchers replace tabular methods with Deep Neural Networks (DNN).DeepGUIT represents the value function by DQN, and ARES adopts TD3, DDPG, and SAC to fit value functions.

Approach
This paper proposes a method of Android GUI testing based on A2C and PPO.This section will describe the GUI testing of APP applying reinforcement learning in detail.The process of testing APPs can be thought of as a Markov decision process.A2C and PPO algorithms were used to generate test cases, respectively.
First, Appium extracts specific information about an APP installed on an Android device or emulator.Appium is an open-source automated testing tool for native, mobile, or hybrid APPs.Android devices communicate with Appium through the Android Debug Bridge (ADB).The GUI state composed of GUI elements exists in the form of an XML file that includes information such as package name, executable events (such as whether the widget is clickable, long-clickable, or scrollable), boundary information, and resource-id.Then, the XML file is preprocessed.Since the XML file saves the page information, the information is extracted from the page.Then all the executable widgets are obtained.Afterward, the page's state(which can be processed by the neural network) is expressed according to the widget's information.Then the current state can be passed to the agent.According to the current state, the agent makes decisions and outputs actions.The digital action is converted into a specific operation, and the corresponding process is executed.Once the action is performed, the reward module rewards the agent for the action in its current state.The agent adjusts the parameters inside the neural network according to the reward feedback.
At a time step t, the environment gets the current GUI of the Android device through Appium, extracts the state from it, and passes the observation to the agent.Then, the agent (Actor-network follows policy π) predicts an action likely to archive more accumulated rewards.Once the action is performed, the agent will obtain a reward as feedback from the environment.The agent will adjust the actor and critic networks according to the rewards.The workflow is shown in Fig. 3.

Representation of States and Actions
In the reinforcement learning process, the current state needs to be abstractly represented to express the form the neural network can understand.The state representation needs to combine the parts that are beneficial to the training of the agent.The GUI of an Android application consists of GUI elements, and a GUI interface contains many widgets.This method only considers the executable GUI elements in the interface.We only consider the executable GUI elements to represent the states so that they can be passed to and processed by the neural network.We denote the abstract state by all executable widgets from the GUI of the application.Each widget is represented by a three-dimensional vector w i , and each state is represented by s t : The first dimension of w i indicates whether the widget i is clickable, the second dimension indicates whether i is long-clickable, and the third dimension indicates whether the i is scrollable.If the widget i is clickable, long-clickable, or scrollable, the corresponding dimension will be marked as 1.Otherwise, we will mark the dimension as 0. For example, if widget i is clickable and can be long-clickable but not scrollable, we would denote w i as [1,1,0].At the time t, the agent observes a state s t composed of n widgets, where n is the total number of executable widgets in the GUI interface.
Both system-level actions and typical actions are considered in our approach.The system-level actions include switching the internet connection state, screen rotation, and return.There is no return button in many applications explicitly, and the applications may keep in a stalemate state, which prevents them from exploring more space.ATAC adopts a similar action representation as ARES [8].A 3-dimensional vector denotes each action, and the first dimension specifies which widget or system-level action will be operated.The second dimension works when the widget expects an input since the dimension indicates the index of a string in a predefined dictionary.The third dimension acts as a compliment.The third dimension decides which actions to perform when a widget is clickable and long-clickable.When a widget is scrollable, the third dimension specifies the scrolling direction.

Reward Function
The design of the reward function is a crucial step where the agent adjusts its strategy based on the feedback (reward) it receives after performing a particular action in the current state (GUI interface).The value of the reward reflects which behaviors are encouraged and which are discouraged.A positive reward value indicates that the current operation is encouraged, and a negative reward value suggests that the current process is discouraged.The larger the reward value, the more expected the action was performed in the current state.
The reward function needs to be designed with the ability to detect faults in mind.However, a few failures in GUI testing can lead to sparse rewards.Intuitively, exploring more GUI space might lead to finding more bugs, so the design of the reward function also considered whether more GUI space exploration could be done.The reward function in this article depends mainly on the change of state and whether an error can be detected.There are two main aspects to the state change process, one is whether a state can be explored that was not explored during the previous test, and the other is whether the executable widget has never been executed in its current state.When an agent chooses to return, there are two aspects to consider.First, we want it to return as soon as the application enters a deadlock state.Second, we don't expect frequent returns, which might interfere with our application exploration.According to the particularity of reward, two kinds of reward functions are defined in this paper.
The definition of reward function is as follows: a)When the agent performs actions: b)When the agent performs other actions: The numbers in the function satisfy the conditions that

Advantage Actor-Critic(A2C) based Testing
A2C (Advantage Actor-Critic) is a synchronous variation of A3C (Asynchronous Advantage Actor-Critic).They all maintain a network of policies and a network of value functions.They choose the same advantage function A(a t , s t ) = (r t+1 + γV (s t ) − V (s t )) and use multithreading to perform gradient descent.The difference between them is the time to update the global policy and the value function.A2C is synchronous, and A3C is asynchronous.A2C uses multiple threads to avoid using the experience replay buffer.A2C constructs multiple threads that interact with the environment.In each iteration, the global network waits for the threads to complete their respective turns and updates the global network through the gradient uploaded by the threads.The global network then sends the latest network parameters to all threads simultaneously.The detailed algorithm for each thread is shown in Algorithm 1.

Proximal Policy Optimization(PPO) based Testing
With the continuous improvement of the reinforcement learning algorithm, the PPO algorithm is proposed to improve the A2C algorithm.This paper also uses the PPO algorithm as the training and decision-making algorithm.The PPO algorithm combines the multi-threading thought of the A2C algorithm and the Reset gradients:dθ π ← 0, dθ v ← 0 3: Observe the GUI state S t 4: repeat 5: Perform an action a t and Receive a reward r t

7:
Observe a new GUI state s t+1 8: until s t is terminal or timeout for each step i of episode do 10: R ← r i + γR 11: 12: end for 14: until timeout 15: return dθ π , dθ v idea of using the trust domain to enhance the actor of the TRPO algorithm.Both A2C and PPO algorithms are reinforcement learning methods based solely on strategy from time to time.In essence, they are both action-based and evaluator methods, including value function and strategy function.
Reinforcement learning can use on-policy learning and off-policy learning.On-policy reinforcement learning indicates that agents need to interact with the environment during the learning process.On-policy reinforcement learning trains the models by the current strategy, and each piece of data is only used once.PPO adopts off-policy reinforcement learning.Standard policybased methods perform a gradient update for each data sample.PPO reuses the data for the multi-stage minor batch update.The main idea of PPO is that after updating a new policy, it should be similar to the previous policy, so the scope of parameter updating should be limited.Using importance sampling, the model of this stage can be updated through the data of strategies of different scenes.This is done by adding the TD-Error of other actions to the probability of action and multiplying it by the gradient of the feedback of different strategies.

Construction and Application of FSM
When exploring the reinforcement learning strategy, some operations need to be executed continuously to trigger a function or function.For example, the user login function must first enter the user name, enter the password, and click the confirm button to realize the login function.However, each step may be interrupted.The longer the sequence, the higher the probability of interruption, which makes it more challenging to achieve the ideal transition, Algorithm 2 Proximal Policy Optimization Algorithm Require: initial policy θ Ensure: up-to-date policy θ 1: for iteration=1,2,... do

3:
Use policies π θ old to Interact with the environment within time T

5:
end for

6:
Optimize θ according to K epochs and small batch data minibatch 7: θ old ←θ 8: end for 9: return θ especially when facing a long path.FSM M can be defined as a five-tuple (S, A, δ, s 0 , F ), in which S is a finite set of states, A is a set of actions, δ : S × A → S is a set of transitions, s 0 is an initial state, and F is a termination state that cannot be transferred to other states.During the testing process, the FSM will update continuously.
When the algorithm may enter a local optimal state, we need to find a path to reach a new state if we want to resume exploration.FSM contains essential information such as state, action, and transition, which can guide subsequent selection.We chose the least visited state in the least explored as the starting point to explore Android APP again.We use the Floyd algorithm to identify the shortest path that can reach the state and then executes the corresponding operation sequence to guide to reach the target state and execute the conversion.
Floyd algorithm is based on greedy and dynamic programming and is similar to the Dijkstra algorithm.Dijkstra algorithm applies to the solution of the shortest path of a single source, and Floyd algorithm applies to the search of the shortest path between multiple source points in the weighted graph.Floyd algorithm contains two matrices, the shortest distance matrix D n×n of a graph and the shortest path matrix path n×n .The element d i,j n, the shortest distance of the graph, represents the distance from node i to node j.In using reinforcement learning to test Android APP GUI, nodes represent various states.At the initial stage of the test, the distance is positive and infinite.Each time a state transition occurs, the value in the shortest distance matrix may change because the distance between two states will be updated to the shortest distance.If the distance between two nodes is shortened, the shortest path will also change.The algorithm updates the shortest distance matrix and path according to the state transition equation.
The target state can be easily reached when the shortest path matrix is solved.If the state s i wants to transfer to the state s j , it needs to move to the state s pathi,j first, then from the state s pathi,j to the target state s j .With the state transition process, the target state can be reached according to the shortest path between states, and further exploration can be carried out on this basis.

Empirical Study
When testing the application, we want to be able to detect exceptions.In theory, it is only possible to detect more anomalies by exploring as many states as possible.In many studies [7,8,14,16,17], code coverage and fault detection number are used as evaluation indicators.Therefore, this chapter mainly investigates the code coverage capability and fault detection capability of the proposed method.We evaluated the following metrics: instruction coverage, branch coverage, line coverage, method coverage, and number of failures.ATAC and ATPPO are compared with the state-of-art tools, Monkey, and ARES.Our empirical study is designed to answer the following research questions: RQ1: Do ATAC and ATPPO achieve higher instruction coverage, branch coverage, line coverage, and method coverage than state-of-art testing tools?RQ2: Can ATAC and ATPPO reveal more failures than state-of-art testing tools?RQ3: Can the introduction of FSM into the reinforcement learning framework prevent the algorithm from entering the local optimal state and trigger more functions?

Applications under Test
Although this approach is black-box, in order to compare with other approaches, you must obtain the source code to gather coverage information, and the experiment uses the Jacoco [18] plug-in to generate coverage information.The experiment selected 20 open source F-Droid applications from Github for method evaluation.The applications selected for the experiment are also used by other researchers studying mobile application GUI testing, and the experiment excluded many applications that were outdated or no longer operational.Some of the applications selected for the experiment are still being modified and iterated.Details of the application under test are shown in Table 1.To facilitate the calculation of the coverage information of the application, the apk package and source code of the AUT have been uploaded to Github1 .

Evaluation Setup
We compare the proposed methods with the tools Monkey and ARES.Monkey is a random test generation tool, and ARES is based on deep reinforcement learning.Then, specific experimental settings are introduced.
We experiment on a computer with the MacOS operating system, M1 processor, and 16GB of memory.Since the simulator is unstable, the application in the experiment runs on an actual Android device, the operating system is Android 10, and the memory is 6GB.The framework of ATAC is based on ARES.It realizes the A2C algorithm mainly by OpenAI Gym [19].The communication between applications and agent is through Appium [20], and Android Debug Bridge [21].The structures of the Actor and Critic networks are the same, and ATAC adopts two 2-layers neural networks.There are 64 neurons in each layer.ATAC and ATPPO choose relu as the activation function in both neural networks.The number of actions that can be performed changes from time to time in each state, so the Gaussian distribution is adopted as output, representing the probability distribution of actions in the current state.The ATAC and ATPPO parameters are shown in Table 2 and 3, respectively.The experiment also builds FSM during ARES and ATAC testing, respectively.When the experiment may be in the local optimal state, we find the least frequently visited state and the shortest path to the state in transition through Floyd algorithm under the guidance of FSM.By performing the actions in the shortest route, reaching the target state, and continuing exploring.The condition to determine whether the local optimal state may be achieved is if the new state can be found in 30 steps or if it stays on the current page after repeating 30 steps.
The length of an episode is set as 250.The agent explores and tests the application within 5000-time steps, limiting the time to one hour.The same settings are used in ARES.The throttle setting was 200 in Monkey, and the procedure lasted one hour.The tool was configured to ignore any crash, system timeout, and security exceptions until the timeout was reached.The running log of the application is analyzed to obtain the crash information.Reinforcement learning has a certain randomness.We repeated all the experiments five times and used the average value of 5 times to represent the final result to avoid the randomness of the results.The comparative experiment was also repeated five times to take the average value.

RQ1:Do ATAC and ATPPO achieve higher code coverage than state-of-art testing tools?
To answer RQ1 and avoid the randomness of the results as much as possible, we repeat all the experiments five times and use the average of five executions to represent the final result, including the comparative experiment.Table 4 and Table 5 show the average instruction coverage of ATAC and ATPPO.Inst represents instruction coverage, bran represents branch coverage, the line represents line coverage, and meth means method coverage.Table 3 shows the results of comparing ATAC with the coverage of ARES and Monkey.It can be seen that ATAC covers more instructions, branches, lines, and methods on 20 APPs.It can be noted that the average instruction coverage (50.8%) of ATAC is higher than Monkey (35.0%) and ARES (47.5%).The average branch coverage (36.5%) is also higher than Monkey (21.3%) and ARES (33.1%).The average line coverage (51.2%) is also higher than Monkey (34.1%) and ARES (47.9%), and the average method coverage (55.4%) is also higher than Monkey (40.3%) and ARES (52.1%).
Regarding average instruction coverage and branch coverage, ATAC performs best in 16 of the 20 APPs, ARES performs best in 4, and Monkey performs poorly.In terms of average line coverage, Monkey also has no outstanding performance.ARES performs best on 4 APPs, and ATAC performs best on 16 APPs.Regarding average method coverage, ATAC is the best on most apps, ARES is the best on three apps, and Monkey is the best on only  two apps.By applying the A2C algorithm, ATAC achieves higher instruction, branch, line, and method coverage than state-of-art tools Monkey and ARES.
The results of the code coverage of ATPPO, ARES, and Monkey are shown in Table 5.By analyzing the coverage results, we can find that the average instruction coverage of ATPPO(50.7%)was higher than Monkey(35.0%)and ARES(47.5%).The average branch coverage of ATPPO(35.9%)was also higher than Monkey(21.3%)and ARES(33.1%).The average line coverage (51.1%)Fig. 4 Code Coverage achieved by Monkey, ARES and ATAC was also higher than that of Monkey(34.1%)and ARES(47.9%), and the average method coverage (55.1%) was higher than that of Monkey(40.3%)and ARES(52.1%).ATPPO achieves higher instruction, branch, line, and method coverage in 15 of 20 apps.
ATPPO performed best in 15 out of 20 applications, ARES performed best in 5 out of 20 applications, Monkey performed poorly and only performed best in QuickSettings in method coverage.For some reason, Monkey doesn't perform well on data sets running on Android 10.
In short, ATPPO achieves higher instruction coverage, branch coverage, line coverage, and method coverage than the existing advanced methods Monkey and ARES tools.ATAC and ATPPO perform better than ARES and Monkey regarding code coverage.ATAC performs better than ATPPO in the number of apps and code coverage.

RQ2: Can ATAC and ATPPO reveal more failures than state-of-art testing tools?
We explore five times on each APP.This method may find the same exception multiple times in each iteration.ATAC finds exceptions in 8 out of 20 apps, ATPPO also finds exceptions in 7 apps, ARES finds exceptions in 4 apps, and Monkey finds exceptions in 2 apps.Since no exceptions are found in the other 11 apps, this section only focuses on the nine apps in which the method proposed in this paper, ARES and Monkey found exceptions.ARES and Monkey can't find exceptions that ATAC and ATPPO find in Jamendo, AmazeFileManager, AnyMemo, AntennaPod, and materialistic.Table 6 records the number of experiments where failures are seen five times.TATAC and ATPPO have the best performance and can find exceptions on more apps.All the methods detect some exceptions in the five experiments in zooborns and BatteryDog.Table 6 and Table 7 show that ATAC and ATPPO can reveal as many exceptions as possible, and ATPPO finds more exceptions than other methods.Since many of the exceptions discovered by researchers have been fixed, and the source code of most apps is still being iterated and modified, the number of exceptions that can be found decreases with each release.Exceptions found by ATAC and ATPPO include RuntimeException, NullPointerException, and NumberFormatException.Table 8 describes the detailed exceptions.Compared to ARES and Monkey, ATAC performs best on 16/20 apps, while ATPPO performs best on 15/20 apps regarding code coverage.ATAC and ATPPO can find more exceptions in terms of exception detection.At the same time, ATPPO finds the most types of exceptions.Experiments have shown that ATAC and ATPPO can achieve higher code coverage and find more number and variety of exceptions than ARES and Monkey.

RQ3: Can the introduction of FSM into the reinforcement learning trigger more functions?
To answer RQ3, we established FSM in the testing process based on ARES and ATAC, respectively.When the experiment is in the local optimal state, the action is performed under the guide of FSM.Branch coverage and method coverage are used to evaluate the degree of exploration of each APP.The average branch coverage and average method coverage of each method in five experiments are calculated, where branch represents branch coverage and method means method coverage.Table 9 compares the branch coverage and method coverage of ARES methods with and without FSM guidance on 20 apps.In the app data set of this experiment, the average branch and method coverage of ARES with FSM guidance is higher than that of ARES without FSM guidance.The average branch coverage of ARES with FSM guidance is 36.6%, and the method coverage is 55.5%.Without FSM guidance, the branch coverage of ARES is 33.1%, and the method coverage is 52.1%.The ARES method with FSM guidance performed best in 16 of 20 applications in the branch coverage, while the ARES method without FSM guidance performed best in only 4 of them.In terms of method coverage, ARES method and FSM guidance have improved the effect of 13 APPs.The method coverage of the two models on MunchLife, Silent ping sms, BirthdayCountDown, and LockPattern is the same, and the ARES method without FSM guidance on the three applications performs better.In general, using FSM to guide ARES improves the coverage of branches and methods on most apps, provides the possibility of triggering more functional methods or more functions, offers advanced guidance for the execution of long specific sequences, and alleviates the possible local optimization problems.
Table 10 shows the branch coverage and method coverage of ATAC, including FSM guidance and excluding FSM.On 20 APP, the ratio of ATAC with FSM guidance is higher than without guidance.The average branch coverage of the A2C-based GUI test model with FSM guidance is 37.7%, and the method coverage is 56.9%, while the branch coverage of the ATAC without guidance is 36.5% and the method coverage is 55.4%.Branch coverage and method coverage are analyzed, respectively.Regarding branch coverage, ATAC with FSM guidance is the best, while ATAC without guidance is only the best on silent ping SMS, BatteryDog, and Jamendo.In terms of method coverage, FSM has instructed ATAC to improve its effect on 10 APPs.The method coverage of these two algorithms is the same on QuickSettings, MunchLife, Silent ping sms, AnyCut, BirthdayCountDown, Materialism, BMI, and LockPattern.The impact of ATAC on BatteryDog and MutiSmsSender without guidance is not apparent at a young age, but it performs better on large apps.Generally, using FSM-guided ATAC can improve branch and method coverage.A specific sequence of operations triggers a function.The introduction of FSM into the reinforcement learning framework makes it possible to explore the long operation sequence, which provides the possibility of realizing the ideal transition, triggering and detecting more branches and functions while preventing the algorithm from entering the local optimal state, providing higher guidance for the exploration of the reinforcement learning algorithm.

Threats to validity
Internal Threats.The threat to internal validity is the non-deterministic characteristic of reinforcement learning.In different iteration cycles, code coverage and error detection quantity may be different.Each method was performed five times on each application to reduce the threat, and the results of the five experiments were combined to evaluate and analyze the technique.Reinforcement learning-based mobile application GUI testing is black-box, based on the front-end page to find exceptions, and may not cover the back-end logic errors.
External Threats.Limits on the number of apps used for evaluation.Although there are tons of apps, we only tested 20 apps.Experiment with selecting different categories and sizes of applications to reduce this threat.The parameters of the reinforcement learning algorithm are set.Many hyperparameters are selected according to domain knowledge, and some may have better choices.
6 Related work

Random Exploration Strategy
In the simplest form, the random strategy only produces UI events, is inefficient at generating system-level events, and can only react to a few occasions in a given situation.One of the advantages of the random exploration strategy is that it can quickly generate UI events, which is suitable for stress testing.However, the disadvantage of this type of approach is that it is difficult to produce highly compatible test cases, resulting in many redundant test cases.Monkey is a tool provided by Google for stability and stress testing.It simulates usergenerated events or system events (such as clicking, randomly entering text, and so on).It implements the most basic random strategy, where the user specifies the number of events to be generated, and when that number is reached, the test stops.However, it can create a lot of invalid tests, which makes no sense in testing the application.
Dynodroid [22] is also based on a random exploration strategy and can generate both system events and UI events.It can create system events by examining APP-related events.A novel random algorithm is used to select a widget, either by choosing the least frequently selected event or by taking context into account, allowing the user to provide input manually when exploration stalls, for example, by authenticating the user.Some researchers [23,24] have introduced fuzzy testing into test applications to generate fuzzy input.These tools are designed to create invalid input, crash the APP, and test the robustness of the test application.Such methods are also quite effective in revealing security vulnerabilities (such as denial of service), highly targeted, and less effective in detecting defects.Intent Fuzzer [23] combines static analysis with random test generation.Droidfuzzer [25] analyzes Intent-filter tags in the AndroidManifest.xmlfile to target activities.

Model-based Exploration Strategy
Some mobile application testing methods begin by building a GUI model of the application and then generating events based on that to explore the mobile application.Typically these test methods use finite state automata as a model, abstracting an activity into a state and events as transitions between states.
Based on the depth-first traversal strategy, the GUIRipper [2] dynamically builds the APP model, adding a list of events every time a new state is found.To increase the application coverage, SwiftHand [6] will also need to optimize the exploration strategy by generating a finite state automaton model of the application, which cannot create system events, only simple UI events such as clicking and scrolling.The framework PUMA [26] is easy to extend.It implements the same essential random exploration as Monkey and can also be extended and dynamically analyzed based on the basic exploration strategy.As a model-based approach, PUMA also includes finite-state machines.The framework supports different exploration methods and state representation.However, it has some problems, such as incompatibility with some versions of the framework.Stoat [4] used a random finite state automaton model to describe the behavior of APP.In model construction, the executable events are judged, and their execution priority is determined according to the event type and execution frequency.

System Exploration Strategy
Functions cannot be triggered by simple clicking, scrolling, or system-level events.They need to be triggered by specific inputs.Some mobile application testing tools use evolutionary or symbolic execution algorithms to explore mobile applications systematically.Although the system exploration strategy can cover the function or function well, it needs to improve in terms of scalability.
EvoDroid [27] uses evolutionary algorithms to systematically explore applications, including fitness functions, to achieve maximum coverage.ACTEve [28] is a symbol testing tool that tracks events from the point in the framework where they are generated to the point where they are processed in the APP.For this reason, ACTEve needs to examine the APP and its framework.ACTEve supports both system and UI events.Sapienz [29] used Pareto-based optimal multi-objective search to maximize code coverage, reveal errors and minimize the length of test sequences.To generate the article field for a specific input, reverse design the APK to get a statically defined string.However, generating new test cases through random crossover and mutation results in generating invalid sequences, and iterative evaluation of newly generated test cases also takes a significant amount of time.

Machine Learning based Strategy
Machine learning techniques include supervised Learning, reinforcement learning, and Active Learning.Methods based on machine learning have been widely used in the field of testing.Reinforcement learning also does not need to label data sets.Under the guidance of the reward function, the model is trained through trial and error exploration to find the most favorable actions under the current situation.SwiftHand [30] applies machine learning to learn a model of the app during testing and uses the model to generate user inputs that visit unexplored states of the app.Then, it uses the app's execution on the generated inputs to refine the model.More and more researchers [9][10][11][12][13]15] try to adopt reinforcement learning into GUI test generation.Most of them [2,16] utilize Q-learning as an agent which maintains a Q-table to record Q value.Q-testing [14] uses Qlearning and divides different states at the granularity of functional scenarios to efficiently explore other functionalities.The Q-table needs wonderful memories if the states and actions space is enormous.[7,17] replace Q-learning with Deep Q Network, which utilizes Q network to predict actions in certain states.ARES [8] is a Deep RL approach for black-box testing of Android applications, and it employs DDPG, SAC, and TD3 algorithms as the agents.However, it needs a big replay buffer to save the experience.

Conclusion
Applications are constantly updated, and their functions and pages are expanded continuously.They are becoming more and more complex and limited by human resources and time and space, making it more challenging to test apps.Aiming at the problem of state combination explosion and exploring the spatiotemporal limitations of an APP, the paper proposes ATAC and ATPPO.Our approaches apply a deep neural network as an agent and are evaluated in 20 apps and perform better than Monkey and ARES.They have higher instruction, branch, line, and method coverage.The performance of detecting failures is also better.This paper also introduces Finite-State Machine into the reinforcement learning framework to avoid falling into the local optimal state, which provides high-level guidance for further improving the test efficiency.We will compare them with more tools and explore more apps in future work.

Fig. 1
Fig. 1 Markov decision process 2 PreliminariesSection 2.1 introduces the concept of reinforcement learning and the specific Markov decision process.There are two main methods to solve the Markov decision process: model-based and model-free methods.The modelbased approach is mainly implemented by the dynamic programming method.However, the premise of using model-based methods is that the current environment is known and well-described.Therefore, the approximation of the model-based method (model-free method) is usually used to deal with the problem.The existing model-free agents are presented in section 2.2.

Table 1
Target applications for evaluation

Table 2
Specific parameter settings of ATAC

Table 7
The kinds of experiments where failures is found

Table 8
The number of experiments where failures is found

Table 9
Branches and method coverage of ARES and FSM-guided ARES (%)

Table 10
Branches and method coverage of ATAC and FSM-guided ATAC(%)

8 Declarations
Ethical Approval Not applicable.Competing Interests The authors declare no conflicts of interest.Consent to Participate Informed consent was obtained from all individual participants.Authors' Contributions Chuanqi Tao and Hongjing Guo wrote the main manuscript text, Hongjing Guo and Jerry Gao prepared figures 1-5 and Table 1-10.All authors reviewed the manuscript.Funding Not applicable.