Android malware detection based on sensitive patterns

In recent years, the rapid increase in the number and type of Android malware has brought great challenges and pressure to malware detection systems. As a widely used method in android malware detection, static detecting has been a hot topic in academia and industry. However, in order to improve the accuracy of detection, the existing static detecting methods sacrifice the excessively high analysis complexity and time cost. Moreover, the correlation between static features leads to redundancy of a large amount of data. Therefore, this paper proposes a static detecting method of Android malware based on sensitive pattern. It uses an improved FP-growth algorithm to mine frequent combinations of sensitive permissions and API calls in malicious apps and benign apps, which avoids the generation of redundant information. In addition, this paper adopts multi-layered gradient boosting decision trees algorithm to train the detection model. And a dual similarity combination method is proposed to measure the similarity between different sensitive patterns. The experimental results show that our proposed detection method has high accuracy and great generalization ability.


Introduction
With the rapid development of mobile communication technologies, the usage of various mobile communication devices such as smartphones, tablets, and the bracelets is also increasing. In order to provide a good user experience, a variety of mobile terminal operating systems have emerged, of which Android operating system (OS) occupies the most of the market share. According to the global smartphone operating system market report released by International Data Corporation (IDC), Android leads the way with a market share of 86.7% [1].
The Android system usually adopts a hierarchical design model, as shown in Fig. 1. It is divided into Linux kernel, hardware abstraction layer, system local library and Android runtime environment, Java API framework and system application from bottom to top [2]. Each layer contains a large number of sub modules or subsystems. The underlying kernel space of the Android system is based on the Linux kernel. While the upper user space of Android system is composed of the local system library, virtual machine operating environment, and API framework. The kernel space and user space are connected through permission access requests. The advantage of this hierarchical structure is that it can hide the specific implementation details of the lower layer. It can use the lower layers to provide uniform services for the upper layers and mask differences between layers. Thus, this hierarchical structure of Android system does not affect the upper layer when changes occur at lower layer. And each layer can provide fixed Service Access Point (SAP) to achieve high cohesion and low coupling. The android system architecture [2] The hierarchical structure gives Android system a powerful open-sourcing feature and lax application release verification policy. Using Android system, the developers can easily release their own apps with no strict restrictions. And users can download these apps they need from diverse sources, including official app stores and third-party app markets. However, because of its open-sourcing feature, Android has become one of the most vulnerable targets of malware attack. In the mobile threat report [3] released in the first quarter of 2021, McAfee pointed out that the total number of mobile malware samples had exceeded 40 million, and the growth rate was still very fast. By observing the malicious behavior of malware, we discovered that most of them attempt to steal users' private information, such as phone contacts, text messages, emails, personal photos and even bank accounts. As a bad result of this, criminals can use malware to collect these privacy data and engage in unlawful activities to obtain illegal income.
In order to protect users from malware and create a safe and healthy mobile communication environment, researchers in academia and industry have invented some techniques and tools for detection. In the detection of Android malware, static detecting is a widely used method. This technology statically processes the APK files of an Android application, and finds potential security hazards by analyzing the content of the files. At present, almost all static detecting methods are developed around APK files.
Existing static detection methods often adopt the idea of "comprehensive analysis", that is, to parse as many files in the compressed package as possible, and extract multiple types of feature information from them to ensure a complete description of data samples. Although this approach can achieve relatively satisfactory results, it brings other problems. For example, the comprehensive analysis leads to excessive analysis complexity and time cost; The correlations between features will also cause data redundancy. Therefore, what to choose as the basis for analysis and how to effectively characterize the data sample are the keys, which are used to determine whether a detection method is efficient.
In addition, Android malware has a huge family. With the evolution and development of malware, new type of malware and its variants continue to emerge. However, the existing malware detection technologies and tools are difficult to achieve good results in the scenes of identifying these unknown instances. This leads to the fact that the malware detection system in actual applications needs to be updated frequently due to the low generalization ability. While the cost of updating will seriously affect the evaluation performance of malware detection system. Fortunately, with the rise of artificial intelligence, the detection system has introduced machine learning technology to improve malware detection performance.
Therefore, this paper proposes a novel Android malware detection method based on sensitive patterns by using artificial intelligence. The proposed detection method not only considers the accuracy of static detection, but also considers the generalization performance. It is able to discover the difference between malicious apps and benign apps on the basis of combined information of sensitive permissions and API calls. And the combined information can be used to build a high-performance detection model.
In summary, our major contributions include the following: (1) To mine and discover frequent item sets in the proposed detection method, we design an improved FP-growth algorithm. It can generate the sensitive patterns of malware and normal software with efficient data mining. And this way reduces the generation of redundant information. (2) To reduce the feature dimension, this paper combines text similarity and support similarity to carry out hierarchical clustering of sensitive patterns. Meanwhile, the samples are characterized based on sensitive pattern cluster and inclusion theory. We mine sensitive patterns from app samples and then construct feature vectors with them. (3) We train our detection model with multi-layered gradient boosting decision trees. To the best of our knowledge, this is the first application in the field of Android malware detection. It improves the detection accuracy and generalization ability of malicious software.
The remainder of this paper is organized as follows. In Sect. 2, we discuss the related work. Our method is detailed in Sect. 3, continuing with the evaluation in Sect. 4. We conclude this paper in Sect. 5.

Related work
In order to promote the development of malware detection, a large amount of related work has been carried out [4,5].

Static detecting analysis
The traditional detection methods are mostly based on the signature authentication mechanism [6,7]. They stored the signatures of the known malicious software in the database. Then, the signature of the sample was extracted and compared with the database to determine whether it was malicious software. While this approach is simple and effective, it has two major drawbacks: First, the traditional method cannot detect the unknown malware. Because the corresponding signature of unknown malware does not exist in the database, and it is expensive to create a new signature and publish it through other methods; Second, malware can bypass signature-based identification by changing a small amount of code in an application in a way that does not affect semantics.
To solve the above problems, more static features have been introduced into malware detection. Malware often needs to apply for appropriate permissions when performing malicious behaviors, such as reading contact lists and sending text messages. Therefore, the static detecting methods based on permission analysis have been proposed. For example, the authors [8] conducted a comparison about the usage of permissions between malicious apps and benign apps. They found that those common permissions could not be used to effectively distinguish them. However, the difference became obvious when the number of required permissions was small; Shuang Liang et al. [9] compiled a list of all permission combinations that appear frequently in malware and used them to develop a detection model.
As the underlying implementation of application functionality in Android system, the API largely reflects the behavior characteristics of an application. Thus, some static detection methods based on API analysis have been studied. For example, the authors [10] disassembled APK files. Through data stream analysis, it could obtain API calls with high security threats to users; In addition to analyzing the permissions and APIs, the authors in [11] analyzed other features such as activities, services, intents and network addresses to improve the accuracy of detection; TaeGuen Kim et al. [12] had added the opcodes and environment settings to the feature set.
There are also some methods to combine sensitive permissions with API calls. A Feature Centralized Siamese Convolutional Neural Network (FCSCNN) was proposed in [12] in which the benign and malicious mean centers were calculated from the sample database. In [13], the permission matching and malware similarity analysis (PMMSA) was used. It proposed a binary adjacency (BA) oversampling method to expand the number of applications for similarity analysis of malicious application. The reference [14] proposed a novel approach for Android malware detection and familial classification based on the Graph Convolutional Network (GCN). To maximize the likelihood of identifying Android malware apps, the reference [15] proposed three different grouping strategies for choosing the most valuable API calls: the ambiguous group, risky group, and disruptive group.
In addition, some researchers believe that using string features is prone to be influenced by dimensionality. While structural features are more conducive to processing massive data. For example, Jixin Zhang et al. [16] constructed the Dalvik opcode graph and analyzed its topology features such as node number, probability density and graph distance to characterize malware.

Machine learning
In recent years, with the rapid development of artificial intelligence, machine learning has become a research hotspot in various fields. Its theories and methods have been widely used to solve complex problems in scientific research and engineering applications. For malware detection, its essence is a classification task, which meets the applicable requirements of machine learning. Therefore, improving detection performance by machine learning has become an important development direction for malware detection.
Supervised learning is one of the most widely used model training methods in machine learning. It learns the mapping relationship (ie function) from the labeled data, and then performs instance inferences on the unlabeled data based on this relationship. Common supervised learning algorithms include Bayesian networks, K nearest neighbors (KNN), support vector machines (SVM), decision trees, etc. In fact, as the acquisition cost of labeled data is relatively large, the supervised learning requires experienced and relevant experts to spend a lot of time to complete. Therefore, the unsupervised learning has become another learning method to break through this limitation. In unsupervised learning, the training data is unlabeled. The training process of the model is a "self-learning" process, and the model itself discovers the relationship between the data. The most typical unsupervised learning algorithm is the clustering algorithm. For example, Literature [17] studied the clustering K-MEANS algorithm. It collected the runtime traffic of Android applications and  Fig. 2 The overall architecture of the method selected six features to construct feature vectors, including frame length, frame number, connection duration, relative duration (time from the first frame), source port and the destination port. The experiment in this paper proved that the K-MEANS algorithm had a high accuracy rate in malware detection.
With the emergence of neural networks and the advent of the era of big data, deep learning has gradually been applied to malware detection. For example, in [17], the authors correlated the characteristics of static analysis with those of dynamic analysis. It used deep belief networks to characterize malware; In [18], the authors used permissions and system calls to model neural networks; In [19], the authors proposed a system for Android malware detection using convolutional neural network, which used the original opcode sequence of the application as a feature; In [20], the detection system used various classifiers, including deep neural networks. It allowed to enter various information, such as intentions, permissions, system commands, and API calls.
The application of machine learning makes the malware detection system more intelligent and avoids the influence of subjective factors brought by artificial recognition. However, current research indicates that the combination of machine learning and malware detection still has certain issues. For example, feature selection cannot effectively represent sample data, resulting in failure to cover multiple detection scenarios. In addition, a common problem of the current detection systems is that generalization is too low to identify new type of malware and its variants. It leads to frequent updates of detection systems. For large detection systems, the cost of updating is high.

Proposed method
Aiming at Android malware detection, this paper proposes a static detecting method based on sensitive pattern by using machine learning. Figure 2 shows the overall architecture of the proposed method. First, it disassembles the APK files of the Android apps to obtain information about permissions and API calls. Then, sensitive patterns of normal software and malware are generated by filtering raw information, mining frequent combinations and clustering patterns. Next, we construct the feature vectors with the sensitive patterns. Finally, the malware detection model is trained by machine learning algorithm.
The key of malware detection algorithm is to determine the feature vector. Although there are a lot of similar malware detection algorithms [21][22][23][24], the difference of our detection method is that the selected features are more typical and the redundant information in the features is reduced. Thus, our detection method has higher accuracy.

Extraction of raw data
In our proposed method, we must first retrieve the raw data of Android software before generating sensitive patterns. Thus, it necessitates the employment of reverse engineering [25] tools to extract the permission and APK files in raw data.
The reverse engineering tool used in this paper is Apktool as shown in Table 1. It provides a large number of commands for compiling and decompiling APK files. In the command line window, we can type "apktool d xx.apk" (xx is the APK file name) to start the decompilation process for the APK file. Following the execution of the command, a folder with the same name will be created in the same directory as the APK file, containing various decompiled files. The folder mainly includes assets file, res file, AndroidManifest.xml file and smali file. The proposed method focuses on permissions information provided by androidmanifest.xml file and API call information provided by smali file.
In order to restrict application access to permissions such as phone, network, contacts, SMS, and GPS location, Android provides a permission-based security model in the application framework. The developer must declare the required permissions using the <uses-permission> tag in AndroidManifest.xml, as shown in Fig. 3. These permissions are divided into the following three levels [26], including normal permission, dangerous permission and signing permission. Our proposed method completes the extraction of permission information by canning the AndroidManifest.xml file with matching the keyword ".permission".
(1) Normal This type of permission poses the least risk to the user, system application, or device. Normally, normal permissions are granted by default when an app is installed.
(2) Dangerous This type of permission has the ability to access the private data and important sensors of the device. Thus, dangerous permissions are granted by users themselves when the application is installed.   (3) Signing Signing permissions are available for system applications. They are granted when the requesting app is signed with the same developer certificate of the app that declared the permission.
As the functions of the application are represented by various API calls at the bottom layer, analyzing the API call information helps to understand the behavior characteristics of the application. At the code level, an API is a method or function in a class. Figure 4 depicts the specific implemen-tation of the API in the smali code. Our proposed method completes the extraction of API call information by scanning the smali file with matching the keywords ".method" and "invoke-".

Sensitive pattern generation
AS shown the Fig. 2, the sensitive pattern generation mainly includes three steps: data filtering, pattern mining and pattern clustering.

Data filtering
For the raw data extracted by reverse engineering, there are certain problems in directly analyzing it. First, each sample contains a large number of permissions and API calls, especially the data related to API calls, which range from hundreds to thousands. It greatly increases the complexity of the analysis. Secondly, there is a certain amount of noise in the raw data, that is, the sample contains data that is useless for analysis, such as GET_PACKAGE_SIZE in permissions, <init> in API calls, etc. Therefore, the raw data needs to be filtered.
Relevant research shows that most malicious software involves the operation of user sensitive data when performing malicious behavior [27]. Therefore, we take "sensitive data" as the basic condition for filtering.
For the data of permission categories, Android officially announces permission groups and related permissions with potential security risks, which could contain sensitive data, as shown in Table 2. Such sensitive data include calendar, camera, contacts, location, phone, sensors, SMS, and storage. In addition, permissions such as network, Bluetooth, and accounts are also considered permissions associated with "sensitive data".
For the data of API calls, SuSi [28] provided a list of sensitive APIs. Table 3 shows part of the list of sensitive APIs. According to the flow of sensitive data, these sensitive APIs are divided into two categories: source-related APIs and sink-related APIs. There are various source-related APIs and sink-related APIs in the Android security field. For example, a user's location information or address book can be considered source-related API. A network connection or text messages sent to a device can be considered sink-related API. The access to source-related or sink-related data is achieved through specific API calls. For example, it calls getLast-KnownLocation() to obtain the user's current location, and calls getLine1Number() to obtain the mobile phone number.
Since most malicious behaviors of malware involve operations related to user's privacy, the proposed method in this paper focuses on the permissions and API calls which are directly associated with the sensitive information. Therefore, in order to filter the raw data, we create a database of sensitive permissions and API calls according to network connection, phone status, contact list, text message, location information and so on. Then we remove the non-sensitive data based on this database.

Pattern mining
The combination of sensitive permissions and API calls that occurs frequently in malware or normal software is called sensitive pattern. And the proposed method aims to discover the difference between malware and normal software in sensitive pattern. This paper uses relevant data mining methods (eg. FP-growth algorithm) to get the potential sensitive pattern from frequent itemset.
Suppose I {i 1 , i 2 , …, i m } is the set of global items and D {T 1 , T 2 , …, T n } is transaction dataset. Each transaction consists of several items (that is, T i is the subset of I). Thus, the support of an itemset S is defined as the percentage of transactions in the data set that contain the itemset: If the sup(S) is greater than or equal to the minimum support, the S is called frequent itemset.
However, the FP-growth algorithm has limitations when searching for frequent itemset in a dataset. For example, there will be a lot of redundant pattern information in the case of a large number of element items. After analyzing the results generated by FP-growth, we find that a frequent itemset may have one or more supersets with same support as itself, that is, the itemset is non-closed. Based on this finding, we propose an improved FP-growth algorithm to avoid the redundancy of pattern information.
If the frequent itemset extracted from a conditional FP-tree are all closed, the conditional FP-tree is called a closed conditional FP-tree. Thus, in the improved FP-growth algorithm, we add a pruning strategy to construct a closed conditional FP-tree, as shown the algorithm 1. In the improved FP-growth algorithm, the conditional pattern base is first traversed. Each element item and its corresponding count value are recorded in the header pointer table. Then we remove the element items in the header pointer table whose conditional frequency is less than the minimum support or equal to the conditional support. Finally, the element items in the conditional pattern base are filtered and sorted by using the header pointer table to complete the filling of the closed conditional FP tree.
Although there are many similar FP algorithms [29,30], the improved FP-growth algorithm proposed in this paper has the advantage of adding pruning strategy, which greatly reduces the complexity of the algorithm, speeds up the convergence speed of the algorithm, and increases the efficiency of the algorithm.
The improved FP-growth algorithm can greatly reduce the size of the search space through the pruning strategy. This way not only improves the mining efficiency, but also reduces the number of frequent itemsets. We use the improved FPgrowth algorithm to mine sensitive patterns in malicious apps and benign apps, and find that there are significant differences between them.
As shown in  The support of sensitive patterns (part) in malicious apps and benign apps of 0.71 in benign apps. But the support in malicious apps is merely 0.36. In order to ensure the effectiveness of sensitive patterns, we remove those which have similar support in two kinds apps.

Pattern clustering
By mining frequent combinations of sensitive permissions and API calls, we obtain the sensitive patterns of malicious apps and benign apps. Although we use the improved FPgrowth mining algorithm to reduce the redundancy of pattern information, there is still a risk of dimensionality in constructing feature vectors with these patterns directly. Therefore, we use pattern clustering to lessen the feature dimension.
In order to measure the similarity of two sensitive patterns, we analyze from two aspects: text similarity and support similarity.
Since a sensitive pattern is a combination of sensitive permissions and API calls, we can consider it as a special text segmentation result. Thus, the text similarity of two sensitive patterns can be obtained by calculating the Jaro distance of the two texts: where |s 1 | and |s 2 | are the number of words in the two texts, and d sup is the number of the matching words. If two words from sim(C 1 , C 2 ) 1 2 (sim_ max(C 1 , C 2 ) + sim_ min(C 1 , C 2 )) and sim_ max(C 1 , C 2 ) max({sim(sp i , sp j )|sp i ∈ C 1 , sp j ∈ C 2 }), respectively, are in the same position in their text and not more than the matching window size sim_ min(C 1 , C 2 ) min({sim(sp i , sp j )|sp i ∈ C 1 , sp j ∈ C 2 }) apart, they are considered to match each other. The sim(sp i , sp j ) w · d J aro + (1 − w) · d sup sp i ∈ C 1 , sp 2 ∈ C 2 is the count of transpositions. That is, it is half the number of matching words in different order.
Because the position of a permission or API call in a sensitive pattern does not have a specific meaning, we sort the sensitive pattern according to the lexicographic order. When calculating the Jaro distance. Thus, the Jaro distance of two sensitive patterns is: As mentioned above, a sensitive pattern has different support in malicious apps and benign apps. This distinction can be used to quantify the difference between two kinds of patterns. The support similarity of two sensitive patterns is defined as follows: where sup m and sup b respectively indicate the support of the sensitive pattern in malicious apps and benign apps.
Considering the text similarity and support similarity of sensitive patterns, we perform hierarchical clustering of sensitive patterns based on the bottom-up merge idea. First, all sensitive patterns are treated as a single cluster. Then the similarities between clusters are calculated. The two clusters with the maximum similarity are merged. The process of clustering is repeated until the maximum similarity is less than the threshold value. The similarity of two clusters is calculated as follows: where sim(sp i , sp j ) is a weighted sum of the text similarity and the support similarity.The process of hierarchical clustering of sensitive patterns is shown in Algorithm 2.

Feature vector construction
To represent each Android app sample, we construct the feature vector based on existence and inclusion degree. The feature dimension is the number of clusters of sensitive patterns. If the sample has any pattern in a cluster, the corresponding feature value is 1. Otherwise, the inclusion degree of the sample to each pattern in the cluster is calculated. And we take the maximum degree as the feature value. Defining that the set of permissions and API calls in a sample is P A, the feature vector of the sample is constructed as follows: other wise (11) inclu(sp j , P A) sp j ∩ P A sp j (12) where the inclusion degree inclu(sp j , P A) indicates the ratio of the number of identical items in sp j and P A to the total number of items in sp j .

Detection model training
In the fields of data classification and target detection, models built using ensemble learning and multi-layer structures often have high accuracy and robustness. Based on this knowledge, Fig. 6 Training Process of mGBDT we adopt the multi-layered gradient boosting decision trees algorithm to train our detection model. The multi-layered gradient boosting decision trees (mGBDT) is a hierarchical model algorithm with representation learning ability proposed by Ji Feng et al. [24]. It stacks multiple regression GBDT layers as building blocks and trains jointly with a variant of target propagation. For each layer, the mapping F i : o i−1 → o i (G i denotes the output of the i-th layer) has a corresponding pseudo-inverse mapping . It can be achieved by minimizing the expected value of the reconstruction loss function.
where Gaussian noise ε can enhance the robustness and generalization of the model. As the Fig. 6 shows, the training process of mGBDT includes several iterations. At each iteration, it first updates the pseudo-inverse mapping of each layer and calculates the corresponding pseudo-inverse label. Then it uses them as given and updates the forward mapping following a gradient ascent step towards the pseudo-residuals.
Algorithm 3 shows the training process of the detection model based on mGBDT.

Evaluation
We evaluate the performance of the proposed method by conducting extensive experiments.

Simulation environment
In this paper, the proposed method is validated on two datasets. The first dataset is named CICMalDroid 2020 [31] which is created by Canadian Institute for Cybersecurity. Its samples span five categories, including advertising software pieces, bank malware, SMS malware, risk software and benign software. And the second dataset DAFG is created by ourselves. DAFG includes known malware sharing website VirusShare in the field of network security and normal software samples. The normal software sample set is obtained from multiple official app stores, such as Google Play, 360 Assistant, etc. In DAFG, the sample set is subdivided into two parts according to the time of data collection: 8183 malware samples collected from 2014 to 2016 and 2791 malware samples collected in 2017. DAFG contains a total of 9058 normal software samples. In order to ensure the quality of the dataset, this paper uses the VirusTotal [32,33] tool to filter the crawled normal software samples. Thus, the final number of normal software samples used for the experiment in DAFG is 8745. Among them, the malware in DAFG is mainly privacy leaking software. In theory, as our algorithm is highly related to information leakage, our algorithm performs better on DAFG than another dataset in detection performance.
In addition, more than 90% of the samples in the dataset are smaller than 10 MB. About 3% of the samples are larger than 20 MB. The largest sample is 87 MB, and the smallest sample is only 1 KB.
Regarding the experimental platform, all experiments in this paper are completed on a PC with a dual-core 3.7 GHz processor and 8G running memory, and the operating system is windows10 (64-bit). The experimental programs are all written in Python language and run under the software Spyder.

Performance of the improved FP-growth algorithm
As for the improved FP-growth algorithm proposed in this paper, we compare its performance with the original FPgrowth. In our dataset, the number of sensitive permissions and API calls of each sample is different, ranging from several to hundreds. Table 1 shows the number of frequent itemset and the mining time of the two algorithms in different minimum support. As shown in Table 4, due to the pruning strategy, the number of frequent itemset mined using the improved FP-growth  Fig. 7 The performance of different algorithms is less than that using the original one. And this difference becomes more pronounced as the minimum support decreases. Therefore, this improved FP-growth can effectively avoid redundancy when mining sensitive patterns. In addition, the improved FP-growth has higher efficiency, and it can complete mining in a shorter time.
In Fig. 7, the original FP-growth algorithm and the improved FP-growth algorithm are applied to malware inspection when minSup is 0.3. The effectiveness of FPgrowth algorithm is analyzed from three aspects: Accuracy, Precision and Recall. It can be seen that the malware detection method with the improved FP-growth has higher efficiency. It improves the detection rate by 5%.

Performance of the mGBDT algorithm
In our proposed method, the mGBDT algorithm is used to train the detection model. In this section, we compare the performance of the mGBDT algorithm with other classification algorithms when using sensitive permissions and API calls as classification features. In order to evaluate its performance, we compare it with traditional machine learning classification algorithms such as Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF). We also compare our method with the deep-learning algorithm which is named multi-layer perception (MLP).
In our experiment, the evaluation measures used are Accuracy, Precision and Recall, as show the formulas (15)- (17).

Accuracy
T P + T N T P + F P + T N + F N Precision T P T P + F P (16)  Recall Among them, TP and FP represent the number of correctly detected malware and the number of normal software that was misjudged as malware, respectively. TN and FN respectively represent the number of correctly detected normal software and the number of malware that was misjudged as normal software. Table 5 depicts the detection results obtained by training the detection model with mGBDT and the comparison algorithms. It can be concluded that the performance of mGBDT is significantly better than other machine learning algorithms. Compared to traditional machine learning (such as SVM, DT, RF), the accuracy of mGBDT is improved by 3~6%, the precision rate is improved by 2~4%, and the recall rate has increased by 2~8%. Compared to deep machine learning (MLP), our method mGBDT improves accuracy by 2-3%, precision by about 3%, and recall by 2-3%.
The threats to effectiveness are mainly the types of malwares. Our algorithm is mainly designed for information disclosure software. From the performance of our algorithm on the two datasets, we can know that our algorithm has better performance on the dataset DAFG where the malware type is mainly privacy disclosure.

The detection performance
To confirm the superiority of the proposed method, we compare it with other similar methods (the references [34][35][36][37]). Both Literature [34] and [35] used famous feature selection methods, byte code graph features and call function features. The reference [34] converted the binary file obtained after unpacking the Android APK file into a bytecode image, and extracted features of the bytecode image. While the reference [35] extracted the features of function call graph by building a call graph between Android software functions. Reference [36] studied the changes in permissions and API updates of different versions of Android systems. It proposed a fine-grained malware detection method for different API levels. Reference [37] analyzed the usage of permissions and API calls in malware and normal software, respectively. It extracted 50 highly sensitive API calls as distinguishing features according to the mapping relationship between permissions and APIs.
Although the proposed method and the above mentioned methods both use APIs calls and sensitive permissions as detection characteristics, the biggest difference of our method is that it designs an improved FP-growth mining algorithm. This way reduces the generation of redundant information and makes the features more accurate. Secondly, we combine text similarity and support similarity to cluster sensitive patterns hierarchically, which realizes feature dimensionality reduction and improves detection efficiency again. Finally, our method constructs an effective detection model based on the inclusion theory and multi-level gradient lifting decision tree algorithm. Simulation results show that it has high detection efficiency and algorithm performance.
In this section, we compare algorithms using different classification features and their performance. Table 6 depicts the detection performance of different methods on DAFG and CICMalDroid 2020 datasets. On the two datasets, the method proposed in this paper has higher accuracy, precision and recall than other methods, which shows the superiority of the algorithm in detecting malware. The accuracy of our method is improved by 1.3~7.8%, the precision rate is improved by 2.1~7.3%, and the recall rate has increased by 1.8~7.5%.
For the method of detecting sensitive permissions and API calls, Fig. 8 analyzes the accuracy of the algorithm in different time dimensions in DAFG. By taking the samples collected from 2014 to 2016 as the training set and the samples collected in the first half of 2017 as the test set, we evaluate the generalization for different detection methods on new malware. Figure 8 shows the detection results on malware samples released from different months of 2017. As we can see, our method performs better at detecting new malware. It means that we don't have to update the detection system frequently when applying our method.

Effectiveness of combining permissions and API calls
In this paper, both permissions and API calls are used to generate sensitive patterns of Android apps. In order to prove the effectiveness of the combining permissions and API calls, we also conduct experiments about using permissions or API calls. As shown in Table 7, a single type of static feature has limitations in distinguishing between malicious apps and benign apps. And the combined method has higher accuracy, precision, and recall than the single permission call detection and API detection. Android software operation does not rely on a single permission call or API call, but a combination of the two. In different Android software, the combination mode of the two is often different, which is shown as having different sensitive modes in this work. Therefore, by combining multiple features, we can effectively improve the performance of the detection model. Because the sensitive mode formed by the combination method contains more information. The Android software has more accurate descriptions and will do a better performance in malware detection.

Conclusion
This paper proposes an Android malware detection method based on sensitive patterns. We use an improved FP-growth algorithm to mine frequent combinations of sensitive permissions and API calls in malicious apps and benign apps, which effectively avoids redundancy. Based on text similarity and support similarity, we cluster the generated sensitive patterns to achieve feature dimensionality reduction. In addition, the multi-layered gradient boosting decision trees algorithm is used to build a malware detection model with high accuracy and strong generalization ability. However, the proposed detection method in this paper only considers the frequency of the occurrence of sensitive patterns in malicious software sets and normal software sets. It does not fully consider the difference of sensitive patterns in specific samples. Therefore, in future research work, sample weight can be assigned to sensitive patterns according to the number of API calls and other factors, so as to more accurately characterize the samples.
Author contributions KL, and GZ designed the related algorithms and wrote the main manuscript text. XC and QL designed the main architecture. LP and YL had carried out simulation experiments on the proposed detection method. All authors reviewed the manuscript.

Conflict of interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work in this paper.

Ethical approval
We understand that our manuscript and associated personal data will be shared with Research Square for the delivery of the author dashboard.