BotFinder: a novel framework for social bots detection in online social networks based on graph embedding and community detection

With the widespread popularity of online social networks (OSNs), the number of users has also increased exponentially in recent years. At the same time, Social bots, i.e. accounts that controlled by program, are also on the rise. Service providers of OSNs often use them to keep social networks active. Meanwhile, some social bots are also registered for malicious purposes. It is necessary to detect these malicious social bots to present a real public opinion environment. We propose BotFinder, a framework to detect malicious social bots in OSNs. Specifically, it combines machine learning and graph methods so that the potential features of social bots can be effectively extracted. Regarding the feature engineering, we generate second order features and use coding methods to encode variables that have high cardinality. These features make full use of both labelled and unlabeled samples. With respect to the graphs, we firstly generate node vectors through embedding method, following which the similarity between vectors of humans and bots can be further calculated; Then, we use an unsupervised method to diffuse labels and thus the performance can be improved again. To valid the performance of the proposed method, we conduct extensive experiments on the dataset provided by an artificial intelligence contest which is composed of over eight million records of users. Results show that our approach reaches a F1-score of 0.8850, which is much better compared to the state of the art.


Introduction
Over the past few decades, Online Social Networks (OSNs) have played an increasingly important role in our daily life, by the aid of which human beings can communicate with each other in real time and maintain their social relationships more conveniently and efficiently.There exist many world-wide popular social platforms that connect users all over the globe, such as Facebook, Twitter and etc.Furthermore, OSNs have also become the most popular channel for individuals to obtain social news compared with traditional medias, such as newspaper.
Social bots, i.e., accounts controlled by program may be used for keeping social networks active.
Although there are beneficial social bots in OSNs, the emergence of some malicious social bots has harmful effect.For example, some people can register a large number of accounts for various purposes, such as increasing the number of fans or likes maliciously.These malicious behaviors have become an important information security problem which threatens the healthy development of social network platforms [1][2].Therefore, it is necessary to detect those malicious social bots, which are also referred to as social bots detection.In particular, majority of current studies deal with Twitter and other foreign platforms, whereas few studies are conducted to investigate OSNs in China.
Hence, various scholars devote to study the problem of social bots detection.The current works related to social bots detection are mainly divided into two categories, i.e. machine learning approaches and graph-based approaches.However, there still exists some challenges for this topic: 1) In general, most methods rely on a single algorithm to identify social bots, which might not be desirable options due to the diversity of the dataset.
2) In practice, most of the data is unlabeled, which indicates that the numbers of labels are usually very small.Hence, it is a great challenge to effectively exploit the unlabeled data.
Aiming to tackle the above challenges, we here consider the users' profiles, behaviors and relationships among them jointly.Furthermore, we proposed an integrated mechanism BotFinder through combining feature engineering and graph methods to detect social bots.Firstly, feature engineering is conducted on the dataset to extract global information.Then, we generate node vectors through embedding methods.After that, we calculate the similarity between the vectors for humans and bots.
Finally, we adopt unsupervised method (here, community detection algorithm is considered) in order to further improve the performance.With the proposed algorithms, we can easily detect those machine accounts.
The contributions of this paper are summarized as follows.
1) Firstly, graph algorithm may not perform well when there are many isolated nodes.Whereas machine learning method is suffering from the incapability of learning topological structure.Hence, we combine machine learning method and graph approach to overcome these problems.
2) Secondly, in feature engineering, we try to obtain second order features and adopt coding methods to encode variables that have high cardinality, or in other words, that contain a large number of distinct values.In terms of graphs, we generate node vectors through embeddings methods.Then, we exploit unsupervised method to diffuse labels to improve the performance.These approaches make full use of both labelled and unlabeled samples.
The rest of this paper is organized as follows.In Section 2, we review some related works.In Section 3, we present the proposed framework BotFinder.Then, in Section 4, we describe the studied dataset in detail and experiments are conducted with sufficient analysis.Eventually, we conclude our research in Section 5.

Related works
In this section, we review the recent research on social bots detection in OSN: machine learning approaches and graph-based approaches.

machine learning approaches
Among the machine learning approaches, supervised ones are widely investigated.Early anticheating algorithms only utilize user profiles or user behaviors to build models.Breno et al. [3] proposed a methodology using Artificial neural networks with data preprocessing and mining.Chang et al. [4] proposed a feature selection method followed by decision trees to detect bots.Ganji et al. [5] applied K-nearest Neighbors (KNN) in credit card fraud detection.Ferrara et al. [6][7] utilized machine learning and cognitive behavioral modeling techniques to analyze social bots in 2017 French presidential election and 2017 Catalan referendum for independence.Denis et al. [8] proposed an ensemble learning method for detecting bots on Twitter.
With the development of deep learning method (LSTM, CNN, etc.), researchers also try to develop new methods in order to detect social bots aiming to further improve the detecting accuracy.Through viewing user content as temporal text data, Cai et al. [9] proposed BeDM method for bot detection.Kudugunta et al. [10] extracted user metadata and tweet text and these data are regarded as the inputs to the LSTM deep nets.In practice, most of the real-world data is unlabeled, while unsupervised learning methods are widely investigated, which usually relies on the common feature of social bots.Cresci et al. [11][12] proposed a revised approach based on DNA-inspired techniques in order to model online user behavior.Chen et al. [13] proposed an unsupervised approach to detect Twitter spam campaigns in realtime.Jiang et al. [14] proposed CATCHSYNC to detect suspicious nodes using only the topology without label.Su et al. [15] proposed IoT-RU.Mazza et al. [16] converted the retweet time series into feature vectors and then cluster.

Graph-based approaches
Machine learning approaches only consider the features of nodes.Whereas, the relationships among nodes also contain valuable and useful information.With the development of deep learning and graph algorithm, topology information of graphs is necessary to be considered for further improvement.
Social bots have the characteristics of aggregation in graph.While community detection is used to discover community structures in network, which can also be viewed as a generalized clustering algorithm.Thus, community detection algorithms might be applicable to detect social bots.Many researchers have devoted endless efforts to the study of this topic.Guillaume et al. [17] proposed a heuristic method based on modularity optimization.Li et al. [18] proposed WCD algorithm based on a deep sparse autoencoder.For samples with rich features, it is hard to fully mine the information existing in the features.Then, new methods are proposed which first convert the topology information of nodes into feature vector, and then use machine learning algorithms to train and infer.For instance, Pytorch-BigGraph proposed by Lerer et al. [19], NetWalk proposed by Yu et al. [20], Node2Vec proposed by Grover et al. [21] and Bot2Vec proposed by Pham et al. [22].Moreover, Kipf et al. [23] proposed Graph Convolutional Networks (GCN) which models the features of nodes and network topology, and Aljohani et al. [24] apply GCN to detect bots on Twitter.Li et al. [25] proposed BPD-DMP algorithm for network immunization.Nie et al. [26] considered the social network and posted content; then, they proposed DCIM algorithm.Gao et al. [27] characterized dynamic behaviors and proposed a network-based model.Zhu et al. [28] investigated the epidemic spreading process on multi-layer networks.Su et al. [29] proposed IDES to detect malicious nodes in the vehicular network.
Most methods rely on a single algorithm to identify social bots.In terms of both accuracy and other relevant evaluation metrics, the previous identification methods still have significant limitations.

Our Proposed Method: BotFinder
In this section, we mainly illustrate BotFinder which mainly consists of three steps: 1) we represent feature engineering techniques on tabular data; 2) we derive node embeddings, and then measure the similarity between humans and bots; 3) we applied community detection algorithm to further improve performance.

Overview
Figure 1 illustrates the steps in detail.Step1, we exploit feature engineering techniques to generate feature matrix.Step2, we use graph embedding method to generate similarity matrix, and then merge these two matrixes.After that, we adopt LightGBM [30] to train the merged matrix and infer temporary results.Step3, we apply community detection method to generate partial results, and use these results to correct the results of LightGBM.Second order feature: To represent combinations of categorical variable in table, we assume the second order feature is represented as (, , ).
Here,  reflects the degree of activity.Specifically, we select a pair of variables (i.e.,  1 and  2 ) and we are anticipated to record the number of times this pair occurs in dataset.We abbreviate it to ( 1 ,  2 ).For example, a user gives a thumb-up to someone using the combination of device type ( 1 ) iPhone12,1 and app version ( 2 ) 126.7.0, and this combination appears k times in the dataset.Then, the users who use iPhone12,1 and 126.7.0 will get a  value of k.
While  indicates the diversity in a given extent.We use a variable ( 1 ) as the primary key, and record number of unique categories in the other variable (  2 ).We abbreviate it to ( 1 )[ 2 ].For example, for the users who use device type ( 1 ) iPhone12,1, there are k different app versions in the dataset.Then, the users who use iPhone12,1 will get a  value of k.
describes the proportion of count.It is calculated as ( 1 ,  2 )/( 1 ) .For example, the combination of device type ( 1 ) iPhone12,1 and app version ( 2 ) 126.7.0 appears k times, and device type ( 1 ) iPhone12 appears for v times in the dataset.Then, all the users who use iPhone12,1 and 126.7.0 will get a  value of k/v.

Time interval feature:
The request time interval varies for different user.Here, we mainly consider max, min, median and sum of the time interval.
Count encoding: Count encoding is conducted through replacing categories with their counts computed on the dataset.However, count may be the same for some variables, which may result in the collision that two categories might be encoded as the same value.This will lead to a degradation in the performance of model.Hence, we here introduce a target encoding technique.

K-folds target encoding (or likelihood encoding, impact encoding, mean encoding):
Target encoding is numeration of categorical variables via target (label).Here, we replace each category of the categorical variable with corresponding probability of the target.To reduce target leak, we apply k-folds target encoding.This is implemented as follows:

Step2: Similarity Calculation
Here, we adopt the Node2vec [21] to obtain the node embeddings (vectors) of users, and then calculate the cosine similarity of embeddings between a user and a labeled one.The similarity value indicates the probability of having the same label for the two users; for example, if the cosine similarity between user1 and user2 is relatively large, then they are likely to have the same label with a high probability.
For example,  and  denotes two node vectors for users/accounts.The cosine similarity between two vectors is calculated as where Ai and Bi denote the element of vector  and , respectively.
Then, for each node vector  in training set and testing set, we calculate it's max and mean cosine where   and   represents a node vector.
The process is illustrated as follows:

Step3: community detection
For community detection, we adopt the typical Louvain Method [17] which divides the constructed graph into communities.After that, we will label communities with rules as follows: 1) All users in the community are supposed to be of the same label if the users with labels belongs to the same community.
2) If the users in a community do not have any label, or if the users are of different labels, we will not make prediction.
However, prediction may not cover all users.So, performance in this rule is limited.But the result of this rule is more accurate than LightGBM.Through combining the above two steps, performance can be further improved.

Experiment
In order to evaluate the performance of the proposed mechanism, we collect a dataset from an artificial intelligence contest (https://security.bytedance.com/fe/ai-challenge#/secproject?id=2&active=1).It contains over eight million records consisting of user profiles and user requests (follow or like someone).Basic information of the dataset is shown in Table 1 and Table 2: Table 1 shows the users' personal information (profile), while Table 2 illustrates the users' behavior (request), including the device and the app version used to initiate the request at that time.
The task is described as follows: Given user profiles and their requests.Only a small percentage of users are labeled.Hence, we have to build a reasonable, explanatory and effective model to detect malicious bots from users.

Evaluation Metric
To evaluate the performance, we need to take Recall and Precision into consideration comprehensively, while we are anticipated to excavate bots as many as possible (to improve Recall).
Meanwhile, we are supposed to make accurate prediction without harming normal users (to improve Precision).Hence, the traditional F1 score is adopted as the evaluation metric.

Data processing
In this section, we describe the data processing in detail.Overall, we merge user profiles matrix to user requests matrix on variable 'user', and make prediction for user requests.If any request is predicted as 1, then this user is labeled as 1.
For Feature Engineering, the process on different variables is presented in Table 4.After the operation, we obtain the feature engineering matrix.

Dataset Applied method Variables
User requests Second order feature [user, request_device_id, Count encoding [user,request_device_id, request_ip, For Similarity Calculation, we exploit 3 kinds of relationships to derive the graphs; and then we calculate similarity.We obtain graphs by using 'request_ip, 'request_device_id' and 'request_target' relationships.For example, there will be an edge between two users if they use the same IP, or follow the same target, or share the same device.
1) Constructing the graph by using 'request_ip': Here, we analyze the number of users associated with each IP while corresponding results are provided in Table 5.We find that for most scenarios, each IP is associated with only one user.However, the constructed graph may not cover all users.
Percentiles 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Max Hence, we are anticipated to exclude the IP with only 1 user.Furthermore, there also exists some IPs with more than 1000 users which may be public IPs.The existence of such public IPs will lead to a large number of edges in the constructed graph, which weaken the association between users.Hence, we also exclude the IPs with more than 1000 users for simplicity.Table 6 shows the quantile of the number of users associated with each IP after exclusion.There are 26,836,730  3) Constructing graph by using 'request_target': If we use the full data, there will be 60,544,191 edges and 2408,814 nodes; however, this will incur the requirement of high computing resources to address such data.Similar to IP association, if a target is liked/followed by lots of users, this indicates the target might be a celebrity.Therefore, we exclude targets that are associated with more than 20 users' likes.After the operation, we obtained a graph with 5121,010 edges and 1317,012 nodes.
Figure 3 shows the number of edges and nodes in different graphs.The number of edges in IP Graph is much higher than that of the other two graphs.

Result analysis
In order to valid the superiority of the proposed model, extensive experiments are conducted on the considered data.For comparison, we also implement several baseline models using raw data, including Decision Tree [31], AdaBoost [32], XGBoost [33], Random Forest [34], CatBoost [35] and LightGBM [30].Corresponding results are provided in As presented in Figure 5, results are provided to verify the validity of different steps indicated by the obtained F1-score.As indicated, we can find that the F1-score can be improved by a large extent with the consideration of step 2; while step 3 can only slightly improve the F1-score.For variable 'request_time' in Figure 8, In addition to the characteristics of aggregation, we also find that the request time of bots shows obvious periodicity, that is, bots are programmed regularly.

FIGURE 8. Feature Histogram in Request Side
For second order features in Figure 9 and similarity features in Figure 10, there also exists obvious differentiation between the positive and negative samples.We find that some bots may use special devices, resulting in a low  and high  in app_version.

Conclusions
We propose a social bots detection method, BotFinder, in this paper.In order to valid the performance of the developed approach, we collected a dataset with more than eight million records of users.
Meanwhile, machine learning and graph methods are applied to extract potential features of social bots from such dataset.In particular, for feature engineering, we generate second order features and use coding methods to encode high-cardinal variables.In terms of graphs, we generated node vectors for accounts and then exploit unsupervised method (here we utilize community detection) to diffuse labels in order to further improve the performance.Through experiments conducted on the collected dataset, the effectiveness of the proposed integrated mechanism is guaranteed by a relatively large F1-score of 0.8850.
The performance is super compared with existing methods.

FIGURE 1 .
FIGURE 1.The Framework of BotFinder 3.2 Step1: Feature Engineering Here, we try to obtain the second order features, time interval feature, count encoding and k-folds target encoding.Then, we apply the LightGBM to train the obtained features and infer temporary results.
(a) Split the training data into 10-folds.(b) Regard the mean of the folds #2-10 target as the coding value of the fold #1, and calculate the coding value of #2~#10 similarly.(c) Use the target of training data to determine the coding value of testing data.

FIGURE 2 .
FIGURE 2. Presentation of The Similarity Calculation Process

FIGURE 4 .
FIGURE 4. Similarity Calculation For Community Detection, we only apply the community detection algorithm to address the device graph obtained in the above step.

FIGURE 5 .
FIGURE 5. F1 Score Obtained for the Testing SetFigure6shows the feature importance generated in step1 and step2.The score of target encoding of device types seems to be high, indicating that social bots tend to use fixed types of device.Furthermore, the personal information of users such as login time and register time also has a high score.

FIGURE 6 .FIGURE 7 .
FIGURE 6. Feature ImportanceTo visually show the differentiation between positive and negative samples in different variables, we apply Kernel Density Estimation (KDE) In Figure7.We find that social bots show the characteristics of aggregation, that is, they are tend to register or login at a fixed time.

FIGURE 9 .
FIGURE 9. Second Order Feature Histogram

Table 3 .
Request with label 1 indicates that this request is blocked and corresponding user is a bot.We find that the number of bots is significantly less than the number of humans.

TABLE 4 .
The Process in Step 1 edges and 1,953,559 nodes, accounting for 44.23% of all users.

Table 7 ;
as indicated, BotFinder is of the best performance indicated by the largest F1-score.

TABLE 7 .
Results of the Classification Model