Attributed Network \(\text{G}=\{\text{V},{\epsilon },\text{X} \}\) is defined as an undirected graph with \(\text{M}=|\text{V}\)| nodes and \({\epsilon }\) edges, each of nodes is associated with a \(\text{N}=\left|\text{X}\right|\) dimension attribute.
In this paper, we focus on the problem of anomaly detection on attributed networks. The definition of the problem is as follows:
Given an attributed network\(\text{G}=\{\text{V},{\epsilon }, \text{X}\}\), our goal is to detect the nodes that are rare and differ significantly from the majority of the reference nodes in terms both
the structure and attribute information of the nodes. More formally, we aim to learn a score function \(\text{f} : {\text{V}}_{\text{i}}\mapsto {\text{y}}_{\text{i}}\in \text{R}\) to classify sample \({\text{x}}_{\text{i}}\) based on the threshold \({\lambda }\), as shown in Eq. 1:
$$\begin{array}{c}{\text{y}}_{\text{i}}=\left\{\begin{array}{c}1 if f\left({V}_{i}\right\})\gg \lambda , \\ 0 otℎerwise.\end{array}\right., (1)\end{array}$$
In Eq. 1, \({\text{y}}_{\text{i}}\) represents the label value of \({\text{x}}_{\text{i}}\), \(\text{y}=0\) means \({\text{x}}_{\text{i}}\) belongs to the normal class and \(\text{y}=1\) means \({\text{x}}_{\text{i}}\) belongs to the abnormal class.
1. The Architecture of GraphAEAtt
The architecture of GraphAEAtt
In order to capture potential cross-modal interactions between network structures and node attributes in graph data, this paper designs a deep learning framework based on self-encoders and attention mechanisms. The framework extracts features in attributed networks from multiple perspectives and fuses different types of feature information to restore the initial information.
The architecture of the model proposed in this paper is shown in Fig. 1, GraphAEAtt is designed to capture the cross-modal interaction of structural and attribute information through hidden space feature fusion. Existing approaches use one embedding representation to reconstruct multiple types of information, and a single embedding has limited feature extraction capability for the initial data.
The two decoders of GraphAEAtt framework select the corresponding feature embedding for deep fusion according to the class of reconstructed information, for example, when reconstructing attribute features, feature extraction is performed for each dimensional attribute and combined with the features of nodes to restore the exact attribute information of each node, thus minimizing the difference from the original matrix when reconstructing the matrix.
The framework consists of two parts: the encoder and the decoder. The method takes as input the node embedding representation learned by the structure encoder combined with the graph attention layer and the attribute embedding representation learned by the attribute encoder, and during the training process, the structure decoder and the attribute decoder work together to capture the interaction between the network structure and the node attributes. Finally, the reconstruction error of the network structure and node attributes is used to measure the anomalies in the attributed network.
2. Structure Auto-Encoder
In the structure encoder, the full attribute information in the data is ignored and only the node structural features are extracted.
The goal of the structure encoder is to obtain an embedding representation that reflects the global structural features in the network by learning. The input is the adjacency matrix of the graph data: \(\text{A},\text{A}\in {\text{R}}^{\left\{\text{N}\times \text{N}\right\}}\). Obtain the embedding representation of the hidden layer by nonlinear feature transformation, as shown in Eq. 2:
$$\begin{array}{c}{\stackrel{\sim}{\text{Z}}}^{\text{s}}=\sigma \left({\text{A}\text{W}}^{{\text{S}}_{1}}+{\text{b}}^{{\text{s}}_{1}}\right), (2)\end{array}$$
Input it to the second feature transformation layer, as shown in Eq. 3:
$$\begin{array}{c}{\text{Z}}^{\text{s}} ={\stackrel{\sim}{\text{Z}}}^{\text{s}}{ \text{W}}^{{\text{s}}_{2}}+{\text{b}}^{{\text{S}}_{2}}, \left(3\right)\end{array}$$
The presence of anomalous datasets, overly dense aggregated connections or nearly independent nodes have obvious features in the graph structure, but in the attribute graph with many nodes, the percentage of anomalous nodes is small, and it is easier to capture the structural features of most normal nodes when generating low-dimensional embedding. Therefore, the weight parameter matrix can be dynamically optimized by the loss function during model training so that the extracted feature embedding contain the normal structural features to the maximum extent.
3. Graph Attention Layer
The goal of the graph attention layer is to extract the features of each node in the network data in the hidden space. The network structure information and node attribute information of each node can be read directly from the real data. Since the information carried by nodes varies, the abstract representation of each feature can be extracted by combining the information from both aspects. The accurate extraction of node features is the key to reconstructing the original network, and since the nodes' own information comes from two sources, the node features can reflect the local connection structure and the attributes of similar nodes, each with a unique representation.
The graph attention layer extracts the potential features of each node and generates node embedding, which are highly characterized generalizations combining the node's own attributes, neighboring node attributes, and node connectivity relationships. In the reconstruction process of the adjacency and attribute matrices, the adjacency and attribute information is reconstructed for each node based on its specific characteristics.
The input to the graph attention layer is a node feature vector, and subsequently the features of each node are again dimensioned to obtain a more abstract feature representation. To calculate the correlation degree of two nodes, the node features are spliced to form a scalar, and a nonlinear feature transformation is performed on this scalar to obtain a potential representation that unites the features of the two nodes. After the central node performs the same operation with all neighboring nodes, all potential representations are normalized using the soft-max layer so that the importance degree of neighboring nodes is distributed between [0,1].
To obtain a representative high-level node feature representation, the observed original node attribute X is first converted into a vector representation of the low-dimensional potential space, and the output is nonlinearly feature transformed using an activation function to obtain a node feature vector mapped from the attribute feature X to the potential space, which is represented as Eq. 4:
$$\begin{array}{c}{\stackrel{\sim}{\text{Z}}}^{\text{v}}=\sigma \left({\text{X}\text{W}}^{{\text{v}}_{1}}+{\text{b}}^{{\text{v}}_{1}}\right), \left(4\right)\end{array}$$
where\({\sigma }(\cdot )\) is the activation function, and \({\text{W}}^{{\text{v}}_{1}}\) and \({\text{b}}^{{\text{v}}_{1}}\) denote the weights and biases learned by the encoder.
After obtaining the embedding representation \({\stackrel{\sim}{\text{Z}}}^{\text{v}}\) of the graph, it is combined with the adjacency matrix to aggregate the embedding representations of all neighboring nodes by using a shared attention mechanism for the nodes. The adjacency matrix provides information about the neighbors of the nodes. First, the degree of association between two nodes is defined to be represented as \({\text{e}}_{\text{i},\text{j}}\), as shown in Eq. 5:
$$\begin{array}{c}{\text{e}}_{\text{i},\text{j}}=attn({\stackrel{\sim}{\text{Z}}}_{\text{i}}^{\text{v}},{\stackrel{\sim}{\text{Z}}}_{\text{j}}^{\text{v}})=\sigma ({\text{a}}^{\text{T}}\cdot [{\text{W}}^{{\text{v}}_{2}}{\stackrel{\sim}{\text{Z}}}_{\text{i}}^{\text{v}}\left|\right|{\text{W}}^{{\text{v}}_{2}}{\stackrel{\sim}{\text{Z}}}_{\text{j}}^{\text{v}}\left]\right), \left(5\right)\end{array}$$
where \({\text{e}}_{\text{i},\text{j}}\) is the importance weight of nodes \({\text{v}}_{\text{i}}\) to \({\text{v}}_{\text{j}}\), independent of the network structure of other nodes in the graph.\(\text{a}\text{t}\text{t}\text{n}(\cdot )\) denotes the neural network parameterized for \(\text{a}\in {\text{R}}^{\text{D}} \text{a}\text{n}\text{d}{ \text{W}}^{{\text{v}}_{2}}\in {\text{R}}^{\frac{D}{2}\times {D}_{out}^{v}}\), where the weights are shared by all nodes. || denotes the cascade operation of the node feature vectors.
In order to make the correlation coefficients easy to compare and calculate, the attention score of the node and each neighbor node needs to be normalized. The importance weights \({{\gamma }}_{i,j}\)are normalized by the softmax function.
$$\begin{array}{c}{{\gamma }}_{i,j}=\frac{\text{e}\text{x}\text{p}\left({\text{e}}_{\text{i},\text{j}}\right)}{\sum _{k\in {N}_{i}}\text{e}\text{x}\text{p}\left({\text{e}}_{\text{i},\text{k}}\right)}, \left(6\right)\end{array}$$
The embedding representation of node \({\text{V}}_{\text{i}}\) can be obtained by learning the importance weights and weighting the summation against the embedding representations of all nodes. The final node embedding \({\text{z}}^{\text{v}}\) is obtained after the graph attention layer, as shown in Eq. 6:
$$\begin{array}{c}{\text{Z}}_{\text{i}}^{\text{v}}=\sum _{k\in {N}_{i}}{{\gamma }}_{i,k}\cdot {\stackrel{\sim}{\text{Z}}}_{\text{k}}^{\text{v}}, \left(7\right)\end{array}$$
4. Attribute Auto-Encoder
The goal of the attribute encoder is to obtain an embedding representation of all attribute information in the attributed network in the hidden space. In the attribute encoder, two nonlinear feature transformation layers are used to map the observed attribute data into a potential attribute embedding representation. The formula is as follows:
$$\begin{array}{c}{\stackrel{\sim}{\text{Z}}}^{\text{A}}=\sigma \left({{\text{X}}^{\text{T}}\text{W}}^{{\text{A}}_{1}}+{\text{b}}^{{\text{A}}_{1}}\right), \left(8\right)\end{array}$$
$$\begin{array}{c}{\text{Z}}^{\text{A}}={\stackrel{\sim}{\text{Z}}}^{\text{A}}{\text{W}}^{{\text{A}}_{2}}+{\text{b}}^{{\text{A}}_{2}}, \left(9\right)\end{array}$$
The attribute matrix is all the attribute information contained in each node. The transposed attribute matrix takes each one-dimensional attribute as the analysis object, extracts the hidden space information of the attribute itself by nonlinear feature transformation, and uses the obtained embedding for attribute matrix reconstruction.
5. Decoder
The goal of the structure decoder is to reconstruct the new adjacency matrix, and to achieve cross-modal information fusion, inner product operations are done using node embedding and structure embedding, as shown in Eq. 9:
$$\begin{array}{c}\widehat{\text{A}}=Sigmoid \left({\text{Z}}^{\text{v}}{\left({\text{Z}}^{\text{s}}\right)}^{\text{T}} \right), \left(10\right)\end{array}$$
\({\text{Z}}^{\text{s}}\) contains global structural features and \({\text{Z}}^{\text{v}}\) contains node features that take into account the similarity of attributes between nodes, and the fusion of the two is able to assign appropriate neighboring relationships to each node and restore the same structure as the initial network.
In the above equation, the nodes are embedded as \({\text{Z}}^{\text{v}}\in {R}^{N\times D}\) and the structure is embedded as \({\text{Z}}^{\text{s}}\in {R}^{N\times D}\), and finally the matrix is normalized using the sigmoid activation function to make the connection relationship between the nodes as close as possible to 0 and 1, so that the reconstructed matrix and the original matrix converge infinitely.
The role of the attribute encoder is to reconstruct the attribute matrix and do the inner product operation between node embedding and attribute embedding, which can generate attribute information for each node that matches the characteristics of the node by fusing the information in the hidden space.
Finally, attribute decoder takes both the node embeddings
\({\text{Z}}^{\text{v}}\) leand the attribute embeddings \({\text{Z}}^{\text{A}}\)as inputs for decoding of the original node attribute, as shown in Eq. 10:
$$\begin{array}{c}\widehat{\text{X}}={\text{Z}}^{\text{v}}{\left({\text{Z}}^{\text{A}}\right)}^{\text{T}}, \left(11\right)\end{array}$$
\({\text{Z}}^{\text{v}}\in {R}^{N\times D}, {\text{Z}}^{\text{A}}\in {R}^{N\times D}, X\in {R}^{N\times F}\) . The role of the attribute encoder is to reconstruct the attribute matrix and do the inner product operation between node embedding and attribute embedding, which can generate attribute information for each node that matches the characteristics of the node by fusing the information in the hidden space.
6. Anomaly Detection
The goal of the framework is to reconstruct an attributed network for the trained network that is as similar as possible to the original network in terms of both network structure and attribute information. Due to the dichotomy between normal and abnormal features and the fact that all nodes share a common set of embeddings during reconstruction, the reconstruction error can be minimized when the embeddings contain only normal features as much as possible.
The objective function is to minimize the reconstruction error of the structure and attributes, expressed as the following equation:
$$\begin{array}{c}{\text{L}}_{rec}=\alpha {‖\left(\text{A}-\widehat{A}\right)\odot \theta ‖}_{F}^{2}+\left(1-\alpha \right){‖\left(X-\widehat{X}\right)\odot \eta ‖}_{F}^{2}, \left(12\right)\end{array}$$
where\({\alpha }, {\theta }, {\eta }\) are hyperparameters, parameter \({\alpha }\) controls the weights of structural reconstruction error and attribute reconstruction error, and \({\alpha }\) is the Hadamard product, using the matrix difference amount two parametric as a measure of network similarity, including the structural and attribute information of each node; \({\eta }\) and \({\theta }\) are defined as follows:
$$\begin{array}{c}{{\theta }}_{i,j}=\left\{\begin{array}{c}1 if {A}_{i,j}=0,\\ \theta otℎerwise.\end{array}\right., \left(13\right)\end{array}$$
$$\begin{array}{c}{{\eta }}_{i,j}=\left\{\begin{array}{c}1 if {X}_{i,j}=0,\\ \eta otℎerwise.\end{array},\right. \left(14\right)\end{array}$$
In the formula, \({\theta } >1\) and \({\eta } >1\). Since the matrix representing the real dataset is sparse, it is more desirable for attributes or connections that are present in the original matrix to be preserved in the reconstruction.
Anomalous nodes are defined as those nodes that deviate significantly from other nodes in the structure and attributes. Discovering anomalous connections in the network, nodes with anomalous behavior, nodes containing anomalous information, etc. can be abstracted as anomaly detection problems for graphs. The anomalous transactions in the data are usually some illegal violations such as money laundering, tax evasion, illegal financing and fraud. These transactions usually have different values from other nodes in terms of characteristics such as time step, transaction amount, number of inputs and outputs. Through the above analysis, the anomaly score of a node can be measured by the network structure and attribute reconstruction error.
The embedding reconstruction method can filter out nodes with rare topology or node attributes, and the reconstructed network is generated by an induction of the features of normal nodes that occupy the majority of the original data. Therefore, if the structural and attribute information of a node is reconstructed and still approximates the original state, then it is a normal node with a high probability. Otherwise, it may be more different from the majority of nodes, i.e., anomalous node.
In the anomaly detection stage, the anomaly score of each node is calculated by combining its connection and attribute information based on the reconstructed error of the matrix. The anomaly scores are arranged in descending order of the nodes, and the nodes with higher scores are more likely to have anomalies. Nodes with scores higher than \({\theta }\)are marked as predicted anomalies and compared with their true labels, and the detection effectiveness of the algorithm is judged using various evaluation metrics. \({\theta }\)is taken in relation to the distribution of anomaly scores.