A subspace constraint based approach for fast hierarchical graph embedding

Hierarchy network, as a type of complex graphs, is widely used in many application scenarios such as social network analysis in web, human resource analysis in e-government, and product recommendation in e-commerce. Hierarchy preserving network embedding is a representation learning method that project nodes into feature space by preserving the hierarchy property of networks. Recently, researches on network embedding are devoted to mining hierarchical structures and profit a lot form it. Among these works, SpaceNE stands out of preserving hierarchy with the help of subspace constraint on the hierarchical subspace system. However, like all other existing works, SpaceNE is based on transductive learning method and is hard to generalize to new nodes. Besides, they have high time complexity and hard to be scalable to large-scale networks. This paper proposes an inductive method, FastHGE to learn node representation more efficiently and generalize to new nodes more easily. As SpaceNE, a hierarchy network is embedded into a hierarchical subspace tree. For upper communities, we exploit transductive learning by preserving inner-subspace proximity of subspace from the same ancestor. For extending to new nodes, we adopt inductive learning to learn representations of leaf nodes. The overall representation of a node is retrieved by concatenating the embedding vectors of all its ancestor communities and the leaf node. By learning the basis vectors of subspace, the computing cost is alleviated from updating many parameters of projection matrices as in SpaceNE. The performance evaluation experiments show that FastHGE outperforms with much fast speed and the same accuracy. For example, in the node classification, FastHGE is nearly 30 times faster than SpaceNE. The source code of FastHGE is available online.


Introduction
Graph is an important abstraction and representation way for various complex network systems such as social networks, biological structures, and chemical compounds in the real world since it can reveal relationship between objects and help to make further analysis and mining.With the progress of the machine learning field, many machine learning methods like GNN can perform effective analysis of graphs, for example, classification, clustering, and recommendation.To do this, a representation learning on graphs should be provided to prepare graph data for building prediction models.Graph embedding is such a learning method to learn low-dimensional vector representation of graph with the constraint of preserving various types of graph properties.
Graph embedding attracted many research works and has already made impressive achievements in many important applications such as personalized recommendation and natural language processing [1,2].The aim of graph embedding is to encode the targeted information explicitly and implicitly contained in the graph data.There is rich information on graph data from different complicated systems, including heterogeneity, diversity, and varying scale.Hierarchy is one of the most critical structural properties.For example, a category hierarchy in e-commerce web sites (Lady's fashion → Sweaters → Crewneck) and leader-follower hierarchy in the social network of human organizations ( Company → IT Department → BI project → John).This paper focuses on the hierarchy preserving graph embedding method for complicated networks with high efficiency.
To preserve structural properties of networks, network embedding methods generally focus on the vertex-centric context information.The structural properties of networks can be described in three levels: 1. Microscopic level.The microscopic structural property only reflects local property of the vertex and its relationship with other vertices, such as neighbor relationship or similarity between nodes.The representative works include DeepWalk [3], LINE [4], and node2vec [5].2. Mesoscopic level.The mesoscopic properties reflect the organizational structure and functional components of networks such as of community structure with the features of dense connections within a community but sparer connections between different communities.The representative work is M-NMF [6].M-NMF proposes an embedding method for encoding community structures by nonnegative matrix factorization approaches.3. Macroscopic level.The macroscopic properties of network reflect the overall structure and the relationship of the sub-networks (e. g. communities) at different scales.Hierarchical structure belongs a type of macroscopic properties.Since hierarchy greatly enriches the network structure by introducing the concept of community, a lot of studies on network embedding are devoted to mining this type of structural relations.The representative works include GNE [7] and SpaceNE [8].GNE proposes an embedding method for encoding hierarchical structures by spherical projection.SpaceNE proposes an embedding method by introducing hierarchical subspace into network representation learning.It is the first attempt to introduce hierarchical subspace into network representation learning, and stands out from many hierarchy network embedding approaches.
In this paper, we propose an inductive learning method FasHGE (Fast Hierarchy Preserving Graph Embedding) to learn hierarchy network embedding much more efficiently while preserving the subspace constraints presented by SpaceNE.More specifically, a hierarchy network is embedded into a hierarchical subspace system, where sub-communities from the same community should be projected into the same subspace.Considering the stability of upper communities' structure, the upper communities and leaf node are treated differently.For upper communities, transductive learning is used to learn their representations by preserving inter-subspace proximity of subspace from the same ancestor.For extending to new nodes, we adopt inductive learning to obtain representations of leaf nodes.The overall representation of a node would be retrieved by concatenating the embedding vectors of all its ancestor communities as well as the leaf node.
The main contribution of our work can be summarized as follows: 1. We propose FastHGE framework to learn node representations with hierarchical structural information inductively under subspace constraints, proven to be faster than the existing hierarchical network embedding methods.2. We design FastHGE algorithm with optimization strategy, which can efficiently choose embedding methods for both communities and leaf nodes.The source code is available online.3. We achieve high performance by evaluating our method on two real-world network data.In depth analyses show the effectiveness and high efficiency of FastHGE.
The preliminary version of this work was presented at the conference ICASSP 2021 [9].This paper makes the following additional contributions: 1. First, we present the complete proof of the objective function (see Lemma 1).
2. Second, we present the optimization strategy for algorithms (see Algorithm 1). 3. Third, we add the ablation experiments for evaluating the objective functions.4. Fourth, we re-run the performance experiments for the new algorithm with two more data sets and add the experiments on.5. Fifth, we implement FastHGE on the implicit hierarchical networks.The results show the efficiency of our method on both classification and link predication tasks of implicit hierarchical networks.6.Moreover, we introduce the related work sufficiently to make the idea more clearly.
The rest of paper is organized as follows: In Sec 2, we give the problem formulation and definitions on network representation.The architecture of our FastHGE is introduced in Sec 3. Then we evaluate our model in Sec 4 .The related works are introduced in the Sec 5. Finally, we conclude our works in Sec 6.

Definition and Notation
Definition 1 (Hierarchical Network).Denote G = (V, E) as an undirected graph with V as a non-empty vertex set and E = e ij | v i ∈ V, v j ∈ V as an edge set.Hierarchical structure of community on G is represented through an L-layer hierarchical tree T with node set C (In this paper, we use "vertex" to represent a node in the original network and "node" to represent a node in the corresponding hierarchical tree).Given a node c i ∈ C, pa(c i ) and ch(c i ) denote the parent and children of c i in T , respectively.C l i represents the i-th community in the l-th layer within T and C L i = V according to the definition.Definition 2 (Subspace Constraint).Let U ∈ R n is a vector space and Us is a subset of U .If for any x, y ∈ U and α, β ∈ R, it follows that αx + βy ∈ Us.Then we call Us as subspace of U .
Figure 1 shows a network with its corresponding hierarchal tree and subspace constraints.For example, vertex v1, v2 and v3 in Fig. 1 1 are projected into the cube of subspace.Denote U l i ∈ R n×d l as the representation matrix for i-th community in the l-th layer.The key idea of SpaceNE is to preserve inner-community proximity.More specifically, the distance between nodes with the same community should be closer and they should be projected into the same subspace.Mathematically, Fig. 1 Illustration of hierarchical network, hierarchical tree and subspace for community C l i in the l-th layer, whose representation matrix is U l i , the subspace constraints can be formulated as follows: where d l is a hyper-parameter representing the dimension of subspace in the l-th layer.Equation 1 brings more difficulty on hierarchy subspace network embedding.SpaceNE solves it by introducing layer-wise node projection.Auxiliary projection matrix S l j ∈ R d l ×d l−1 is introduced to project representation matrix U l i in the l-th layer into U l i−1 by: Although node embeddings satisfying subspace constraints can be obtained by SpaceNE, O ((| V | +C) × D) parameters are required, where C is the number of the total communities and D is the dimension of vertex's embeddings, SpaceNE will carry out matrix multiplication starting from the root node of tree T to the leaf node layer by: where S l j ⊺ is the pseudo-inverse of S l j .It is time-consuming either to update auxiliary parameters or to calculate the matrix multiplication.

Framework
To tackle the problem of subspace constraints while reducing time complexity, we propose FastHGE.The basic idea is that sub-communities from the same community should share the same information about their upper structure of the network.
Definition 3 (Hierarchical tree).For a hierarchical tree T , let a path from the root not of T to a vertex v i as a 0 i , a 1 i , ..., a L i , where a l i denotes an upper community of node v i in layer l, and L denotes the depth of T .Specially, a L i = v i and a 0 i is the root node.
Definition 4 (Hierarchy embedding).For a hierarchical tree T , let − → u vi ∈ R m denote embedding of v i with dimension m.For convenience, we adopt concatenation operation to incorporate information from all communities along the path and the leaf node as: where[•, •] is the concatenation operation that directly concatenates embeddings in the feature dimension.g l (•) denotes the embedding method applied in layer l.It means that we can use any graph embedding methods to learn the representation vectors of communities or leaf nodes.
Definition 5 (Hierarchical function).For a hierarchical tree T and a leaf node v i with a path a 0 i , a 1 i , ..., a k i , a k+1 i , ..., a L i , let f sub ( − → u vi , l) denote the function that obtains the information from the uppermost community node a 0 i to the community at layer l along the path.If the path of leaf node v j is a 0 j , a 1 j , ..., a k j , a k+1 j , ..., a L j , then f sub ( − → u vi , k) = f sub ( − → u vj , k), which means v i and v j share the same information from their common ancestors.
Definition 4 presents the learning process of FastHGE and Definition 5 presents the optimization strategy for embedding computation.A two-phase embedding learning framework is designed as shown in Figure 2 .

Learning representation of leaf node
Node embedding by incorporation The first phase is the learning representation of community in corresponding sub-space, and the second phase is the learning representation of leaf nodes in the low-dimension vector space.By learning the basis vectors of sub-space, the computation cost is alleviated from updating many parameters of projection matrices as in Equation 2 and 3. Furthermore, the overall embedding vectors of vertex can be easily retrieved by concatenating representations along the path rather than additional matrix multiplications.Although the degree of freedom of parameters is reduced during optimization, it can remarkably accelerate the training process while still satisfying the subspace constraints (Equation 1) Lemma 1 FastHGE satisfies sub-space constrains as defined in Equation 1.
Note that the communities at the same layer only vary in g l (a l i ).Rank ofU l i will meet the condition that rank(U l i ) ≤ 1 + rank(g l (a l i )) as the communities at the upper layers share the same embedding vectors so that the rank of g 0 (a 0 i ), g 1 (a 1 i ), ..., g l−1 (a l−1 i ) is reduced to 1.If we carefully chose the dimension of embedding vectors for each g l (a l i ) such that rank(g l (a l i )) ≤ d l−1 , then the sub-space constraints are satisfied.□ 3 FastHGE Learning Procedure

Learning Procedure
There are two main problems left to be tackled in FastHGE: 1) How to properly choose embedding methods g l (•) to embed each community C l i and leaf nodes v i ; and 2) How to optimize the embedding vectors under the framework of FastHGE.
Intuitively, the upper community structure is more stable while the leaf node layer may dynamically change due to adding new nodes.Thus, we learn representations of the upper subspace in a transductive way, and learn representation of the leaf nodes in a selected inductive way.We first utilize GraphSAGE [10] to aggregate information from neighbors of leaf nodes.and DeepWalk [3] is adopted as the transductive learning method.To reduce the complexity, we adopt NCE [7] loss to train representations of upper communities and leaf nodes simultaneously: where N (v i ) is set of neighbors of vertex v i , σ is an activation function and k is the number of negative samples v n .Besides, we introduce an additional objective to maintain community proximity.Specifically, we calculate community proximity [7] based on the common neighbor similarity: where A u is the u-th column of adjacency matrix A. With community proximity, the objective function can be defined as: where is the parent node of c l i and c l j .k ′ is the number of negative sample c n .
Finally, we can get the overall optimization objective function with hyperparameter λ is:

FastHGE Algorithm
Algorithm 1 presents the pseudo-code of FastHGE.The algorithm consists of two functions.In the LearnEmbedding function, node embeddings U l of each layer are calculated via DeepWalk algorithm and then are concatenated along the tree's hierarchy to generate U initial which is the initial embedding matrix for graph G.While the RecursiveOptimization function is designed to update U in a recursive manner.To optimize the embedding vectors, we leverage the NCE loss and Adam.Specifically, a batch of nodes are sampled from Graph G and calculate Z leaf via GraphSAGE algorithm to update each node's embedding on the leaf layer.With the updated embedding matrix U , loss can be calculated via Equ.8 and minimized via Optimizer Adam.If Iter is not equal to zero, the process will step into next iteration.

Model Complexity Analysis
Note FastHGE can be promoted by exploiting transductive learning for the communities at upper layers and inductive learning for leaf nodes.Assuming the embedding dimensions of the communities in different layers are the same D g , the number of parameters for the transductive part is O(C × D g ), where C is the number of total communities.For the inductive part, the complexity of FastHGE is the same as one of GraphSAGE, i.e., O is the size of parameter matrix at the i-th layer and K is the layer number which is practically two or three and set to two n our experiments.It is independent of the scale of the given network.Thus, the total model complexity in the experiment is

Experiment Setup
Datasets.Four real-word dataset Amherst, Georgetown, Hamilton, and UC are chosen for performance evaluation.They are four-layer social networks formed by friendship relations at corresponding American universities on Facebook [11] .Their parameters about size are shown in Table 1.
In addition, to further evaluate the effectiveness of FastHGE, we use two datasets without explicitly hierarchical structures, Chameleon and Squirrel, extracted from Wikipedia.Both are page-page networks, in which each node is article and the vertices are the links between two articles.Table 2 shows the details.
Baseline Methods.The effectiveness of FastHGE is compared with two hierarchical network embedding models, SpaceNE [8] and GNE [7] Meanwhile, the comparison with traditional network embedding models such as GCN(unsupervised) [12] , and GraphSAGE (unsupervised) [10] are also made.
Settings.The grid search strategy is adopted to obtain the optimal hyperparameters.The dimension of embedding vectors is 64 for all models.We apply Adam [13] to optimize parameters for better convergence.

Ablation Study
To evaluate the efficient of our object function, in this section, we conduct several ablation experiments.
As the function consist of two parts: communities' representations and community proximity.Therefore, we implement two variations of FastHGE, in which l1 and l2 refer to the model only use l 1 and l 2 as the objective function, respectively.And l1 + λl2 is the overall FastHGE objective function.The results of efficiency and accuracy are shown in Fig 4 and   Fig. 4 Ablation study on efficiency From the results we can see in all the datasets, the overall objective function can usually outperform the variations, especially in the Georgetown.And for the Hamilton dataset, its performance is close to the best one.For the two variant object functions, we find only use l 1 can efficiently reduce elapsed time since we can always choose a properly embedding methods for community and leaf nodes.So that it can keep the similarity of neighbor nodes.Meanwhile, the variation with l 2 achieves better accuracy by utilizing community proximity.Therefore, our two objectives can efficiently help to optimize FastHGE.

Training Efficiency
To demonstrate the efficiency advantage of FastHGE, we compare running time (meaning the time it takes from starting training to convergence) in training procedures with SpaceNE.The experiments are compiled and tested on Linux cluster.The results are shown in Table 4.
The results shows that the running time of SpaceNE is apparent lower than FastHGE.The running time of SpaceNE model is almost 30 times lower than FastHGE.This is because FastHGE roots in applying GraphSage to extract information in leaf node layer and DeepWalk in other subspace with fewer parameters.The results indicate that FastHGE can be scalable to large-scale networks.

Node Classification
The node classification on the four datasets is conducted with different percentages of training samples randomly selected from the original networks.Logistics Regression is applied as the classifier and Accuracy is used as the evaluation metric.Each testing result is the average of 5 running results.The results are shown in Table 5 and 6.
From the results we can see at most time, FastHGE achieve the highest performance in the experiments, while SpaceNE slightly inferior to it.This is because our model improves from both robustness and effectiveness to the network with a various scale of size.And SpaceNE consider subspace constraints on community representation for embedding so that it can performs well in the most case.For the other methods, they employee the traditional network embedding models which need to modeling the whole networks.Therefore, they do not perform well the experiments.

Efficiency Evaluation on Implicit Hierarchical Networks
To further evaluate the efficiency of FastHGE, we implement it on the networks whose hierarchy structures are implicit.Since Chameleon and Squirrel do not have explicit hierarchical structure, we need to initialize the datasets.We first divide each dataset into five categories with Equal Frequency Partition to make it discrete.Then for the hierarchical community division, we use Louvain Algorithm to classify the original vertices into different communities.Finally, we get a four-layer structure.And we implement our FastHGE on these datasets.We evaluate the efficiency on both classification and link prediction tasks.The results of classification are shown in Table 7 and 8.The results show that our FastHGE can efficiently classify the data in the implicit hierarchical networks.Even though the best method is DeepWalk, which utilizes the path of networks to find the hierarchy relationship between vertices, FastHGE achieves similar performance with it.This is because our method measures vertices classification by their projected subspaces, that the vertices in the same community should be projected into the same subspace.
Table 9 shows the results on link prediction task.Even though our FastHGE has not achieved the best performance, its efficiency is close to the best one.This is because to speed up the representation, we use innercommunity proximity for the nodes projected in the same subspace, and some structures are ignored in our model.Methods like SpaceNE mainly utilize the graph structure to learn representation.So that for the link prediction task, they can use this information and achieve a good performance.But consider its running time, the performance of FastHGE is acceptable.

Related Works
In this section, we summarize the related works on network embedding in two aspects, Attribute-preserving methods and structure-preserving methods.
Attribute-preserving Methods.As the attributes widely exist in the networks, which can enhance the node representations, there are some works [14][15][16][17][18][19] focus on using this rich information for network embedding.Liu et al. [19] consider network embedding as neural machine translation.They propose a content-to-node seq2seq model to learn node representation, which maps content sequences into corresponding node sequences.Inspired by TADW [20], which involves text feature of vertices into network representation learning, HSCA [18] embeds a network into a single latent representation space to capture the interplay among homophily, structural context and node content.Meng et al. [21] provide a variation auto-encoder to embed the nodes and attributes into Gaussian distributions for low-dimensional network representations.Zheng et al. [17] consider attribute relations and dependencies in the hierarchical intention embedding network to predict click-through rate.Zhang et al. [22] construct a CNN-based model to fuse multiple conceptual attribute embeddings.Wang et al. [14] combine structure and attribute information in a united approach for network embedding.To adaptively weigh the strength of interactions between the center node and its neighbors, they design an attention mechanism based on the node attribute similarity.
Structure-preserving Methods.Structure-preserving methods [3,5,7,8,[23][24][25][26] aim to preserve the structure and inherent properties of the network when learning the low-dimensional representations of nodes in it.Pioneering network embedding works such as DeepWalk [3], LINE [4], and node2vec [5]are proposed to encode a vertex with its neighbor vertices.They focus on to preserve the microscopic structural properties.
DeepWalk introduce deep learning into network embedding.DeepWalk learns a latent social representation for vertices from a stream of short random walks.DeepWalk exploits language modeling to define objective functions for capture the neighborhood similarities between vertices.It is parallelizable and scalable for large scale graph data.LINE encodes arbitrary information networks (undirected, directed, and/or weighted) using first-order and secondorder proximities.The algorithm first encodes the representation nodes with first-order and second order proximities, respectively, and then get the final representations by concatenating them.The asynchronous stochastic gradient algorithm is used for model optimization and an edge-sampling method is proposed for model inference to solve the limitation of SVD on weighted edges.
To preserve the mesoscopic properties of network, M-NMF is proposed [6], which preserves both the microscopic structures (first and second order proximities) and mesoscopic community structures.M-NMF(Modularized Nonnegative Matrix Factorization) incorporates the community structure into network embedding by a unified framework with jointly optimizing the NMF based embedding model and the modularity based community detection model.
To preserve the macroscopic properties of network, GNE [7] and SpaceNE [8] are proposed to encode the community structure and hierarchical structures of the network.GNE (Galaxy Network Embedding) provides hierarchical community preserving network embedding, which considers both the topological relationship between nodes in the same layer of the hierarchical tree but also the relationship between nodes (for example, parent and child) in the lineage hierarchy.GNE formulates the network embedding task as an optimization problem, which constraints on both the local community structure (horizontal constraint) and the global hierarchical structure (vertical constraint).Inspired by the galaxy structure, GNE define a community as a sphere with the parent as the center and propose a spherical projection based embedding method.SpaceNE (Subspace Network Embedding) introduces subspaces into network representation learning, to encode the community structures in network along with their hierarchy.A subspace is a subset of a topological space endowed with the subspace topology, which can be used to approximate data with higher dimensions such that only keeping principal features.The objectives to preserve proximity between pairwise nodes, across communities, along with the constraints on subspace dimension is designed.

Conclusion
In this paper, we propose FastHGE for hierarchical network embedding.The hierarchical structures of networks are projected into hierarchical subspace systems.In our learning framework, we not only learn representations of nodes but also representations of subspace.The sub-communities in the same community are projected into the same subspace and they share the common structure information of the upper communities.Representations of vertex can be obtained by concatenating the related subspace representations and the leaf node representation.Our performance experiments show that the convergence speed of FastHGE has increased about 30 times on the condition of achieving the similar node classification accuracy with the state-of-art model SpaceNE.

3 1 , C 3 2 , C 3 3 , and C 3 4
Figure1shows a network with its corresponding hierarchal tree and subspace constraints.For example, vertex v1, v2 and v3 in Fig.1(a) belongs to community C 3 1 in Fig.1(b), then community C 3 1 belongs to community C 2 1 at upper level and C 2 1 belongs to the root community C 1 1 .In Fig.1(c) , the community C 3 1 , C 3 2 , C 3 3 , and C 3 4 are projected into the line of subspace, respectively, the community C 2 1 , C 2 2 are projected into the plane of subspace, respectively, and the the root community C 11 are projected into the cube of subspace.Denote U l i ∈ R n×d l as the representation matrix for i-th community in the l-th layer.The key idea of SpaceNE is to preserve inner-community proximity.More specifically, the distance between nodes with the same community should be closer and they should be projected into the same subspace.Mathematically,

Table 1
Real social networks used for experiments

Table 2
Implicit hierarchy networks used for experiments

Table 4
Running time(seconds) on different datasets

Table 5
Comparison of node classification with existing methods on Amherst & Hamilton

Table 6
Comparison of node classification with existing methods on Georgetown & UC

Table 7
Efficiency evaluation of classification on Chameleon

Table 9
Evaluate efficiency of link prediction