The Concept Information of Graph Granule with Application to Embedding Learning of Knowledge Graph

Knowledge graph embedding (KGE) are rou-tinely used to represent entities and their relations in knowledge bases with a quantitative measure, and the triples usually play the role of basic units in KGE learning. Considering that triples are sometimes far from adequate and knowledge graph itself contains a lot of information, this paper employs FCA-based technology to mine the deterministic knowledge from knowledge graph, that is, the formal concept, and attempts to establish the relationship between knowledge graph (KG) and formal concept analysis (FCA). Speciﬁcal-ly, each set of triples sharing the same head entity are grouped as a graph granule and the concepts of each graph granule are mined. By further exploration, the maximal concepts are integrated into embedding learning to develop a novel KGE model named TransGr for knowledge graph completion. This model learns a matrix for each maximal concept in graph granule as well as a vector for each entity and relation. The performed experiments on link prediction and triple classiﬁcation tasks demonstrate that the proposed TransGr model is eﬀective on the datasets with relatively complete graph granules.


Introduction
Knowledge graph (KG) is a multi-relational graph database, using nodes to represent entities, and directed edges to depict the relations between the two connected entities. It has become an important infrastructure for many research fields and various KGs Auer et al. (2007); Bollacker et al. (2008); Miller (1995); Suchanek et al. (2007) have been established for different tasks. However, the symbolism of knowledge graph makes it face the challenge of efficiency in specific application. At this time, many knowledge reasoning methods came into being.
Knowledge graph embedding (KGE), which aims at learning vector representation for the symbolic entities and relations, is one of the most explored knowledge reasoning methods. The existing KGE models can be roughly divided into two categories. One category includes translational distance models Bordes et al. (2013); Feng et al. (2015); He et al. (2015); Ji et al. (2015); Lin et al. (2015b); Wang et al. (2014), such as TransE Bordes et al. (2013), which measure the plausibility of the facts as the distance between two entities. Simplicity is the biggest advantage of such models. However, because of the complexity of the relations in the knowledge graph, they sometimes may lack efficiency. Therefore, some researchers try to embed addition information to improve the robustness of the KGE models and the semantic matching models Lacroix et al. (2018); Lin et al. (2015a); Nickel et al. (2011Nickel et al. ( , 2016; Trouillon et al. (2016); Xiao et al. (2016); Yang et al. (2015); Zhang et al. (2019) which taking certain semantic information into account is one of them. For instance, RESCAL Nickel et al. (2011) associates each entity with a vector and each relation with a matrix to capture their latent semantics while another work Lin et al. (2015a) considers relational path as semantic compositions of their relation embedding to benefit for embedded learning. These methods regard triple or relational path as the basic unit of embedding, and thus lack the mechanism of using additional theories to mine structured information from knowledge graph, which may affect the practical applications of knowledge graph with obvious structured information.
As a mathematical tool for data analysis and processing, formal concept analysis (FCA) Wille (1982) has been widely used to deal with several variants of the knowledge graph, such as graph and relational data. In an existing report Wille and Fachbereich (1997), conceptual graphs and FCA have been connected through the conceptual structure to obtain a formalization of elementary logic. Relational concept analysis (RCA) Rouane-Hacene et al. (2013) extended the concept to multi-relational datasets. Concretely, it constructed a set of concept lattices, one per object sort, through an iterative analysis process that is bound towards a fixed-point. Graph-FCA Ferré (2015); Ferré and Cellier (2019) is a direct application of FCA in the knowledge graph. It introduced projected graph patterns (PGP) and the object relations to define graph concept. Based on the theory introduced by Graph-FCA, Ferré (2017) proposed a symbolic form of k-NN (k Nearest Neighbors) named C-NN (Concepts of Nearest Neighbors) where numerical distances were replaced by graph patterns that provide an intelligible representation of how similar two entities are. Later, Ferré (2019Ferré ( , 2020 applied C-NN to knowledge graph for link prediction and achieved state-of-the-art results. In addition to link prediction, some scholars applied FCA for information retrieval Balasubramaniam (2015); Cigarrán et al. (2005); Fkih and Omri (2016), ontology learning Stoffel (2018, 2019) and semantic annotation González and Hogan (2018), from text data or semantic network. For instance, Balasubramaniam (2015) proposed a hybrid FOGA framework using FCA for information retrieval based on clustering while Jabbari and Stoffel (2019) introduced a pipeline that incorporates Natural Language Processing (NLP), FCA and Ontology Engineering techniques to build an ontology from textual data. Although these FCA-based studies involve the concept information of the knowledge graph, there is no discussion on how to use this information to develop specific knowledge graph embedding models.
In FCA, concept is used to represent the generalization and specialization relationship between entities, usually formed as (X, B), indicating that objects in X have at least the attributes in B. Therefore, concepts usually imply the deterministic information and contain rich semantic information. This fact motive our idea to establish the relationship between KG and F-CA to develop a novel KEG model named TranGr by taking the concept information into account. Note that only some relevant information rather than the whole knowledge graph is needed to infer an entity, we first group each set of triples with the same head entity as a graph granule. Then, concept mining is carried out in graph granule instead of the whole knowledge graph to reduce the search space and improve the efficiency of concept computing. By further exploration, a novel KGE model named TransGr is developed to learn a matrix for each concept as well as a vector for each entity and relation by capturing the interaction between entities through concepts obtained from graph granules. Considering that there are a large number of concepts in the whole knowledge graph and the maximal concepts contains more tail entities, this paper uses the maximal concepts for embedding learning.
The paper is organized as follows. Section 2 reviews some related pieces of knowledge with KG and FCA. The TransGr model is presented in Section 3 and the numerical experiments conducted on two standard datasets Wordnet and Freebase follow immediately in Section 4. A summary and future work are presented in Section 5.

Preliminaries
In this section, we first recall some related definitions about knowledge graph and formal concept analysis, and then introduce the notion of graph granule.

Knowledge graph and knowledge graph embedding
Knowledge graph was formally proposed by Google in 2012, aiming at improving the ability of search engines. It encodes structured information of entities and their rich relations, and triples formed as (head entity, relation, tail entity), also called facts, usually play the role of basic units in a knowledge graph. In the literatures, it can be defined as follows: Definition 2.1 Cai et al. (2018) A knowledge graph G = (V, E) is a directed graph whose nodes are enti-ties and edges are subject-property-object triple facts. Each edge of the form (head entity, relation, tail entity) (denoted as < h, r, t >) indicates a relation of r from entity h to entity t.
Although the triples are efficient in representing structured data, the potential symbolic property often makes them difficult to application. Knowledge graph embedding is a promising approach to tackle this issue.
Definition 2.2 Given a knowledge graph G = (V, E). Knowledge graph embedding is to convert the entities and relations in G into a continuous low dimensional vector or matrix space in which the plausibility of a triple is preserved as much as possible.
The plausibility of the triple (h, r, t) is typically reflected in the definition of score function f r (h, t). For instance, if triple (h, r, t) holds, TransE Bordes et al. (2013) requires the embedded vectors to satisfy h + r = t. Then, f r (h, t) = ||h + r − t|| 2 1/2 is employed as a score function to make (h, r, t) have a lower score than an incorrect triple in the vector space.
In order to obtain the vectors of entities and relations, a corrupted triple is generated with either the head entity or tail entity replaced by a random entity for each triple in knowledge graph to construct a margin-based score function: where γ > 0 is a margin hyperparameter and S,S ′ respectively denote a triple set in the knowledge graph and corresponding corrupted triple set.

The graph granule and its concept lattice
Generally speaking, the triples in a knowledge graph are extremely large. To infer an entity accurately, we only need some relevant information rather than the whole knowledge graph. Thus, in this section, we group each set of triples with the same head entity as a graph granule for deterministic information mining.
is a set of tail entities related to h, and R from the relation part of E, point to all the relations from head entity h to tail entity set T .
It is obvious that a graph granule is the subgraph of a knowledge graph with one head entity and a certain number of tail entities related to it. When a graph granule contains only one tail entity, it will degenerate into a triple. Since the head entity in a graph granule is unique, it will be used to represent a graph granule in this paper when there is no confusion.
Example 1: Fig.1 presents a graph granule chosen from the dataset FB15K used in Section 4, in which the middle ball represents the head entity, and the surrounding balls represent the tail entities. The strings below the tail entities represent the relation between all entities from the head entity to the tail. Each edge denotes a triple: such as (/m/0q9kd, /people/person /profession, /m/02hrhlq), which means there is a relation "/people/person/profession" from entity "/m/0q9kd" to entity "/m/02hrhlq".
A graph granule contains a lot of information, and some of which may be uncertain. The FCA-based method can be used to mine the deterministic information in graph granule. Therefore, we transform the graph granule into formal context.
, ..., (h, t |T | )} is a nonempty, finite set of entity pairs, R = {r 1 , r 2 , ..., r m } is a set of relations between the entity pairs, while I is the binary relation on O × R. If triple (h, r, t) be a fact, then the value of I is 1, otherwise 0.
Example 2: Table 1 Similar to the classical concept lattice theory, we can define concept in graph granule based on the operations defined in Definition 2.5.
Definition 2.6 Let the triple (O, R, I) be a formal context. A tuple (X, L), with X ∈ P (O) and L ∈ P (R), is called the concept of the formal context (O, R, I) if f (X) = L and g(L) = X. The set of entity pair set  Fig. 2 The concept lattice L(U, R, I) X and relation set L are, respectively, labeled as the extension and the intension of the concept (X, L).
For two concepts (X 1 , B 1 ) and (X 2 , B 2 ), the order relation between them can be defined as follows: The infimum and supremum between them can be defined as follows: In what follows, (h, T * ) with T * = {t 1 , t 2 , ..., t n } is used to represent the set of entity pairs {(h, t 1 ), ..., (h, t n )} for convenience.
Example 3:According to the Definition 2.6, the formal context (O, R, I) in Table 1 has a total of nine con-  Fig.2 shows its concept lattice in which the head entity "26" is omitted.
Obviously, there are many concepts in a graph granule and it is unnecessary to use all the concepts for embedded learning. According to the definitions of supremum and infimum between concepts, it is easy to find that the maximal concepts connect more tail entities, so they contain more extensive information. Therefore, in this paper, we only focus on the maximum concepts and develop Algorithm 1 to compute the maximal concepts from each graph granule for KGE learning.
Algorithm 1 The computing algorithm for finding the maximal concepts Input: Knowledge graph K Output: The maximal concept set C max 1: Set C max = ∅ 2: Get the graph granule set G K 3: For each G h in G K 4: For each r in G h 5: Get the frequency f re(r) of r 6: End 7: Arrange f re in descending order 8: For each f re(r) in f re 9: If there is no (A, B) ∈ C max satisfies g(r) ⊆ A 10: C max = C max ← (g(r), f (g(r)) 11: End 12: End 13: End 14: Return C max It is easy to verify that the time complexity of Algorithm 1 in worst case is O(2|G K ||R|).

Knowledge graph embedding learning via maximum concept
From the results obtained in the previous sections, we know that a graph granule shares a mathematical structure as a concept lattice. Since the concepts in the concept lattice describe the generalization and specialization relationship between entities, they can be used to The concept information of graph granule with application to embedding learning of knowledge graph 5 Fig. 3 The diagrammatic sketch of TransGr capture the connection between entities in knowledge graph. Thus, in this section, we develop a novel KGE model TransGr by considering the concept information in graph granule. Considering that there are a large number of concepts in the whole graph granule and the maximal concepts contains more tail entities, TransGr uses the maximal concepts for embedding learning.
If we use bold e to represent the vector of e, then, given a graph granule G h , TransGr learns a series of matrices C i (i = 1, ...n) for its maximal concepts (X i , B i ) and represents these information with Then, if there is a relation r between entities h and t, (h, t) must be part of some concepts. Since these concepts reflect the relationship between two entities, TransGr argues that they can project some information of one entity from another entity. Thus, it forces in the vector space hoping that the concepts cover an entity can gather the related tail entities together and employs L r (h, t) = C h * t + r − C t * h 2 1/2 as the score function of triple (h, r, t). In addition, in order to capture the connection between e c and e, Trans-Gr defines the following two functions to make h c , t c not too far away from h, t: Fig.3 displays a simple diagrammatic sketch of Trans-Gr. As shown in Fig.3, the entities h and t are firstly projected onto the relation space by the corresponding concept planes. Then, it requires h c + r = t c in the relation space. Finally, it defines the following score function: The score function f r (h, t) is expected to be lower for a correct triple and higher for an incorrect triple. For better convergence, the following restrictions are applied on entities and relations: e 2 = 1, ∀e ∈ V e c 2 = 1, r 2 = 1, ∀r ∈ R.
In order to obtain the embeddings of entities and relations, we only need to optimize the following marginbased score function: where γ > 0 is a margin hyperparameter and S,S ′ respectively denote a triple set in the knowledge graph and corresponding corrupted triple set. To generate reasonable corrupted triples, we consider to randomly corrupt each triple by replacing either the head or the tail entity with every entity that do not violate the semantics of the corresponding relation-types Krompa et al. (2015) in a knowledge graph.
For solving the optimization problem, the mini-batched stochastic gradient descent is used to optimize the objective function, see details in Algorithm 2. During training, the data will first be divided into minibatches. Then, for each iteration, a minibatch is used to calculate the loss function and update the vectors.

Algorithm 2 The learning algorithm of TransGr
Input: Training set S, entities and relations set E and R, margin γ, embedding dimension k Output: The embeddings of entities and relations and matrices of the maximal concepts 1: Get the maximal concept set C max by Algorithm 1 2: Initialize 3: Randomly initialize e for each e ∈ E 4: Randomly initialize r for each r ∈ R 5: Randomly initialize C e for each maximal concept 6: Calculate e c and normalize e c , e and r 7: Loop 8: S batch ← sample (S, b)//sample a minibatch of size b 9: T batch ← ∅ //initialize the set of pairs of triples 10: For (h, l, t) ∈ S batch 11: (h ′ , r, t ′ ) ← sample a corrupted triple 12:

15: End
It is easy to verify that TransGr totaly has k 2 n e + kn e +kn r parameters, and its time complexity is O(k 2 ).
In the literatures, there also exist models that consider to learn vector or matrix representation for a nonsingle triple, such as CTransR Lin et al. (2015b) learns a discrete relation vector for each entity pair cluster and path-based model Lin et al. (2015a) learns a vector or matrix for every relational path. Still, neither entity pair cluster nor relational path contains concept information of knowledge graph. Although Guan et al. (2019) proposed a common-sense concept based KGE model, they need to carry out multiple training processes due to the dependence on additional language database. Unlike these, TranGr groups multiple triples with the same head entity into a graph granule to extract the maximum concepts, and uses its own concept information to learn vector of knowledge graph.

Experiments
In this section, we conduct some numerical experiments on two popular tasks involving knowledge graph completion to evaluate the proposed TrasnGr model, i.e., link prediction and triple classification. For comparison, we chose two typical knowledge graphs: Wordnet and Freebase. Wordnet is a large lexical database of English, in which each entity represents a synset consisting of several words, and a word can also belong to different synsets, while Freebase is a sharing website that stores general facts about the world. Six sub datasets were generated from those two knowledge graphs. Table 2 lists the details of these datasets. Among these six datasets used, WN18, WN18RR and WN11 are subsets of Wordnet while FB15K, FB15K-237 and FB13 are subsets of Freebase. Besides, there have no inverse relations in datasets WN18RR and FB15K-237. Link prediction aims to predict the unknown entity or relation in a triple, such as predicting the missing head entity Beijing in triple (*, capital of, China) or predicting the missing relation "Capital of" in triple (Beijing, *, China). This task is conducted on four datasets: WN18, WN18RR, FB15K and FB15K-237. For comparison, we chose several non-FCA based models, that is RESCAL Nickel et al. To generate reasonable corrupted triples, we randomly corrupted each triple by replacing either the head or the tail entity with every entity that do not violate the semantics of the corresponding relation-types Krompa et al. (2015) in a knowledge graph. Then the score of each corrupted triple was calculated by the score function f r (h, t) introduced in Section 3 and ranked these entities in descending order of possibility. Considering that the correct entity should rank before the incorrect ones, we expect a high value for two indicators: the average of the reciprocal ranks (MRR) of correct entities and the proportion of testing triple whose rank is not larger than N (Hits@N). Since false-negative triples may be generated in the process of generating negative triples that can influence the experimental results, the corrupted triples that exist in the training, validation or test datasets were filtered out for a more reasonable result. And this is called the "Filter" setting while the original one is called the "Raw" setting.
For comparison, the experimental results of several baselines were directly reproduced from the existing literatures, in which the MRR are from Kazemi and Poole (2018) and Tan et al. (2018), and the results of TransG come from Jia et al. (2018). We ran TransGr with learning rate α = 0.3, k = 50, γ = 10, N = 100 on WN18; γ = 0.7, k = 50, γ = 7, N = 100 on FB15K. Table 3 lists the comparison of the MRR and HIT-S@10 on WN18 and FB15K, in which "-" implies that we didn't find the corresponding results from the existing literatures. It is easily identifiable from Table 4 that TransGr ranks first with 82.2%, 28.9% and 53.9% in HITS@10 on WN18 and in MRR and HITS@10 on F-B15K with the "Raw" setting. It also outperforms the baselines significantly when compared to translationbased models (TransE,TransR,TransH,and KG2E), demonstrating that capturing concept information can benefit for embedding learning. When comparing Trans-Gr to the remaining baselines, it produces better results than DistMult and ComplEx in several aspects, such as MRR and HITS@10 on FB15K with "Filter" setting, but worse results than CompleEx-N3 and C-NN in all respects. Since FB15K is a relatively complex knowledge graph with 1345 relations, this may indicate that the complex model can capture complex datasets better.
To verify the impact of the inverse relation on the proposed model, we also compare TransGr with the four most effective models on two more challenging datasets: WN18RR and FB15K-237 with inverse relations removed. We ran TranGr with learning rate α = 0.1, k = 50, γ = 27, N = 100 on the former dataset and α = 0.1, k = 50, γ = 25, N = 30 on the latter dataset. The evaluation results are listed in Table 4, in (2) Compared with Table  3, the performance of TransGr is greatly reduced on these two datasets; (3) The inverse relation has smaller influence on FB15K than WN18. These facts all indicate that TransGr is sensitive to the relation types and has more stable performance on more complex datasets.
For further evaluation, we separately show evaluation results by mapping the properties of relations on FB15K in Table 5, in which the results of TransD Ji et al. (2015) are from the original paper and the others are from TransG Xiao et al. (2016). It demonstrates that TransGr, which takes concept into consideration, outperforms all the chosen baselines except for Trans-G and KG2E. Although TransGr shows weaker perfor-mance in some aspects than TransG when facing 1-n and n-n relations, it recovers up to 3.1% when predicting head entity of n-1 relations and 4.2% when predicting tail entity of 1-n relations. Besides, TranGr has 0.2% and 0.5% increase respectively when predicting the head entity and tail entity of 1-1 relation, which may be due to the fact that taking structured information into account can make better use of information in the knowledge graph.
For this task, we set a threshold δ r for each relation r in a knowledge graph maximizing the classification accuracies on the validation set. Then, for a triple (h, r, t), if its score f r (h, t) is below δ r , this triple is judged to be correct; otherwise, it is incorrect. During training, we searched for dimension k from {30, 50, 70, 100}, learning rate α from {0.5, 0.05, 0.005}, margin γ from {20, 22, 25, 27} and batch size N from {50, 100, 200}. The optimal configurations are: learning rate α = 0.5, k = 50,γ = 22,N = 200 on WN11; α = 0.1, k = 50, γ = 27, N = 100 on FB13; α = 0.5,k = 50, γ = 7, N = 100 on FB15K.  Table 6 lists the results, in which the results of baselines are from Xiao et al. (2016). We can find that TransGr ranks first with 91.9% precision in FB15K, but ranks last with 67.9% and 79.3% precision in WN11 and FB13, respectively. For further investigation on these results, we made statistics on the graph granule information of these three datasets used. By comparison, we find that the datasets WN11 and FB13 with more entities have smaller graph granules and quite a few missing graph granules, even up to 10% in FB13. Since we need enough graph granules to obtain the maximal concepts, the missing graph granules make the training insufficient. In one word, there is no sufficient training data for those graph granules.

Conclusion and Future Work
In this paper, a set of triples sharing the same head entity was grouped as a graph granule and the maximal concepts were obtained by a technique obtained from formal concept analysis. By further exploration, a novel model called TransGr was developed to learn a matrix for each maximal concept obtained in addition to a vector for each entity and relation, respectively. The performed experiments demonstrate that the proposed TransGr model is effective on the datasets with relatively complete graph granules.
In this paper, the graph granule only includes the tail entity of a head entity and ignores the head entities of this head entity. Since these head entities also contain rich information, in future work, our effort will be focused on how to balance these two parts of information to improve the learning ability of KGE model.

Compliance with Ethical Standards
Funding:This work is partially supported by the National Key R&D Program of China under granted (2018YFC0831404, 2018YFC0830605) and the National Natural Science Foundation of China (No. 12071131 and 11971211). Conflict of Interest:The authors declare that they have no conflict of interest. Ethical approval: This article does not contain any studies with animals performed by any of the authors.