## 1 Introduction

Graphs naturally exist in a wide diversity of real-world scenarios, e.g., social graph/diffusion graph in social media networks, citation graph in research areas, user interest graph in electronic commerce area, knowledge graph etc. Analysing these graphs provides insights into how to make good use of the information hidden in graphs, and thus has received significant attention in the last few decades. Effective graph analytics can benefit a lot of applications, such as node classification

[1], node clustering [2], node retrieval/recommendation [3], link prediction [4], etc. For example, by analysing the graph constructed based on user interactions in a social network (e.g., retweet/comment/follow in Twitter), we can classify users, detect communities, recommend friends, and predict whether an interaction will happen between two users.

Although graph analytics is practical and essential, most existing graph analytics methods suffer the high computation and space cost. A lot of research efforts have been devoted to conducting the expensive graph analytics efficiently. Examples include the distributed graph data processing framework (e.g., GraphX [5], GraphLab [6]), new space-efficient graph storage which accelerate the I/O and computation cost [7], and so on.

In addition to the above strategies, graph embedding provides an effective yet efficient way to solve the graph analytics problem. Specifically, graph embedding converts a graph into a low dimensional space in which the graph information is preserved. By representing a graph as a (or a set of) low dimensional vector(s), graph algorithms can then be computed efficiently. There are different types of graphs (e.g., homogeneous graph, heterogeneous graph, attribute graph, etc), so the input of graph embedding varies in different scenarios. The output of graph embedding is a low-dimensional vector representing a part of the graph (or a whole graph). Fig.

1 shows a toy example of embedding a graph into a 2D space in different granularities. I.e., according to different needs, we may represent a node/edge/substructure/whole-graph as a low-dimensional vector. More details about different types of graph embedding input and output are provided in Sec. 3.In the early 2000s, graph embedding algorithms were mainly designed to reduce the high dimensionality of the non-relational data by assuming the data lie in a low dimensional manifold. Given a set of non-relational high-dimensional data features, a similarity graph is constructed based on the pairwise feature similarity. Then, each node in the graph is embedded into a low-dimensional space where connected nodes are closer to each other. Examples of this line of researches are introduced in Sec.

4.1. Since 2010, with the proliferation of graph in various fields, research in graph embedding started to take a graph as the input and leverage the auxiliary information (if any) to facilitate the embedding. On the one hand, some of them focus on representing a part of the graph (e.g., node, edge, substructure) (Figs. 1(b)-1(d)) as one vector. To obtain such embedding, they either adopt the state-of-the-art deep learning techniques (Sec.

4.2) or design an objective function to optimize the edge reconstruction probability (Sec.

4.3). On the other hand, there is also some work concentrating on embedding the whole graph as one vector (Fig. 1(e)) for graph level applications. Graph kernels (Sec. 4.4) are usually designed to meet this need.The problem of graph embedding is related to two traditional research problems, i.e., graph analytics [8] and representation learning [9]. Particularly, graph embedding aims to represent a graph as low dimensional vectors while the graph structures are preserved. On the one hand, graph analytics aims to mine useful information from graph data. On the other hand, representation learning obtains data representations that make it easier to extract useful information when building classifiers or other predictors [9]. Graph embedding lies in the overlap of the two problems and focuses on learning the low-dimensional representations. Note that we distinguish graph representation learning and graph embedding in this survey. Graph representation learning does not require the learned representations to be low dimensional. For example, [10] represents each node as a vector with dimensionality equals to the number of nodes in the input graph. Every dimension denotes the geodesic distance of a node to each other node in the graph.

Embedding graphs into low dimensional spaces is not a trivial task. The challenges of graph embedding depend on the problem setting, which consists of embedding input and embedding output. In this survey, we divide the input graph into four categories, including homogeneous graph, heterogeneous graph, graph with auxiliary information and graph constructed from non-relational data. Different types of embedding input carry different information to be preserved in the embedded space and thus pose different challenges to the problem of graph embedding. For example, when embedding a graph with structural information only, the connections between nodes are the target to be preserved. However, for a graph with node label or attribute information, the auxiliary information provides graph property from other perspectives, and thus may also be considered during the embedding. Unlike embedding input which is given and fixed, the embedding output is task driven. For example, the most common type of embedding output is node embedding which represents close nodes as similar vectors. Node embedding can benefit node related tasks such as node classification, node clustering, etc. However, in some cases, the tasks may be related to higher granularity of a graph e.g., node pairs, subgraph, whole graph. Hence, the first challenge in terms of embedding output is to find a suitable embedding output type for the application of interest. We categorize four types of graph embedding output, including node embedding, edge embedding, hybrid embedding and whole-graph embedding. Different output granularities have different criteria for a “good” embedding and face different challenges. For example, a good node embedding preserves the similarity to its neighbouring nodes in the embedded space. In contrast, a good whole-graph embedding represents a whole graph as a vector so that the graph-level similarity is preserved.

In observations of the challenges faced in different problem settings, we propose two taxonomies of graph embedding work, by categorizing graph embedding literature based on the problem settings and the embedding techniques. These two taxonomies correspond to what challenges exist in graph embedding and how existing studies address these challenges. In particular, we first introduce different settings of graph embedding problem as well as the challenges faced in each setting. Then we describe how existing studies address these challenges in their work, including their insights and their technical solutions.

Note that although a few attempts have been made to survey graph embedding ([11, 12, 13]), they have the following two limitations. First, they usually propose only one taxonomy of graph embedding techniques. None of them analyzed graph embedding work from the perspective of problem setting, nor did they summarize the challenges in each setting. Second, only a limited number of related work are covered in existing graph embedding surveys. E.g., [11] mainly introduces twelve representative graph embedding algorithms, and [13] focuses on knowledge graph embedding only. Moreover, there is no analysis on the insight behind each graph embedding technique. A comprehensive review of existing graph embedding work and a high level abstraction of the insight for each embedding technique can foster the future researches in the field.

### 1.1 Our Contributions

Below, we summarize our major contributions in this survey.

We propose a taxonomy of graph embedding based on problem settings and summarize the challenges faced in each setting. We are the first to categorize graph embedding work based on problem setting, which brings new perspectives to understanding existing work.

We provide a detailed analysis of graph embedding techniques. Compared to existing graph embedding surveys, we not only investigate a more comprehensive set of graph embedding work, but also present a summary of the insights behind each technique. In contrast to simply listing how the graph embedding was solved in the past, the summarized insights answer the questions of why the graph embedding can be solved in a certain way. This can serve as an insightful guideline for future research.

We systematically categorize the applications that graph embedding enables and divide the applications as node related, edge related and graph related. For each category, we present detailed application scenarios as the reference.

We suggest four promising future research directions in the field of graph embedding in terms of computational efficiency, problem settings, solution techniques and applications. For each direction, we provide a thorough analysis of its disadvantages (deficiency) in current work and propose future research direction(s).

### 1.2 Organization of The Survey

The rest of this survey is organized as follows. In Sec. 2, we introduce the definitions of the basic concepts required to understand the graph embedding problem, and then provide a formal problem definition of graph embedding. In the next two sections, we provide two taxonomies of graph embedding, where the taxonomy structures are illustrated in Fig. 2. Sec. 3 compares the related work based on the problem settings and summarizes the challenges faced in each setting. In Sec. 4, we categorize the literature based on the embedding techniques.The insights behind each technique are abstracted, and a detailed comparison of different techniques is provided at the end. After that, we present the applications that graph embedding enables in Sec. 5. We then discuss four potential future research directions in Sec. 6 and concludes this survey in Sec. 7.

## 2 Problem Formalization

In this section, we first introduce the definition of the basic concepts in graph embedding, and then provide a formal definition of the graph embedding problem.

### 2.1 Notation and Definition

The detailed descriptions of the notations used in this survey can be found in Table I.

Notations | Descriptions |
---|---|

The cardinality of a set | |

= | Graph with nodes set and edges set |

= | A substructure of graph , where |

, | A node and an edge connecting and |

The adjacent matrix of | |

The -th row vector of matrix | |

The -th row and -th column in matrix | |

, | Type of node and type of edge |

, | The node type set and edge type set |

The k nearest neighbours of node | |

A feature matrix, each row is a -dimensional vector for | |

, , | The embedding of node , edge , and structure |

The dimensionality of the embedding | |

A knowledge graph triplet, with head entity , | |

tail entity and the relation between them | |

, | First- and second-order proximity between node and |

An information cascade | |

A cascade graph which adopts the cascade |

###### Definition 1

A graph is = , where is a node and is an edge. is associated with a node type mapping function and an edge type mapping function .

and denote the set of node types and edge types, respectively. Each node belongs to one particular type, i.e., . Similarly, for , .

###### Definition 2

A homogeneous graph = is a graph in which . All nodes in belong to a single type and all edges belong to one single type.

###### Definition 3

A heterogeneous graph = is a graph in which and/or .

###### Definition 4

A knowledge graph = is a directed graph whose nodes are entities and edges are subject-property-object triple facts. Each edge of the form (head entity, relation, tail entity) (denoted as ) indicates a relationship of from entity to entity .

are entities and is the relation. In this survey, we call a knowledge graph triplet. For example, in Fig. 3, there are two triplets: and . Note that the entities and relations in a knowledge graph are usually of different types [14, 15]. Hence, knowledge graph can be viewed as an instance of the heterogeneous graph.

The following proximity measures are usually adopted to quantify the graph property to be preserved in the embedded space. The first-order proximity is the local pairwise similarity between only the nodes connected by edges. It compares the direct connection strength between a node pair. Formally,

###### Definition 5

The first-order proximity between node and node is the weight of the edge , i.e., .

Two nodes are more similar if they are connected by an edge with larger weight. Denote the first-order proximity between node and as , we have . Let denote the first-order proximity between and other nodes. Take the graph in Fig. 1(a) as an example, the first order between and is the weight of edge , denoted as . And records the weight of edges connecting and other nodes in the graph, i.e., .

The second-order proximity compares the similarity of the nodes’ neighbourhood structures. The more similar two nodes’ neighbourhoods are, the larger the second-order proximity value between them. Formally,

###### Definition 6

The second-order proximity between node and is a similarity between ’s neighbourhood and ’s neighborhood .

Again, take Fig. 1(a) as an example: is the similarity between and . As introduced before, and

. Let us consider cosine similarities

and . We can see that the second-order proximity between and equals to zero as and do not share any common -hop neighbour. and share a common neighbour , hence their second-order proximity is larger than zero.The higher-order proximity can be defined likewise. For example, the -th-order proximity between node and is the similarity between and . Note that sometimes the higher-order proximities are also defined using some other metrics, e.g., Katz Index, Rooted PageRank, Adamic Adar, etc [11].

It is worth noting that, in some work, the first-order and second-order proximities are empirically calculated based on the joint probability and conditional probability of two nodes. More details are discussed in Sect. 4.3.2.

###### Problem 1

Graph embedding: Given the input of a graph = , and a predefined dimensionality of the embedding (), the problem of graph embedding is to convert into a -dimensional space, in which the graph property is preserved as much as possible. The graph property can be quantified using proximity measures such as the first- and higher-order proximity. Each graph is represented as either a -dimensional vector (for a whole graph) or a set of -dimensional vectors with each vector representing the embedding of part of the graph (e.g., node, edge, substructure).

Fig. 1 shows a toy example of graph embedding with . Given an input graph (Fig. 1(a)), the graph embedding algorithms are applied to convert a node (Fig. 1(b))/ edge (Fig. 1(c)), substructure (Fig. 1(d))/ whole-graph (Fig. 1(e)) as a 2D vector (i.e., a point in a 2D space). In the next two sections, we provide two taxonomies of graph embedding, by categorizing the graph embedding literature based on problem settings and embedding techniques respectively.

## 3 Problem Settings of Graph Embedding

In this section, we compare existing graph embedding work from the perspective of problem setting, which consists of the embedding input and the embedding output. For each setting, we first introduce different types of graph embedding input or output, and then summarize the challenges faced in each setting at the end.

We start with graph embedding input. As a graph embedding setting consists of both input and output, we use node embedding as an example embedding output setting during the introduction of different types of input. The reason is that although there exist various types of embedding output, the majority of graph embedding studies focus on node embedding, i.e., embedding nodes to a low dimensional space where the node similarity in the input graph is preserved. More details about node embedding and other types of embedding output are presented in Sec. 3.2.

### 3.1 Graph Embedding Input

The input of graph embedding is a graph. In this survey, we divide graph embedding input into four categories: homogeneous graph, heterogeneous graph, graph with auxiliary information and constructed graph. Each type of graph poses different challenges to graph embedding. Next, we introduce these four types of input graphs and summarize the challenges faced in each input setting.

#### 3.1.1 Homogeneous Graph

The first category of input graph is the homogeneous graph (Def. 2), in which both nodes and edges belong to a single type respectively. The homogeneous graph can be further categorized as the weighted (or directed) and unweighted (or undirected) graph as the example shown in Fig. 4.

Undirected and unweighted homogeneous graph is the most basic graph embedding input setting. A number of studies work under this setting, e.g., [1, 16, 17, 18, 19]. They treat all nodes and edges equally, as only the basic structural information of the input graph is available.

Intuitively, the weights and directions of the edges provide more information about the graph, and may help represent the graph more accurately in the embedded space. For example, in Fig. 4(a), should be embedded closer to than because the weight of the edge is higher. Similarly, in Fig. 4(b) should be embedded closer to than as and are connected in both direction. The above information is lost in the unweighted and undirected graph. Noticing the advantages of exploiting the weight and direction property of the graph edges, the graph embedding community starts to explore the weighted and/or the directed graph. Some of them focus on only one graph property, i.e., either edge weight or edge direction. On the one hand, the weighted graph is considered in [20, 21, 22, 23, 24, 25]. Nodes connected by higher-weighted edges are embedded closer to each other. However, their work is still limited to undirected graphs. On the other hand, some work distinguishes directions of the edges during the embedding process and preserve the direction information in the embedded space. One example of the directed graph is the social network graph, e.g, [26]. Each user has both followership and followeeship with other users. However, the weight information is unavailable for the social user links. Recently, a more general graph embedding algorithm is proposed, in which both weight and direction properties are considered. In other words, these algorithms (e.g., [27, 3, 28]) can process both directed and undirected, as well as weighted and unweighted graph.

Challenge: How to capture the diversity of connectivity patterns observed in graphs? Since only structural information is available in homogeneous graphs, the challenge of homogeneous graph embedding lies in how to preserve these connectivity patterns observed in the input graphs during embedding.

#### 3.1.2 Heterogeneous Graph

The second category of input is the heterogeneous graph (Def. 3), which mainly exist in the three scenarios below.

Community-based Question Answering (cQA) sites. cQA is an Internet-based crowdsourcing service that enables users to post questions on a website, which are then answered by other users [29]. Intuitively, there are different types of nodes in a cQA graph, e.g., question, answer, user. Existing cQA graph embedding methods distinguish from each other in terms of the links they exploit as summarized in Table II, where denotes that the -th answer provided by user obtains more votes (i.e., thumb-ups) than the -th answer of user for question .

GE Algorithm | Links Exploited |
---|---|

[30] | user-user, user-question |

[31] | user-user, user-question, question-answer |

[29] | user-user, question-answer, user-answer |

[32] | users’ asymmetric following links, a ordered tuple |

Multimedia Networks. A multimedia network is a network containing multimedia data, e.g., image, text, etc. For example, both [33] and [34] embed the graphs containing two types of nodes (image and text) and three types of links (the co-occurrence of image-image, text-text and image-text). [35] processes a social curation with user node and image node. It exploits user-image links to embed users and images into the same space so that they can be directly compared for image recommendation. In [36], a click graph is considered which contains images and text queries. The image-query edge indicates a click of an image given a query, where the click count serves as the edge weight.

Knowledge Graphs. In a knowledge graph (Def. 4), the entities (nodes) and relations (edges) are usually of different types. For example, in a film related knowledge graph constructed from Freebase [37], the types of entities can be “director”, “actor”, “film”, etc. The types of relations can be “produce”, “direct”, “actin”. A lot of efforts have been devoted to embeding knowledge graphs (e.g., [38, 39, 40]). We will introduce them in details in Sec. 4.3.3.

Other heterogeneous graphs also exist. For instance, [41] and [42] work on the mobility data graph, in which the station (s), role (r) and company (c) nodes are connected by three types of links (s-s, s-r, s-c). [43] embeds a Wikipedia graph with three types of nodes (entity (e), category (c) and word (w)) and three types of edges (e-e, e-c, w-w). In addition to the above graphs, there are some general heterogeneous graphs in which the types of nodes and edges are not specifically defined [44, 45, 46].

Challenge: How to explore global consistency between different types of objects, and how to deal with the imbalances of objects belonging to different types, if any?

Different types of objects (e.g., nodes, edges) are embedded into the same space in heterogeneous graph embedding. How to explore the global consistency between them is a problem. Moreover, there may exist imbalance between objects of different types. This data skewness should be considered in embedding.

#### 3.1.3 Graph with Auxiliary Information

The third category of input graph contains auxiliary information of a node/edge/whole-graph in addition to the structural relations of nodes (i.e., ). Generally, there are five different types of auxiliary information as listed in Table III.

Auxiliary Information | Description |
---|---|

label | categorical value of a node/edge, e.g., class information |

attribute | categorical or continuous value of a node/edge, |

e.g., property information | |

node feature | text or image feature for a node |

information propagation | the paths of how the information is propagated in graphs |

knowledge base | text associated with or facts between knowledge concepts |

Label: Nodes with different labels should be embedded far away from each other. In order to achieve this, [47] and [48] jointly optimize the embedding objective function together with a classifier function. [49] puts a penalty on the similarity between nodes with different labels. [50] considers node labels and edge labels when calculating different graph kernels. [51] and [52] embed a knowledge graph, in which the entity (node) has a semantic category. [53] embeds a more complicated knowledge graph with the entity categories in a hierarchical structure, e.g., the category “book” has two sub-categories “author” and “writtenwork”.

Attribute: In contrast to a label, an attribute value can be discrete or continuous. For example, [54] embeds a graph with discrete node attribute value (e.g., the atomic number in a molecule). In contrast, [4] represents the node attribute as a continuous high-dimensional vector (e.g., user attribute features in social networks). [55] deals with both discrete and continuous attributes for nodes and edges.

Node feature: Most node features are text, which are provided either as a feature vector for each node [56, 57] or as a document [58, 59, 60, 61]. For the latter, the documents are further processed to extract feature vectors using techniques such as bag-of-words [58], topic modelling [59, 60], or treating “word” as one type of node [61]. Other types of node features, such as image features [33], are also possible. Node features enhance the graph embedding performance by providing rich and unstructured information, which is available in many real-world graphs. Moreover, it makes inductive graph embedding possible [62].

Information propagation: An example of information propagation is “retweet” in Twitter. In [63], given a data graph , a cascade graph is constructed for each cascade , where are the nodes that have adopted and are the edges with both ends in . They then embed to predict the increment of cascade size. Differently, [64] aims to embed the users and content information, such that the similarity between their embedding indicates a diffusion probability. Topo-LSTM [65] considers a cascade as not merely a sequence of nodes, but a dynamic directed acyclic graphs for embedding.

Knowledge base: The popular knowledge bases include Wikipedia [66], Freebase [37], YAGO [67], DBpedia [68], etc. Take Wikipedia as an example, the concepts are entities proposed by users and text is the article associated with the entity. [66] uses knowledge base to learn a social knowledge graph from a social network by linking each social network user to a given set of knowledge concepts. [69] represents queries and documents in the entity space (provided by a knowledge base) so that the academic search engine can understand the meaning of research concepts in queries.

Other types of auxiliary information include user check-in data (user-location) [70], user item preference ranking list [71], etc. Note that the auxiliary information is not just limited to one type. For instance, [62] and [72] consider both label and node feature information. [73] utilizes node contents and labels to assist the graph embedding process.

Challenge: How to incorporate the rich and unstructured information so that the learnt embeddings are both representing the topological structure and discriminative in terms of the auxiliary information? The auxiliary information helps to define node similarity in addition to graph structural information. The challenges of embedding graph with auxiliary information is how to combine these two information sources to define the node similarity to be preserved.

#### 3.1.4 Graph Constructed from Non-relational Data

As the last category of input graph is not provided, but constructed from the non-relational input data by different strategies. This usually happens when the input data is assumed to lie in a low dimensional manifold.

In most cases, the input is a feature matrix where each row is an -dimensional feature vector for the -th training instance. A similarity matrix is constructed by calculating using the similarity between (, ). There are usually two ways to construct a graph from . A straightforward way is to directly treat as the adjacency matrix of an invisible graph [74]. However, [74] is based on the Euclidean distance and it does not consider the neighbouring nodes when calculating . If lies on or near a curved manifold, the distance between and over the manifold is much larger than their Euclidean distance [12]. To address these issues, other methods (e.g., [75, 76, 77]

) construct a K nearest neighbour (KNN) graph from

first and estimate the adjacency matrix

based on the KNN graph. For example, Isomap [78] incorporates the geodesic distance in . It first constructs a KNN graph from , and then finds the shortest path between two nodes as the geodesic distance between them. To reduce the cost of KNN graph construction (), [79] constructs an Anchor graph instead, whose cost is in terms of both time and space consumption. They first obtain a set of clustering centers as virtual anchors and find the K nearest anchors of each node for anchor graph construction.Another way of graph construction is to establish edges between nodes based on the nodes’ co-occurrence. For example, to facilitate image related applications (e.g., image segmentation, image classification), researchers (e.g., [80, 81, 82]) construct a graph from each image by treating pixels as nodes and the spatial relations between pixels as edges. [83] extracts three types of nodes (location, time and message) from the GTMS record and therefore forms six types of edges between these nodes. [84] generates a graph using entity mention, target type and text feature as nodes, and establishes three kinds of edges: mention-type, mention-feature and type-type.

In addition to the above pairwise similarity based and node co-occurrence based methods, other graph construction strategies have been designed for different purposes. For example, [85] constructs an intrinsic graph to capture the intraclass compactness, and a penalty graph to characterize the interclass separability. The former is constructed by connecting each data point with its neighbours of the same class, while the latter connects the marginal points across different classes. [86] constructs a signed graph to exploit the label information. Two nodes are connected by a positive edge if they belong to the same class, and a negative edge if they are from two classes. [87] includes all instances with a common label into one hyperedge to capture their joint similarity. In [88], two feedback graphs are constructed to gather together relevant pairs and keep away irrelevant ones after embedding. In the positive graph, two nodes are connected if they are both relevant. In the negative graph, two nodes are connected only when one node is relevant and the other is irrelevant.

Challenge: How to construct a graph that encodes the pairwise relations between instances and how to preserve the generated node proximity matrix in the embedded space? The first challenge faced by embedding graphs constructed from non-relational data is how to compute the relations between the non-relational data and construct such a graph. After the graph is constructed, the challenge becomes the same as in other input graphs, i.e., how to preserve the node proximity of the constructed graph in the embedded space.

### 3.2 Graph Embedding Output

The output of graph embedding is a (set of) low dimensional vector(s) representing (part of) a graph. Based on the output granularity, we divide graph embedding output into four categories, including node embedding, edge embedding, hybrid embedding and whole-graph embedding. Different types of embedding facilitate different applications.

Unlike embedding input which is fixed and given, the embedding output is task driven. For example, node embedding can benefit a wide variety of node related graph analysis tasks. By representing each node as a vector, the node related tasks such as node clustering, node classification, can be performed efficiently in terms of both time and space. However, graph analytics tasks are not always at node level. In some scenarios, the tasks may be related to higher granularity of a graph, such as node pairs, subgraph, or even a whole graph. Hence, the first challenge in terms of embedding output is how to find a suitable type of embedding output which meets the needs of the specific application task.

#### 3.2.1 Node Embedding

As the most common embedding output setting, node embedding represents each node as a vector in a low dimensional space. Nodes that are “close” in the graph are embedded to have similar vector representations. The differences between various graph embedding methods lie in how they define the “closeness” between two nodes. First-order proximity (Def. 5) and second-order proximity (Def. 6) are two commonly adopted metrics for pairwise node similarity calculation. In some work, higher-order proximity is also explored to certain extent. For example, [21] captures the -step () neighbours relations in their embedding. Both [1] and [89] consider two nodes belonging to the same community as embedded closer.

Challenge: How to define the pairwise node proximity in various types of input graph and how to encode the proximity in the learnt embeddings? The challenges of node embedding mainly come from defining the node proximity in the input graph. In Sec 3.1, we have elaborated the challenges of node embedding with different types of input graphs.

Next, we will introduce other types of embedding output as well as the new challenges posed by these outputs.

#### 3.2.2 Edge Embedding

In contrast to node embedding, edge embedding aims to represent an edge as a low-dimensional vector. Edge embedding is useful in the following two scenarios.

Firstly, knowledge graph embedding (e.g., [90, 91, 92] ) learns embedding for both nodes and edges. Each edge is a triplet (Def. 4). The embedding is learnt to preserve between and in the embedded space, so that a missing entity/relation can be correctly predicted given the other two components in . Secondly, some work (e.g., [28, 64]) embeds a node pair as a vector feature to either make the node pair comparable to other nodes or predict the existence of a link between two nodes. For instance, [64] proposes a content-social influential feature to predict user-user interaction probability given a content. It embeds both the user pairs and content in the same space. [28] embeds a pair of nodes using a bootstrapping approach over the node embedding, to facilitate the prediction of whether a link exists between two nodes in a graph.

In summary, edge embedding benefits edge (/node pairs) related graph analysis, such as link prediction, knowledge graph entity/relation prediction, etc.

Challenge: How to define the edge-level similarity and how to model the asymmetric property of the edges, if any? The edge proximity is different from node proximity as an edge contains a pair of nodes and usually denotes the pairwise node relation. Moreover, unlike nodes, edges may be directed. This asymmetric property should be encoded in the learnt edge representations.

#### 3.2.3 Hybrid Embedding

Hybrid embedding is the embedding of a combination of different types of graph components, e.g, node + edge (i.e., substructure), node + community.

Substructure embedding has been studied in a quantity of work. For example, [44] embeds the graph structure between two possibly distant nodes to support semantic proximity search. [93] learns the embedding for subgraphs (e.g., graphlets) so as to define the graph kernels for graph classification. [94] utilizes a knowledge base to enrich the information about the answer. It embeds both path and subgraph from the question entity to the answer entity.

Compared to subgraph embedding, community embedding has only attracted limited attention. [1] proposes to consider a community-aware proximity for node embedding, such that a node’s embedding is similar to its community’s embedding. ComE [89]

also jointly solves node embedding, community detection and community embedding together. Rather than representing a community as a vector, it defines each community embedding as a multivariate Gaussian distribution so as to characterize how its member nodes are distributed.

The embedding of substructure or community can also be derived by aggregating the individual node and edge embedding inside it. However, such a kind of “indirect” approach is not optimized to represent the structure. Moreover, node embedding and community embedding can reinforce each other. Better node embedding is learnt by incorporating the community-aware high-order proximity, while better communities are detected when more accurate node embedding is generated.

Challenge: How to generate the target substructure and how to embed different types of graph components in one common space? In contrast to other types of embedding output, the target to embed in hybrid embedding (e.g., subgraph, community) is not given. Hence the first challenge is how to generate such kind of embedding target structure. Furthermore, different types of targets (e.g., community, node) may be embedded in one common space simultaneously. How to address the heterogeneity of the embedding target types is a problem.

#### 3.2.4 Whole-Graph Embedding

The last type of output is the embedding of a whole graph usually for small graphs, such as proteins, molecules, etc. In this case, a graph is represented as one vector and two similar graphs are embedded to be closer.

Whole-graph embedding benefits the graph classification task by providing a straightforward and efficient solution for calculating graph similarities [55, 49, 95]. To establish a compromise between the embedding time (efficiency) and the ability to preserve information (expressiveness), [95] designs a hierarchical graph embedding framework. It thinks that accurate understanding of the global graph information requires the processing of substructures in different scales. A graph pyramid is formed where each level is a summarized graph at different scales. The graph is embedded at all levels and then concatenated into one vector. [63]

learns the embedding for a whole cascade graph, and then trains a multi-layer perceptron to predict the increment of the size of the cascade graph in the future.

Challenge: How to capture the properties of a whole graph and how to make a trade-off between expressiveness and efficiency? Embedding a whole graph requires capturing the property of a whole graph and is thus more time consuming compared to other types of embedding. The key challenge of whole-graph embedding is how to make a choice between the expressive power of the learnt embedding and the efficiency of the embedding algorithm.

## 4 Graph Embedding Techniques

In this section, we categorize graph embedding methods based on the techniques used. Generally, graph embedding aims to represent a graph in a low dimensional space which preserves as much graph property information as possible. The differences between different graph embedding algorithms lie in how they define the graph property to be preserved. Different algorithms have different insights of the node(/edge/substructure/whole-graph) similarities and how to preserve them in the embedded space. Next, we will introduce the insight of each graph embedding technique, as well as how they quantify the graph property and solve the graph embedding problem.

### 4.1 Matrix Factorization

Matrix factorization based graph embedding represent graph property (e.g., node pairwise similarity) in the form of a matrix and factorize this matrix to obtain node embedding [11]. The pioneering studies in graph embedding usually solve graph embedding in this way. In most cases, the input is a graph constructed from non-relational high dimensional data features as introduced in Sec. 3.1.4. And the output is a set of node embedding (Sec. 3.2.1). The problem of graph embedding can thus be treated as a structure-preserving dimensionality reduction problem which assumes the input data lie in a low dimensional manifold. There are two types of matrix factorization based graph embedding. One is to factorize graph Laplacian eigenmaps, and the other is to directly factorize the node proximity matrix.

#### 4.1.1 Graph Laplacian Eigenmaps

Insight: The graph property to be preserved can be interpreted as pairwise node similarities. Thus, a larger penalty is imposed if two nodes with larger similarity are embedded far apart.

GE Algorithm | Objective Function | |

MDS [74] | Euclidean distance | Eq. 2 |

Isomap [78] | KNN, is the sum of edge weights along the shortest path between and | Eq. 2 |

LE [96] | KNN, | Eq. 2 |

LPP [97] | KNN, | Eq. 4 |

AgLPP [79] | anchor graph, , , | |

LGRM [98] | KNN, | |

ARE [88] | KNN, , | |

denotes the images relevant to a query, controls the unbalanced feedback | ||

SR [99] | KNN, | |

is the -th class, = | ||

HSL[87] | , where is normalized hypergraph Laplacian | , s.t. |

MVU [100] | KNN, , s.t. , and , | Eq. 2 |

where | ||

SLE [86] | KNN, | Eq. 4 |

is the -th class, = | ||

NSHLRR [76] | normal graph: KNN, | Eq. 2 |

hypergraph: is the weight of a hyperedge | ||

, | ||

[77] | ||

PUFS [75] | KNN, | Eq. 4 +(must-link and cannot link constraints) |

RF-Semi-NMF-PCA [101] | KNN , | Eq. 2 +(PCA) + (kmeans) |

Based on the above insight, the optimal embedding can be derived by the below objective function [99].

(1) |

where is the “defined” similarity between node and ; is the graph Laplacian. is the diagonal matrix where . The bigger the value of , the more important is [97]. A constraint is usually imposed on Eq. 1 to remove an arbitrary scaling factor in the embedding. Eq. 1 then reduces to:

(2) |

The optimal

’s are the eigenvectors corresponding to the maximum eigenvalue of the eigenproblem

.The above graph embedding is transductive because it can only embed the nodes that exist in the training set. In practice, it might also need to embed the new coming nodes that have not been seen in training. One solution is to design a linear function so that the embedding can be derived as long as the node feature is provided. Consequently, for inductive graph embedding, Eq. 1 becomes finding the optimal in the below objective function:

(3) |

Similar to Eq. 2, by adding the constraint , the problem in Eq. 3 becomes:

(4) |

The optimal ’s are eigenvectors with the maximum eigenvalues in solving .

The differences of existing studies mainly lie in how they calculate the pairwise node similarity , and whether they use a linear function or not. Some attempts [85, 81] have been made to summarize existing Laplacian eigenmaps based graph embedding methods using a general framework. But their surveys only cover a limited quantity of work. In Table IV, we summarize existing Laplacian eigenmaps based graph embedding studies and compare how they calculate and what objective function they adopt.

The initial study MDS[74] directly adopted the Euclidean distance between two feature vectors and as . Eq. 2 is used to find the optimal embedding ’s. MDS does not consider the neighbourhood of nodes, i.e., any pair of training instances are considered as connected. The follow-up studies (e.g., [78, 102, 96, 97]) overcome this problem by first constructing a k nearest neighbour (KNN) graph from the data feature. Each node is only connected with its top k similar neighbours. After that, different methods are utilized to calculate the similarity matrix so as to preserve as much desired graph property as possible. Some more advanced models are design recently. For example, AgLPP [79] introduces an anchor graph to significantly improve the efficiency of earlier matrix factorization model LPP. LGRM [98] learns a local regression model to grasp the graph structure and a global regression term for out-of-sample data extrapolation. Finally, different from previous work’s preserving local geometry, LSE [103] uses local spline regression to preserve global geometry.

When auxiliary information (e.g., label, attribute) is available, the objective function is adjusted to preserve the richer information. E.g., [99] constructs an adjacency graph and a labelled graph . The objective function consists of two parts, one focuses on preserving the local geometric structure of the datasets as in LPP [97], and the other tries to get the embedding with the best class separability on the labelled training data. Similarly, [88] also constructs two graphs, an adjacency graph which encodes local geometric structures and a feedback relational graph that encodes the pairwise relations in users’ relevance feedbacks. RF-Semi-NMF-PCA [101]

simultaneously consider clustering, dimensionality reduction and graph embedding by constructing an objective function that consists of three components: PCA, k-means and graph Laplacian regularization.

Some other work thinks that cannot be constructed by easily enumerating pairwise node relationships. Instead, they adopt semidefinite programming (SDP) to learn . Specifically, SDP [104] aims to find an inner product matrix that maximizes the pairwise distances between any two inputs which are not connected in the graph while preserving the nearest neighbors distances. MVU [100] constructs such matrix and then applies MDS [74] on the learned inner product matrix. [2] proves that regularized LPP [97] is equivalent to regularized SR [99] if is symmetric, doubly stochastic, PSD and with rank . It constructs such kind of similarity matrix so as to solve LPP liked problem efficiently.

#### 4.1.2 Node Proximity Matrix Factorization

In addition to solving the above generalized eigenvalue problem, another line of studies tries to directly factorize node proximity matrix.

Insight: Node proximity can be approximated in a low-dimensional space using matrix factorization. The objective of preserving node proximity is to minimize the loss of approximation.

Given the node proximity matrix , the objective is:

(5) |

where is the node embedding, and is the embedding for the context nodes [21].

Eq. 5 aims to find an optimal rank- approximation of the proximity matrix (

is the dimensionality of the embedding). One popular solution is to apply SVD (Singular Value Decomposition) on

[110]. Formally,(6) |

where are the singular values sorted in descending order, and are singular vectors of . The optimal embedding is obtained using the largest singular values and corresponding singular vectors as follows:

(7) | |||

Depending on whether the asymmetric property is preserved or not, the embedding of node is either [21, 50], or the concatenation of and , i.e., [106]. There exist other solutions for Eq. 5, such as regularized Gaussian matrix factorization [24], low-rank matrix factorization [56], and adding other regularizers to enforce more constraints [48]. We summarize all the node proximity matrix factorization based graph embedding in Table V.

### 4.2 Deep Learning

Deep learning (DL) has shown outstanding performance in a wide variety of research fields, such as computer vision, language modeling, etc. DL based graph embedding applies DL models on graphs. These models are either a direct adoption from other fields or a new neural network model specifically designed for embedding graph data. The input is either paths sampled from a graph or the whole graph itself. Consequently, we divide the DL based graph embedding into two categories based on whether random walk is adopted to sample paths from a graph.

#### 4.2.1 DL based Graph Embedding with Random Walk

Insight: The second-order proximity in a graph can be preserved in the embedded space by maximizing the probability of observing the neighbourhood of a node conditioned on its embedding.

In the first category of deep learning based graph embedding, a graph is represented as a set of random walk paths sampled from it. The deep learning methods are then applied to the sampled paths for graph embedding which preserves graph properties carried by the paths.

In view of the above insight, DeepWalk [17] adopts a neural language model (SkipGram) for graph embedding. SikpGram [111] aims to maximize the co-occurrence probability among the words that appear within a window . DeepWalk first samples a set of paths from the input graph using truncated random walk (i.e., uniformly sample a neighbour of the last visited node until the maximum length is reached). Each path sampled from the graph corresponds to a sentence from the corpus, where a node corresponds to a word. Then SkipGram is applied on the paths to maximize the probability of observing a node’s neighbourhood conditioned on its embedding. In this way, nodes with similar neighbourhoods (having large second-order proximity values) share similar embedding. The objective function of DeepWalk is as follows:

(8) |

where is the window size which restricts the size of random walk context. SkipGram removes the ordering constraint, and Eq. 8 is transformed to:

(9) |

where is defined using the softmax function:

(10) |

Note that calculating Eq. 10 is not feasible as the normalization factor (i.e., the summation over all inner product with every node in a graph) is expensive. There are usually two solutions to approximate the full softmax: hierarchical softmax [112] and negative sampling [112].

Hierarchical softmax: To efficiently solve Eq. 10, a binary tree is constructed in which the nodes are assigned to the leaves. Instead of enumerating all nodes as in Eq. 10, only the path from the root to the corresponding leaf needs to be evaluated. The optimization problem becomes maximizing the probability of a specific path in the tree. Suppose the path to leaf is a sequence of nodes , where root, . Eq. 10 then becomes:

(11) |

where is a binary classifier: .

denotes the sigmoid function.

is the embedding of tree node ’s parent. The hierarchical softmax reduces time complexity of SkipGram from to .Negative sampling

: The key idea of negative sampling is to distinguish the target node from noises using logistic regression. I.e., for a node

, we want to distinguish its neighbour from other nodes. A noise distribution is designed to draw the negative samples for node . Each in Eq. 9 is then calculated as:(12) |

where is the number of negative nodes that are sampled.

is a noise distribution, e.g., a uniform distribution (

). The time complexity of SkipGram with negative sampling is .GE Algorithm | Ransom Walk Methods | Preserved Proximity | DL Model |
---|---|---|---|

DeepWalk[17] | truncated random walk | ||

[34] | truncated random walk | (word-image) | |

GenVector [66] | truncated random walk | (user-user & concept-concept) | SkipGram with |

Constrained DeepWalk [25] | sampling with edge weight | hierarchical softmax | |

DDRW [47] | truncated random walk | + class identity | (Eq. 11) |

TriDNR [73] | truncated random walk | (among node, word & label) | |

node2vec [28] | BFS + DFS | ||

UPP-SNE [113] | truncated random walk | (user-user & profile-profile) | SkipGram with |

Planetoid [62] | sampling node pairs by labels and structure | + label identity | negative sampling |

NBNE [19] | sampling direct neighbours of a node | (Eq. 12) | |

DGK [93] | graphlet kernel: random sampling [114] | (by graphlet) | SkipGram (Eqs. 11–12 ) |

metapath2vec [46] | meta-path based random walk | heterogeneous SkipGram | |

ProxEmbed [44] | truncate random walk | node ranking tuples | |

HSNL [29] | truncate random walk | + QA ranking tuples | LSTM |

RMNL [30] | truncated random walk | + user-question quality ranking | |

DeepCas [63] | Markov chain based random walk | information cascade sequence | GRU |

MRW-MN [36] | truncated random walk | + cross-modal feature difference | DCNN+ SkipGram |

*with*random walk paths.

The success of DeepWalk [17]

motivates many subsequent studies which apply deep learning models (e.g., SkipGram or Long-Short Term Memory (LSTM)

[115]) on the sampled paths for graph embedding. We summarize them in Table VI. As shown in the table, most studies follow the idea of DeepWalk but change the settings of either random walk sampling methods ([25, 28, 62, 62]) or proximity (Def. 5 and Def. 6) to be preserved ([34, 66, 47, 73, 62]). [46] designs meta-path-based random walks to deal with heterogeneous graphs and a heterogeneous SkipGram which maximizes the probability of having the hetegeneous context for a given node. Apart from SkipGram, LSTM is another popular deep learning model adopted in graph embedding. Note that SkipGram can only embed one single node. However, sometimes we may need to embed a sequence of nodes as a fixed length vector, e.g., represent a sentence (i.e., a sequence of words) as one vector. LSTM is then adopted in such scenarios to embed a node sequence. For example, [29] and [30] embed the sentences from questions/answers in cQA sites, and [44]embeds a sequence of nodes between two nodes for proximity embedding. A ranking loss function is optimized in these work to preserve the ranking scores in the training data. In

[63], GRU [116](i.e., a recurrent neural network model similar to LSTM) is used to embed information cascade paths.

#### 4.2.2 DL based Graph Embedding without Random Walk

Insight:The multi-layered learning architecture is a robust and effective solution to encode the graph into a low dimensional space.

The second class of deep learning based graph embedding methods applies deep models on a whole graph (or a proximity matrix of a whole graph) directly. Below are some popular deep learning models used in graph embedding.

Autoencoder

: An autoencoder aims to minimize the reconstruction error of the output and input by its encoder and decoder. Both encoder and decoder contain multiple nonlinear functions. The encoder maps input data to a representation space and the decoder maps the representation space to a reconstruction space. The idea of adopting autoencoder for graph embedding is similar to node proximity matrix factorization (Sec.

4.1.2) in terms of neighbourhood preservation. Specifically, the adjacency matrix captures a node’s neighbourhood. If we input the adjacency matrix to an autoencoder, the reconstruction process will make the nodes with similar neighbourhood have similar embedding.Deep Neural Network

: As a popular deep learning model, Convolutional Neural Network (CNN) and its variants have been widely adopted in graph embedding. On the one hand, some of them directly use the original CNN model designed for Euclidean domains and reformat input graphs to fit it. E.g.,

[55] uses graph labelling to select a fixed-length node sequence from a graph and then assembles nodes’ neighbourhood to learn a neighbourhood representation with the CNN model. On the other hand, some other work attempts to generalize the deep neural model to non-Euclidean domains (e.g., graphs). [117] summarizes the representative studies in their survey. Generally, the differences between these approaches lie in the way they formulate a convolution-like operation on graphs. One way is to emulate the Convolution Theorem to define the convolution in the spectral domain [118, 119]. Another is to treat the convolution as neighborhood matching in the spatial domain [82, 72, 120].Others: There are some other types of deep learning based graph embedding methods. E.g., [35] proposes DUIF, which uses a hierarchical softmax as a forward propagation to maximize the modularity. HNE [33] utilizes deep learning techniques to capture the interactions between heterogeneous components, e.g., CNN for image and FC layers for text. ProjE [40] designs a neural network with a combination layer and a projection layer. It defines a pointwise loss (similar to multi-class classification) and a listwise loss (i.e., softmax regression loss) for knowledge graph embedding.

We summarize all deep learning based graph embedding methods (random walk free) in Table VII, and compare the models they use as well as the input for each model.

GE Algorithm | Deep Learning Model | Model Input |

SDNE [20] | autoencoder | |

DNGR [23] | stacked denoising autoencoder |
PPMI |

SAE [22] | sparse autoencoder | |

[55] | CNN | node sequence |

SCNN [118] | Spectral CNN | graph |

[119] | Spectral CNN with smooth | graph |

spectral multipliers | ||

MoNet [80] | Mixture model network | graph |

ChebNet [82] | Graph CNN a.k.a. ChebNet | graph |

GCN [72] | Graph Convolutional Network | graph |

GNN [120] | Graph Neural Network | graph |

[121] | adapted Graph Neural Network | molecules graph |

GGS-NNs [122] | adapted Graph Neural Network | graph |

HNE [33] | CNN + FC | graph with image and text |

DUIF [35] | a hierarchical deep model | social curation network |

ProjE [40] | a neural network model | knowledge graph |

TIGraNet [123] | Graph Convolutional Network | graph constructed from images |

*without*random walk paths.

Summary: Due to its robustness and effectiveness, deep learning has been widely used in graph embedding. Three types of input graphs (except for graph constructed from non-relational data (Sec. 3.1.4)) and all the four types of embedding output have been observed in deep learning based graph embedding methods.

### 4.3 Edge Reconstruction based Optimization

Overall Insight: The edges established based on node embedding should be as similar to those in the input graph as possible.

The third category of graph embedding techniques directly optimizes an edge reconstruction based objective functions, by either maximizing edge reconstruction probability or minimizing edge reconstruction loss. The later is further divided into distance-based loss and margin-based ranking loss. Next, we introduce the three types one by one.

#### 4.3.1 Maximizing Edge Reconstruction Probability

Insight: Good node embedding maximizes the probability of generating the observed edges in a graph.

Good node embedding should be able to re-establish edges in the original input graph. This can be realized by maximizing the probability of generating all observed edges (i.e., node pairwise proximity) using node embedding.

The direct edge between a node pair and indicates their first-order proximity, which can be calculated as the joint probability using the embedding of and :

(13) |

The above first-order proximity exists between any pair of connected nodes in a graph. To learn the embedding, we maximize the log-likelihood of observing these proximities in a graph. The objective function is then defined as:

(14) |

Similarly, second-order proximity of and is the conditional probability of generated by using and :

(15) |

It can be interpreted as the probability of a random walk in a graph which starts from and ends with . Hence the graph embedding objective function is:

(16) |

where is a set of in the paths sampled from the graph, i.e., the two end nodes from each sampled path. This simulates the second-order proximity as the probability of a random walk starting from the and ending with the .

#### 4.3.2 Minimizing Distance-based Loss

Insight: The node proximity calculated based on node embedding should be as close to the node proximity calculated based on the observed edges as possible.

Specifically, node proximity can be calculated based on node embedding or empirically calculated based on observed edges. Minimizing the differences between the two types of proximities preserves the corresponding proximity.

For the first-order proximity, it can be computed using node embedding as defined in Eq. 13. The empirical probability is , where is the weight of edge . The smaller the distance between and is, the better first-order proximity is preserved. Adopting KL-divergence as the distance function to calculate the differences between and and omitting some constants, the objective function to preserve the first-order proximity in graph embedding is:

(17) |

Similarly, the second-order proximity of and is the conditional probability of generated by node (Eq. 15). The empirical probability of is calculated as

Comments

There are no comments yet.