Heterogeneous academic network embedding based multivariate random-walk model for predicting scientific impact

The prediction of current scientific impact of papers and authors has been extensively studied to help researchers find valuable papers and recent research directions, also help policymakers make recruitment decisions or funding allocation. However, how to accurately evaluate the future impact of them, especially for new papers and young researchers, is the focus of scientific impact prediction research, and is less explored. Existing graph-based methods heavily depend on the global structure information of heterogeneous academic network and ignore the local structure information and text information, which may provide important clues to identify influential papers and authors with novel perspective. In this paper, we propose a hybrid model called ESMR to predict the future influence of papers and authors by mainly exploiting these information mentioned above. Specifically, we first put forward a novel network embedding-based model, which can capture not only the local structure information, but also the text information of papers into a unified embedding representation. Then, the future impact of papers and authors is mutually ranked by integrating the learned embedding representations into a multivariate random-walk model. Empirical results on two real datasets demonstrate that the proposed method significantly outperforms the existing state-of-the-art ranking methods.


Introduction
Accurately assessing the potential importance of papers and authors has attracted rising research attention, and became one of the centric research issues in scientometric recently. That can help researchers catch up the most recent research directions, direct policymakers in recruitment decisions, funding allocation or expert finding [1][2][3][4]. So far, most remarkable works have focused on ranking the current importance of papers and authors [2,5,6], and proposed some more complicated metrics, such as h-index [7] and s-index [8].
These ranking methods can be roughly divided into citation-count based methods and PageRank-based methods. Citation count is a simple but widely used measurement to evaluate the popularity of papers and authors [7,9]. The major limitation is that such methods only consider the popularity of papers or authors, but ignore the importance of the citations themselves. To overcome this shortcoming, PageRank-based methods (i.e., univariate random walk model) are proposed to rank the authorities of papers or authors by iteratively computing the entire citation or coauthor network. In PageRank-based methods, the paper prestige can be propagated through the citation relationship among the papers [10], which is a much reasonable way for literature ranking compared to citation-count based methods. However, how to accurately identify potential papers and researchers, and predict their future impact is less explored, especially for new papers and young researchers.
The graph-based models, whether univariate or multivariate random walk, are considered as the state-of-the-arts and widely used to rank and predict future impact of papers [3,[11][12][13][14]. The univariate random walk methods is firstly used to construct a single homogeneous network or split the heterogeneous academic network (HAN) into several homogeneous ones [12]. Such methods ignore the different influences among different types of objects. To address this issue, multivariate random walk models are proposed to rank multiple objects simultaneously [3,5].
A key assumption of these algorithms is that the authority of the papers and the reputation of their authors are mutually reinforced. However, these methods (including both univariate or multivariate random-walk algorithms) aim at capturing the global structure information in a simple way to recursively calculate the ranking scores. Here the global structure information is topological characteristics of graphs, that is usually used to compute the transition matrix of random-walk model. While the local structure information is relations among entities in graphs. The local structure information (i.e., local similarity between entities) and the text information of papers (i.e., paper topics, words included in papers, words included in topics), which is helpful to identify influential papers and authors, can not be directly taken into account [15]. Therefore, the prediction accuracy of existing models are relatively poor because of ignoring these two kinds of information.
In fact, both the two types of information are essential to predict the future impact of authors and papers, especially for new published papers and young authors. The new published papers are those have been published recently, while young authors are researchers who began to publish papers lately. Because the structure information of their citations or co-authors is sparse, that leads to lower prediction performance. For new published papers and young authors, they can establish the direct and undirect relations between them and other authors or papers through the information of topics and words contained in paper and the local structure information. Specifically, we can use the local structure information to boost the presentations of links between nodes. Similarly, the text information of papers can be used to better capture potentially research hotspot. Some researches combine various kinds of information with multivariate random-walk or other models to improve the accuracy of prediction, such as publication time [12,16], author order factors [16], early citations [17], journal impact factor [17], text features [12,18], topical authority [17,19], and so on. But it is difficult to express the information well in a unified representation to integrate those into prediction approaches. Cluster-based zero-shot learning can deal with many types of data by working on vector space [20]. Recent advances based on network embedding [21][22][23] have been extensively studied to learn a unified low-dimensional representation for different kinds of entities in the heterogeneous network. Motivated by these, we propose a network embedding based model to gain better prediction, which can simultaneously take the local structural and the text information into account.
In this paper, we propose a novel model called ESMR, which adds the learned Embeddings and global Structure information to Multivariate Random-walk, to predict the future scientific influence of papers and authors. More specifically, a heterogeneous academic network embedding model is first designed to learn the local structural and text information simultaneously. Then the future scientific impact of papers and authors is comprehensively predicted by using the multivariate random walk algorithm, whose transition matrix is computed by combining the learned embeddings with global structural information. Extensive experiments on two datasets are conducted and the experimental results demonstrate that the performance of ESMR is significantly better than the existing state-of-the-art methods.
We summarize our main contributions as follows: • We design a heterogeneous academic network (HAN), which includes multiple connections between different kinds of entities, especially for the connections between authors and research topics or words of their papers. This does help to predict scientific impact of new papers and young authors. • A network embedding model is proposed to learn a unified representation for different types of nodes, which makes the ranking process more easier by using a multivariate random walk. • Extensive experiments on two datasets have been conducted, and the experimental results demonstrated that ESMR can accurately predict the future impact of papers and author, especially for the new papers and authors.

Related work
The earliest works on scientific publication ranking are citation count based methods. Although very simple, citation count is widely used to measure the importance of papers and researchers. Based on citation count, several more complicated metrics are proposed, such as h-index [7], c-index [24] and s-index [8]. However, all the citationcount based methods do not consider the available network structure and only focus on citation popularity. PageRank-based methods initially have been proposed to rank papers or authors on the homogeneous networks of citation or co-author network, which propagates the paper prestige through the citation relationship among the papers [10]. Although they can give the current influence of papers or authors, it is difficult to predict the future influence of them. It needs many types of related factors when we used it to accurately predict the future influence of papers and authors. However, it includes limited information to do it on the homogeneous networks of citation or co-author network.
In fact, an academic network is heterogeneous and is composed of various different kinds of networks, such as co-author network, paper citation network and venuepaper network [9,15]. The graph-based models including univariate and multivariate random walk techniques are widely used to predict future impact of papers [3,[11][12][13][14]. These methods first construct a heterogeneous academic network. Then the univariate random walk techniques usually split the heterogeneous academic network into several homogeneous ones by treating all the nodes and edges as the same type (like PageRank-based methods mentioned above) [12]. Such straightforward methods ignore the different influences between different types of objects, and thus limit the effectiveness in ranking different kinds of objects. For example, the authorities of papers can enhance the academic reputations of their authors, conversely, the reputable authors can also increase the authorities of their papers. Thus, some multivariate randomwalk techniques have been proposed to rank multiple entities simultaneously to identify the future influence of papers and researchers. The Co-Rank algorithm [25] was the first method to improve the ranking results for both papers and researchers by using citation network, coauthor network and the social network of authors. Most of later related works followed or extended Co-Rank simultaneously rank a kind or different types of entities (such as papers, authors) [5,11,12].
Following these methods, various kinds of information about papers and authors are integrated into the multivariate random-walk framework to further improve the accuracy of prediction. Sayyadi et al. [11] and Wang et al. [12] applied the time information about papers to the multivariate random-walk ranking model to predict the future citations of papers under the assumption that new published papers are easier to be cited than older ones. Wu et al. [16] proposed TAORank, which considers mutual influence among scholarly entities, which includes the publication time and author order in scholarly papers. Wang et al. [12] incorporated the text information because they thought that it is useful to improve the predicting results. Chaturvedi and Snigdha [18] analyzed the usefulness of text features and got the conclusion that the most accurate prediction result can be obtained from combining the metadata and text features. Dong et al. [19] and Giovanni et al. [17] found that topical authority and publication venue were crucial to these effective predictions. Liang et al. [5] integrated the multinomial multidimensional relationships between papers and authors into ranking model. However, the limitation of these methods is that the rich information is not expressed in a unified representation to adequately use it for prediction.
In recent years, network embedding-based methods have received widespread attention for their ability of learning unified low-dimensional vectors for different kinds of entities in a network while the structure information is preserved. Various network embedding algorithms have been put forward for multiple tasks, such as link prediction [26], node classification [20,21,23,27], community detection [28], and recommendation task [29]. These provide a dawn for us to solve this problem. However, these methods have not been extended to scientific influence prediction task. In this paper, a novel network embedding-based method called ESMR has been developed to predict future scientific impact, which can simultaneously capture the local and global structure information and text information from HAN.

Methodology
The goal of our proposed ESMR is to predict future impact of papers and authors by integrating learned embedding representations of entities in HAN into multivariate random-walk model, which can capture the local structure information of HAN and the rich text information related to the papers. The process of ESMR is shown as Fig. 1, which consists of three parts: (1) constructing the heterogeneous academic network (HAN), which includes the paper citation network (PCN), text information network(TIN), paperauthor network (PAN) and co-author network (CAN); (2) Learning a unified embedding representations for all entities of the HAN, in which the local structure information and text information are preserved; (3) predicting the future impact of papers and authors by integrating these learned representations into a multivariate random walk based Co-Ranking model.

Heterogeneous academic network definition
In this subsection, different types of networks are defined and can be considered to be the formal construction of the heterogeneous academic network.

Definition 1 Paper Citation Network (PCN).
Paper citation network is denoted as G pp = (P , E pp , F pp ), where P is a set of papers, E pp is the directed edges representing the citation relationships among the papers. F pp is a set of edge weights.
That is to say if paper p i cites p j , there exists an edge e ij pp ∈ E pp from p i to p j , whose weight is represented by f ij pp ∈ F pp . It relates to the time when paper p i cites p j and the time span between the current time and their citation time, and is calculated by (3). So the paper citation network is a time-aware graph, in which an edge established newly can obtain the greater weight because the papers Fig. 1 The framework of ESMR. The process of ESMR consists of three parts: (1) constructing the HAN; (2) learning unified Embedding representations of entities in HAN; (3) predicting the future influence of papers and authors frequently cited recently are much more likely to be cited by other papers. The paper citation network can capture the citation relationship among the papers. However, the newly published papers with few citations may not be well represented. Generally, the citation relationship between two papers is established because of the similar research topics or contents, so the text information of papers can be helpful to learn the better representations of papers. The words and topics in the papers are used to capture the text information, which is added to the embedding of the paper citation network.

Definition 2 Text Information Network of A Paper (TINp). Given a paper
is the set of edges between p i and its topics and words, F i p is the set of weights of E i p and There exists edges e ij pz ∈ E i pz and e ij pw ∈ E i pw if the topic z i and words w j are included in p i , e ij zw ∈ E i zw if the topic z j includes the word w k in the paper p i ; F i p is the set of weights of edges E i pz , E i pw and E i zw , respectively. The weight f ij pz ∈ F i pz is calculated by LDA, while the weight f ij zw ∈ F i zw by TF-IDF, which is described detailly in next Section. TINp of a paper p i contains paper-topic network G i pz , paper-word network G i pw and topic-word network G i zw . For the published papers (especially for new papers) with only fewer citations, the topics and words information of p i may provide us clues to represent the context of it. That will help us address the problem of less available information of them when ranking their future influences. After defining PCN and TINp, We can further define the paper citation network with text information (PCNT). The citation relationships between two papers are established largely because of the similar research topics, and these similarities can be further reflected by text content of articles. So PCN and TIN are combined into a unified network PCNT. [15]. Let G pp and G i p be graphs as in Definition 1 and Definition 2, then the PCNT is defined as

Definition 3 Paper citation network with text information (PCNT)
P , Z, W respectively represent the sets of papers, their topics and words, E pp , E pz and E pw are the sets of edges between papers and their references, topics, words, E zw is the set of edges between topics and words, F pp , F pz , F pw and F zw are the corresponding weights of the edges E pp , E pz , E pw , E zw , respectively.
The PCNT can help us to accurately predict the scientific impact of new papers with few citations by effectively calculating the text semantic similarities between papers. In order to discover the relationships of authors, papers and their authors, we define the co-author network and paper-author network, respectively.

Definition 4 Co-author network (CAN).
The CAN is denoted as G a = (V a , E a , F a ), where V a is a set of authors, E a is the set of undirected edges representing collaborations among authors. F a is the set of weights of edges E a , which is calculated by (5).
It is also a time-aware weighted network, and when we calculate the weight between author a i and a j , the number of papers coauthored by them and the time span between the current time and the time they coauthor every paper are taken into account. It assumes that an author is more likely to coauthor new papers with authors who recently coauthors with him.

Definition 5 Paper-author network (PAN). The PAN is defined as
where E pa is the set of edges between papers and authors, which connects the papers and the corresponding authors, and F pa is the set of weights of edges E pa . There exists an edge e ij pa ∈ E pa between p i and a j if a j is an author of p i , and the wight f ij pa ∈ F pa relates to the signature order of author a j in paper p i , which is described detailly in next Section.
Multiple networks mentioned above can be further merged into a unified network, in which multiple connections between different types of entities are established, especially for the connections between authors and research topics or words of papers. This can help us predict scientific impact of new papers and new authors. [15]. Let G p be graph as in Definition 3, G a in Definition 4, G pa in Definition 5, then the HAN is defined as Then the heterogenous academic network can been established to learn the embeddings of entities in it. For example, there are three papers p 1 , p 2 and p 3 , their authors are a 2 and a 1 , a 2 and a 3 , a 1 and a 3 , respectively. So there exists weights f 12 pp , f 23 pp and f 31 pp between p 1 and p 2 , p 2 and p 3 , p 3 and p 1 , f 21 aa , f 23 aa and f 13 aa between a 2 and a 1 , a 2 and a 3 , a 1 and a 3 , respectively. While in CAN f 12 pa = 1,

Definition 6 Heterogeneous academic network (HAN)
pz , e 12 pz , e 12 zw , e 13 zw , e 12 pw , e 13 pw and their weights are calculated by LDA and TF-IDF, respectively. After the representations of entities are learned, the influences of papers and authors are ranked using multivariate random-walk on HAN.

Embedding for PCNT
As mentioned above, the PCNT G p consists of three networks G pp , G pz and G pw , which are connected by the paper nodes. The empirical distributions of paper p i in G pp , G pz and G pw are uniformly expressed asP(·|p i ), and P(·|p i ) are their conditional probability distributions. To learn the low-dimensional embedding representations p i , z i , w i of paper p i , topic z i , and word w i , the objective is to minimize the following KL-divergence between two probability distributions.
where v j is one of p j , z j and w j , λ i p is a unified representation, including λ i pp , λ i pz and λ i pw , which are weights of paper p i representing the importance of p i in G pp , G pz and G pw and will be defined below. P(v j |p i ) can be estimated by the following softmax function: where where ω ij is the weight of edge (p i , v j ), and R(p i ) is the nodes connecting to p i . As with λ i p , it has different definitions in G pp , G pz and G pw .
For G pp , v j ∈ P , R(p i ) is the set of papers cited by p i , and ω ij (ω ij ∈ F pp ) is the edge weights between papers which represents the citation relationship. Obviously, the paper citation network is dynamic, and the citation relationships established at different years have different effects on the future influence of papers. Thus we try to capture the dynamic properties of the network by assigning different weights to the citation relations based on their set-up time. We assign higher weights to the more recent citations through the exponential decay function over time.
Thus ω ij is defined as: where ρ is a decaying parameter that has been predefined, T c represents the current time, and T i→j is the time that the citation occurs between paper p i and p j . Furthermore, we is the set of papers which references p i . It represents the influence of paper p i in the paper citation network.
For G pz , v j ∈ Z, R(p i ) is the set of topics that are most likely to be touched upon p i . ω ij (ω ij ∈ F pz ) represents the likelihood that the topic z j is included in p i (i.e., P(z j |p i )), which is calculated by LDA model [30]. And λ i p is defined as λ i pz = k∈R(p i ) ω ik . It represents the influence of paper p i over the topics.
For G pw , v j ∈ W , R(p i ) is the set of words which are included in p i . ω ij (ω ij ∈ F pw ) reflects the importance of w j in p i and can be calculated by if-idf. And λ i p is denoted as λ i pw = k∈R(p i ) ω ik . It represents the influence of paper p i over the words.

Embedding for CAN
The CAN G a can show the influence of authors by mining the cooperation relationship among authors. We assume that the impact of two authors sharing common co-authors is similar to each other. Then similar to the PCNT, the loss function for embedding the co-author network G a can be defined as: whereP(a j |a i ) = of author a i , and ω a ik is the weight of the collaboration among co-authors. Although the number of papers coauthored by two authors reflects the closeness of their collaboration relationship, it is not fair for those young authors who do not have many co-authors. To this end, the time information is taken into account and the weight of the collaboration between author a i and a j is set as where Co(a i , a j ) is the set of papers that author a i collaborates with author a j , T c is the current time, and T p k co is the time when author a i and a j co-authored paper p k . λ i a represents the influence of different authors a i in G a , and is computed by λ i a = j ∈N(a i ) ω a ij . The conditional probability P(a j |a i ) is also calculated by the softmax function defined by (2).

Embedding for PAN
The paper-author network G pa can capture the relationships between a paper and its all authors, and that should be preserved in PAN Embedding. The weight ω pa ij of an edge (p i , a j ) linking a paper p i and its author a j is regarded as their empirical probability which indicates the closeness between them. The joint probability P(p i , a j ) is specified by the low-dimensional representation in the latent space. So the embedding for PAN can be learned by minimizing the following objective function.
where ω pa ij is set to 1 s , and s is the signature order of author a i in paper p j . P(p i , a j ) is defined as P(p i , a j ) = 1 1+exp(−p i ·a j ) , where p i ∈ R d and a j ∈ R d are d-dimensional latent representations of p i and a j , respectively.

Embedding for HAN
To embed the HAN by integrating all the embeddings on G p , G a and G pa , we combine the objective functions (1), (4) with (6), then jointly minimize the following objective function.

Model optimization
Direct calculation of the function in (2) [21] is both timeconsuming and impractical. To address the computational challenge, negative sampling approach [31] is adopted. The probability of positive samples is maximized while the probability of negative samples is minimized as far as possible. Therefore, the objective function L can be expressed as the following formula, which uses L2-norm to avoid over-fitting and ignores some constraints.
where σ (x) = 1 1+exp(−x) is the sigmoid function. λ, β ∈ R are regularization coefficients. The positive samples are modeled by the first term in (8), and negative samples by the second term. Where E = {E pp ∪ E pz ∪ E pw ∪ E a }, (i, j ) / ∈ E represents a set of randomly sampled edges between v i and v j , which are not actually included in HAN.
In order to optimize the loss function (7), the gradients for p i and a i can be computed by using the stochastic gradient descent algorithm.
where d p and d a respectively represent the dimension of vectors p and a, and here d p = d a = d. We do not detailedly show the gradients for z i and w i , which can be derived in the similar way.
For each iteration, we adopt the backtracking line search [22] to obtain the most suitable learning rate. The complexity of Algorithm 1 is proportional to the complexity of the gradients of vertex embeddings. Let n be the number of pairs of vertices with edges, k be the iteration times, and d p and d a are the dimensions of v p and v a , respectively. Then its complexity is O(nd p × d a k). Therefore, it is easy to see that the training can be done in polynomial time. The detailed training process has shown in Algorithm 1 [15].

Predicting the future scientific impact of papers and authors
In this section, we will introduce how does ESMR predict the future influence of papers and authors. By integrating all the available information through HAN embedding, different entities with similar potential influence are considered closer to each other in the learned latent representation space. Then based on the learned entities embedding, the cosine similarity is used to measure the similarity between them. For example, the similarity between two papers can be calculated by Sim(p i , p j ) = p i ·p j ||p i ||×||p j || . Thus, to be used in the multivariate random-walk model, the transition matrix of PCN can be represented as: o t h e r w i s e (11) where N(p i ) and deg(p i ) respectively represent the sets of neighbors and out-degree nodes of p i , γ is used as a adjustable parameter to balance factors affecting the transition probability. In a similar way, M aa , M pa , M ap can be easily learned. At last, the intra and inter-network multivariate randomwalk on HAN uses these transition matrices to calculate the future influence of papers and authors. Each iteration process is defined as the following equations: where p (t) and a (t) are predicted score vectors at time t. α pp and β pa are influence weights of other papers and authors on one specific paper, while α aa and β ap are on one specific author. Thus, the stationary vectors can be obtained by iterating the (12) and (13) until convergence. For example, there are papers p 1 , p 2 , p 3 and their authors a 1 , a 2 , a 3 in HAN mentioned above, and their embeddings have been learned by using network embedding. Then we can predict their score vectors v p 1 , v p 2 , v p 3 and v a 1 , v a 2 , v a 3 using multivariate random-walk based on calculated transition matrices M pp , M aa , M pa and M ap . We respectively rank the scores vectors v p 1 , v p 2 , v p 3 and v a 1 , v a 2 , v a 3 in descending order. The rankings of their influences are p 1 > p 3 > p 2 and a 2 > a 3 > a 1 if the rankings of their score vectors are v p 1 > v p 3 > v p 2 and v a 2 > v a 3 > v a 1 .

Datasets
ESMR is evaluated by using the following two public datasets. One is the ACL Anthology Network (AAN), AAN is the complete collection of computational linguistics papers published by ACL. It contains 23,766 papers published before 2014, 18,862 authors, and 124,857 citations among these papers.
Another dataset is the Academic Social Network of AMiner Dataset, which is released by AMiner. We use two versions called AMiner2014 1 and AMiner2020. 2 AMiner2014 includes 2,092,356 papers published before 2014 and their 8,024,869 citations. AMiner2020 includes 4,894,081 papers published before 2020 and 45,564,149 citations. The metadata for each paper contains the following information: paper ID, paper title, author list, author affiliation, published year, published venue, abstract and the list of references.
Firstly, the datasets are preprocessed as follows. Papers published after 1998 in AMiner2014 and 2000 in AMiner2020 were selected for evaluating the predicting performance. Then the papers without sufficient metedata, such as without author, publication time, reference or citation, are removed, because the impact of such papers is hard to evaluate. Then, authors are extracted from these selected papers and their effects are predicted. Finally, we obtain 19,564 papers, 91,498 citations in the AAN dataset. In the AMiner dataset2014, we obtain 328,971 papers and 2,732,340 citations, while in AMiner2020, 809,392 papers and 10,222,111 citations are selected.

Ground truth
Owing to the lack of criteria, it is a challenge to evaluate the performance of almost all works. We use the number of future citations as the ground truth to evaluate it, which is adopted by recent works [3,12].
The dataset is divided into two parts according to a historical time point: the training part and the test part. The training part is used to obtain the estimated ranking lists of papers and authors published before the historical time point. The test part is used to calculate the future citation number for each paper, and then the ground truth lists of papers and authors can be obtained by ranking them in the descending order. Finally, the results is reported by comparing the similarities of the two ranking lists.
In this paper, the papers are divided into the training and test set based on whether or not they published before 2009 in AAN and AMiner2014, before 2015 in AMiner2020.

Evaluation metrics
There are two widely used metrics to evaluate the performance. One is recommendation intensity RI [12]. The intuition behind the RI is that, given two ranking lists R1 and R2 with the top-k results, R1 is better than R2 if R1 returns more objects matching the ground truth ranking list, and the matched objects are at the front of the top-k list. Assuming R is the top-k returned objects of a ranking approach and L is the list of ground truth, for each object P i in R with the ranked order o r , the recommendation intensity of P i at k can be defined as Based on each object's recommendation intensity, the recommendation intensity of the top-k list R can be defined as RI (R)@k = p i ∈R RI (P i )@k. As mentioned in [12], RI will degenerate to precision when taking the top-k list R as un-ordered and dividing RI (R)@k by k.
The other is Normalized Discounted Cumulative Gain (NDCG), which is commonly used in sorting algorithms. The factors considered in NDCG are the relevance of the ranking lists and the sorting position. Its intuition is to divide the relevance of each ranking list into multiple levels for scoring. The higher the level is, the higher the importance is. Then considering the position information of each ranking list, the higher the order position is, the higher the importance is. It can be calculated as follows.
. (15) where rel i represents the correlation of the i-th result, and |REL| represents the set of Top-k results selected after sorting the correlation in descending order. Obviously, the higher the value of NDCG is, the better the ranking result is.

Baselines
To evaluate ESMR, the following methods are used to compare with it.
• MRCoRank (MR). MRCoRank is the state-of-the-art graph-based method to rank the future influence of papers, authors, and venues simultaneously. It takes time, text and structure information into account when using a mutual reinforcement framework to predict the results [12]. • FutureRank (FR). FutureRank is a representative model to rank the future impact of papers by fusing the relevant information related to papers (like authors, citations,and publication time) [11]. • PageRank (PR). PageRank is a base model for many graph-based ranking methods [32], which is used to compare with ESMR having the same weights of edges. • LINE+CoRank (LCR.) LCR is a method that the paper and author embeddings are learned by using LINE [21]. Then the learned embeddings are combined with our ranking algorithm described in Section 3.3. • EOE+CoRank (ECR). ECR is an advanced method that integrates the paper and author embeddings learned by EOE [22] into our proposed ranking model described in Section 3.3.
In addition, ESMR has the two different variations: ESMR without Text information (ESMR-T) which shows the effectiveness of the text information, and ESMR without the network embedding model (ESMR-NE) which studies the necessity of embedding process for improving the prediction performance.

Parameter sensitivity analysis
There are several parameters to be learned in our model. The initial values of each dimension of the vectors are random, which appears in the vectors of authors, papers, topics and words. All of them are randomly generated between 0 and 1, which takes the current system time as seed. Other parameters are learned by experiments. We evaluate the performance of our model by varying these parameters in a predefined and default search range to determine the values of them. In the process of training ESMR, the negative sampling rate is set to 5. The regularization coefficients λ, β are specified as 1. To study the influence of dimensions, the dimensions of papers and authors are varied from 20 to 200. The results of top-20 paper ranking and author ranking are shown in Fig. 2a-d. The result shows that the performance of ESMR slightly varies in different dimensions and it is reasonable to select 100 as their dimensional values by balancing computational complexity and algorithm performance. So, 100 is used as the default dimension setting in the following experiments. Furthermore, here are four parameters in the process of ranking, α pp , α aa , β pa and β ap . Take the AAN dataset as an example, Fig. 3a, b shows the effect of α pp on papers, and Fig. 3c, d shows the effect of α aa on authors. For α pp , it is set to 0.3 for getting the best result. For α aa , 0.15 is a reasonable choice. Whether it is too large or too small for both α pp and α aa will reduce the performance. In the following experiments, for the AAN dataset, α pp = 0.3, α aa = 0.15, β pa = 0.3 and β ap = 0.85 are the default parameter settings. And for the two AMiner datasets, when α pp = 0.6, α aa = 0.2, β pa = 0.35 and β ap = 0.8, the performance of ESMR is best one. For AAN, 0.3 is the value of parameter γ in (11) while for the two AMiner datasets, the value is 0.6.
The parameters of baselines are same as the settings in the corresponding papers. For MRCoRank, α pp = 0.6, α aa = 0.5, β pa = 0.

Ranking results of paper impact
The performance of ESMR is quantitatively compared with all baselines, the ranking results of papers on the AAN dataset are illustrated in Fig. 4. Figure 4 demonstrates that the performance of ESMR is better than all baselines with different k. ECR and ESMR-T perform better than most of other baselines for both NDCG@k and RI@k, but it is still inferior to ESMR. For NDCG@50 and RI@50, ESMR outperforms ESMR-T by 21% and 27%, ECR by 20% and 25%, improves over MRCoRank which achieves the best performance in all other baselines by 33% and 42% on two metrics. A possible explanation is that ESMR-T and ECR fail to capture the text information of papers, MRCoRank does not use network embedding, whereas ESMR adds all of them to improve the performance. ESMR-T is consistent with ECR in most cases, sometimes better than ECR. The possible reason is that they integrate network embedding with corank. The performance of LCR is lower than that of ESMR-T and ECR. This is because LCR is a graph based method built on random walk, and it fails to capture all the relations between entities. But LCR performs better than MRCoRank for RI@k, and EMSR-NE is inferior to all methods except FutureRank and PageRank. All these verify network embedding indeed facilitate to generate effective impact prediction.
Next, we turn to the experiments on the Aminer dataset and the papers published in the same year and in the same research community are selected to evaluate the prediction results. It is based on the following two considerations: (1) After the top-100 papers in the ground truth are listed, we discover that the ground truth obviously tends to the older ones. For example, the number of papers published before The results in fields of Artificial Intelligence (AI) and Database(DB) in different years are selected to evaluate the paper ranking performance and NDCG@20 and RI@20 are taken as metrics, respectively.
The results on Aminer2014 dataset in 2001, 2003, 2005 and 2007 are shown in Fig. 5. The results are basically consistent with them on AAN, but there are still some differences. For the ranking results in AI, as shown in Fig. 5a, b, ESMR generally outperforms LCR and ECR, which is better than MRcoRank and FutureRank. ESMR-T does well in most cases, and its performance is competitive with LCR. For the results on NDCG@20 and RI@20, ESMR respectively performs better than ECR by 9. That confirms that ESMR is effective to rank the future influence of papers. ESMR generally outperforms ESMR-T, that proves the text information of papers is useful to improve the accuracy.
For the prediction results in DB, Fig. 5c Similar observations can be seen by comparing ESMR with its two variations ESMR-T and ESMR-NE. The ESMR still performs better than ESMR-NE, but is worse than ESMR-T by 12 The results on AMiner2020 are shown in Fig. 6. The performance of all methods are better than that of AMiner2014 in different years and fields because there are more records on AMiner2020 to provide more available information and the research communities are divided by the weights provided by dataset. ESMR is still superior to the best baseline (ECR) by 14.3% and 15.5% in AI, 5.7% and 12.9% on average in DB. In addition, the overall performance is descended   Figure 7 shows the ranking results of authors on AAN dataset. We can see the performances of our ESMR and ESMR-T are better than all other methods at various k. From the overall views of average NDCG and RI, ESMR performs better than FutureRank by 41% and 42%, MRCo-Rank by 54% and 60%, ECR by 52% and 63%. Comparing  The paper results of NDCG@20 and RI@20 on the AMiner2014 Dataset ESMR with ESMR-T, ESMR still outperforms better than ESMR-T by 13.5% and 16.8% on average, which implies that adding the text information does help to better rank authors. Different from the prediction of paper impact, we have some interesting observations. LCR has achieved the worst performances except for PageRank. We guess that one possible reason is that users' relations on paths obtained by random walk may be smaller because of a litter number of papers published by each author. Comparing with MRCoRank and FutureRank, we note that the ESMR-NE provides more competitive results, but ECR obtains relatively poor performance. The results indicate that the text topic information and the better ranking method are more important to predict author impact. From the above observations and ESMR achieving the best performances, we can conclude that it is necessary to integrate network embedding, text information and better ranking method. Similar to the ranking of papers, we only select and rank the authors who begin to publish papers in the same research field and year on the AMiner dataset and also use NDCG and RI as metrics. In AMiner2014, the ranking results in the fields of AI and DB in 2002, 2004, 2006 and 2008 on NDCG@20 and RI@20 are listed and shown in Fig. 8a, b and Fig. 8c, d, respectively.

Ranking results of author impact
It can be seen that ESMR performs the best performance. For NDCG@20 and RI@20 in AI, ESMR is superior to the best baseline (ECR) by 6 The results on AMiner2020 are shown in Fig. 9. The results are basically consistent with them on AMiner2014. The performance of all methods are also better than that of AMiner2014 because it includes more available data and the fields may be accurately divided according to the provided weights of fields in it. ESMR is still superior to ECR (the best baseline) by 14.3% and 15.5% in AI, 5.7% and 12.9% in DB on average. ESMR also significantly outperforms

Case study
A case study of paper prediction results on AAN dataset is presented, as shown in Table 1. In the left two columns are the index of the top-10 papers returned by the ground truth and the year of their publication. For comparison, the rankings of these papers in ESMR and the baseline approaches are listed. The boldface is used to denote that the predicting order of the papers in the top-10 list of these approaches. Table 1 shows that ESMR gives better prediction result than all other methods. 7 papers out of top-10 papers returned by ESMR are in the top-10 papers in the ground truth, while the numbers of those hit by ECR, LCR, MRCoRank, FutureRank and PageRank are 6, 5, 6, 4 and 2, Fig. 7 The author results of NDCG@k and RI@k on the AAN Dataset  Fig. 8 The author results of NDCG@20 and RI@20 on the AMiner2014 Dataset respectively. For the influential papers P07-2045 and J07-2003 published in 2007, all the methods fail to identify them, but they ranked significantly higher in ESMR than in other methods, which indicates that ESMR improves the performance on the impact prediction of new papers.
Then a case study of ranking results of AI papers published in the year 2007 is presented. As shown in Table 2, the titles of the top-10 papers returned by ground truth and their published venues or journals are listed in the left two columns. The order of these papers in ESMR and the baselines are listed ,too. Table 2 shows that 8 out of the top-10 papers returned by ESMR are in the top-10 papers in the ground truth, while those of ECR, LCR, MRCoRank, FutureRank and PageRank are 7, 7, 5, 5 and 4, respectively. ESMR gives better ranking results than all other approaches. The influential paper Graph Embedding and Extensions: A General Framework for Dimensionality Reduction which ranks 2 in the groud truth is also in the top-10 list of ESMR, ECR and LCR, while the other methods fail to identify it. That is because that we only use the available papers before 2009 for ranking, its obtained citations between 2007 and 2009 is not sufficient. It demonstrates again that ESMR has powerful ability to discover the new paper with lager impact.
A case study of the author prediction results on the AAN dataset is presented. The results are shown in Table 3. The top-10 authors returned by ground truth and the future citation numbers they received are listed in the left two columns. The predicting order of these authors in ESMR and the baseline approaches are also given. Table 3 shows that 7 out of the top-10 authors returned by ESMR are in the top-10 rankings of the ground truth, while 6, 5, 3, 5 and 1 matched authors are returned by ECR, LCR, MRCoRank, FutureRank and PageRank, respectively.
Then, for AMiner2020 datset, a case study of ranking results of authors who began to publish AI papers in 2015 are shown in Table 4. 6 out of top-10 authors returned by ESMR are in the top-10 authors in the ground truth, while the numbers of those hit by ECR, LCR, MRCoRank, FutureRank and PageRank are 5, 4, 3, 3 and 1, respectively. ESMR gives the best prediction results comparing with the baselines. For Callison-BurchChris, JurafskyDaniel in AAN and Jeff Kiske, Joel Pazhayampallil in AMiner2020, all scientific impact. We can construct a more comprehensive HAN in the future works, which includes much other information, such as publication venue, journal, publisher, the institutions of authors and so on, and extend ESMR for getting better ranking results.
For Aminer2014, the papers are divided into different research fields in term of published venues. This may be unreasonable because that papers with different topics can be published in the same published venue or papers with same topic can be published in different published venues. We will divide them by extracting keywords of abstract and title content, which may refine the field division of papers.
In our experiments, the papers are divided into the training and test set based on whether or not they published before 2009/2015. Since this may affect the prediction results, in the follow-up research, we will consider to divide the data with other time points, and train the corresponding models. In addition, we can replace two-stage training with collaborative training to explore further better predicting results.
On the other hand, although the proposed ESMR cannot completely solve the problem of predicting the future scientific impact, it is the first attempt to learn a unified entity representations by using network embedding based model, and to integrate richer information to a multivariate randomwalk model for improving the prediction performance.

Conclusions
In this paper, a new ranking method ESMR was proposed to predict the future scientific influence of papers and authors. A network embedding based model was designed to learn a unified better representations for various entities in the constructed heterogeneous academic network. The learned embedding representations could capture the local structural information, the rich text and time information of papers, which is important to effectively predict the scientific impact. By integrating the learned embeddings and the global structural information into a multivariate random-walk model, the future impact of papers and authors were predicted simultaneously, especially for new papers and authors. The experimental results on two datasets demonstrated that the proposed model outperformed other baselines.
Abbreviations HAN, Heterogeneous Academic Network; ESMR, Embeddings and Structure information to Multivariate Random-walk; PCN, Paper Citation Network; PCNT, Paper Citation Network with Text information; TINp, Text Information Network of a Paper; PAN, Paper Author Network; CAN, Co-Author Network; AAN, ACL Anthology Network; RI, Recommendation Intensity; NDCG, Normalized Discounted Cumulative Gain; AI, Artificial Intelligence; DB, DataBase.