Temporal word embedding with predictive capability

Semantics in natural language processing is largely dependent on contextual relationships between words and entities in a document collection. The context of a word may evolve. For example, the word “apple” currently has two contexts—a fruit and a technology company. The changes in the context of words or entities in text data such as scientiﬁc publications and news articles can help us understand the evolution of innovation or events of interest. In this work, we present a new diffusion-based temporal word embedding model that can capture short-and long-term changes in the semantics of entities in different domains. Our model captures how the context of each entity shifts over time. Existing temporal word embeddings capture semantic evolution at a discrete/granular level, aiming to study how a language developed over a long period. Unlike existing temporal embedding methods, our approach provides temporally smooth embeddings, facilitating prediction and trend analysis better than those of existing models. Extensive evaluations demonstrate that our proposed temporal embedding model performs better in sense-making and predicting relationships between words and entities in the future compared to other existing models.


Introduction
Text data available over the Internet have grown exponentially in the past decade.There are ample techniques to transform an unstructured text collection into a structured representation allowing us to apply conventional data mining and machine learning algorithms for analytical purposes.An issue with the conventional approach to representing unstructured data is that the contextual changes in the meanings of words as the language evolves are not considered in the models.That is, conventional representation models do not consider text publications as evolving streams of data.
The contextual evolution of a word plays a vital role in its contemporary semantics.For example, the context of the word Cloud changed over time in the last two decades.The word, Cloud, in news articles, reflected its connection with weather-related terms in the beginning.From 2008 to 2012, the word cloud started to reflect the context of web-based storage, such as Dropbox and Google drive.Slowly, from 2016 the word cloud started to reflect cloud-based computing services, such as Amazon Web Service (AWS), Microsoft Azure, and Google Cloud Platform, as the services became more affordable and popular.Figure 1 demonstrates that the nearest neighbors of the word cloud changed over the years for a news dataset.Taking all these changes in the neighborhood of cloud, how can we predict its neighborhood in the coming years?
It needs to be clarified that the scope of this paper is the context of words rather than the meanings.The context, in a sense, may cover multiple meanings or sometimes a mixture of meanings.For instance, if we look at the neighbors of the word "Cloud," we might see a mix of words from different domains, such as "Google" and "AWS" from the technology domain and "Vapor" and "Rain" from the nature domain, indicating that the word "Cloud" has a mixture of multiple meanings within its context.
To serve a larger set of analytic needs-including the prediction of a mathematical space that words may represent in the future-modern unstructured data representation techniques are slowly drifting toward the analysis of temporal aspects of text [1][2][3].Nevertheless, the ability to represent unstructured text with a temporal context is still in its infancy.In this paper, we present a temporal embedding model for representing text data in a structured way based on the temporal context.
Our observation from timestamped text collections indicates that new concepts and events do not spike on one day and disappear on the next; rather concepts and events evolve over time.Some events might appear and disappear suddenly within a very short time.The paper does not focus on such events, rather it focuses on the evolving events.For example, the concept of COVID started to rise at a fast rate in February and March of 2020, but the topic started several months before that.Similarly, the topic of Russia's Ukraine invasion did not Fig. 1 The nearest neighbors of the word cloud over time occur in one day.In unstructured text representations, discrete timestamps do not help much to reflect the rise and fall of entities well.The appearance of a new entity in one timestamp needs to be diffused earlier than when it appeared first, to reflect a smooth transition to capture evolution.However, not all words and entities evolve, and there can be some words that appear suddenly and then disappear quickly.The scope of this paper is limited to words and entities that evolve over time, with time units of around months or years.
Temporal smoothness is a crucial aspect of temporal word embedding because it plays a key role in achieving our primary goal of projecting embeddings into future timestamps.The ability to project embeddings in future timestamps allows us to predict the evolution of an entity.We improved the model's ability to make reasonable predictions by focusing on temporal smoothness.To provide a smooth transition for evolving words in their representations, in one of our previous research efforts, we presented the issue of capturing the evolution of concepts using a diffusion-based time-reflective representation [4].Diffusion of a word, in [4], was reflected by smoothly incorporating its effect across timestamps, before and after it appears.This time-reflective representation enabled the tracking of the meaning of every word in terms of their neighborhood and captured changes in context over time.
The term, document, in this paper, refers to a single news article or an abstract of a scientific publication.One of the challenges of [4] was that each word vector had a length equal to the number of documents in the corpus (see Sect. 4.1.1),which is not practical for analyzing a corpus containing thousands of documents.Since the vectors generated for the words were directly dependent on the documents where those words appeared, vectors for the same words in the future cannot be extrapolated (because the future documents are yet to be seen and not features of this representation).That means the model did not generate embeddings with prediction capability.To address these challenges, in the current paper, we construct a contextual low-dimensional temporal embedding space that mimics this high-dimensional representation while maintaining the essential diffusion information contained in the vectors.We introduce a neural-network-based framework that generates low-dimensional temporal word embeddings while optimizing for multiple key objectives.Through empirical analysis outlined in Sect.5.1.8,we determined that an embedding size of 64 is the optimal choice for our approach which is significantly smaller compared to the thousands or a million dimension of temporal tf-idf representations.
Word embeddings are low-dimensional vector space models obtained by training a neural network using contextual information from a large text corpus.There are several variants of word embeddings with different features, such as word2vec [5,6], GloVe [7], and BERT [8].These embedding techniques do not explicitly address dynamic changes in context.The few existing efforts [3,[9][10][11] to generate dynamic low-dimensional language representations fail to integrate the concept of temporal diffusion into language models effectively.Moreover, these existing models cannot simultaneously capture both the short-term and long-term drifts in the meaning of words.As a result, sharply trending concepts, such as COVID-19 (coronavirus disease 2019), cannot be modeled in the embedding space when long-term drifts are considered.On the other hand, long-range effects-such as the change in the meaning of the word cloud-are not captured when these algorithms take only short-term drifts into account.Moreover, the existing models only have limited capability to generate dynamic embeddings that can be used for predicting future embeddings.
The diffusion-based temporal word embedding model that we propose in this paper is able to capture both short term and long term in the corpus.It incorporates diffusion into the modeling by integrating the time dimension smoothly into the model's objective function.The temporal embeddings are generated for all words in all timestamps in a connected space, and hence, the vectors can be used to predict embeddings of words in the future, unlike other existing temporal or dynamic embedding techniques.In this paper, we use the phrase temporal word embeddings and dynamic word embeddings interchangeably.
The experimental results in Sect. 5 show that the proposed method performs significantly better than the state-of-the-art temporal embedding models [3,11,12] in capturing both shortterm and long-term changes in word semantics.Additionally, our model provides embeddings to facilitate more meaningful predictions for the future context of a word.
The contributions of this paper are summarized as follows: • This paper describes a neural network model that generates low-dimensional dynamic embeddings from high-dimensional time-reflective feature vectors without degrading the quality of word vectors.• Our proposed model creates a homogeneous embedding space for all timestamps in the data so that each word's temporal embedding can be seen as a multi-variate signal, which conventional signal prediction algorithms can leverage to predict embeddings in a future timestamp.• We introduced a diffusion mechanism in the objective function in order to smooth the embeddings for better prediction capability.• We compare the new model with regular temporal co-occurrence-based, dynamicembedding-based, and time-masking transformer-based models on the task of semantic change detection.• In this paper, we introduce the application of predicting a future embedding space for an existing timestamped document collection.

Related work
This section provides a detailed review of the literature on the various aspects of our research work in order to put our contributions in perspective.The related works presented here are divided into two subsections, describing two different tasks: the generation of temporal embeddings and the prediction of a future embedding space.

Generation of temporal embeddings
Semantic evolution: Meanings of words in a language change over time depending on their use [13,14].Temporal syntactic and semantic shifts are called diachronic changes [1].Several probabilistic approaches tackle the problem of modeling the temporal evolution of a vocabulary by converting a set of timestamped documents into a latent variable model [15][16][17][18].Other approaches model diachronic changes using Parts of Speech features [19] or using graphs where the edges between nodes (that represent words) are stronger based on context information [20].However, tracking semantic evolution is not possible using these techniques because they do not generate language models.

Language models:
The state-of-the-art technique for language modeling is word2vec, introduced by Mikolov et al. [5,6].This method generates a static language model where every word is represented as a vector (also called embedding) by training a neural network to mimic the contextual patterns observed in a text corpus.There are several variants of this method which include probabilistic approaches [21] as well as matrix-factorization-based techniques such as GloVe [7].A major challenge with static representations is that they do not incorporate any temporal information that can be used for tracking semantic evolution.Our work focuses on incorporating the temporal dimension of text data into text embedding models so that evolution of a vector space over time can be studied.

Static to dynamic embeddings:
A proposed solution to tracking semantic evolution is to obtain a static representation for each timestamp in a corpus and then artificially couple these embeddings over time using regression or similar methods [1,3,12,22].However, this approach has several drawbacks.First, it requires having a significant number of occurrences for all words at all times, which is usually not the case since words can gain popularity or appear at different times.Second, the artificial coupling of embeddings across timestamps can introduce artifacts in the model that may lead to wrong conclusions.A potential solution to the sparsity problem is introduced by Camacho et al. [4], which leverages diffusion theory [23] to generate a robust temporal representation.The technique uses a temporal tf-idf representation in which the model changes size with the number of documents and as a result is not extensible.

Joint training of temporal embeddings:
The drawbacks of using static word embedding models to generate temporal representations have led to the development of new techniques that can train the embeddings for different timestamps jointly.The models use filters or regularization terms to connect the embeddings over time.Yao et al. [10] propose to generate a co-occurrence-based matrix and factorize it to generate temporal embeddings.The embeddings over timestamps are aligned using a regularization term.Rudolph et al. [11] apply Kalman filtering to exponential family embeddings to generate temporal representations.Bamler et al. [9] use similar filtering but apply it to embeddings using a probabilistic variant of word2vec.According to Bamler et al. [9], using a probabilistic method makes the model less sensitive to noise.All these methods focus primarily on capturing long-term semantic shifts, while our goal is to be able to capture both long-and short-term shifts.

Prediction of a future embedding space
Prediction using linear transformation: Several researchers have focused on studying multidimensional time-series prediction [24][25][26][27] using methods such as linear regression and accounting for temporal effects such as seasonality [25].[28] discussed linear, homogenous, and heterogeneous transformation of embeddings with timestamps for estimating embeddings for future timestamps.Linear transformation learns a mapping between embeddings of two consecutive timestamps.Homogenous transformation learns a mapping between stacked embeddings of timestamp 1 to (T − 1) and stacked embeddings of timestamp 2 to (T ).In heterogeneous transformation, instead of stacking the embeddings from all the timestamps, it learns weight matrices for mapping embeddings from all pairs of consecutive timestamps and uses various smoothing functions to combine the weight matrices.However, the embedding vectors for each timestamp are learned separately, which means the embedding vectors at each timestamp are static.Some other studies [29], [30] also leverage linear transformations of embedding vectors in order to predict embeddings in the future timestamps [29], first train a dynamic embedding model, and then create a timecontext vector to predict estimated embeddings for the next timestamp through the linear transformation of embeddings of the present timestamp.
However, these models are not well suited for the prediction of high-dimensional nonlinear phenomena, which is the case of semantic evolution.
Prediction using nonlinear modeling: Our generated temporal embedding is a nonlinear sequence of signals.Therefore, our prediction of a future embedding space from existing time-series data requires sequence modeling that can handle nonlinearity.

123
To model nonlinear phenomena, several neural-network-based sequence modeling techniques have been introduced recently [31][32][33][34].In this paper, we leverage state-of-the-art techniques in terms of sequence modeling, such as (1) the long short-term memory (LSTM) [31] and (2) gated recurrent units (GRUs) [32] and recurrent neural networks (RNNs).We also explore the effect of adding an attention mechanism [33] to both LSTM-and GRU-based RNNs, which allows the network to focus on the most important elements of the sequence.Finally, we evaluate the non-recurrent attention-based Transformer model [34] which, in contrast to RNNs, can be trained in parallel.

Problem description
In this paper, we focus on timestamped text corpora, such as collections of scientific publications, or news articles, that have publication dates.Let D = {d 1 , d 2 , . . ., d |D| } be a corpus of |D| documents and W = {w 1 , w 2 , . . ., w |W| } be the set of |W| entities (names, places, scientific terms) extracted from the text corpus D. We consider each of the entities a word.
Each document d contains words from the vocabulary (W d ⊂ W) in the same order as they appear in the original document of d.Every document d ∈ D is labeled with a timestamp t d ∈ T , where T is the ordered set of timestamps.
The goal of this paper is twofold.
• Task 1: Constructing a low-dimensional temporal embedding space, with predictive capability: The first task is to obtain a temporal word embedding model U from corpus D.
For every timestamp t ∈ T , we seek to obtain a vector representation u it for every word w i ∈ W. The word embeddings U are represented as a three-dimensional matrix of size |W| × |T | × |u| where |u| is a user-given constant that indicates the size of each vector.We use the shorthand U i to describe the two-dimensional matrix of size |T | × |u| that represents word w i ∈ W over all the timestamps.• Task 2: Predicting a future embedding space: The second task is associated with predicting a future embedding space.We aim to train a model for which P(U t a :t b ) ≈ U t b+1 , i.e., a model that takes as input the temporal embedding vectors for every word between t a and t b timestamps and predicts as output embedding vectors for every word in the vocabulary for timestamp t b+1 .The output of P can be used to forecast the future contexts of the words in the vocabulary.

Methodology
This section is divided into two subsections describing the two tasks: temporal embedding generation and prediction of a future embedding space, as outlined in the problem description.

Construction of a temporal embedding space
To construct a temporal embedding space that can be used for the prediction of future embedding space, we design an objective function that satisfies several crucial aspects in terms of similarity between words, weights reflecting relevance between contextual words, temporal diffusion of amplitudes of words, and connecting embedding spaces of different timestamps.We use a shallow neural network to accommodate the embedding generation with a com-plex objective function that models a temporally driven training dataset.The training data generation and each component of the objective function are explained below.

Training data for generating low-dimensional temporal embeddings
To construct the training dataset for generating low-dimensional temporal embeddings, we use the diffusion-based time-reflecting representation from our previous research efforts [4].
In this subsection, we first summarize the steps of generating a diffusion-based time-reflecting representation, constructed over a high-dimensional space and not having predictive capacity.Afterward, we discuss the process of creating training data.
As part of our previous research [4], we compute the frequency of words appearing in the documents over time to track the semantic evolution of the words.In our earlier research, we observed that the distribution of the frequency of a word over time was severely irregular, in particular for words or noun phrases that suddenly appeared at a particular timestamp or that are used sporadically.A frequency distribution that is inconsistent or uneven with time does not help capture trends because an evolving trend is considered a slower process compared to a sporadic one.
When examining the frequency distribution of words or entities over time, the presence of an uneven distribution poses challenges in identifying trends because sudden spikes in frequency can overshadow the slower-evolving patterns.For instance, the frequency of the word "war" might sharply increase in news articles when a topic like the "Russia-Ukraine war" breaks out.However, wars do not generally start suddenly; a series of events typically lead to war.To effectively capture trends, we require a smoother frequency distribution.Therefore, to address uneven frequency distribution of words, we applied a diffusion process to the frequency distribution.Our idea of evolving trends is based on a social science concept known as the diffusion theory [23].
Based on diffusion theory [23], which refers to the change of the distributional patterns of a phenomenon over time, the meaning of a word, and consequently, its vector representation, diffuses over time.In [4], to generate word vectors for each timestamp based on the diffusion theory, we assumed that every document is present in every timestamp but with a higher probability for the timestamps closer to when the document was initially published.
We used a Gaussian filter to diffuse the contribution of the document smoothly before and after the publication date of the document.The contribution of a document d increased the closer its timestamp t d was to timestamp t.To generate the temporal tf-idf representation, we multiplied the tf-idf weight of each word in every document by the Gaussian filter associated with each respective timestamp [4].Figure 2 illustrates this process in a synthetic dataset, with the first plot (at the top with a green line) showcasing the tf-idf values of a word across different documents published between the timestamps of 2016 and 2020.The second row (with red lines) plots the Gaussian filters at various timestamps, while the last row (with blue lines) displays the temporal word embedding vectors obtained by multiplying the static tf-idf values (the green line) with the Gaussian filters (red lines).The equation governing this process is described in Equation ( 1), which outlines the calculation for generating temporal tf-idf using the Gaussian filter.In order to construct a training dataset for the current paper, using the temporal tf-idf representation of Eq. 1, we compute the cosine dissimilarity (1.0-cosine similarity) between every pair of words and store these as a distance/dissimilarity matrix , where each element can be addressed as δ i jt ∈ .This distance/dissimilarity matrix becomes the training data for the expected distance/dissimilarity between a particular pair of words (w i , w j ) ∈ W at time t ∈ T .We use the notation δ i j to represent a vector of size |T | with the temporal tf-idf-based cosine dissimilarity between (w i , w j ) ∈ W for all the timestamps.The cosine dissimilarities are later used in the output layer of our proposed neural network.

Simplistic embedding model optimizing for similarity only
One of our objectives is to obtain a low-dimensional word embedding model U such that computing the cosine dissimilarity between the word vectors results in a distance matrix that closely resembles .Equation (2) formulates this objective as ϑ.In this case, we are optimizing the vectors in U to minimize the difference between the cosine dissimilarity of each pair of word vectors for every timestamp and the cosine dissimilarity from temporal tf-idf model in (Eq.( 1)).The minimization of the difference will ensure that our model captures the same similarity as the temporal tf-idf model, but ours will provide low-dimensional contextual vectors.
In this paper, the term dist(A, B) refers to the cosine dissimilarity between vectors A and B. The cosine dissimilarity between a pair of word vectors is bounded between [0, 1].A cosine dissimilarity of 0 between two word vectors means that both words share the same context, while a cosine dissimilarity of 1 means that the vectors are completely orthogonal, and thus, they do not share contextual similarities.The variable α is introduced as a scaling

Weighing relevance: giving more importance to the neighborhood of each word
In our work, we focus on the task of studying the semantic evolution of a word based on changes to its context.Thus, it is more important that our word embedding model correctly captures the relevant neighborhood of a word.
We empirically discovered that each word has a small number of relevant neighbors.To examine this phenomenon, we conducted an analysis of the distribution of neighboring words within a specified distance threshold of the base word.The outcomes, depicted in Fig. 3, specifically focus on a cosine distance threshold of 0.80.The graph illustrates that, on average, approximately 40% of the words have fewer than 10 neighboring words falling within the 0.80 threshold, while around 85% of the words have less than 20 such neighbors.These findings lead us to conclude that each word shares context with a small number of words.To take this into account in the objective function, we introduce a penalty when the temporal tf-idf-based cosine dissimilarity δ i jt is small, ensuring that our word embedding model captures the relevant context accurately.
where β is a scaling parameter to increase/decrease the importance given to the samples with a smaller distance.Notice that e −βδ i jt in Eq. ( 3) imposes a higher penalty to examples with smaller baseline distances.The penalty is less when the dissimilarity from the temporal tf-idf model is large (that is a lesser penalty for contextually dissimilar words).Equation ( 3) supports the phenomenon that, for a specific word, most of the words in the vocabulary are at a relatively large distance.The large distances need not be a part of the penalty because the objective function is only concerned about neighbors that appear in the vicinity for the temporal tf-idf model.

Fig. 4
The box represents the embedding layer of temporal embedding vectors in our neural network model.When calculating the distance between two words, the Gaussian filter diffuses all the vectors of a word from neighboring timestamps of current timestamp t

Temporal diffusion filter
In connection with the diffusion theory [23] (introduced earlier with frequency-based training data generation in Sect.4.1.1),we assume that the meaning of a word, and consequently its vector representation, diffuses (or drifts) over time.Thus, our model should generate word embeddings that evolve smoothly over time.To introduce this concept in our objective function, we model the effect of every word vector in all timestamps to some degree.We use a Gaussian filter (Eq.( 4)) to diffuse the contribution of each vector smoothly before and after the timestamp of the current sample.The filter uses a sliding window, going from the first to the last timestamp.σ is a user-settable parameter representing the standard deviation of the Gaussian distribution.A large value of σ means that the diffusion of word vectors is slow over time.A small standard deviation allows for capturing short-term changes in meaning.
Figure 4 illustrates the diffusion process.Notably, when calculating the distance between two words at the timestamp t, we do not simply consider two vectors at the timestamp t.Instead, we multiply the vectors for both words across all timestamps (that is, U i and U j matrices of Fig. 4) by the Gaussian filter specific to time t (Eq.( 4)).
Equation ( 5) presents the updated objective ϑ 3 which includes the temporal diffusion of the word embeddings.

Smoothness penalty: creating a homogeneous temporal embedding space
The second important goal that our word embedding model should achieve is to be spatially smooth over time.Continuous or smooth temporal embeddings are those where the distance (e.g., Manhattan or Euclidean) between two vectors of the same word for consecutive timestamps is small.Equation ( 6) captures the expected behavior by penalizing significant spatial Fig. 5 The proposed neural network architecture for temporal embedding generation in the hidden layer changes.
The main issue with this expression is that by forcing consecutive vectors to be very close together, we might be losing important information when the vectors drift apart in the original data.Thus, we introduce weights, ω ϑ , and ω ε , to control the effect of each objective.The final objective function takes the form of Eq. ( 7).
An alternative form would be: or

Implementation
We implemented a neural network-based model using TensorFlow to generate our lowdimensional temporal word embeddings.An overall view of the architecture of our neural network is shown in Fig. 5.The goal of the neural network is to minimize Eq. (7).The embeddings for all words in all timestamps are generated in the hidden layer.We initialize the weights in the hidden layer in the range [0, 1].The data used for training the model contain three inputs (one-hot encoding of a pair of words for which the cosine dissimilarity is known, and the timestamp) and one target value (cosine dissimilarity).The inputs are the indices for two random words w it and w jt , at timestamp t.The target value is the expected cosine dissimilarity between w it and w jt , obtained using the temporal tf-idf representations of Eq. ( 1).

Sequential model predicting a future embedding space
In this subsection, we explain how to extrapolate the generated temporal embeddings to build an embedding space for a future timestamp.
For predicting a future embedding space, we leverage state-of-the-art techniques in terms of sequence modeling, such as (1) the long short-term memory (LSTM) [31] and (2) gated recurrent units (GRUs) [32] and recurrent neural networks (RNNs).We also explore the effect of adding an attention mechanism [33] to both LSTM-and GRU-based RNNs, which allows the network to focus on the most important elements of the sequence.Finally, we evaluate the non-recurrent attention-based Transformer model [34] which, in contrast to RNNs, can be trained in parallel.The following subsections provide a detailed description of the four sequential models we used for predicting future embeddings.

LSTM model structure
Most of the progress in RNN-based sequence modeling has been oriented toward machine translation, which uses an encoder-decoder architecture [32,33,35].The objective of the encoder part of the network is to summarize the input data as state vectors.The state vectors pass to the decoder layer, which is in charge of generating the outputs that best fit the input state.In the particular case of machine translation, a complete sentence such as "I love you" would be transformed into a single vector by the encoder, and then, a decoder trained to generate text in Spanish would output "Te amo." In our case, we only focus on the encoder part of the model, since our primary goal is to train the RNN in such a way that we can predict the next element in the sequence.Each element of the sequence is a word embedding for timestamp t, or, more generally, a fixed-size vector.
Figure 6 illustrates the structure of the LSTM-based network we use for word embedding prediction.In this particular diagram, we are using three LSTM cells.This means that we predict the embedding vector for the next timestamp using the word embeddings of the previous three timestamps.The number of LSTM cells, or sequence length, is a user-defined parameter.The dense layer before the predicted embeddings is required because the output layer of the LSTM is always between -1 and 1 due to the tanh activation of the hidden state.

GRU model structure
Figure 7 illustrates the structure of the GRU-based network used to model the evolution of our temporal word embeddings.In this particular diagram, we are using two GRU cells, but the user-defined sequence length parameter can be used to change this number.This model also requires a dense layer before the predicted embeddings layer because of the tanh activation function in the GRU cell structure.

Transformer-based approach
The sequential nature of the different versions of RNNs prohibits parallel training.The Transformer model [34] gets rid of the recurrence part of the previous networks and relies completely on a self-attention mechanism.This model allows for parallel training, and it is also good at learning long-term dependencies.Furthermore, distant elements can affect each other without running into the vanishing gradients issue [34].
The Transformer model uses a stack of self-attention layers to handle sequential inputs.The idea of self-attention is to be able to generate a compressed representation of a sequence by studying (or attending to) different positions of the input.The original Transformer has an encoder-decoder architecture, but as in the previous cases, we only use the encoder portion of the model.
Figure 9 presents a diagram of the resulting network, which consists of a stack of encoder layers, a positional encoding element, and the inputs and outputs.Each encoder layer contains: • A multi-head attention element, which is the most important (and complex) element of the encoder.• A feed-forward dense network, which consists of a dense layer with a ReLU activation function followed by a regular dense layer.• A normalization of the sum of the residual connection and the output of each of the previous two elements.This is introduced to avoid the vanishing gradients issue.
The positional encoding is required to give the model information about the temporal dimension of the input word embedding vectors.There are different positional encoding functions, but we use the one presented by Vaswani et al. [34], which consists of a vector of sine-cosine pairs at each position that rotate at different frequencies.

Experimental results
We performed experiments using multiple datasets: a synthetic dataset, PubMed Pandemic dataset, PubMed COVID dataset, NyTimes news dataset, and the National Vulnerability dataset.Experimental analysis is conducted using these datasets to evaluate the stages of problems described in the problem description, namely (1) the generation of temporal embeddings and (2) the prediction of future embedding space.Based on these two stages, we present our experimental results in Sects.5.1 and 5.2.
The datasets that we used in this section are outlined in Table 1.
We generated the synthetic dataset consisting of 10,000 words and ten timestamps.For this dataset, we already know the 10-nearest neighbors of each word in every timestamp.Neighborhoods of larger sizes will contain random words starting at the 11th nearest neighbor.
The PubMed pandemic dataset contains 328,908 abstracts of pandemic and epidemicrelated biomedical publications.The abstracts were published between years 2000 to 2020.The PubMed COVID dataset contains 374,883 abstracts of biomedical papers related to COVID-19, published between 2020-2022.The corpus was collected from Kaggle COVID-19 Open Research Dataset Challenge [36].The PubMed CANCER consists of 21 years of data with 613,949 abstracts that contain the keyword cancer.
The New York Times corpus contains 812,857 news articles that were published over 29 years.We collected Russia-Ukraine-related news from NyTimes that contains 50,000 news articles published between years 2020 and 2021.The NVD dataset includes 165,552 bulletins published in the last 20 years.
The named entities from the NyTimes and NVD dataset are extracted using spaCy's named entity recognition (NER) model [37], and we extracted the biomedical entities for the PubMed abstracts using scispaCy's Biomedical Named Entity Recognition [38].
The analysis of the PubMed and NVD datasets was limited to the top-3,000 entities and words based on their term frequency-inverse document frequency (tf-idf) weights, while the analysis of the New York Times corpus was limited to the top 5,000 entities and words, based on their tf-idf weights.
Our model treats named and biomedical entities as words, but this doesn't exclude individual words.When a biomedical entity like "acute respiratory syndrome" appears, we generate 123 word embeddings for its constituent words ("acute," "respiratory," and "syndrome") and also generate embeddings for the entity as a whole.Using the entire entity enables us to capture the specific context of the entity, such as "acute respiratory syndrome."In summary, our model considers named and biomedical entities along with all the individual words in the text corpus.

Experiments on the generated temporal embeddings
We evaluate our temporal word embedding method by comparing its performance with that of a regular tf-idf model, the temporal tf-idf model [4], dynamic Bernoulli embeddings [11], temporal word embeddings with a compass (TWEC) [12], and TempoBert [3].In all our experiments, we used an embedding size of 64.
We seek to answer the following questions through experiments and case studies.
1. What is the effect of introducing different penalty terms in our objective function?(Sect.5.1.1)2. How well do the models perform in terms of capturing the neighborhood of entities over time, compared to the temporal tf-idf?(Sect.5.1.2) 3. How well do the models perform in terms of capturing changes in the neighborhood over time in the respective embedding spaces?(Sect.5.1.3)4. How well does our algorithm track the quick evolution of a specific entity, such as COVID, compared to other methods?(Sect.5.1.4)5. How well does our algorithm capture the semantic evolution of a general term, such as pandemic, compared to other methods?(Sect.5.1.5)

Effect of penalty terms
In this experiment, we study the effect of the different versions of our objective function on the quality of the temporal word embedding model, focusing on the task of tracking semantic evolution.The versions under this study correspond to ϑ 1 (2), ϑ 2 (3), ϑ 3 (5), F a (7), F b (8), and F c (9).We quantify the quality of the resulting vectors with two different metrics: similarity and continuity.The similarity is measured as the number of intersections between the word neighborhoods obtained using the temporal tf-idf model and each of the different versions of our objective function.The goal of the similarity evaluation is to quantify how well our model mimics the temporal tf-idf model.It must be noted that we did not expect to have a perfect match in the neighborhoods of words since the temporal tf-idf model representation does not take into account latent contextual relationships between words.
The continuity is measured using the average, maximum, and minimum mean squared errors (MSE) across consecutive timestamps for the word vectors obtained using the different versions of our objective function.
Figure 10a shows the results for the similarity evaluation with the synthetic dataset described at the beginning of Sect. 5.The objective function labeled as F a on the figure performs significantly better than the other formulations.If we discard F b and F c , it is possible to see how the similarity improves with the progression in which we developed our objective function.Furthermore, taking into account that only the top-10 nearest neighbors are known and set as accurate in the synthetic data and the rest of the neighbors are random, having an average of 8 intersections means that our model can correctly capture the semantic evolution of the synthetic dataset.Note that MSE is computed between two vectors of the same word in two consecutive timestamps.MSE of 0.0 indicates no change in the embedding vector of a word between two timestamps.Avg.MSE of a word is the average of the MSEs of that word in each pair of consecutive timestamps.The avg.MSE in the y-axis of Fig. 10b refers to the average over "the avg./max./min(aggregated) of MSEs over all pairs of consecutive timestamps" of a set of randomly chosen 1000 words.Also, note that if the average (over 1000 words) of the maximum of the pairwise consecutive timestamp MSEs is too small, that would indicate that the vectors of those 1000 words are not changing much over time.Contrarily, if the average (over 1000 words) of the minimum of the pairwise consecutive timestamp MSEs is large, that will indicate that the word vectors are changing a lot over time.In general, the average (over 1000 words) of the average of the pairwise consecutive timestamp MSEs can be considered a measure to compare the continuity aspect of different models.
Figure 10a shows how much of the neighborhood each model follows compared to the target temporal tf-idf model, reflecting the similarity aspect with temporal tf-idf.Higher similarity with training temporal tf-idf model's neighborhood is expected out of the vectors generated by a model.Figure 10b shows how much of the consecutive vectors are changing, reflecting the continuity aspect.We want a good model to follow as much as the neighborhood possible as the neighborhood of temporal tf-idf, with minimal change in the generated embeddings over time.
We observed that F c in Fig. 10b has average 0.0 MSE (over 1000 words) in terms of average, maximum, and minimum.Even though low MSE is desirable out of a model, if a high similarity (intersection) with the neighborhood of temporal tf-idf is not observed from the neighborhood of words in the temporal embedding space, the vector outcomes will be useless.F c in Fig. 10a exhibits a low intersection and hence F c is not considered a reasonable model that can reflect the neighborhood of the temporal tf-idf-based training data.Notice that in Fig. 10a, F a has the best intersection with the temporal tf-idf-based neighborhood, reflecting that F a best captures the neighborhood of the training data.We need to see if F a embeddings also change minimally in the temporal embedding space while maximally intersecting with the training temporal tf-idf-based data.F a of Fig. 10b reflects that the Fig. 11 Jaccard similarity in the neighborhoods between temporal tf-idf [4] and each of the three models-TWEC, Bernoulli embeddings, and our temporal word embedding model.(PubMed (pandemic) dataset.Embedding size = 64) averages (over 1000 words) of the average, maximum, and minimum are smaller than other models-F b , v 1 , v 2 , and v 3 .This indicates that F a gives the most continuity and maximum desired resemblance with the neighborhood of the training data.

Capability to capture content neighborhood
A major purpose of any temporal or dynamic word representation modeling is to capture content similarity over time.We compare three models-TWEC, dynamic Bernoulli embeddings, and our temporal word embedding-with temporal tf-idf [4] in Fig. 11, using PubMed (pandemic) dataset.We use temporal tf-idf [4] for this comparison because it models content smoothly over time.Each line in the figure represents average set-based Jaccard similarity between the 10-nearest neighbors of 1000 randomly selected entities using temporal tf-idf and the 10-nearest neighbors of the same entities using one of the three models.Figure 11 demonstrates that our embedding model and TWEC have closer similarity with temporal tf-idf than Bernoulli embeddings.Additionally, our model has greater similarity with the neighborhood of temporal tf-idf in the earlier timestamps, compared to both TWEC and Bernoulli embeddings.Bernoulli embeddings over different timestamps do not change much to capture the evolution of words.This resulted in an almost horizontal line for Bernoulli embeddings in Fig. 11.
Our model smoothly spreads word influence using diffusion over the years.As a result, our embedding model performs significantly better than other methods, even when the vocabulary is smaller in the earlier timestamps.

Capability to detect changes in neighborhood
An objective of a temporal embedding technique is to capture changes in the neighborhood of each word over time.The ability to capture changes allows us to study the evolution of concepts.This subsection provides an experiment to investigate how much change occurs from one year to another in the neighborhood using different models.We quantify the change in terms of set-based Jaccard dissimilarity (1.0-Jaccard similarity) between the neighborhood of a word in the current year and the neighborhood of the same word in the previous year.Average Jaccard dissimilarity over many words in a certain year for a model gives an overall idea of how much the model can detect changes in the neighborhood.
Figure 12 demonstrates average Jaccard dissimilarity (change) at each year for five different models-our temporal embedding model, Bernoulli embeddings, TWEC, and vanilla tf-idf computed independently at each year, and temporal tf-idf using 1000 randomly selected entities from the PubMed (pandemic) dataset.The plot shows that our temporal embedding model detects more changes in terms of average Jaccard dissimilarity compared to other models.Furthermore, the purple line in Fig. 12 represents vanilla tf-idf, which measures the frequency and importance of words without complex modeling.Here, it is notable that the changes in the tf-idf line over time suggest that the meaning or context of a word evolves.
The Bernoulli embeddings capture the least amount of changes.Based on further investigation (not covered in this paper), we noticed that Bernoulli embeddings rarely capture any changes.These embeddings capture only a few long-term changes, whereas our temporal embedding model significantly captures both long-term and short-term changes.TWEC captures more changes than Bernoulli and temporal tf-idf, but lesser changes than the vanilla tf-idf.Our temporal word embedding performs even better than the vanilla tf-idf.Contextual changes are best-captured using our temporal embedding because the objective function of our model spreads the effect of each word smoothly from the current year to other years.As a result, our model captures changes, in terms of average Jaccard dissimilarity, better than regular tf-idf and temporal tf-idf models.
Our model is clearly superior in terms of the ability to capture changes.In Sect.5.1.4,we explain how the superiority in the detection of changes in the neighborhood helps in analyzing evolving concepts, such as COVID-19.

Analyzing the neighborhood of COVID-19
In this experiment, we analyze the changes in the neighborhood of the word COVID in the PubMed (COVID) dataset.Figure 13 presents how the similarities between the entity COVID and some of its nearest neighbors-China, epidemic, pandemic, and patients-change over time using (a) TWEC model, (b) Bernoulli embeddings, (c) temporal tf-idf, and (d) our temporal embedding model.The data contain ranges of two weeks from January to July of 2020.From August 2020, COVID-19 was considered a pandemic-which is a global outbreak rather than a local epidemic [39].In Fig. 13, we observe that (Fig. 13c) temporal tf-idf and (Fig. 13d) our temporal embedding can detect the rising trends of pandemic and falling trends of the word epidemic.This observation matches our known knowledge regarding COVID-19.TWEC (Fig. 13a) is able to track this to some degree but with zigzag patterns in the trends.Bernoulli embeddings (Fig. 13b) give higher similarity for pandemic than epidemic with the word COVID, which is correct in July but the timeline does not demonstrate any rising and falling trends of the words pandemic and epidemic, as they should be based on our knowledge about COVID-19.
Our temporal embedding model (Fig. 13d) demonstrates that the word China had high similarity with COVID in the beginning.The similarity started to fall by the end of March.According to our model, starting at the end of March, the word epidemic started to exhibit lesser similarity with COVID and the word pandemic started to show higher similarity.The temporal tf-idf model (Fig. 13c) demonstrates a similar trend.The trends resemble our common knowledge regarding the COVID-19 pandemic.Also, TWEC (Fig. 13a) has an overall downward trend for the word China, but with zigzag movements over the timeline.Bernoulli embeddings (Fig. 13b) do not demonstrate any change and capture a static similarity Fig. 14 Cosine similarity of the top nearest neighbor of pandemic in each year using all methods.Nearest neighbors are written with selected peaks.Our temporal embedding method provides better context for pandemic.Embedding size = 64 for the entire timeline.We noticed that the underlying vectors in Bernoulli embeddings change, but the neighbors of a word do not change much.That indicates that the changes in the vectors generated by Bernoulli embeddings might be the result of some scaling effect rather than changes due to the reformation of the neighborhood.
We know that the number of COVID-infected patients increased over the months of 2020.Our temporal embedding model (as well as the temporal tf-idf) captures the rising-similarity of the word patients in the context of COVID quite smoothly (Fig. 13d).TWEC also has an upward trend which is less smooth.However, the embeddings do not demonstrate any changes in the similarity between the words patients and COVID.
This experiment demonstrates that our temporal embedding model captures the short-term changes in content (as shown by temporal tf-idf) while also capturing the context that we can track smoothly to study the evolution of a concept, such as COVID.In contrast, Bernoulli embeddings construct a context that is intractable in terms of similarity.TWEC provides noisy patterns that are difficult to interpret.

Analysis of the word Pandemic
With the rise of the COVID-19 pandemic, it has become essential to study how biomedical scientists have dealt with a pandemic in the past years.Such an analysis requires a model that can capture long-term changes.In this experiment, we attempt to track the closest term to the word pandemic in each year of the PubMed (pandemic) dataset, which spans biomedical abstracts from 2000 to 2020.
Each line of Fig. 14 plots the similarity of the top nearest-neighbor of the word pandemic in each year.The five lines represent similarities using five different models-Bernoulli embeddings, our temporal embedding model, temporal tf-idf, vanilla tf-idf, and TWEC.Notice that our temporal embedding model demonstrates peak similarities in 2009/2010 and in 2020, when H1N1 influenza (swine flu) and COVID-19, respectively, became prominent.This signal from our temporal embedding model reflects the fact that the worst pandemics in the last 20 years are the H1N1 influenza in 2009 [40] and COVID-19 in 2020 [41].Note that other words like concerns in 2004 and public in 2015 are detected as the top nearest neighbors, which are not highly similar to the word pandemic.This indicates that no entities appeared too close to the word pandemic in those years.
TWEC captures influenza H1N1 in the middle of the timeline but fails to capture COVID-related keywords in 2020 as the closest entity to pandemic.In Fig. 14, the Bernoulli model can pick up coronavirus as the nearest neighbor of pandemic but it was not able to pick up influenza in its trend.Moreover, coronavirus appears in all the years as the top nearest neighbor of pandemic which is not correct because the fact is that the coronavirus spread started in 2019 and became a pandemic in 2020 [41].Temporal tf-idf and vanilla tf-idf were able to pick up coronavirus/COVID.Temporal tf-idf and vanilla tf-idf were also able to pick up influenza subtype H1N1 (swine flu), but the respective similarities were not high.
Based on the experiment presented in this subsection, our temporal embedding model has the ability to separate highly contextual words (such as H1N1 and COVID) of a concept (such as pandemic) via similarity peaks.Our model helps in determining prominent neighbors of a concept in the past.Our vectors are able to construct a peak for a prominent nearest neighbor because our method models diffusion.That is, a concept that appears today affects the past and the future to some extent, regardless of whether the concept directly appears in the contents or not.

Comparison with BERT temporal embedding
Based on recent literature, there is a surge in applications using Bidirectional Encoder Representations from Transformers (BERT).BERT provides vectors for each appearance of words.The vectors of the same word appearing in the same timestamp can be used to create a word embedding vector for that word in that timestamp, leading to temporal word embeddings.TempoBert model [3] is such a mechanism.In the TempoBert model, the timestamps are added at the start and end of each sentence as a means of training text data for different timestamps.The BERT-generated embeddings are clustered for semantic evolution.In our experiment here, we use a variant of TempoBert, referred to as temporal BERT, which does not cluster vectors but rather calculates the arithmetic mean of embedding vectors of a word at each timestamp to generate one vector for one word at each timestamp.
A drawback of the BERT-based model is its excessive computation time compared to other embedding models due to the complex structure of the deep learning model.Therefore, for this experimental analysis, we downsample the corpus to 10% of articles per month to ease the computation for BERT.The corpus for this experiment includes 10% of PubMed COVID-related articles (between January 2020 and May 2022) and 10% of NyTimes Russia-Ukraine-related articles (over 24 months between January 2020 and December 2021).In this section, we examine how the neighborhood of the words COVID and UKRAINE changes over time using the word embeddings generated by temporal BERT and our temporal embedding model.
Figure 15 presents the cosine similarity of embeddings of the word "COVID" with its neighboring words at different timestamps.The plot on the left (a) is generated by using Fig. 15 Cosine similarity of the embedding of "COVID" with neighbor words at the different timestamp (temporal BERT vs. our model) 16 Cosine similarity of the embedding of "UKRAINE" with neighbor words at the different timestamp (temporal BERT vs. our model) embeddings from temporal BERT, and the plot on the right (b) is generated by using our temporal embedding model.In Fig. 15b, we observe that at the beginning of the month of January 2020, the words "case," "china," "epidemic," and "pandemic" are the closest neighbors of the word "COVID."Then, over time, the similarities of the words "china" and "epidemic" with "COVID" decrease, while the similarities of the words "pandemic," "united states," and "vaccination" with COVID increase.The changing neighborhood of "COVID" reflects how "COVID" spreads throughout the world over time, starting from china and eventually becoming the world's most widespread pandemic.Further, vaccination programs were initiated at the beginning of 2021.Our model captures these trends well, as reflected in the plot of Fig. 15b.The plot on the left (Fig. 15a), on the other hand, provides mostly straight lines and fails to capture such changes in the neighborhood of "COVID." Another experimental result in Fig. 16 presents the cosine similarity of embeddings of the word "UKRAINE" with its neighboring words at different timestamps.Based on our known knowledge from the news, Russia invaded Ukrainian territory in February 2022 [42].As part of our experiment, we trained both the temporal BERT and our model with news data collected prior to the invasion.The embeddings generated by our model (Fig. 16b) shows 123 Fig. 17 Streamgraph of the word President using the New York Times corpus.The width of each of the bands represents cosine similarity of word with word President.word vectors at different timestamps are generated using our temporal word embedding (embedding size = 64) that the similarities between each of the words "cold war," "invasion," and "Russia" with the word "UKRAINE" at end of the year 2021, prior to the actual invasion.The temporal BERT, however, failed to detect any changes in the neighborhood of the word "Ukraine."This analysis indicates that the embeddings generated by our model have better prediction capability compared to the temporal BERT embedding model.
Temporal BERT does not perform well due to the fact that complex deep neural network models require extensive training on large datasets.In order to train a BERT model from scratch, it is recommended to use billions of sentences, which is sometimes not available for a specific domain.In our experiment related to COVID (Fig. 15), we used a pre-trained BERT model (clinical-bert [43] and fine-tuned it with our data.For the Russia-Ukrainerelated experiment (Fig. 16), we used bert-small [44,45]), which is later fine-tuned by the smaller dataset.In these experiments, we find that despite fine-tuning the BERT model, it fails to capture the temporal evolution of words.Our model performs well even with small datasets, requiring lesser training samples.

Stacking cosine similarities
A streamgraph is a stacked area chart widely used in concept visualization [46].In this subsection, we provide an analysis of the entity president using a streamgraph using the New York Times dataset.Cosine similarities of the nearest neighbors of the entity president are stacked in the streamgraph of Fig. 17.
We observe how the entities Obama and Trump started to get closer to president only a few years before their presidency.Biden started to get closer to president in the year 2008 when he became the vice president.The cosine similarity between the word Biden and the word president increased in 2020, which matches with the actual event that Biden won the presidential election in 2020.We also observe the particular cases of political families, such as the Bush and Clinton families, that have been relevant to the presidential elections for

Sensitivity analysis for hyperparameters
In this experiment, we evaluate the effect of performing a sweep of different values for (a) the embedding size, (b) the exponential factor β, (c) the scale factor α, (d) the temporal diffusion filter standard deviation σ , (e) the smoothness penalty factor ω , and (f) the learning rate of our model.We use the average number of intersections per timestamp between the neighborhoods obtained using our method and those generated using the temporal tf-idf method as our accuracy metric.
Figure 18 presents the effect of changing the parameters of interest on the accuracy of our model.The results show that the embedding size, the scale factor (α), and the smoothness penalty factor (ω ) have significant effects on the accuracy of neighborhood detection.
Figure 18 only presents the results obtained using the NVD dataset.We performed similar analyses for the other datasets, which resulted in similar findings.
Figure 18a shows that using an embedding size of 128 resulted in slightly better performance than a size of 64.However, this slight improvement does not justify the significant increase in training time and computational/storage complexity.Training time increased from ∼ 18 hours to ∼ 50 hours per experiment in a GPU-enabled cluster when we increase the embedding size from 64 to 128.Therefore, we decided to keep an embedding size of 64.The selection of all the parameters in a neural network with a complex cost function like ours is a "big data" challenge.After many different iterations, we discovered that the size of 64 provides reasonably meaningful results in reasonable run times with the available resources.The results depicted in Fig. 18d demonstrate that using a smaller standard deviation for the Gaussian filter yields a slight improvement in performance.Considering this observation, we have selected a standard deviation value of 0.5.Additionally, we have set the smoothness penalty factor to 0.01, as this value significantly enhances the average intersection of the neighborhood obtained by the temporal tf-idf and our proposed embedding model.

Experiments on prediction capability
One of the major downstream applications of the generated temporal embeddings is the prediction of future embeddings.In this section, we evaluate the selected time-series modeling techniques to generate temporal word embedding predictions.We perform the experiments on three different datasets: (1) PubMed abstracts [47], (2) New York Times articles, and (3) National Vulnerability Database (NVD) bulletins [48].
The temporal word embeddings used as baseline data were generated using the method presented in Sect. 4. We split the embeddings for each dataset into training and test datasets based on their timestamps.The word embeddings of the first X years out of |T | are the training data.The generated models were tested using the word vectors for the last |T | − X years, which are not part of the training data.The X parameter is user-defined.
We evaluate the performance of the time-series modeling techniques using two different metrics: (1) average mean squared error (MSE) between the predicted and the actual word vectors, and (2) neighborhood similarity (explained next).We define neighborhood similarity as the average of the average number of intersections between the neighborhoods generated using the actual word embeddings and those generated using the predicted word embeddings, divided by the neighborhood size k where k ∈ S and S = [1,2,4,6,16].The neighborhood similarity is computed only for the test data timestamps.We formalize the concept of neighborhood similarity as follows: where N a (w, t, k) returns the k nearest neighbors of word w at time t obtained from word embeddings generated using method a.
In this section, we seek to answer the following questions.
1. Which sequence modeling technique is most well suited to predict future word embeddings?(Sect.5.2.1)Fig. 19 Neighborhood similarity between the neighborhoods obtained with the baseline temporal embedding method (Eq.7) and neighborhoods obtained using the predict-next technique with different sequential models for the NVD, PubMed, and New York Times datasets, using dynamic values of K 2. How sensitive is the selected time-series modeling technique to changes in the hyperparameters?(Sect.5.2.2) 3. How well does our algorithm predict the evolution of a specific term?(Sect.5.2.3)

Model selection for prediction
The main goal of this experiment is to identify the sequence modeling technique that has the best performance in terms of predicting the semantic evolution of the given corpora.First, we identify the best hyperparameters by performing a sensitivity analysis for (1) LSTM, (2) GRU, (3) LSTM with attention, (4) GRU with attention, and (5) the Transformer model.Section 5.2.2 describes this sensitivity analysis in more detail.
For each model, we generate predicted word embeddings for every timestamp of the test dataset.We use the neighborhood similarity metric to measure the performance of each model.
Figures 19 and 20 present a comparison, for each dataset, between the best versions of each sequence modeling technique.Figure 19 presents the results in terms of the neighborhood similarity, while Fig. 20 shows the effect of changing the neighborhood size K on the average number of intersections between the baseline and the predicted embeddings.
The results show that the GRU-and LSTM-based networks outperform the more complex sequential models in all datasets.For both GRU and LSTM, the inclusion of the attention mechanism did not result in better performance.This is because the attention mechanism is explicitly designed to model long sequences [34], but our text-based embeddings have a limited number of timestamps, resulting in a small temporal sequence.Therefore, GRU and LSTM without the attention mechanism are more appropriate for predicting future embeddings.
Fig. 20 Average number of intersections for different neighborhood sizes k per timestamp between the neighborhoods obtained with the baseline temporal embedding method (Eq.7) and neighborhoods obtained using the predict-next technique with different sequential models for the NVD, PubMed, and New York Times datasets (embedding size = 64), using dynamic values of K Fig. 21 Neighborhood similarity between the neighborhoods obtained with the baseline temporal embedding method (Eq.7) and the neighborhoods obtained using the predict-next techniques with different sequential models for the NVD, PubMed, and New York Times datasets, using dynamic values of K, while changing a the batch size and b the fraction of timestamps used for training

Sensitivity analysis
In this section, we present the effect of performing a sweep of the hyperparameters on the GRU-based sequential model.We performed similar analyses for the other variants, but, for brevity, we only present the results obtained with the best model.The evaluated parameters are (a) the batch size, (b) the fraction of timestamps used for training, (c) the input sequence length, (d) the number of encoder units for the neural network, (e) the optimizer, and (f) the learning rate.We use the neighborhood similarity metric to quantify the performance of each parameter combination.
Figure 21 presents the effect of changing the batch size (Fig. 21a) and the fraction of timestamps used for the training/test split (Fig. 21b) on neighborhood similarity.Based on the Fig. 22 Neighborhood similarity between the neighborhoods obtained with the baseline temporal embedding method (Eq.7) the neighborhoods obtained using the predicted embeddings with different sequential models for the NVD, PubMed, and New York Times datasets, using dynamic values of K, while changing (a) the input sequence size and (b) the number of encoder units plots, it appears that changing the batch size does not have a significant effect on neighborhood similarity.Meanwhile, changing the train-test size has little effect on neighborhood similarity.
Figure 22a presents the effect of changing the input sequence length on neighborhood similarity.This plot reveals that using an input consisting of the word embeddings for two or more timestamps results in a slightly better performance.Figure 22b shows that changing the number of encoder units for the neural network has a negligible effect on the neighborhood similarity.
The plots of Figs.21 and 22 indicate that the generation of neighborhoods of words from predicted embeddings is not sensitive to batch size, training/test split, input sequence size, and the number of encoder units, in general.
Figure 23 presents the effect of changing the optimizer and learning rate on neighborhood similarity.It is important to note that for these plots, a learning rate of 0.0 on the x-axis actually represents the neighborhood similarity obtained using the dynamic learning rate presented by Vaswani et al. [34].The results clearly exhibit downward or upward trends with increasing learning rates.This is an indication that the neighborhood similarity is sensitive to the learning rate and hence the model requires tuning with different learning rate values to make sure the optimizer does not get stuck in local minima.

Case study and trend analysis
In this section, we perform a qualitative analysis of the performance of the predicted temporal word embeddings on the task of tracking semantic evolution.In this subsection, we report prediction results using LSTM.Similar results are observed using GRU.We used New York Time data for two case studies-how well do the predicted embeddings (1) relevant to the word war represent our knowledge about contemporary political tension between different countries and (2) capture entities in the US political domain while studying the word president.
Figure 24a contains cosine similarity of the nearest neighbors of the word war.Relevant nearest neighbors-"Russia," "Ukraine," "ISIS," "Syria," and "Taliban"-were selected from the predicted embedding for the year 2022.Embeddings from previous years were used for training the LSTM. Figure 24a shows that the embeddings of the words "Syria" and "ISIS" were two words most similar to the embedding of war in the beginning.Their similarity Fig. 23 Neighborhood similarity between the neighborhoods obtained with the baseline temporal embedding method (Eq.7) and neighborhoods obtained using the predicted embeddings with different sequential models for the NVD, PubMed, and New York Times datasets, with different optimizers and learning rates.The learning rate of 0.0 represents the dynamic learning rate presented by Vaswani et al. [34] Fig. 24 Evolution of the neighborhood of the term war in the NyTimes dataset where the embeddings of words at timestamp 2022 is extrapolated using LSTM a cosine similarity between the embedding of the word war and relevant word embeddings at different timestamps.b Rank (position) of relevant words in the nearest neighborhood of the word war at different timestamps with war gradually declined.The prediction for the year 2022 demonstrates the continuous decline of "Syria" and "ISIS" from the word war.The word "Taliban" became more similar to the word war between 2017 and 2020.The US and the Taliban peace deal occurred in 2020 [49].The Taliban took over the Afghan government in 2021 [50].The prediction for 2022 reflects the end of a long-lasting "war" by showing that the similarity of the word "Taliban" and war will be lesser.
The similarities of the words "Russia" and "Ukraine" with war were declining through 2017-2020 but started to rise between 2020 and 2021.The predicted embeddings in the year 2022 show that "Russia" becomes the topmost nearest neighbor of war.Simultaneously, the similarity between "Ukraine" and war is predicted to increase in 2022.
Figure 24b shows the ranks of the same nearest neighbors of the word war.The smaller the rank value, more similar a neighbor is to the word war.The vertical axis is in logarithmic Fig. 25 Evolution of the neighborhood of the term President in the NyTimes dataset where the embeddings of words at timestamp 2021 is extrapolated using LSTM a cosine similarity between the embedding of the word President and relevant word embeddings at different timestamps.b Rank (position) of relevant words in the nearest neighborhood of the word President at different timestamps scale for better visualization.Similar trends as in Fig. 24a are observed in Fig. 24b.For example, "Russia" and "Ukraine" both are coming closer in rank to the word war, which is reflected as a falling pattern in Fig. 24b.Also, it is noticeable that the prediction of "Syria" for 2022 has an upward direction indicating that "Syria" is shifting away from the word war.On the other hand, the downward direction of "Russia" in the 2022 prediction supports our known knowledge that "Russia" moved closer to war.
For the second case study, our word of interest is President.We selected some relevant words-"Obama," "Trump," "Biden," "Bernie," "Democrats," and "Republicans"-which have been closely associated with the word President based on our knowledge of US politics of the past decade.We used embeddings from 2011 to 2020 for training and extrapolated the embeddings of 2021.
In Fig. 25a, we observe that the embedding of the word "Obama" is the most similar to the embedding of President in the year 2011.In the year 2016, the embedding of the word "Trump" gains more similarity to the word President as "Trump" was elected as the new president of the USA.On the other hand, the similarity of the words "Biden" and "Bernie" started to increase in the year 2018 as they were competing for the presidential candidate for the election in the year 2020.The predicted embeddings in the year 2021 show that the similarity of the embedding of the word "Biden" increases substantially, and it becomes the second top closest neighbor between all the relevant words in this study.Figure 25b shows that the position of "Biden" in the nearest neighbors of President dropped significantly from 1000 to 9 in the last 3 years, indicating that "Biden" quickly moved closer to the word President.
These case studies demonstrate that the predicted embeddings using an LSTM-based prediction model can capture the trend in the training data and provide well-explainable relationships between entities in a predicted embedding space of a future timestamp.The performance of forecasting the embeddings depends heavily on the ability of temporal embeddings to capture trends.In Sect.5.1, we demonstrate that the state-of-the-art models cannot capture changes in the context of words and cannot produce smooth transitions of word similarity over time.While our temporal embedding model can capture the change in the context and produce a smooth transition of embeddings, thus it performs well when extrapolated into the future embedding space.

Conclusions
This paper introduces a new technique to generate low-dimensional temporal word embeddings for timestamped documents and predict a future embedding space.We compare our temporal word embedding technique with other state-of-the-art techniques.Our temporal embeddings reflect a representation that: (1) can track changes observed within a short period, (2) provides a smooth evolution of the word vectors over a continuous temporal vector space, (3) uses the concept of diffusion to capture trends better than the existing models, ( 4) is low-dimensional, and (5) performs well in capturing future neighborhoods of words.Unlike previous dynamic embedding models, our proposed model creates a homogeneous space over every timestamp of the embeddings.As a result, the generated vectors of timestamps perform well in the prediction of a future embedding space using conventional predictive models.In the future, our research will expand on fine-tuning large language models (LLMs)-such as OpenAI's GPT models-to reflect temporal aspects of a dataset.

Fig. 2
Fig. 2 Generation of temporal tf-idf from static tf-idf using Gaussian filter

Fig. 3
Fig. 3 Percentage of words in the NVD dataset that have X number of neighbors within a cosine distance threshold of 0.80

Fig. 6
Fig. 6 Structure of the LSTM-based network used for word embedding prediction

Fig. 7
Fig. 7 Structure of the GRU-based network used for word embedding prediction

Figure 8
Figure 8 illustrates a generic version of an attention-based network with RNNs.In our experiments, we replace the generic RNNs with LSTM and GRU cells.The α values in the figure refer to the attention weights.The diagram shows only one line going from one RNN to the next, but as we have explained in the LSTM section, it is possible to have more than one state passed from one cell to the next.Similar to the LSTM-and GRU-based networks, it is possible to change the training sequence length, which is set to 5 in this example for illustrative purposes.

Fig. 8 Fig. 9
Fig.8Structure of the attention-based networks[33] used for word embedding prediction, with LSTM and GRU cells instead of vanilla RNNs Fig.9Structure of the network used for word embedding prediction using only the encoder layer from the Transformer architecture[34]

Fig. 10
Fig. 10 Study of the effect of the different versions of our objective function on the quality of the temporal word embedding models

Fig. 12
Fig. 12 Comparison of average Jaccard dissimilarity (change) between 10-neighbors of the current year and the previous year

Fig. 13
Fig.13 Evolution of the word COVID in PubMed COVID-19-related abstracts published in 2020 using four different models-TWEC, Bernoulli embeddings, temporal tf-idf, and our temporal embedding model.Cosine similarity is used to compute the similarity between the vectors of the word COVID and any other word.The results shown in this figure are not reflective of any causal relationships.The temporal nature only reflects the contextual appearance of the entities in the data over time

Fig. 18
Fig. 18 Average number of intersections per timestamp between the neighborhoods obtained with temporal tf-idf [4] and our temporal embedding model for the NVD data with changing (a) embedding size, (b) β, (c) α, (d) σ , (e) ω , and (f) learning rate