Unsupervised Embedding of Trajectories Captures the Latent Structure of Mobility

Human mobility drives major societal phenomena including epidemics, economies, and innovation. Historically, mobility was constrained by geographic distance, however, in the globalizing world, language, culture, and history are increasingly important. Here, we show a mathematical equivalence between word2vec model and the gravity model of mobility and demonstrate that, by using three human trajectory datasets, word2vec encodes nuanced relationships between locations into a systematic and meaningful vector-space, providing a functional distance between locations, as well as a representation for studying the many dimensions of mobility. Focusing on the case of scientiﬁc mobility, we show that embeddings implicitly learn cultural, linguistic, and hierarchical relationships at multiple levels of granularity. Connecting neural embeddings to the gravity model opens up new avenues for the study of mobility.


Introduction
How far apart are two places?The question is surprisingly hard to answer when it involves human mobility.Although geographic distance has historically constrained human movements, it is becoming less relevant in a world connected by rapid transit and global airline networks.
For instance, a person living in Australia is more likely to migrate to the United Kingdom, a far-away country with similar language and culture, than to a much closer country such as Indonesia 1 .Similarly, a student in South Korea is more likely to attend a university in Canada than one in North Korea 2 .Although geographic distance has been used as the most prominent basis for models of mobility, such as the Gravity 3 and Radiation 4 models, the decreasing relevance of geography calls for alternative ways of conceptualizing "distance" [5][6][7] .
Yet, functional distances are often low-resolution, computed at the level of countries rather than regions, cities, or organizations, and have focused on only a single facet of mobility at a time, whereas real-world mobility is multi-faceted, influenced simultaneously by geography, language, culture, history, and economic opportunity.Low dimensional distance alone cannot represent the multitude of inter-related factors that drive mobility.Networks offer a solution to representing many dimensions of mobility, yet edges only encode simple relationships between connected entities.Capturing the complexity of mobility requires moving beyond simple functional distances and networks, to learning high-dimensional landscapes of mobility that incorporate many facets of mobility into a single fine-grained and continuous representation.
Here, we demonstrate that the word2vec model (Skip-Gram Negative Sampling) is equivalent to the gravity law of mobility, a fundamental mobility framework.We then empirically test the resulting representation by its ability to derive from mobility trajectories the functional distances between locations.After validating that it coherently represents real-world data, we leverage unique embedding-based methods to demonstrate word2vec's capacity for encoding rich information relating to geography, culture, language, and even prestige, at multiple scales of analysis.We embed trajectories from three datasets: U.S. passenger flight itinerary records, Korean accommodation reservations, and a dataset of scientists' career mobility between organizations captured in bibliometric records (Detailed descriptions are available in the Methods).
We focus in most detail on scientific mobility due to its richness and importance.Scientific mobility-which is a central driver of the globalized scientific enterprise 8,9 and strongly related to innovation 10,11 , impact 12,13 , collaboration 14 , and the diffusion of knowledge 10,15 -is not only an important topic in the Science of Science but also ideal for our study thanks to its well-known structural properties such as the centrality of scientifically advanced countries and the strong prestige hierarchy 16,17 .In spite of its importance, understandings of scientific mobility have been limited by the sheer scope and complexity of the phenomenon 17,18 , being further confounded by the diminishing role of geography in shaping the landscape of scientific mobility.Trajectories of scientific mobility are constructed using more than three million name-disambiguated authors who were mobile-having more than one affiliation-between 2008 and 2019, as evidenced by their publications indexed in the Web of Science database (see Methods).
As a scientist's career progresses, they move between organizations or pick up additional (simultaneous) affiliations forming affiliation trajectories (Fig. 1a).Thus, the trajectories encode both migration and co-affiliation-the holding of multiple simultaneous affiliations involving the sharing of time and capital between locations-that is typical of scientific mobility 12,14 (see Supporting Information).
Here, we study the skip-gram negative sampling (SGNS), or word2vec neural-network architecture (see Methods).This neural embedding model, originally designed for learning models of language 19 , has been making breakthroughs by revealing novel insights into texts [20][21][22][23][24][25] , networks [26][27][28] and trajectories [29][30][31][32][33][34] .It works under the notion that a good representation should facilitate prediction, learning a mapping between words that can predict a target word based on its context (surrounding words).The model is also computationally efficient, easy to use, robust to noise, and can encode relations between entities as geometric relationships in the vector space 22,25,[35][36][37] .When applied to the trajectory data, each location is encoded into a single vector representation, and vectors relate to one another based on the likelihood of locations appearing adjacent to one another in the same trajectory.Also, word2vec can be intepreted as a metric recovery, which recovers the underlying metric of the semantic manifold 37 .
Our study builds on the gravity model framework 3 , widely-used, fundamental mobility model [38][39][40][41] that connects the expected flux, T  , between locations based on their populations and distance: where   is the population of location ,  (   ) is a decay function with respect to distance between locations, and  is a constant estimated from data (see Methods).For the flight itinerary data, we use population   as the total number of unique passengers who passed through each airport, for the Korean accommodation reservation data, we use the total number of unique customers who booked with each accommodation, and for scientific mobility, we use the mean annual number of unique mobile and non-mobile authors who were affiliated with each organization.T  , which is often referred to as "expected flux" 4 , is the expected frequency of the co-occurrence of location  and  in the trajectory in the gravity model.
The gravity model dictates that the expected flow, T  , ( T  = T ), is proportional to the locations' population, T  ∝     , and decays as a function of their distance,  (   ).Traditionally, the decay function has been defined in terms of geographic distance, due to its intuitiveness and availability.Here, we also consider the embedding distance, calculated as the cosine distance between location vectors modeled by word2vec, to test the ability to encode the complex relationships in the mobility data.The decay function  (   ) defines the effect of distance, and different decay functions can model fundamentally different mechanisms 42 such as the cost functions for a given distance and the spatial granularity of the observation.For geographic distance, we define  (   ) as the standard power-law function, and for the embedding distance, we use the exponential function, selected as the best performing for each case (Fig. S7 and Fig. S8).
Our results show that embedding distance deriving from the word2vec significantly improves the ability to predict the mobility compare to geographical distance no matter decay function while providing the useful representation of locations as well (see Tables S3, S4, S5).However, we note that the geographic distance best models real-world mobility with the power-law decay function, likely resulting from the function's suitability for large, complex, and scale-free spatial systems 43 ; the embedding distance, in contrast, models our mobility data best for the exponential decay function, which stems from an underlying connection between the function and word2vec.

Results
word2vec and the gravity model where the denominator   =  ′ ∈A exp(  ′ •   ) is a normalization constant, and A is the set of all locations.Although word2vec generates two embedding vectors   and   -referred to as the in-vector and out-vector, respectively-we follow convention to use the in-vector   as an embedding of location .
When  = 1 (Fig. S4), the flow can be written as where ( ) is the fraction of location 's frequency in the data.In general, calculating   is computationally expensive and there are two common approximations: hierarchical softmax 44 and negative sampling 19 .Due to its simplicity and performance, negative sampling is the most widely used strategy, which we also adopt in our study.
Although negative sampling is the most common approximation, it is a biased estimator 45,46 and fits a different probability model.When taking into account this bias, word2vec with skipgram and negative sampling fits a probability model given by where we redefine the normalization constant as  ′  =  ′ ∈A   (  ′ ) exp(  ′ •   ).(See the Methods and Supporting Information for the full derivation).
Parameter  = 1 is a special choice that ensures that, when the embedding dimension is sufficiently large, there exists optimal in-vectors and out-vectors such that   =   35 .Setting  = 1 and substituting   =   into Eq.4, the flow predicted by word2vec is given by where  is the sum of frequencies in all locations and  () =  () is the frequency of location  in the data.
The flow    is symmetric (i.e.,    =   ) because the skip-gram model neglects whether the context  appears before or after the target  in the trajectory.If we swap  and  in Eq. 5, the numerator remains the same but the denominator can be different.Therefore, to ensure    =   , the denominator   should be a constant.
Taken together, the word2vec model with the negative sampling predicts a flow in the same form as in the gravity model: In other words, word2vec with skip-gram negative sampling is mathematically equivalent to the gravity model, with the mass given by the location's frequency (), and the distance measured by their dot similarities.While the gravity model describes mobility flows from the given mass and locations, word2vec estimates proper locations in the vector space which best explain the given mobility flow.This makes word2vec a powerful tool for learning models of mobility.

Embeddings provide functional distance between locations
To ensure that word2vec learns an systematic representation of mobility that encodes meaningful functional distances, we devise an empirical validation task.Past studies have utilized a range of validating tasks for assessing the accuracy word2vec, including analogy completion 19 and human surveys 22 , in the case of natural language, or link prediction, in the case of network embedding 26,27 .Here, we expect that an extracted representation from the mobility data should provide a functional distance that better models the flux between institutions than does geographic distance, which is at present the most widely measure used in the gravity model.We test this notion using three human mobility datasets, showing that word2vec consistently offers a better representation of actual mobility flows than geographic distance, as well as alternative network and direct optimization approaches.
In the case of flight itineraries, the embedding distance explains more than twice the expected flux between airports ( 2 = 0.51, Fig. 1b) than does geographic distance ( 2 = 0.22), which has traditionally been used to quantify distance for the gravity model.Also, the embedding distance produces better predictions of actual flux between airports than does the geographic distance (Fig. 1c).In the case of Korean accommodation reservations, embedding distance better explains the expected flux ( 2 = 0.57, Fig. 1d) than does geographic distance ( 2 = 0.25), and predictions made using the embedding distance outperform those made with geographic distance (Fig. 1e).This performance is consistent in the case of scientific mobility: the embedding distance explains more than twice the expected flux ( 2 = 0.48, Fig. 1f) than does the geographic distance ( 2 = 0.22), and predictions made using the embedding distance outperform those using the geographic distance (Fig. 1g).These patterns hold for the subsets of only domestic (within-country organization pairs, Fig. S7 and Fig. S9c) and only international mobility flows (across-country organization pairs, Fig. S9d).
The embedding distance also out-performs alternative diffusion-based network distance   Neural embedding provides functional distance that improves predictive power of the gravity model of mobility best across three distinct human trajectory datasets.a.A unique identifier is assigned to each organization and they are assembled into an affiliation trajectory ordered by year of publication (top).If an author lists multiple organization affiliations within the same year, we shuffle the order within that year in each training iteration (bottom, see Supporting Information).b.Embedding distance (top) better explains the expected flux of passengers between U.S. airports than does geographic distance (bottom).The red line is the line of the best fit.Black dots are mean flux across binned distances.99% confidence intervals are plotted for the mean flux in each bin.Correlation is calculated on the data in the log-log scale ( < 0.0001 across all fits).The lightness of each hex bin indicates the frequency of organization pairs within it.c. Predictions of flux between airport pairs made using embedding distance (top) outperform those made using geographic distance (bottom).Box-plots show the distribution of actual flux for binned values of predicted flux.Box color corresponds to the degree to which the distribution overlaps with  = ."RMSE" is the root-mean-squared error between the actual and predicted values.Embedding distance consistently produce powerful functional distance for Korean accommodation reservations (d,e), and global scientific mobility (f,g) measures including the personalized Page Rank scores calculated from the underlying mobility network (Fig. S5, Fig. S11, Fig. S12).The embedding distance derived from neural embedding also explains more of the flux and better predicts mobility flows than simpler embedding baselines, such as distances derived from a singular-value decomposition and a Laplacian Eigenmap embedding 47 of the underlying location co-occurrence matrix, Levy's symmetric word2vec 35 , and even direct optimization of the gravity model (Fig. S5 and Tables S3, S4, S5).In sum, our results demonstrate that, consistently and efficiently, the embedding distance better captures patterns of actual mobility than does the geographic distance.
In practice, because we only have limited amounts of noisy data and the optimization may not find the true optimum, the mathematical result may only approximately hold.Indeed, we find that the in-and out-vectors tend to be different and that the cosine similarity tends to better capture real-world mobility than the inner product similarity.This result echos other applications of word embedding, such as word analogy testing 48 , in which cosine distance also outperformed the inner product similarity.Nevertheless, a model with the inner product similarity has the second-best performance after cosine similarity (Tables S3, S4, S5), and the embedding distance still outperforms all alternatives we considered.

Embeddings capture global structure of mobility
In the remainder of the paper, we focus on scientific mobility and interrogate the geometric space generated by the neural embedding to shed light on the multi-faceted relationships between organizations.To explore the topological structure of the embedding, we use a topologybased dimensionality reduction method (UMAP 49 ) to obtain a two-dimensional representation of the embedding space (Fig. 2a).By leveraging the unique characteristics of representation learning approach, we are able to show the relationships between individual organizations, rather than aggregates such as nations or cities, this projection constitutes the largest and highest resolution "map" of scientific mobility to date.
Globally, the geographic constraints are conspicuous; organizations tend to form clusters based on their national affiliations and national clusters tend to be near their geographic neighbors.At the same time, the embedding space also reflects a mix of geographic, historic, cultural, and linguistic relationships between regions much more clearly than alternative network representations (Fig. S13) that have been common in studies of scientific mobility 8,50 .
The embedding space also allows us to zoom in on subsets and re-project them to reveal local relationships.For example, re-projecting organizations located in Western, Southern, and Southeastern Asia with UMAP (Fig. 2b) reveals a gradient of countries between Egypt and the Philippines that largely corresponds to geography, but with some exceptions seemingly stemming from cultural and religious similarity; Malaysia, with its official religion of Islam, is nearer to Middle Eastern countries in the embedding space than to many geographically-closer South Asian countries.We validate this finding quantitatively with the cosine distance between nations (the centroids of organizations vectors belonging to that country).Malaysia is nearer to many Islamic countries such as Iraq ( = 0.27), Pakistan ( = 0.32), and Saudi Arabia ( = 0.41) than neighboring but Buddhist Thailand ( = 0.43) and neighboring Singapore ( = 0.48).
Linguistic and historical ties also affect scientific mobility.We observe that Spanish-speaking Latin American nations are positioned near Spain (Fig. 2c), rather than Portuguese-speaking Brazil ( = 0.35 vs.  = 0.54 for Mexico and  = 0.39 vs.  = 0.49 for Chile) reflecting linguistic and cultural ties.Similarly, North-African countries that were once under French rule such as Morocco are closer to France ( = 0.32) than to similarly geographically-distant European countries such as Spain ( = 0.39), Portugal ( = 0.52), and Italy ( = 0.52).Comparable patterns exist even within a single country.For example, organizations within Quebec in Canada are located nearer France ( = 0.37) than the United States ( = 0.51).
Mirroring the global pattern, organizations in the United States are largely arranged according to geography (Fig. 2d).Re-projecting organizations located in Massachusetts (Fig. 2e) reveals structure based on urban centers (Boston vs. Worcester), organization type (e.g., hospitals vs. universities), and university systems (University of Massachusetts system vs.Harvard & MIT).For example, even though UMass Boston is located in Boston, it clusters with other universities in the UMass System ( = 0.29) rather than the other typically more highly-ranked and research-focused organizations in Boston ( = 0.39), implying a relative lack of mobility between the two systems.Similar structures can be observed in other states such as among New York's CUNY and SUNY systems (Fig. S14), Pennsylvania's state system (Fig. S15), Texas's Agricultural and Mechanical universities (Fig. S16), and between the University of California and State University of California systems (Fig. S17).
Just as the embedding space makes it possible to zoom in on subsets of organizations, it is also possible to zoom out by aggregating organizational vectors.In doing so, we can examine the large-scale structure that governs scientific mobility.We define the representative vector of each country as the average of their organizational vectors and, using their cosine similarities, perform hierarchical clustering of nations that have at least 25 organizations represented in the embedding space (see Fig. 3a).The six identified clusters roughly correspond to countries in Asia and North America (orange), Northern Europe (dark blue), the British Commonwealth and Iran (purple), Central and Eastern Europe (light blue), South America and Iberia (dark green), and Western Europe and the Mediterranean (light green).The cluster structure shows that not only geography but also linguistic ties and cultural between countries are related to scientific mobility.
We quantify the relative importance of geography (by region), and language (by the most widely-spoken language of each country) using the element-centric clustering similarity 51 , a method that can compare hierarchical clustering and disjoint clustering (geography, language...)  at the different level of hierarchy by explicitly adjusting a scaling parameter , acting like a zooming lens.If  is high, the similarity is based on the lower levels of the dendrogram, whereas when  is low, the similarity is based on higher levels.Fig. 3b demonstrates that regional relationships play a major role at higher levels of the clustering process (low ), and language (family) explains the clustering more at the lower levels (high ).This suggests that the embedding space captures the hierarchical structure of mobility.

Embeddings capture latent prestige hierarchy
The embedding space can also encode more fine-grained relationships between entities.For examples, prestige hierarchies are known to underpin the dynamics of scientific mobility, in which researchers tend to move to similar or less prestigious organizations 16,17 .Could the embedding space, to which no explicit prestige information is given, encode a prestige hierarchy?
This question is tested by exploiting the geometric properties of the embedding space with Se-mAxis 36 .Here, we use SemAxis to operationalize the abstract notion of academic prestige, defining an axis in the embedding space where poles are defined using known high-and lowranked universities.As an external proxy of prestige, we use the Times Ranking of World Universities (we also use research impact from the Leiden Ranking 53 , see Supporting Information); the high-rank pole is defined as the average vector of the top five U.S. universities according to the rankings, whereas the low-rank pole is defined using the five bottom-ranked (geographically-matched by U.S. census region) universities.We derive an embedding-based ranking for universities based on the geometrical spectrum from the high-ranked to low-ranked poles (see Data and Methods).
The embedding space encodes the prestige hierarchy of U.S. universities that are coherent with real-world university rankings.The embedding-based ranking is strongly correlated with the Times ranking (Spearman's  = 0.73, Fig. 4a).For reference, the correlation be-a b higher lower Figure 3: Geography, then language, conditions international mobility.a. Hierarchically clustered similarity matrix of country vectors aggregated as the mean of all organization vectors within countries with at least 25 organizations.Color of matrix cells corresponds to the cosine similarity between country vectors.Color of country names corresponds to their cluster.Color of three cell columns separated from the matrix corresponds to, from left to right, the region of the country, the language family 52 , and the dominant language.b.Element-centric cluster similarity 51 reveals the factors dictating hierarchical clustering.Region better explains the grouping of country vectors at higher levels of the clustering.Language family, and then the most widely-spoken language, better explain the fine-grained grouping of countries.
tween the Times ranking and the publication impact scores from the Leiden Ranking 53 , a bibliometrically-based university ranking, is 0.87 (Spearman's , Fig. 4b).The correlation between the embedding-based ranking and the Times ranking is robust regardless of the number of organizations used to define the axes (Fig. S18), such that even using only the single top-ranked and bottom-ranked universities produces a ranking that is significantly correlated with the Times ranking (Spearman's  = 0.46, Fig. S18a).The correlation is also comparable to more direct measures such as node strength (sum of edge weights, Spearman's  = 0.73) and eigenvector centrality (Spearman's  = 0.76, see Supporting Information) from the mobility network.The strongest outliers that were ranked more highly in the Times ranking than in the embedding-based ranking tend to be large state universities such as Arizona State University and the University of Florida.Those ranked higher in the embedding-based ranking tend to be relatively-small universities near major urban areas such as the University of San Francisco and the University of Maryland Baltimore County, possibly reflecting exchanges of scholars with nearby high-ranked institutions at these locations.In sum, our results suggest that the embedding space is capable of capturing information about academic prestige, even when the representation is learned using data without explicit information on the direction of mobility (as in other formal models 16 ), or prestige.
The axes can be visualized to examine the relative position of organizations along the prestige axis, and along a geographic axis between California and Massachusetts.Prestigious universities such as Columbia, Stanford, MIT, Harvard, and Rockefeller are positioned towards the top of the axis (Fig. 4c).Universities at the bottom of this axis tend to be regional universities with lower national profiles (yet still ranked by Times Higher Education) and with more emphasis on teaching, such as Barry University and California State University at Long Beach.
By projecting other types of organizations onto the prestige axis, SemAxis offers a new way of representing a continuous spectrum of organizational prestige for which rankings are often low-resolution, incomplete, or entirely absent, such as for regional and liberal arts universities (Fig. 4d), research institutes (Fig. 4e), and government organizations (Fig. 4f).Their estimated prestige is speculative, though we find that it significantly correlates with their citation impact (Fig. S22).
We also find that the size (L2 norm) of the organization embedding vectors provides insights into the characteristics of organizations (Fig. 5).Up to a point (around 1,000 researchers), the size of U.S. organization's vectors tends to increase proportionally to the number of researchers (both mobile and non-mobile) with published work; these organizations are primarily teachingfocused institutions, agencies, and hospitals that either are not ranked or have a low ranking.
However, at around 1,000 researchers, the size of the vector decreases as the number of researchers increases.These organizations are primarily research-intensive and prestigious universities with higher rank, research outputs, R&D funding, and doctoral students (Fig. S23).
A similar pattern has been observed in applications of neural embedding to natural language, in which the size of word vectors were found to represent the word's specificity, i.e., the word associated with the vector frequently co-appears with particular context words 54 .If the word in question is universal, appearing frequently in many different contexts, it would not have a large norm due to a lack of strong association with a particular context.Likewise, an organization with a small norm, such as Harvard, appears in many contexts alongside many different organizations in affiliation trajectories-it is well-connected.The concavity of the curve emerges in part from the relationship between the size of the vector and the expected connectedness of the organization, given its size ( 2 = 0.17).Large, prestigious, and well-funded research universities such as Princeton and Harvard have smaller vector norms because they appear in many different contexts compared to more teaching-focused organizations such as NY Medical College, and the University of Michigan at Flint.Some universities, such as the University of Alaska at Fairbanks, have considerably small vectors, which may be a result of their remote High−ranked regional and liberal arts colleges Low−ranked regional and liberal arts colleges Un-filled points are those top and bottom five universities used to span the axis.Even when considering only a total of ten organization vectors, the estimate of the Spearman's rank correlation between the embedding and Times ranking is  = 0.73 ( = 145,  < 0.0001), which increases when more topand-bottom ranked universities are included (Fig. S18).b.The Times ranking is correlated with Leiden Ranking of U.S. universities with Spearman's  = 0.87 and  < 0001.c-f.Illustration of SemAxis projection along two axes; the latent geographic axis, from California to Massachusetts (left to right) and the prestige axis.Shown for U.S. Universities (c), Regional and liberal arts colleges (d), Research institutes (e), and Government organizations (f).Full organization names are listed in Table S1.

California Massachusetts
locations and unique circumstances.
We report that this curve is almost universal across many countries.For instance, China's curve closely mirrors that of the United States (Fig. 5b).Smaller but scientifically advanced countries such as Australia and other populous countries such as Brazil also exhibit curves similar to the United States (Fig. 5b, inset).Other nations exhibit different curves which lack the portions with decreasing norm, probably indicating the lack of internationally-prestigious institutions.Similar patterns can be found across many of the 30 countries with the most total researchers (Fig. S24; see Supporting Information for more discussion).Size (L2 norm) of organization embedding vectors compared to the number of researchers for U.S. universities.Color indicates the rank of the university from the Times ranking, with 1 being the highest ranked university.Uncolored points are universities not listed on the Times ranking.A concave-shape emerges, wherein larger universities tend to be more distant from the origin (large L2 norm); however, the more prestigious universities tend to have smaller L2 norms.b.We find a similar concave-curve pattern across many countries such as the United States, China, Australia, Brazil, and others (inset, and Fig. S24).Some countries exhibit variants of this pattern, such as Egypt, which is missing the right side of the curve.The loess regression lines are shown for each selected country, and for the aggregate of remaining countries, with ribbons mapping to the 99% confidence intervals based on a normal distribution.Loess lines are also shown for organizations in Australia, Brazil, and Egypt (inset).

Conclusion
Neural embedding approaches offer a novel, data-driven solution for efficiently learning an effective and robust representation of locations based on trajectory data, encoding the complex and multi-faceted nature of mobility.Its unique strength stems from our discovery that word2vec is equivalent to a gravity model, making it a natural and theoretically-grounded tool for modeling mobility.By virtue of this equivalency, word2vec learns systematic representations of mobility across disparate domains, including U.S. flight itineraries, Korean hotel accommodation reservations, and global scientific mobility.Focusing on the case of scientific mobility, we leverage the unique topological structure of the embedding space to reveal how it encodes nuanced aspects of mobility, including global and regional geography, shared languages, and prestige hierarchies, all of which are learned using only the mobility trajectories of individuals.
In revealing the correspondence between neural embeddings and the gravity model, the study of human mobility can move beyond geographic and network-based models of mobility, and instead leverage the high-order structure from individuals' mobility trajectories using these robust and efficient methods.While we focus on three domains of human mobility, this approach could be applied to many different domains, such as animal migration, transit-network mobility, international trade, and more.Once learned, functional distances between locations, such as countries, cities, or organizations, or the embedding model itself, can be published to facilitate re-use, and support reproducibility and transparency in cases when the underlying data is too sensitive to make available.Moreover, this approach can be used to learn a functional distance even between entities for which no geographic analog exists, such as between occupational categories based on individuals' career trajectories.In addition to providing a functional distance that supports modeling and predicting mobility patterns, the topology of the embedding space is amenable to a range of unique applications for studying mobility.As we have shown, the embedding space allows the visualization of the complex structure of scientific mobility at high resolution across multiple scales, providing a large and detailed map of the landscape of global scientific mobility.Other operations such as comparing entities or calculating aggregates, which could be complex and computationally-expensive for other methods, are here reduced to simple vector arithmetic.Embeddings also allows us to quantitatively explore abstract notions such as academic prestige, and can potentially be generalized to other abstract axes.Investigation of the structure of the embedding space, such as the vector norm, reveals universal patterns based on the organization's size and their vector norm that could be leveraged in future studies of mobility.
This approach, and our study, also have several limitations.First, the skip-gram word2vec model does not leverage directionality, meaning that embedding will be less effective at capturing mobility for which directionality is critical.Future studies may consider bi-directional embeddings, such as BERT 55 , to incorporate directionality, as well as their correspondance to asymetric mobility models, such as the radiation model 4 .Second, the neural embedding approach is most useful in cases of mobility between discrete geographic units such as between countries, cities, and businesses; this approach is less useful in the case of mobility between locations represented using geographic coordinates, such as in the modeling of animal movements.Third, neural embeddings are an inherently stochastic procedure, and so results may change across different iterations.However, in this study we observe all results to be robust to stochasticity, likely emerging from the limited "vocabulary" of scientific mobility, airports, and accommodations (several thousand) and the relatively massive datasets used to learn representations (several million trajectories).Applications of word2vec to problem domains where the ratio of the vocabulary to data is smaller, however, should be implemented with caution to ensure that findings are not the result of random fluctuations.Finally, the case of scientific mobility presents domain-specific limitations.Reliance on bibliometric metadata means that we capture only long-term mobility such as migration, rather than the array of more frequent short-term mobility such as conference travel and temporary visits.The kinds of mobility we do capture-migration and co-affiliation-although conceptually different, are treated identically by our model.Also, our data might further suffer from bias based on publication rates: researchers at prestigious organizations tend to have more publications, leading to these organizations appearing more frequently in affiliation trajectories.
Mobility and migration are at the core of human nature and history, driving societal phenomena as diverse as epidemics 41,56 and innovation [11][12][13][14][15] .However, the paradigm of scientific migration may be changing.Traditional hubs of migration have experienced many politicallymotivated policy changes that affect scientific mobility, such as travel restrictions in the U.S. and U.K. 57 .At the same time, other nations, such as China, are growing into major scientific powers and attractors of talent 58  For example, France, Qatar, the USA, Iraq, and Luxembourg had the most mobile authors (Fig. S2c).However, due to their size, the USA, accounted for nearly 40 % of all mobile authors worldwide (Fig. S2a), with 10 countries accounting for 80 % of all mobility (Fig. S2b).
The countries with the highest proportion of mobile scientists are France, Qatar, the United States, and Iraq, whereas those with the lowest are Jamaica, Serbia, Bosnia & Herzegovina, and North Macedonia (Fig. S2c).In most cases, countries with a high degree of inter-organization mobility also have a high degree of international mobility, indicating that a high proportion of their total mobility is international (Fig. S2d); However, some countries such as France and the United States seem to have more domestic mobility than international mobility.While the number of publications has increased year-to-year, the mobility and disciplinary makeup of the dataset has not notably changed across the period of study (Fig. S1).

Embedding
We embed trajectories by treating them analogously to sentences and locations analogously to words.For U.S. airport itinerary, trajectories are formed from the flight itineraries of individual passenger, in which airports correspond to unique identifiers.In the case of Korean accommodation reservations, trajectories comprise a sequence of accommodations reserved over a customer's history.For scientific mobility, an"affiliation trajectories" is constructed for each mobile author, which is built by concatenating together their ordered list of unique organization identifiers, as demonstrated in Fig. 1a (top).In more complex cases, such as listing multiple affiliations on the same paper or publishing with different affiliations on multiple publications in the same year, the order is randomized within that year, as shown in Fig. 1a (bottom).
These trajectories are used as input to the standard skip-gram negative sampling word em-bedding, commonly known as word2vec 19 .word2vec constructs dense and continuous vector representations of words and phrases, in which distance between words corresponds to a notion of semantic distance.By embedding trajectories, we aim to learn a dense vector for every location, for which the distance between vectors relates to the tendency for two locations to occur in similar contexts.Suppose a trajectory, denoted by ( where, where  and  are the "in-vector" and "out-vector", respectively,   =  ′ ∈A exp(  ′ •   ) is a normalization constant, and A is the set of all locations.We follow the standard practice and only use the in-vector, , which is known to be superior to the out-vector in link prediction benchmarks [20][21][22][23][24][25]28 .
We used the word2vec implementation in the python package gensim.The skip-gram negative sampling word2vec model has several tunable hyper-parameters, including the embedding dimension , the size of the context window , the minimum frequency threshold  min , initial learning rate , shape of negative sampling distribution , the number of the negative samples should be drawn , and the number of iterations.For main results regarding scientific mobility, we used  = 300 and  = 1, which were the parameters that best explained the flux between locations, though results were robust across different settings (Fig. S4).Although the original word2vec paper uses  = 0.75 19 , here we set  = 1.0, though results are only trivially different at different values of  (Fig. S5).We used  = 5, which is suggested default of word2vec.We also use same setting for U.S. airport itinerary and Korean accommodation reservation data.
To mitigate the effect of less common locations, we set  min = 50, limiting to locations appearing at least 50 times across the training trajectories; 744 unique airport for U.S. airport itinerary, 1004 unique accommodations for Korean accommodation reservation data, and 6,580 unique organizations for scientific mobility appear in the resulting embedding.We set  to its default value of 0.025 and iterate five times over all training trajectories.For scientific mobility, across each training iteration, the order of organizations within a single year is randomized to remove unclear sequential order.

An implicit bias in the negative sampling
Negative sampling trains word2vec using a binary classification task as follows.For each target word , we sample a context word  from the given data and label it as positive, denoted by   = 1.Then, we sample  words ℓ from a noise distribution  0 (ℓ) and label them as negative, denoted by  ℓ = 0.In word2vec, the noise distribution is given by  0 (ℓ) ∝   (ℓ), where ( ) is the fraction of  in the data, and  is a hyper-parameter.Then, for the sampled words, we fit a logistic regression model by maximizing the log-likelihood: where D is the set of all sampled context words.
This procedure does not guarantee that the embedding optimally converges, even when increasing the training samples and iterations 45,46 .To make this bias explicit, let us consider the unbiased variant of negative sampling, i.e., the noise contrastive estimation (NCE) 45,46 .NCE is an unbiased estimator for a probability model   of the form: where  is a non-negative likelihood function of data , and X is the set of all data.NCE fits a logistic regression model: where  = ln  + ln  ′ ∈X  ( ′ ) is a constant and maximizes the log-likelihood by calculating the gradients for embedding vectors   ,   and iteratively updating them (see Supporting Information for full derivation).Note that NCE is an unbiased estimator that has convergence to the optimal embedding in terms of the original word2vec's objective function, J 45,46 if we increase the number of words to sample and the training iterations.
Let us revisit negative sampling from the perspective of NCE.We rewrite the logistic regression model in negative sampling (Eq.9) in form of the posterior probability: that    is not a formal metric because it does not satisfy the triangle inequality.Nevertheless, cosine distance is often shown to be useful in practice 6,7,60 .We compare the performance of this cosine-based embedding distance against those derived using inner product similarity and euclidean distance.
We compare the performance of the embedding distance to many baselines.These include distances derived from simpler embedding approaches, such as Singular Value Decomposition (SVD) and a Laplacian Eigenmap embedding performed on the underlying location cooccurrence matrix.We also use network-based distances, calculating vectors using a Personalized Page Rank approach and measuring the distance between them using cosine distance and Jensen-Shannon Divergence (see Supporting Information).Finally, we compare the embedding distance against embeddings calculated through direct matrix factorization, following the approach that word2vec implicitley approximates 35 .

Gravity Law
We model co-occurences    for locations  and  (referred to as flux), using the gravity law of mobility 3 .The gravity law of mobility, which was inspired by Newton's law of gravity, postulates that attraction between two locations is a function of their population and the distance between them.This formulation and variants have proven useful for modeling and predicting many kinds of mobility [38][39][40][41] .In the gravity law of mobility, the expected flux, T  between two locations  and  is defined as, where   and   are the population of locations, defined as the total number of passenger who passed through each airport for U.S. airport itineraries, the total number of customer who booked with each accommodation for Korean accommodation reservations, and the yearly-average count of unique authors, both mobile and non-mobile, affiliated with each organization for scientific mobility. (   ) is a decay function of distance    between locations  and .
Here, we used the most basic gravity model which assumes symmetry of the flow T  = T and distance    =   , while there are four proposed variants 61 .There are two popular forms for the  (   ): one is a power law function in the form  (   ) =  −   ( > 0), and the other is an exponential function in the form  (   ) =  −   ( > 0) 43 .The parameters for  (   ) and  are fit to given mobility data using a log-linear regression where    is the actual flow from the data.The gravity law of mobility is sensitive to    = 0, or zero movement between locations.In our dataset, non-zero flows account for only 4.2 % of all possible pairs of the 6,580 organizations for scientific mobility, while 76.4% of all possible pairs of the 744 airports for U.S. airport Itinerary and 62.5 % of all possible pairs of the 1,004 accommodations for Korean accommodation reservation data.This value is comparable to other common applications of the gravity law, such as phone calls, commuting, and migration 4 .We follow standard practice and exclude zero flows from our analysis.
ranking of universities that we then compare to other formal university rankings using Spearman rank correlation.

Figure 1 :
Figure1: Neural embedding provides functional distance that improves predictive power of the gravity model of mobility best across three distinct human trajectory datasets.a.A unique identifier is assigned to each organization and they are assembled into an affiliation trajectory ordered by year of publication (top).If an author lists multiple organization affiliations within the same year, we shuffle the order within that year in each training iteration (bottom, see Supporting Information).b.Embedding distance (top) better explains the expected flux of passengers between U.S. airports than does geographic distance (bottom).The red line is the line of the best fit.Black dots are mean flux across binned distances.99% confidence intervals are plotted for the mean flux in each bin.Correlation is calculated on the data in the log-log scale ( < 0.0001 across all fits).The lightness of each hex bin indicates the frequency of organization pairs within it.c. Predictions of flux between airport pairs made using embedding distance (top) outperform those made using geographic distance (bottom).Box-plots show the distribution of actual flux for binned values of predicted flux.Box color corresponds to the degree to which the distribution overlaps with  = ."RMSE" is the root-mean-squared error between the actual and predicted values.Embedding distance consistently produce powerful functional distance for Korean accommodation reservations (d,e), and global scientific mobility (f,g)

Figure 2 :
Figure2: Projection of embedding space reveals complex multi-scale structure of organizations.a. UMAP projection49 of the embedding space reveals country-level clustering.Each point corresponds to an organization and its size indicates the average annual number of mobile and non-mobile authors affiliated with that organization from 2008 to 2019.Color indicates the region.The separation of organizations in Quebec and the rest of Canada is highlighted.b.Zooming into (re-projecting) the area containing countries in Western, South, and Southeast Asia shows a geographic and cultural gradient of country clusters.c.Similarly, zooming into the area containing organizations in Spain, Portugal, South, and Central America shows clustering by most widely-spoken majority language group: Spanish and Portuguese.d.Doing the same for organizations in the United States reveals geographic clustering based on state, roughly grouped by Census Bureau-designated regions, e. Zooming in further on Massachusetts reveals clustering based on urban center (Boston, Worcester), organizational sector (hospitals vs. universities), and university systems and prestige (UMass system vs.Harvard, MIT, etc.).

Figure 4 :
Figure 4: Embedding captures latent geography and prestige hierarchy.a.Comparison between the ranking of organizations in the Times ranking and the embedding ranking derived using SemAxis.Un-filled points are those top and bottom five universities used to span the axis.Even when considering only a total of ten organization vectors, the estimate of the Spearman's rank correlation between the embedding and Times ranking is  = 0.73 ( = 145,  < 0.0001), which increases when more topand-bottom ranked universities are included (Fig.S18).b.The Times ranking is correlated with Leiden Ranking of U.S. universities with Spearman's  = 0.87 and  < 0001.c-f.Illustration of SemAxis projection along two axes; the latent geographic axis, from California to Massachusetts (left to right) and the prestige axis.Shown for U.S. Universities (c), Regional and liberal arts colleges (d), Research institutes (e), and Government organizations (f).Full organization names are listed in TableS1.

Figure 5 :
Figure 5: Size of organization embedding vectors captures prestige and size of organizations.a.Size (L2 norm) of organization embedding vectors compared to the number of researchers for U.S. universities.Color indicates the rank of the university from the Times ranking, with 1 being the highest ranked university.Uncolored points are universities not listed on the Times ranking.A concave-shape emerges, wherein larger universities tend to be more distant from the origin (large L2 norm); however, the more prestigious universities tend to have smaller L2 norms.b.We find a similar concave-curve pattern across many countries such as the United States, China, Australia, Brazil, and others (inset, and Fig.S24).Some countries exhibit variants of this pattern, such as Egypt, which is missing the right side of the curve.The loess regression lines are shown for each selected country, and for the aggregate of remaining countries, with ribbons mapping to the 99% confidence intervals based on a normal distribution.Loess lines are also shown for organizations in Australia, Brazil, and Egypt (inset).
. Unprecedented health crises such as the COVID-19 pandemic threaten to bring drastic global changes to migration by tightening borders and halting travel.By revealing the correspondence between neural embedding and the gravity model and revealing their utility and efficacy, our study opens a new avenue in the study of mobility.Methods U.S. flight itinerary data We source U.S. airport itinerary data from the Origin and Destination Survey (DB1B), provided by the Bureau of Transportation Statistics at the United States Department of Transportation.DB1B is a sample of 10 percent of domestic airline tickets between 1993 and 2020, comprising 307,760,841 passenger itineraries between 828 U.S. airports.A trajectory is constructed for each passenger flight itinerary, forming an ordered sequence of unique identifiers of the origin and destination airports.Each itinerary is associated with a trajectory of airports including the of all authors and 17,700,095 author-affiliation combinations.Mobile authors were associated with 2.5 distinct organizational affiliations on average.Rates of mobility differ across countries.