This study makes use of several datasets. A first dataset consists of official immigration figures. The first target was to obtain the monthly data for each country to construct a detailed analysis and examine the results based on different lags (in months) between search and movement. However, these data are not always readily available. The statistical agencies of the different countries were contacted by email with the request for access, but only the representatives of Belgium replied positively to this inquiry. The other countries stated that monthly numbers are not available and refer to the yearly data. Other research has implicitly encountered the same limitation and has successfully used yearly data instead [9, 30]. For yearly figures, OECD data proved to be more complete than Eurostat data when consulted. For instance, entries for Germany after 2008 were missing. When necessary, the data were supplemented with numbers of the national statistical agencies. The immigration flow data for the United Kingdom are based on the yearly “International Passenger Surveys”. It should be noted here that these survey-numbers are not accurate immigration figures and are rounded to one hundred.
Next, Google Trends data are used (trends.google.com). This tool allows extraction of the relative search frequency of one or a set of keywords input in google in a specific geographic entity. Data is available for free from 2004 onwards. The relative frequency is indicated by a number between 0 and 100 (low to high search intensity) and is provided for each month in the time series specified. Absolute frequencies are not made public due to privacy concerns. While Google’s market share in Japan is not as high as in the U.S. or Europe, it is still over 70% for the period January 2009 to December 2021 according to StatCounter (https://gs.statcounter.com/search-engine-market-share/all/japan retrieved on January 8, 2022). Also, according to the International Telecommunication Union, internet is widely accessible in Japan .
Google Trends data are investigated for two periods: 2006-2019 and 2011-2019. Similar to previous research, 2006 is selected as a starting point because it coincides with a more widespread adoption of Google. We end with 2019 because, at the time of research, national statistics of 2020 were not yet available. We opt for this double approach because Google implemented an algorithm change in 2011. Depending on the keyword input, stark differences can be seen in the time series pre-and post-2011 data (see supplementary data). Since Google Trends is a relative index, it is not possible to use the same dataset and investigate the post-2011 numbers separately as the numbers are in relation to all the data in the set. Each new period under investigation necessitates a fresh generation and extraction of Google search frequencies.
For the sections where only country names were used (see methods), the data for Belgium is only considered until May 2018. Due to the popularity of the World Cup football game Japan-Belgium on July 2, 2018, any search action that only takes into account the Japanese word for ‘Belgium’ culminated in an excessive peak around this date, thus skewing all the relating data.
Next, we make use of an existing keyword list generated by Böhme et al. . This list has been successfully used in another research too . In this study, the list is modified and translated to suit the specific context of Japanese immigration. Here, the focus is solely on the Japanese language. Despite mandatory English classes in the Japanese education system, English is not routinely used by native Japanese to the extent that it could realistically be captured by online search activities. As a consequence, Google searches in English would primarily capture the search activities of foreign nationals in Japan. Since these people are typically not included in official immigration statistics counting Japanese citizens entering a country, including non-Japanese Google searches in the Google Trends data would add additional bias to the analysis. As such, the focus is on Japanese language specifically to target the searches of Japanese nationals.
In addition, the Japanese language is sufficiently complicated to warrant a standalone investigation as it has several writing systems. 1) Kanji originates from Chinese characters and is mainly used for kango or Sino-Japanese words. Most nouns and parts of adjectives are written in kanji (e.g., ‘music’ 音楽, or the first character of ‘beautiful’ 美しい). 2) Hiragana is primarily used for grammatical suffixes of words (for instance endings to denote the past tense of adjectives or adverbs such as the aforementioned ‘beautiful’ 美しい ・美しかった・美しく) and grammatical elements in sentences (e.g. は can mark the topic of a sentence or indicate contrast). 3) Katakana on the other hand is mostly used for loanwords, scientific words, and other imported terminology such as IT-related jargon. While 4) rōmaji is rarely used by itself, it can be used to input Japanese on digital devices. Several systems of transcribing a Japanese pronunciation to Latin script exist. We only consider the Hepburn and Nihon Shiki systems here. The former is used primarily by non-native speakers, and the latter is the main system used by native Japanese speakers.
The specific difficulty with applying the Japanese language to research with Google Trends is twofold: first, the different writing systems are not always mutually exclusive. For instance, the same Japanese word for ‘beautiful’ can be written both in kanji and hiragana (きれいな or綺麗な ). Both versions of this word are commonly used although for most kanji is preferred due to the second complication: Japanese is rife with homophones (see supplementary data for examples). As such, using kanji would be the logical option for searching online, but being a logographic system as opposed to a simple alphabet, not all characters are equally well known. Their sheer number can make kanji difficult, so even well-educated Japanese typically have not memorized all of them . In case of ambiguity or uncertainty, one may opt to use hiragana when searching the internet.
The method of clarifying this as it relates to our research is straightforward. Based on the keyword list by Böhme et al. , 20 migration-related keywords were selected. This list is supplemented with ten keywords that focus on the specific Japanese migration experience, so centering around overseas study (e.g., ‘study’ or ‘scholarship’), expats (e.g., ‘insurance,’ ‘work,’ or ‘tax’), and overseas Japanese communities (e.g., ‘Japanese food’ or ‘Japanese Association’).
Each keyword is inputted and compared in Google Trends in as many ways as possible. Concretely this means that, when possible, the same word was input in 1) kanji, 2) hiragana, 3) katakana, 4) rōmaji (Hepburn system), 5) rōmaji (Nihon Shiki system) (see figure 1 for an example). Loanwords in katakana do not have a kanji-equivalent so this option is left out for these words, resulting in two sets of words: a) kango or Sino-Japanese words which have a kanji equivalent (24 words), and b) loanwords that are predominantly katakana and do not have a directly corresponding kanji (six words). Next, the time series of the different inputs for every keyword in Google Trends are compared to come to an understanding of how these different systems impact the data that can be extracted.
To predict migration with Google search data, we start with the same keyword list by Böhme et al. . Whereas research by Golenvaux et al.  successfully used the list unmodified to predict immigration, to use it for Japanese migration it a) needs to be adjusted to reflect the specific nature of Japanese migration and b) needs to be translated taking into account the specificity of the Japanese language. Concretely, most words dealing with topics such as ‘asylum’ or ‘smuggling’ were deleted as these are not relevant for Japanese immigration to Europe, and words such as ‘insurance’ or ‘studying overseas’ were added. Also, words such as ‘migration’ and ‘migrating’, while different in English, are differentiated in Japanese only by grammatical sentence constructions (e.g., ijū and ijū suru). The words containing the meaning of the words do not include these grammatical differentiators. This means that these keywords are identical in Japanese.
Finally, following the findings of examining the different writing systems, the words are translated and transcribed resulting in a list of 90 words (table 1). For some words, compound search terms are also constructed, both to boost measurable search frequencies by Google where results were lacking and subsequently to promote data extraction, and to address the issue of synonyms. For instance, we combined the words ‘consulate’ and ‘embassy’, and operated the search term as follows in combination with ‘Paris’: パリ 領事館 + パリ 大使館 (‘Paris consulate + Paris embassy’)
For determining the predictive power of Google Trends, several approaches are examined. As a first step, a straightforward approach is used, following Wanner . The keywords are inputted in Google Trends together with the Japanese word for each country. For instance ‘study (in) France’ would be translated into 留学 フランス. Monthly time series of Google Trends (ranging from 0 to 100) are downloaded for 2006 to 2019 and 2011 through 2019 and are aggregated for each year t in Japan (ja) as location. The resulting time series are labeled as bilateral Google Trends indexes (GTIbiljat). We estimate linear regression models via ordinary least squares method (OLS) to examine the relationship between immigration (yt = the number of moves in year t), and the relative number of searches in year t conducted in Japan (ja), expressed by GTIbiljat.
In a second step, we follow Golenveaux et al.  and Böhme et al.  and construct an interaction term consisting of additional Google Trends indexes: GTIunijat x GTIdestjat. Whereas the aforementioned authors construct one Google Trends index which aggregates the frequencies of all the keywords, we maintain the frequencies per keyword to examine the possible nuances between words. Although the assumption is that all associations of the words should follow the same direction, this needs to be confirmed by considering each word individually. GTIunijat is a predictor containing the Google Trends values of the keywords by themselves for Japan during year t (i.e., not specifying the European destination). GTIdestjat is the relative search intensity in Japan for the country names (e.g., ‘France’ but without another keyword). OLS linear regression is used for the periods 2006-2019 and 2011-2019 but with two predictors: GTIbiljat + GTIunijat x GTIdestjat.
Compared to moving from Germany to France for instance, migrating from Japan to Europe requires more planning both due to the distance (both Euclidean and cultural) involved and the additional paperwork compared to within-Schengen movement. To capture this preparation phase, the models are run again with a one-year time lag (yt-1) for Germany, France, the Netherlands, and the United Kingdom. Because monthly data are available for Belgium, the number of lags is increased and delays of three, six, nine, and twelve months between searching and moving are examined for this country.
In a third search action, we only focus on the country and city names. Instead of examining general searches, a built-in tool by Google Trends is used that categorizes searches in specific categories. The data are extracted based on four categories: 1) all categories, 2) business and industrial, 3) jobs and education, and 4) law and government. The resulting time series only take into account searches related to the specified categories and are thus not limited to exact words. These are analyzed with OLS linear regression with predictor GTIdestjat.
Next, the first analysis is repeated but the country names are exchanged with a key city from each country. As reflected in the literature, cities such as Paris, Düsseldorf or Brussels are known within their respective countries and Japan as featuring a relatively established Japanese community and may serve as a prime destination for Japanese immigrants. We examine if these city names can serve as proxies for country names. Some keywords practically make more sense on a regional/city level. For instance, when searching for accommodation it can be assumed that people do this at the level of a city and do not just look for a place to stay anywhere in the country. Here we focus on one predictor GTIbiljat and analyze the predictive strength via OLS linear regression for 2011 to 2019.
Finally, as a fifth step, the search location is changed from Japan to each of the five European countries (cod). This translates into searching how frequently Japanese words were searched for in European countries. In this step, only the Japanese keywords are used without the European country or city name. These are analyzed for both periods starting in 2006 and 2011 via OLS linear regression with predictor GTIunicodt. The inspiration for this reversed approach can be found in Connor’s research . We assume that after people have moved, they still need to search for information that may be captured by Google (e.g., where the embassy is to arrange visa formalities, looking for a job, how tax works, and more).
Throughout the above analyses, linear regression is used for a number of reasons. One of which is that linear regression is used in comparable research [1, 6, 9, 54] and this research also aims to find out how replicable these techniques are to other cases. Another reason is that it conceptually follows the logic of migration aspirations: More people aspiring to migrate means more people searching for information. Increases/decreases in these numbers ought to be followed by increases/decreases in real mobility, potentially after some delay. Whereas other research makes use of a narrower, more targeted range of methods and data, there is no prior research which can be used as a guideline for analyzing Japanese immigration. Consequently, this research opts to explore several ways of searching for predictive strength by using a wide range of Google search terms.
4The United Kingdom’s Office for National Statistics has monthly numbers, but these are not split between citizenship or countries of origin and are consequently not suitable for this paper.
5An exception are the numbers for asylum seekers and refugees which are monitored more closely.
6The immigration flow from Japan to Germany for 2019 was lacking in the OECD dataset at the time of consultation and was supplemented with data generated by the German Federal Statistical Office (set 12711-0007).
7Some selection bias is expected due to differences in digital literacy and access to technology. This second part, however, is not a concern when dealing with Japan. According to the ITU (International Telecommunication Union), a specialized department for information and communication technologies by the UN, Japan has a 3G mobile coverage of 100%, and a 4G mobile coverage of 99% of the population (2019 and 2017 respectively), so the basic network is well established. Active subscriptions follow the same trend: 203 active mobile-broadband subscriptions per 100 inhabitants and 34 fixed subscriptions per 100 inhabitants in 2019 .
8 These are words with a different meaning but the same or similar sound.
9Whereas the basic sets (roughly 2000 characters) are learned in school, a complete list of kanji existing in Japanese would range from 40,000 to more than 75,000 unique characters. Diverging proficiency is illustrated by a nationally organized kanji-exam (kanji kentei) aimed at Japanese of all ages and levels. 631,521 people registered for the second round in 2020, but only 10.9% could pass the most difficult first grade .
10For Belgium, the monthly time series are used. For other countries these are aggregated to yearly ones.
11We only follow the researchers’ principle of constructing Google Trends indexes but not the analysis since they used Google Trends as part of a model rather than by itself. In this research, we are more concerned with the keywords, so our emphasis differs.