3.1 Overview of the mining process
Figure 1 gives an overview of the mining process that we use. We start by defining the research area and the time period that we would like to study (Step 1 in Fig. 1). In this case the research area was Big Data, and the time period was the 10-year period from 2012 to 2021. These two parameters are sent to the program for bibliometric mining.
Using the Scopus API, we collect all documents with “Big Data” in either the title, list of author-defined keywords or abstract for each of the 10 years (Step 2 in Fig. 1). This is done in 10 steps using the 10 search strings: “TITLE-ABS-KEY ( {big data} ) AND (PUBYEAR = 2012)”,…, “TITLE-ABS-KEY ( {big data} ) AND (PUBYEAR = 2021)”.
For each document that we find with the search strings shown above, we get a record containing the following (and other) fields (Step 3 in Fig. 1):
-
Title of the document
-
Author-defined keywords for the document (if any)
-
Abstract
-
Number of citations
-
Affiliation country
These records are processed and filtered by the Python program for bibliometric mining that we have developed. One important task for the mining program is to identify popular and fast-growing keywords that reflect important research directions within Big Data. A major challenge is the overwhelming amount of user defined keywords in the retrieved documents (in this case more than 150 000 unique keywords). Another challenge is that some keywords are very general (e.g., “data”, “research” and “future”); such general keywords are in most cases not very useful when identifying research trends and directions. Two approaches have been used to address these challenges: one general automatic approach and one manual, research area specific, approach.
The automatic approach for reducing the number of keywords and removing keywords that are too general is based on only considering author-defined keywords that are present in at least a certain number of documents. When developing our methodology, we saw that keywords that only consist of one word (e.g., “data”, “research” and “future”) tend to be general and thus less useful compared to keywords that consist of two of more words (e.g., “deep learning”, “convolutional neural networks” and “cloud computing”). Based on this observation and the need to reduce the number of keywords in general, and the number of very general keywords in particular, and some testing, we developed a heuristic rule: For keywords consisting of two or more words, we automatically remove the keyword if it is present in less than a certain limit (MinimumOccurences) author-defined keyword lists. For keywords consisting of only one word, we automatically remove the keyword if it is present in less than a certain larger limit (OneWordFactor*MinimumOccurences) author-defined keyword lists. This means that to consider a keyword, we require OneWordFactor times more occurrences for a one-word keyword compared to a keyword consisting of two or more words. By experimenting with different values on OneWordFactor and MinimumOccurences and discussing with research area experts, it was decided that OneWordFactor = 4 and MinimumOccurences = 25 provided a useful list of keywords that could be presented to human experts (Step 4 in Fig. 1). For OneWordFactor = 4 and MinimumOccurences = 25 there were approximately 1500 remaining keywords, i.e., the initial number of keywords was reduced with roughly a factor of 100.
After automatic filtering, which reduced the number of keywords from 150 000 to 1500, human experts developed a blacklist containing remaining keywords that are very general and thus less useful for determining research trends and directions (Step 5 in Fig. 1). Since many keywords are similar or related to the same research direction within the Big Data research area, the experts also created a thesaurus that clusters keywords into groups with similar meaning. These groups can be considered as research direction within Big Data. Some clustering is trivial and based on linguistic aspects, e.g., “neural network” and “neural networks” are put in the same group (all keywords in the documents are converted to lower case), and “health care” and “healthcare” are put in the same group. Some common abbreviations are also trivial to cluster, e.g., “internet of things” and “iot” and “artificial intelligence” and “ai”. Clustering that requires the research area experts’ knowledge and judgement are decisions such as putting “parallel processing” and “distributed processing” in the same group and putting “edge computing” and “fog computing” in the same group. Appendix A contains the blacklist and thesaurus used. Keywords that represent groups (we refer to such keywords as research directions) in the thesaurus are capitalize, e.g., “Deep learning” represents a group of keywords including “deep learning” (see Appendix A for details).
The output from the program to the experts in Step 4 in Fig. 1 is a list of keywords and numbers indicating the number of documents that contain the keyword, e.g. (<…> represent keywords that are omitted in this example):
<…>
Security and privacy 9807
<…>
research 8009
<…>
privacy-preserving 1101
<…>
The fact that “Security and privacy” start with a capital “S” shows that this keyword (or research direction) is defined in the thesaurus by the experts. The other two keywords start with small letters and have been automatically extracted from the author-defined lists of keywords in the documents. Based on the list of keywords printed by the program, the experts may in Step 5 in Fig. 1 decide to put “research” on the blacklist (because this keyword is very general) and add “privacy-preserving” to the research direction “Security and privacy” in the thesaurus. When Step 4 is repeated the output from the program to the experts may look like this:
<…>
Security and privacy 10165
<…>
This means that “research” is removed from the keyword list and that “privacy-preserving” is included in the research direction “Security and privacy”. N.B. since some documents may contain both “privacy-preserving” and some keyword that was already included in “Security and privacy” in the thesaurus the number of documents that contain a keyword associated with the research direction “Security and privacy” only increases with 358 from 9807 to 10165 (10165–9807 = 358), and not with 1101, which was the number of documents that contained “privacy-preserving” (see above).
Steps 4 and 5 are repeated until the research area experts are satisfied with the blacklist and thesaurus. An \(n\times n\) research direction dependency matrix for the n most important research directions is then calculated (what makes a research direction important is defined in Section 3.2.1). The value of n is decided by the experts. However, to avoid an overwhelming number of research directions, n should be in the order of 10 or smaller. Let Ki be the set of all documents that contain a keyword, that (via the thesaurus) contains an author-defined keyword that is associated with research direction i, in either the title, the author defined list of keywords or the abstract. Entry mi,j in the research direction dependency matrix was obtained as \({m}_{i,j}=\left|{K}_{i}\cap {K}_{j}\right|/\left|{K}_{i}\right|\), i.e., mi,j is a number between zero and one, and it is one when i = j. By looking at the research direction dependency matrix one can see if there is a large overlap between two research directions, i.e., if the number of documents that contain keywords associated with both research directions is relatively high (Step 6 in Fig. 1). If the values mi,j and mj,i are high the research area experts may decide to merge research directions i and j by modifying the thesaurus (Step 7 in Fig. 1), i.e., the mining program supports the experts in their non-trivial task of creating a thesaurus that reflects important and reasonably non-overlapping directions within the research area. Based on data retrieved from the Scopus database, the thesaurus and the blacklist, the data mining program then generates data that describes important research directions, trends etc. in the research area (Step 8 in Fig. 1).
3.2 Data generated by the mining process
3.2.1 Research directions and research trends
The main result from the mining process is a list of the most important research directions found. The importance of a research direction is based on three criteria. To quantify these criteria, all keywords and research directions after the automatic and manual filtering are numbered in some arbitrary order. At this point author-defined keywords and expert-defined research directions are treated in the same way, and in the definition of the three criteria below we refer to an expert-defined research direction as a keyword. If keyword i is a research direction, the number of documents is calculated in the following way: The list of pi keywords that correspond to the research direction is obtained from the thesaurus. For each of these pi keywords we create a set Aj containing the documents that contain keyword j (1 ≤ j ≤ pi) in the title, author-defined list of keywords or abstract. A set Ai is then created as \({A}_{i}={\bigcup }_{j=1}^{{p}_{i}}{A}_{j}\); the number of documents for research direction i, is the cardinality of Ai. The three criteria are:
-
The total number of documents (tai) for keyword i (\(1\le i\le m\)) for the entire period (2012–2021).
-
The growth rate (gri) for keyword i during the time period. The idea is that if the number of documents that contain keyword i has increased rapidly during the time period, then keyword i is important. This metric is calculated in the following way: Let ki,j denote the number of documents published during year j that contain keyword i. Based on this we calculate gri in the following way (we set ki,2011 = 0, and the age factor a = 1.5):\({gr}_{i}= {\sum }_{j=2012}^{2021}{a}^{j-2012} {(k}_{i,j}-{k}_{i,j-1})\)
-
The citation count (cci) for keyword i. Let Ki be the set of all documents that contain the keyword (the cardinality of Ki is tai). Also, let cj be the number of citations of document j, then \({cc}_{i}= {\sum }_{document j in {K}_{i}}{c}_{j}\).
Three ranking lists with keywords were created: one based on tai, one based on gri, and one based on cci. The three ranks were added for each keyword and then the n keywords with the lowest sum were selected as the most important keywords. After some experimentation and discussions with research area experts it was decided that this way of selecting keywords and research directions made sense, since each of the three criteria reflects an important aspect that should affect the selection of important research directions within Big Data.
To visualize the trend in research direction i, the list of pi keywords that correspond to research direction i is first obtained from the thesaurus. A document is counted as belonging to a research direction if it, in the title, list of author-defined keywords or abstract, contains at least one of the pi keywords that is associated with the research direction. As a consequence, one document can belong to more than one research direction and some documents may not belong to any of the n most important research directions.
Documents that are published early will in general have more citations than documents published later, e.g., a document from 2012 will in general have more citations than a document published 2021. To be able to compare the citation counts from different years, a year-normalized citation score (NCS = Normalized Citation Score) was calculated for each document. The NCS for a document is the number of citations for the document divided with the average number of citations for documents in our dataset that are published the same year. By definition the average NCS for all documents in our dataset is 1.
The average NCS was calculated for each research direction i by considering the set Ai of all documents that contain at least one of the pi keywords that, according to the thesaurus, are associated with research direction. The average NCS for research direction i is the average of the NCS for the documents in set Ai.
3.2.2 Geographic information
For the important research directions, as well as for the research area Big Data as a whole, we plot the number of documents for the major geographic regions (based on affiliation country). We consider four geographic regions: North America (USA and Canada), European Union (taking Brexit into consideration by including UK until the end of 2019), China and The Rest of the World. A document that has affiliation countries from more than one geographic region will be counted proportionally in the corresponding geographic regions, e.g., a document with three authors with affiliation countries China, Sweden and Brazil will be counted 1/3 in the region China, 1/3 in the region European Union and 1/3 in the region The Rest of the World.
The average NCS was calculated for each region (considering all documents from that region), as well as for each combination of region and research direction. In order to calculate the average NCS for research direction i in a certain region the set of all documents that are from the region were put in a set A (documents are counted proportionally if there are authors from different regions). The list of pi keywords that correspond to research direction i is then obtained from the thesaurus. For each of these pi keywords we create a subset Aj of A such that Aj consists of the documents in A that contain keyword j (1 ≤ j ≤ pi) in the title, author-defined list of keywords or abstract. A set Ai is then created as \({A}_{i}={\bigcup }_{j=1}^{{p}_{i}}{A}_{j}\). The average NCS for research direction i for the region is the average NCS for the documents in Ai.