Bibliometric Mining of Research Directions and Trends for Big Data

doi:10.21203/rs.3.rs-2233095/v1

Download PDF

Research Article

Bibliometric Mining of Research Directions and Trends for Big Data

https://doi.org/10.21203/rs.3.rs-2233095/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

In this paper a program and methodology for bibliometric mining of research trends and directions is presented. The method is applied on the research area Big Data for the time period 2012 to 2021, using the Scopus database. It turns out that the 10 most important research directions in Big Data are Machine learning, Deep learning and neural networks, Internet of things, Data mining, Cloud computing, Artificial intelligence, Healthcare, Security and privacy, Review, and Manufacturing. For four geographical regions (North America, European Union, China, and The Rest of the World) we investigate different activity levels in Big Data during different parts of the time period. North America was the most active region during the first part of the time period. During the last years China is the most active region. The citation scores for documents from different regions and from different research directions within Big Data are also compared. North America has the highest average citation score among the geographical regions and the research direction Review has the highest average citation score among the research directions. The program and a methodology for bibliometric mining developed in this study can be used also for other large research areas than Big Data. Now that the program and methodology have been developed, one could probably perform a similar study in some other research area in a couple of days.

Artificial Intelligence and Machine Learning

Bibliometrics

Research directions

Research trends

Geographic regions

Scopus database

The amount of data collected and managed in most applications is increasing at a staggering pace. In May 2018 Forbes noted that 2.5 quintillion (10¹⁸) bytes of data are produced every day [1], and the data production rate is increasing all the time. As can be expected, the high relevance of the Big Data area has triggered a large amount of research. A recent study of the trends and research directions in Big Data showed that there during the last 10 years there are more than 118 000 documents related to Big Data in the Scopus database alone, and that currently approximately 20 000 new documents are published every year [2] (i.e., more than 50 new documents published every day!). It is clearly very challenging for any researcher to stay up-to-date with the trends and directions in a research area with a production rate as high as that of Big Data.

In this paper the challenge of identifying research directions and trends in large and fast-growing research areas such as Big Data is attacked by developing a program that performs semi-automatic analysis of publication databases, in this case the Scopus database. The analysis is based on bibliometric data mining, which is sometimes referred to as bibliomining [3]. The program does not perform a fully automatic analysis since expert domain knowledge is needed. However, the program provides support that makes it possible for experts to identify important research directions and trends with limited effort even for large research areas with hundreds of thousands of documents.

There are two research contributions in this paper: (i) identification of important research directions and trends in Big Data, including research trends in different geographical regions, and (ii) a tool (program) and a methodology that can be used for bibliometric data mining also for other large research areas than Big Data.

As discussed above, there are two research contributions in this paper. In Section 2.1 we discuss surveys and other studies related to the identification of important research directions and trends in Big Data, and in Section 2.2 we discuss work and tools related to bibliometric studies.

2.1 Studies of research trends and directions in Big Data

Rawat and Sood [4] conducted a scientometric analysis of emerging trends in Big Data analytics for the time period 2010 to 2019. One observation in that study was that the number of published articles increased rapidly during the period 2014 to 2019. The top three countries active were the United States, India, and China. The Scopus database was used, and the study included 4524 documents.

Wang and Lu [5] looked at Big Data research during the 10-year period 2008 to 2017. One conclusion from the study was that Cloud computing is strongly related to Big Data. The study used the WoS (Web of Science) and CNKI (China National Knowledge Infrastructure) databases, with 7171 documents from WoS and 8845 from CNKI. The three most active countries were the United States, China, and United Kingdom.

In [6] Raban and Gordon presented a bibliometric analysis of the evolution of Big Data research based on the WoS database. The study was based on 7299 documents containing “Big Data” in the title from the time period 2006 to March 2019. The conclusions from that study include the observation that Big Data is a growing research area and that the three countries most active countries were the United States, China, and United Kingdom.

In [7] Gupta et al performed a scientometric analysis of articles related to Big Data during the time period 2010 to 2016. The authors used the Scopus database and retrieved 25 334 documents. The three most active countries were the United States, China, and United Kingdom.

Based on 24 662 documents from the ACM Digital Library, IEEE Xplore, SAGE Journals, ScienceDirect and WoS databases during the time period 2000 to 2017, Gupta and Rani [8] performed a bibliometric study of Big Data evolution and research challenges. One conclusion from that study was that the number of documents earlier than 2012 is very small. Another conclusion was that IEEE Access was the journal with the largest number of publications related to Big Data.

In [9] Wang et al performed a bibliometric study of Big Data and data mining in education. The study was based on 334 documents from the WoS database for the time period 2010 to 2022. The conclusions include that a network of collaborating authors in the field has been formed, and that the depth and breath of research in the field has expanded rapidly during the last decade.

2.2 Bibliometric analysis and tools

Bibliometrics can be used for two main purposes: performance assessment of scientific actors (countries, universities, departments, researchers), or displaying the structural and dynamic aspects of scientific research, delimiting a research field, and quantifying and visualizing the detected sub-fields [10]. In [11] Jappe examined the performance assessment practice in Europe. One conclusion from that study was that bibliometric research assessment is most frequently performed in the Nordic countries, the Netherlands, Italy, and the United Kingdom. Another conclusion was that WoS is the dominating database used for public research assessment in Europe.

Campanario [12] discussed how bibliometrics can make it possible to plot and visualize the impact factor of different journals. The plots suggested do not require sophisticated statistical techniques, yet they can be very helpful.

There are a number of software tools for bibliometric analysis. Two of the most popular tools are CiteSpace (https://citespace.podia.com/) [13] and VOSviewer (https://www.vosviewer.com/) [14]. CiteSpace supports visualization and analysis of trends and patterns in scientific literature. The functionality and use of VOSviewer is similar to that of CiteSpace. However, VOSviewer also offers text mining functionality [15]. Markscheffel and Schröter have done a comparative study of CiteSpace and VOSviewer [16]. The conclusion from that study was that visualizations created with VOSviewer are clearer and more user friendly. However, CiteSpace offered advantages in the evaluative analysis of network visualizations, e.g., by enabling analysis of the cluster nodes using a Cluster Explorer. Other similar, but less popular, software tools are BibExcel, Netdraw, Pajek, Sci2, Pop (Publish or Perish), CitNetExplorer, and HistCite.

All of the tools mentioned above focus on visualization of bibliometric data, often in the form of (very) large graphs with keywords, authors, or countries as nodes. However, there does not seem to be any bibliometric study or tool that is able to automatically or semi-automatically identify important research directions (in the form of frequently used keywords) in a large research area. The approach presented here is semi-automatic and makes it feasible for experts to identify important research directions through data mining of hundreds of thousands of documents in a research area.

Section 3.1 provides an overview of the mining process used in this study, and in Section 3.2 the data generated by the mining process are explained.

3.1 Overview of the mining process

Figure 1 gives an overview of the mining process that we use. We start by defining the research area and the time period that we would like to study (Step 1 in Fig. 1). In this case the research area was Big Data, and the time period was the 10-year period from 2012 to 2021. These two parameters are sent to the program for bibliometric mining.

Using the Scopus API, we collect all documents with “Big Data” in either the title, list of author-defined keywords or abstract for each of the 10 years (Step 2 in Fig. 1). This is done in 10 steps using the 10 search strings: “TITLE-ABS-KEY ( {big data} ) AND (PUBYEAR = 2012)”,…, “TITLE-ABS-KEY ( {big data} ) AND (PUBYEAR = 2021)”.

For each document that we find with the search strings shown above, we get a record containing the following (and other) fields (Step 3 in Fig. 1):

Title of the document

Author-defined keywords for the document (if any)

Abstract

Number of citations

Affiliation country

These records are processed and filtered by the Python program for bibliometric mining that we have developed. One important task for the mining program is to identify popular and fast-growing keywords that reflect important research directions within Big Data. A major challenge is the overwhelming amount of user defined keywords in the retrieved documents (in this case more than 150 000 unique keywords). Another challenge is that some keywords are very general (e.g., “data”, “research” and “future”); such general keywords are in most cases not very useful when identifying research trends and directions. Two approaches have been used to address these challenges: one general automatic approach and one manual, research area specific, approach.

The automatic approach for reducing the number of keywords and removing keywords that are too general is based on only considering author-defined keywords that are present in at least a certain number of documents. When developing our methodology, we saw that keywords that only consist of one word (e.g., “data”, “research” and “future”) tend to be general and thus less useful compared to keywords that consist of two of more words (e.g., “deep learning”, “convolutional neural networks” and “cloud computing”). Based on this observation and the need to reduce the number of keywords in general, and the number of very general keywords in particular, and some testing, we developed a heuristic rule: For keywords consisting of two or more words, we automatically remove the keyword if it is present in less than a certain limit (MinimumOccurences) author-defined keyword lists. For keywords consisting of only one word, we automatically remove the keyword if it is present in less than a certain larger limit (OneWordFactor*MinimumOccurences) author-defined keyword lists. This means that to consider a keyword, we require OneWordFactor times more occurrences for a one-word keyword compared to a keyword consisting of two or more words. By experimenting with different values on OneWordFactor and MinimumOccurences and discussing with research area experts, it was decided that OneWordFactor = 4 and MinimumOccurences = 25 provided a useful list of keywords that could be presented to human experts (Step 4 in Fig. 1). For OneWordFactor = 4 and MinimumOccurences = 25 there were approximately 1500 remaining keywords, i.e., the initial number of keywords was reduced with roughly a factor of 100.

After automatic filtering, which reduced the number of keywords from 150 000 to 1500, human experts developed a blacklist containing remaining keywords that are very general and thus less useful for determining research trends and directions (Step 5 in Fig. 1). Since many keywords are similar or related to the same research direction within the Big Data research area, the experts also created a thesaurus that clusters keywords into groups with similar meaning. These groups can be considered as research direction within Big Data. Some clustering is trivial and based on linguistic aspects, e.g., “neural network” and “neural networks” are put in the same group (all keywords in the documents are converted to lower case), and “health care” and “healthcare” are put in the same group. Some common abbreviations are also trivial to cluster, e.g., “internet of things” and “iot” and “artificial intelligence” and “ai”. Clustering that requires the research area experts’ knowledge and judgement are decisions such as putting “parallel processing” and “distributed processing” in the same group and putting “edge computing” and “fog computing” in the same group. Appendix A contains the blacklist and thesaurus used. Keywords that represent groups (we refer to such keywords as research directions) in the thesaurus are capitalize, e.g., “Deep learning” represents a group of keywords including “deep learning” (see Appendix A for details).

The output from the program to the experts in Step 4 in Fig. 1 is a list of keywords and numbers indicating the number of documents that contain the keyword, e.g. (<…> represent keywords that are omitted in this example):

<…>

Security and privacy 9807

<…>

research 8009

<…>

privacy-preserving 1101

<…>

The fact that “Security and privacy” start with a capital “S” shows that this keyword (or research direction) is defined in the thesaurus by the experts. The other two keywords start with small letters and have been automatically extracted from the author-defined lists of keywords in the documents. Based on the list of keywords printed by the program, the experts may in Step 5 in Fig. 1 decide to put “research” on the blacklist (because this keyword is very general) and add “privacy-preserving” to the research direction “Security and privacy” in the thesaurus. When Step 4 is repeated the output from the program to the experts may look like this:

<…>

Security and privacy 10165

<…>

This means that “research” is removed from the keyword list and that “privacy-preserving” is included in the research direction “Security and privacy”. N.B. since some documents may contain both “privacy-preserving” and some keyword that was already included in “Security and privacy” in the thesaurus the number of documents that contain a keyword associated with the research direction “Security and privacy” only increases with 358 from 9807 to 10165 (10165–9807 = 358), and not with 1101, which was the number of documents that contained “privacy-preserving” (see above).

Steps 4 and 5 are repeated until the research area experts are satisfied with the blacklist and thesaurus. An \(n\times n\) research direction dependency matrix for the n most important research directions is then calculated (what makes a research direction important is defined in Section 3.2.1). The value of n is decided by the experts. However, to avoid an overwhelming number of research directions, n should be in the order of 10 or smaller. Let K_i be the set of all documents that contain a keyword, that (via the thesaurus) contains an author-defined keyword that is associated with research direction i, in either the title, the author defined list of keywords or the abstract. Entry m_i,j in the research direction dependency matrix was obtained as \({m}_{i,j}=\left|{K}_{i}\cap {K}_{j}\right|/\left|{K}_{i}\right|\), i.e., m_i,j is a number between zero and one, and it is one when i = j. By looking at the research direction dependency matrix one can see if there is a large overlap between two research directions, i.e., if the number of documents that contain keywords associated with both research directions is relatively high (Step 6 in Fig. 1). If the values m_i,j and m_j,i are high the research area experts may decide to merge research directions i and j by modifying the thesaurus (Step 7 in Fig. 1), i.e., the mining program supports the experts in their non-trivial task of creating a thesaurus that reflects important and reasonably non-overlapping directions within the research area. Based on data retrieved from the Scopus database, the thesaurus and the blacklist, the data mining program then generates data that describes important research directions, trends etc. in the research area (Step 8 in Fig. 1).

3.2 Data generated by the mining process

3.2.1 Research directions and research trends

The main result from the mining process is a list of the most important research directions found. The importance of a research direction is based on three criteria. To quantify these criteria, all keywords and research directions after the automatic and manual filtering are numbered in some arbitrary order. At this point author-defined keywords and expert-defined research directions are treated in the same way, and in the definition of the three criteria below we refer to an expert-defined research direction as a keyword. If keyword i is a research direction, the number of documents is calculated in the following way: The list of p_i keywords that correspond to the research direction is obtained from the thesaurus. For each of these p_i keywords we create a set A_j containing the documents that contain keyword j (1 ≤ j ≤ p_i) in the title, author-defined list of keywords or abstract. A set A_i is then created as \({A}_{i}={\bigcup }_{j=1}^{{p}_{i}}{A}_{j}\); the number of documents for research direction i, is the cardinality of A_i. The three criteria are:

The total number of documents (ta_i) for keyword i (\(1\le i\le m\)) for the entire period (2012–2021).
The growth rate (gr_i) for keyword i during the time period. The idea is that if the number of documents that contain keyword i has increased rapidly during the time period, then keyword i is important. This metric is calculated in the following way: Let k_i,j denote the number of documents published during year j that contain keyword i. Based on this we calculate gr_i in the following way (we set k_i,2011 = 0, and the age factor a = 1.5):\({gr}_{i}= {\sum }_{j=2012}^{2021}{a}^{j-2012} {(k}_{i,j}-{k}_{i,j-1})\)
The citation count (cc_i) for keyword i. Let K_i be the set of all documents that contain the keyword (the cardinality of K_i is ta_i). Also, let c_j be the number of citations of document j, then \({cc}_{i}= {\sum }_{document j in {K}_{i}}{c}_{j}\).

Three ranking lists with keywords were created: one based on ta_i, one based on gr_i, and one based on cc_i. The three ranks were added for each keyword and then the n keywords with the lowest sum were selected as the most important keywords. After some experimentation and discussions with research area experts it was decided that this way of selecting keywords and research directions made sense, since each of the three criteria reflects an important aspect that should affect the selection of important research directions within Big Data.

To visualize the trend in research direction i, the list of p_i keywords that correspond to research direction i is first obtained from the thesaurus. A document is counted as belonging to a research direction if it, in the title, list of author-defined keywords or abstract, contains at least one of the p_i keywords that is associated with the research direction. As a consequence, one document can belong to more than one research direction and some documents may not belong to any of the n most important research directions.

Documents that are published early will in general have more citations than documents published later, e.g., a document from 2012 will in general have more citations than a document published 2021. To be able to compare the citation counts from different years, a year-normalized citation score (NCS = Normalized Citation Score) was calculated for each document. The NCS for a document is the number of citations for the document divided with the average number of citations for documents in our dataset that are published the same year. By definition the average NCS for all documents in our dataset is 1.

The average NCS was calculated for each research direction i by considering the set A_i of all documents that contain at least one of the p_i keywords that, according to the thesaurus, are associated with research direction. The average NCS for research direction i is the average of the NCS for the documents in set A_i.

3.2.2 Geographic information

For the important research directions, as well as for the research area Big Data as a whole, we plot the number of documents for the major geographic regions (based on affiliation country). We consider four geographic regions: North America (USA and Canada), European Union (taking Brexit into consideration by including UK until the end of 2019), China and The Rest of the World. A document that has affiliation countries from more than one geographic region will be counted proportionally in the corresponding geographic regions, e.g., a document with three authors with affiliation countries China, Sweden and Brazil will be counted 1/3 in the region China, 1/3 in the region European Union and 1/3 in the region The Rest of the World.

The average NCS was calculated for each region (considering all documents from that region), as well as for each combination of region and research direction. In order to calculate the average NCS for research direction i in a certain region the set of all documents that are from the region were put in a set A (documents are counted proportionally if there are authors from different regions). The list of p_i keywords that correspond to research direction i is then obtained from the thesaurus. For each of these p_i keywords we create a subset A_j of A such that A_j consists of the documents in A that contain keyword j (1 ≤ j ≤ p_i) in the title, author-defined list of keywords or abstract. A set A_i is then created as \({A}_{i}={\bigcup }_{j=1}^{{p}_{i}}{A}_{j}\). The average NCS for research direction i for the region is the average NCS for the documents in A_i.

4.1 Directions in Big Data Research

The Scopus search (Step 2 in Fig. 1) resulted in 118 611 documents out of which 95 414 had author-defined keywords. These documents contained a total of 481 371 author-defined keywords out of which 157 428 were unique keywords. After automatic filtering, the number of unique keywords is reduced to 1532. After some iterations of steps 4 and 5 in Fig. 1, the experts produced a blacklist and thesaurus (see Appendix A). The ranking list for the 12 top keywords is shown in Table 1. The reason for the gaps in the three ranking lists is that other keywords and research directions, that did not make it into the top 12, have these ranks. It was decided that the top 11 keywords should be used for further analysis, since there is a large gap from 28 to 61 in the sum of the ranks between Keyword 11 and Keyword 12 (see Table 1). The keywords in Table 1 all come from the thesaurus (as can be seen by the fact that they all start with a capital letter). Each keyword in Table 1 thus represents a research direction consisting of a group of author-defined keywords.

Table 1

First list of important research directions.
Thesaurus keywords	Number of docs		Growth rate		Citation count		Sum of ranks
Thesaurus keywords	ta_i	Rank	gr_i	Rank	cc_i	Rank	Sum of ranks
1. Machine learning	13 716	0	37 469	1	189 329	0	1
2. Internet of things	9 045	4	24 434	3	154 725	1	8
3. Data mining	12 709	1	13 289	10	144 045	2	13
4. Cloud computing	11 323	2	12 634	12	142 712	3	17
5. Artificial intelligence	7 602	8	62 970	0	82 644	11	19
6. Healthcare	7 616	7	16 808	8	119 472	5	20
7. Deep learning	6 415	11	24 231	4	102 029	6	21
8. Security and privacy	10 165	3	12 300	13	101 813	7	23
9. Review	5 177	12	16 654	9	139 208	4	25
10. Neural networks	7 403	9	17 831	7	88 398	10	26
11. Manufacturing	4 707	13	18 734	6	92 269	9	28
12. Smart cities	3 053	16	5 402	31	51 501	14	61

The research direction dependency matrix for the 11 most important research directions is shown in Table 2. The two highest values are m_7,10 = 0.47 and m_10,7 = 0.40. From Table 1 we see that Keyword 7 is “Deep learning” and Keywords 10 is “Neural networks”. The research direction dependency matrix thus shows that the number of documents in the intersection between “Deep learning” and “Neural networks” is 47% of the total number of documents that has a keyword associated with “Deep learning”. From Table 1 we see that the total number of documents that has a keyword associated with “Deep learning” is 6415. Consequently, the number of documents in the intersection is 0.47*6415 ≈ 3000. Also, the research direction dependency matrix shows that the number of documents in the intersection between “Deep learning” and “Neural networks” is 40% of the total number of documents that has a keyword associated with “Neural networks”, i.e., 3000 ≈ 0.40*7403 (ta₁₀ = 7409, see Table 1).

The research direction dependency matrix also contains other interesting information, e.g., by looking at m_1,5 (0.17) and m_5,1 (0.30) one can see that the research directions “Machine learning” and “Artificial intelligence” are relatively strongly related. By looking at m_6,11 (0.02) and m_11,6 (0.03) it is clear that the research directions “Healthcare” and “Manufacturing” are almost totally independent.

Table 2

The research direction dependency matrix. The two largest values (except the trivial 1.00 along the diagonal) are highlighted.
	1	2	3	4	5	6	7	8	9	10	11
1	1.00	0.08	0.17	0.06	0.17	0.09	0.16	0.09	0.07	0.15	0.05
2	0.13	1.00	0.07	0.28	0.16	0.08	0.06	0.15	0.08	0.04	0.15
3	0.19	0.05	1.00	0.07	0.06	0.07	0.05	0.07	0.06	0.06	0.03
4	0.08	0.22	0.08	1.00	0.08	0.06	0.03	0.16	0.05	0.03	0.06
5	0.30	0.18	0.10	0.12	1.00	0.12	0.15	0.08	0.08	0.12	0.10
6	0.17	0.10	0.12	0.08	0.12	1.00	0.06	0.09	0.07	0.05	0.02
7	0.33	0.08	0.09	0.06	0.18	0.08	1.00	0.08	0.06	0.47	0.03
8	0.13	0.13	0.09	0.18	0.06	0.07	0.05	1.00	0.05	0.05	0.03
9	0.18	0.14	0.14	0.11	0.12	0.10	0.08	0.10	1.00	0.05	0.09
10	0.28	0.05	0.10	0.05	0.12	0.05	0.40	0.07	0.03	1.00	0.03
11	0.14	0.29	0.07	0.16	0.16	0.03	0.04	0.07	0.10	0.05	1.00

Since there was a large overlap between the research directions “Deep learning” and “Neural networks”, it was decided that these two research directions should be merged into a new research direction “DL and neural networks” in the thesaurus. Table 3 shows the 10 most important research directions after this merge. It can be noted that the number of documents for the new research direction is not the sum of the numbers of documents for the two previous keywords. The reason for this is, as discussed above, that the intersection between the two merged research directions was relatively large.

Table 3

Second and final list of important research directions.
Thesaurus keywords	Number of docs		Growth rate		Citation count		Sum of ranks
Thesaurus keywords	ta_i	Rank	gr_i	Rank	cc_i	Rank	Sum of ranks
1. Machine learning	13 716	0	37 469	1	189 329	0	1
2. DL and neural networks	10 820	3	34 059	2	138 018	5	10
3. Internet of things	9 045	5	24 434	4	154 725	1	10
4. Data mining	12 709	1	13 289	9	144 045	2	12
5. Cloud computing	11 323	2	12 634	11	142 712	3	16
6. Artificial intelligence	7 602	9	62 970	0	82 644	10	19
7. Healthcare	7 616	8	16 808	7	119 472	6	21
8. Security and privacy	10 165	4	12 300	12	101 813	7	23
9. Review	5 177	11	16 654	8	139 208	4	23
10. Manufacturing	4 707	12	18 734	6	92 269	9	27

4.2 Trends in Big Data Research

Figure 2 shows the total number of documents in Scopus for Big Data for the time period 2012 to 2021. The figure shows that there has been a rapid growth from 2012 to 2019. The number of documents is still growing for 2020 and 2021, but at a slower rate. The values visualized in Figs. 2 to 6 can be found in Appendix B.

The number of documents for the time period 2012 to 2021 for the 10 important research directions are shown in Fig. 3 (five research directions in the upper part of the figure and the other five in the lower part). Figure 3 shows that the fastest growing research direction is “Artificial intelligence”, which is consistent with the fact that this research direction has the highest growth rate (see Table 3). The figure also shows that the research directions “Data mining” and “Cloud computing” grew fast during the first half of the time period. However, during the second half of the period these two research directions grew rather slow. Figure 3 also shows that the research direction “Security and privacy” has stagnated the last years. This is consistent with the ranking of the growth rate of “Security and privacy” in Table 3.

Figure 4 shows the average NCS for the 10 important research directions. There is a significant difference between the average NCS for the research directions with the smallest average number of (year-normalized) citations per document (“Security and privacy” with NCS = 1.01) and the research direction with the highest average NCS (“Review” with NCS = 2.78). This means that on average there will be almost three times as many citations of a document related to the research direction “Review” compared to a document related to the research direction “Security and privacy” (2.78/1.01 = 2.75). It seems that systematic literature reviews and similar topics that the thesaurus has clustered under the research direction “Review” have high citation scores. The same is true for the keywords clustered under the research direction “Manufacturing”. In Fig. 4 it may seem that the average NCS for all documents is larger than one. This is, however, not the case; the average NCS for all documents is by definition one. However, one of the three selection criteria for the 10 most important research directions within Big Data is the citation score, which is positively corelated with the NCS. This means that the research directions that end up as the top 10 have relatively high NCS.

When comparing the NCS values in Fig. 4 with the curves in Fig. 3 it seems that the research directions that has a low growth also have a low average NCS, i.e., “Security and privacy”, “Data mining” and “Cloud computing” have relatively low average NCS, and all of these directions seem to be stagnating. For research directions with high average NCS, the connection between the average NCS and the growth rate is less clear.

4.3 Geographic Regions in Big Data Research

Figure 5 shows the number of documents in Scopus for Big Data for the time period 2012 to 2021 for the four geographic regions considered here. The figure shows that North America was the most active region during the first part of the time period. However, during the last part of the period China is the most active region in Big Data; in fact, the production of documents related to Big Data is decreasing in the other three regions.

The upper left part of Fig. 6 shows the average NCS for the four geographic regions considered. The average NCS for all documents in all regions is by definition one. This part of the figure shows that the average NCS is more than twice as high for a document written by someone from North America compared to a document written by someone from China, i.e., on average there will be more than twice as many citations to a document from North America compared to a document from China written the same year. The difference in citation scores between North America and China has been observed previously [17]. One possible explanation for this difference could be that some Chinese authors may prefer to publish their results in Chinese journals [18].

The upper right part of Fig. 6 contains a pie chart that shows the distribution of the total number of documents in the four geographic regions that we consider. This part of the figure shows that China and The Rest of the World are the regions with the largest number of documents (the exact numbers can be found in Appendix B). N.B. The same colour code is used in all of the diagrams in Fig. 6, i.e., North America is blue, European Union is red etc.

The middle and lower parts of Fig. 6 show the same information as the upper part for each of the 10 identified research directions. These diagrams show that documents from North America have the largest average NCS, and that documents from China have the smallest average NCS for all of the 10 identified research directions within Big Data. For each of the 10 identified research directions the average NCS for all documents in all regions is shown in Fig. 4. As can be seen in Fig. 6 and Appendix B, a document written by authors from North America and related to the research direction “Review” (NCS = 4.85) can expect more than 6.5 times more citations compared to a document written the same year by authors from China and related to the research direction “Security and privacy” (NCS = 0.72).

By comparing the pie charts in the middle and lower parts of Fig. 6 with the big pie chart in the upper part of the figure, one can see that some research directions are more popular in some regions, e.g., “Healthcare” is popular in North America, “Manufacturing” is popular in the European Union and “DL and neural networks” is popular in China.

The proposed semi-automatic method strikes a good balance between the need for automatic support to make it possible to investigate large research areas with hundreds of thousands of documents, and the need for expressing research directions in a way that is useful for researchers and humans in general. Fully automatic approaches result in a list of common keywords that are either to general to be useful when identifying research directions (e.g., “model” [9], “data” [7], and “application” [5]) or confusing when identifying research directions (e.g., “big data analysis” different from “data analysis” [7]; “library” different from “college library” [5]; and “big data analytics” different from “data analytics” [4]). In the approach presented here, these two problems are handled by the blacklist and thesaurus, respectively.

The approach presented in this paper makes it possible for research area experts, with efficient support of a program, to mine out important research directions in large research areas. The mining program and the methodology was developed in a research group consisting of more than 15 researchers. This group has almost a decade of experience of working with Big Data research together with international partners and partners from industry and society. The group has published more than 150 research articles related to Big Data, some of these aimed at identifying trends and research directions [2][19] (more information about one of the large research projects done in the group can be found on https://a.bth.se/bigdata/).

Since the mining program and the methodology was developed in this kind of experienced environment, it was possible to validate the automatic filtering strategies, selection criteria for being an important research direction, blacklist, thesaurus, and the resulting research directions by discussing these aspects with experts in the group. The general opinion was that the resulting research directions made sense and were useful when setting strategic agendas etc. for Big Data research. The group has done similar efforts before, but in that case without the support of a mining program but instead based on interviews with industry experts and senior international researchers in Big Data [2].

When comparing the research directions obtained with the help of the mining program with previous similar efforts in the same group, one can see that some of the identified research directions are the same or similar, (e.g., “Deep learning” and “Cloud computing”), but some research directions identified in this study were overlooked in the previous study (e.g., “Healthcare” and “Security and privacy”). The conclusion from this observation was that in a large research area such as Big Data, interviews and discussions with industry experts and international researchers will result in a limited and subjective sample of the research in the area, even if one has a large network of international and industrial research experts. One problem with identifying important research directions through interviews with industrial and academic experts and then using these directions as the basis for different forms of bibliometric trend analysis is that it is hard to predict which keywords different authors use to describe a research direction, e.g., “internet of things”, “the internet of things” and “iot” have all been used to describe the same research direction (see Appendix A), and some of these ways of describing the research direction may be overlooked, thus providing incomplete statistics. In Step 4 in Fig. 1, the experts see a list of the keywords that actually have been used, so the risk of overlooking a certain keyword is minimized. This means that, compared to interviews with experts, the bibliometric mining approach presented here provides a more objective and complete way of identifying research directions.

One of the important decisions in this study was the three criteria for determining if a research direction is important. The choice of criteria was discussed in, and approved by, the research group. Two of the criteria are simple to quantify (Number of documents and Citation count). It is less obvious how the third criterion (Growth rate) should be quantified. However, by comparing the growth rate ranking in Table 3 with the actual growth curves in Fig. 3, one can see that the mathematically calculated growth rate factors correspond reasonably well to the subjective impression concerning which research directions that grows fastest, e.g., when looking at Fig. 3 most people would probably agree that “Artificial intelligence” grows the fastest and that “Security and privacy” grows the slowest. This means that the mathematical definition of the growth rate factor captures and quantifies the intuitive opinion about different growth rates in a good way.

The research direction dependency matrix was useful when deciding how independent/overlapping different research directions are. This kind of information is also useful when deciding if different research directions should be merged. In this case it was decided that, due to a significant overlap, the research directions “Deep learning” and “Neural networks” should be merged to a new research direction “DL and neural networks”.

The mining program and the process (see Fig. 1) are not specific to Big Data. As a next study, the same approach will be used for other research areas (e.g., Software Engineering or Machine Learning). Based on the experiences from the current study, the time required for defining the blacklist and the thesaurus, i.e., steps 4 to 7 in Fig. 1, is estimated to 2–3 days. This means that the tool and method described in this paper will probably make it possible to do similar studies of other areas within a limited time frame, which opens up new possibilities for doing bibliometric research.

The fact that a computer program is used in the mining process makes it easy to calculate metrics such as growth rate, research direction dependency matrix, average NCS per geographical region, average NCS for the different research directions, and average NCS for combinations of geographical regions and research directions. Such metrics would have been more complicated to obtain without the support of a computer program.

Big Data is an active research area that has been growing significantly during the last decade. Using a mining program and research area experts, 10 important research directions within Big Data have been identified. These research directions are Machine learning, Deep learning and neural networks, Internet of things, Data mining, Cloud computing, Artificial intelligence, Healthcare, Security and privacy, Review, and Manufacturing.

When looking at the Big Data activities in different geographic regions one can see that China is the most active region during the last years. The number of citations to documents in Big Data from China are, however, on average less than half of the citations to documents in Big Data from North America published the same year. Documents from North America have the most citations, and documents from China have least citations for all 10 important research directions within Big Data.

By comparing the current study with a previous study based mainly on interviews with international well-known researchers and experts, it is clear that the bibliometric mining approach presented in this paper provides a more objective and complete way of identifying research directions.

The tool (program) and a methodology developed in this study can be used for bibliometric data mining also for other large research areas than Big Data. Now that the program and methodology have been developed, one could probably perform a similar study in a research area such as Software Engineering or Machine Learning in a couple of days. Such speed opens up new and exciting possibilities for doing this kind of bibliometric research.

Ethics approval and consent to participate

Not applicable.

Consent for publication

There is only one author (Lars Lundberg), and he gives his consent for publication.

Availability of data and materials

The data used are available in Appendix B. The data in Appendix B have been extracted from the Scopus database in September 2022.

Competing interests

No competing interests.

Funding

This research was funded by Blekinge Institute of Technology, Sweden. No external research grant.

Authors' contributions

There is only one author, Prof. Lars Lundberg. Lars has written the paper and done the work.

Acknowledgements

The author would like to thank the researchers and teachers at the department of Computer Science at Blekinge Institute of Technology for good discussions and providing a good research environment for this study.

B. Marr, "How much data do we create every day? The mind-blowing stats everyone should read," https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/?sh=661e274e60ba, 2018.
L. Lundberg and H. Grahn, “Research Trends, Enabling Technologies and Application Areas for Big Data,” Algorithms, vol. 15, no. 8, p. 280, 2022, DOI: https://doi.org/10.3390/a15080280.
S. Nicholson, "The basis for bibliomining: Frameworks for bringing together usage-based data mining and bibliometrics through data warehousing in digital library services," Information Processing & Management, vol. 42, (3), pp. 785–804, 2006.
K. S. Rawat and S. K. Sood, "Emerging trends and global scope of big data analytics: a scientometric analysis," Quality & Quantity, vol. 55, (4), pp. 1371–1396, 2020;2021.
W. Wang and C. Lu, "Visualization analysis of big data research based on Citespace," Soft Computing (Berlin, Germany), vol. 24, (11), pp. 8173–8186, 2019;2020
D. R. Raban and A. Gordon, "The evolution of data science and big data research: A bibliometric analysis," Scientometrics, vol. 122, (3), pp. 1563–1581, 2020.
V. Gupta et al, "A quantitative and text-based characterization of big data research," Journal of Intelligent & Fuzzy Systems, vol. 36, (5), pp. 4659–4675, 2019.
D. Gupta and R. Rani, "A study of big data evolution and research challenges," Journal of Information Science, vol. 45, (3), pp. 322–340, 2019.
C. Wang, J. Dai and L. Xu, "Big data and data mining in education: A bibliometrics study from 2010 to 2022," 7th International Conference on Cloud Computing and Big Data Analytics (2022), DOI: https://doi.org/10.1109/ICCCBDA55098.2022.9778874.
E. Herrera-Viedma, M. A. Martinez and M. Herrera, "Bibliometric tools for discovering information in database," in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), H. Fujita et al, Eds. Cham: Springer International Publishing, 2016, pp. 193–203.
A. Jappe, "Professional standards in bibliometric research evaluation? A meta-evaluation of European assessment practice 2005–2019," PloS One, vol. 15, (4), pp. e0231735, 2020.
J. M. Campanario, "JIF-Plots: using plots of citations versus citable items as a tool to study journals and subject categories and discover new scientometric relationships," Scientometrics, vol. 113, (2), pp. 1141–1154, 2017
M. V. Guzmán Sánchez, "CHEN, CHAOMEI, CiteSpace: A Practical Guide for Mapping Scientific Literature. Hauppauge, N.Y., Nova Science, 2016, 169 pp. ISBN: 978-1-53610-280-2: eBook: 978-1-53610- 295-6 [CiteSpace: una guía práctica para el mapeo de la literatura científica]," Investigación Bibliotecológica, vol. 31, (nesp1), pp. 293–295, 2018;2017
D. Wong, "VOSviewer," Technical Services Quarterly, vol. 35, (2), pp. 219–220, 2018.
N. J. van Eck and L. Waltman, "Text mining and visualization using VOSviewer," https://arxiv.org/abs/1109.2058, 2011.
B. Markscheffel and F. Schröter, "Comparison of two science mapping tools based on software technical evaluation and bibliometric case studies," Collnet Journal of Scientometrics and Information Management, vol. 15, (2), pp. 365–396, 2021.
J. Zhu et al, "Measuring recent research performance for Chinese universities using bibliometric methods," Scientometrics, vol. 101, (1), pp. 429–443, 2014.
F. Shu, C. Julien and V. Larivière, "Does the web of science accurately represent chinese scientific performance?" Journal of the Association for Information Science and Technology, vol. 70, (10), pp. 1138–1152, 2019.
L. Lundberg et al, "Editorial to the Special Issue on Big Data in Industrial and Commercial Applications," Big Data Research, vol. 26, pp. 100244, 2021

Appendix.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Bibliometric Mining of Research Directions and Trends for Big Data

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

2.1 Studies of research trends and directions in Big Data

2.2 Bibliometric analysis and tools

3 Methodology

3.1 Overview of the mining process

3.2 Data generated by the mining process

3.2.1 Research directions and research trends

3.2.2 Geographic information

4 Results

4.1 Directions in Big Data Research

4.2 Trends in Big Data Research

4.3 Geographic Regions in Big Data Research

5 Discussion

6 Conclusions

Declarations

Ethics approval and consent to participate

Consent for publication

Availability of data and materials

Competing interests

Funding

Authors' contributions

Acknowledgements

References

Supplementary Files

Status:

Version 1