The number of articles sampled at every generation of citation, as well as the number of citation contexts at every citation generation for each of the ten base articles, can be found in Table 1. Included in Table 2 are the number of citation contexts collected from generations one to five. The average number of citation contexts of the base articles that were extracted from the first generation was 9.8 (maximum=20, minimum=5). A total of 221 (average=22.1, maximum=44, minimum=13) citation contexts of the first-generation articles were extracted from the second-generation articles. 439 (average=43.9, maximum=61, minimum=22) citation contexts of the second-generation articles were extracted from the third-generation articles. Fourth-generation articles produced 748 (average=74.8, maximum=102, minimum=40) citation contexts of third-generation articles. Similarly, fifth-generation articles produced 1257 (average=125.7, maximum=141, minimum=113) citation contexts of fifth-generation articles. For ease of reporting, citation contexts of the base articles from the first-generation articles were labelled first-generation citation contexts. Similarly, citation contexts of the first-generation articles that were obtained from the second-generation articles were labelled second-generation citation contexts. The same rule applies to the citation from other generations.
The number of citation context pairs between the citation contexts of first-generation articles and citation contexts of articles in other generations is displayed in Table 4.6. The result shows that the number of citation context pairs from the first- and second-generation citation contexts was 419 (average= 41.9, maximum=103, minimum=15). The number of citation context pairs from the first- and third-generation citation contexts was 879 (average=87.9, maximum=194, minimum=36). The number of citation context pairs from the first- and fourth-generation publications was 1524 (average=152.4, maximum=259, minimum=79). The number of citation context pairs from the first- and fifth-generation publications is 2450 (average=245.0, maximum=563, minimum=127).
The number of citation pairs depends on the number of citation mentions in the citing articles of the two generations in question. The number of citation context pairs between an article with m citation mentions and another article with n citation mentions was obtained as nxm. The lowest citation context pairs (n=249) from the first and other generation papers were recorded by the second article, while the highest number (n=1,114) of citation context pairs were from the ninth article. The total number of citation context pairs for the indirect citation weighting part of this thesis was 5272.
Table 1: Indirect Citations Statistics
|
Table
|
times cited
|
sample size per generation
|
citation contexts no/generation
|
1
|
2
|
3
|
4
|
5
|
1
|
2
|
3
|
4
|
5
|
1
|
Siegel, R. et al (2014) Cancer statistics, 2014.
|
9090
|
5
|
10
|
20
|
40
|
69
|
11
|
22
|
22
|
40
|
113
|
2
|
Bolger, A.M., Lohse, M. and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data.
|
8864
|
5
|
10
|
20
|
40
|
74
|
5
|
15
|
36
|
71
|
127
|
3
|
Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.
|
8508
|
5
|
10
|
20
|
40
|
73
|
8
|
19
|
41
|
76
|
128
|
4
|
Love, M. I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
|
6732
|
5
|
10
|
20
|
40
|
78
|
6
|
17
|
41
|
65
|
131
|
5
|
Ogden, C.I. et al (2014). Prevalence of childhood and adult obesity in the United States, 2011-2012.
|
4928
|
5
|
10
|
20
|
40
|
67
|
10
|
15
|
60
|
89
|
112
|
6
|
Ng., M. (2014). Global, regional, and national prevalence of overweight and obesity in children and adults during 1980-2013: a systematic analysis for the Global Burden of Disease Study 2013.
|
4576
|
5
|
10
|
20
|
40
|
68
|
10
|
44
|
61
|
83
|
130
|
7
|
Go, A., et al (2014). Photovoltaics. Interface engineering of highly efficient perovskite solar cells.
|
3905
|
5
|
10
|
20
|
40
|
67
|
7
|
13
|
52
|
76
|
141
|
8
|
Koln, P., et al (2014). Heart disease and stroke statistics--2014 update: a report from the American Heart Association.
|
3880
|
5
|
10
|
20
|
40
|
65
|
12
|
21
|
42
|
102
|
117
|
9
|
Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Black phosphorus field-effect transistors.
|
3577
|
5
|
10
|
20
|
40
|
75
|
20
|
30
|
51
|
68
|
137
|
10
|
Lamouille, S., Xu, J., and Derynck, R. (2014). Solvent engineering for high-performance inorganic-organic hybrid perovskite solar cells.
|
3323
|
5
|
10
|
20
|
40
|
74
|
9
|
25
|
33
|
78
|
121
|
|
50
|
100
|
200
|
400
|
710
|
98
|
221
|
439
|
748
|
1257
|
Table 2: Number of citation context pairs in the Indirect Citation Dataset
|
No of pairs between first and other generations
|
|
First-second
|
First-third
|
First-fourth
|
First-fifth
|
Total
|
Article 1
|
34
|
89
|
168
|
266
|
557
|
Article 2
|
15
|
36
|
71
|
127
|
249
|
Article 3
|
28
|
59
|
117
|
208
|
412
|
Article 4
|
21
|
45
|
79
|
151
|
296
|
Article 5
|
29
|
97
|
171
|
238
|
535
|
Article 6
|
74
|
122
|
163
|
221
|
580
|
Article 7
|
19
|
82
|
108
|
202
|
411
|
Article 8
|
55
|
107
|
251
|
276
|
689
|
Article 9
|
103
|
194
|
254
|
562
|
1113
|
Article 10
|
41
|
49
|
138
|
202
|
430
|
Total
|
419
|
880
|
1520
|
2450
|
5272
|
Descriptive Statistics of the semantic similarity measure between citation context pairs are presented in Table 3. below. First, the distributions of the semantic similarity measures on histogram graphs were inspected visually for normality. The distributions are presented in Figure 3, Figure 4, Figure 5 and Figure 6, and they are symmetrical in shape. Table 3 shows that averages of the weights of the residual citations received by the base articles from the second, third, fourth and fifth generations are 0.47, 0.43, 0.40 and 0.37, respectively. The average reduced consistently from the second to the fifth citation generation.
Table 3: Averages of the residual citations received from the second to the fifth citation generations
|
First-second
|
First-third
|
First-fourth
|
First-fifth
|
article 1
|
0.51
|
0.41
|
0.40
|
0.39
|
article 2
|
0.31
|
0.33
|
0.30
|
0.28
|
article 3
|
0.44
|
0.41
|
0.36
|
0.31
|
article 4
|
0.3
|
0.27
|
0.29
|
0.27
|
article 5
|
0.5
|
0.44
|
0.43
|
0.40
|
article 6
|
0.52
|
0.5
|
0.45
|
0.42
|
article 7
|
0.41
|
0.41
|
0.40
|
0.38
|
article 8
|
0.62
|
0.57
|
0.53
|
0.48
|
article 9
|
0.59
|
0.5
|
0.48
|
0.42
|
article 10
|
0.48
|
0.45
|
0.41
|
0.40
|
all
|
0.47
|
0.43
|
0.40
|
.37
|
Indirect Citation Weights
The semantic similarity-based residual citation weights were categorized using the thresholds that were specified in the methods section for classifying the citation weights. Not similar citation context pairs (i.e. with less than 0.51 semantic similarity score) were allocated zero weight. Somewhat similar citation context pairs (i.e., greater than or equal to 0.51 and less than 0.71 semantic similarity score) were allocated a weight of 0.5. Similar citation context pairs (i.e. greater than or equal to 0.71 semantic similarity score) were allocated weight of one.
Categorization of the weights (Table 4) shows that the fewest of weights received by the base articles was that of 1. Most of the weights received were zero and the proportion of zero weights increased from second generation to the fifth generation.
Table 4: Categories of the residual citation semantic similarity scores
Generation
|
Weight=1
|
Weight=0.5
|
Weight=0
|
N
|
Second
|
4%
|
37.00%
|
59%
|
100
|
Third
|
0
|
26.50%
|
73.50%
|
200
|
Fourth
|
1%
|
20%
|
79%
|
400
|
Fifth
|
0%
|
10%
|
90%
|
710
|
The percentage of non-zero weights received by the base articles from the second to the fifth generations are presented in Table 5. The result shows the percentage of non-zero weights received by each of the base articles reduced from the second generation (43%) to the fifth generation (10%). Article 8 consistently received that highest percentage of non-zero weight at all the generations, with 90% non-zero weight at the second generation, more than 50% non-zero weights at all the generations except the 5th generation. On the other hand, the worst-performing base articles-Article 2 and Article 4- received no non-zero residual weights at three of the four generations of citations.
Table 5: Non-zero indirect Citation weights
Base articles
|
Non-zero residual weights
|
|
2nd Generation
|
3rd Generation
|
4th Generation
|
5th Generation
|
article 1
|
40%
|
15%
|
22.5%
|
10.14%
|
article 2
|
0%
|
5%
|
0%
|
0%
|
article 3
|
10%
|
25%
|
12.5%
|
2.74%
|
article 4
|
0%
|
0%
|
5%
|
0%
|
article 5
|
30%
|
30%
|
17.5%
|
8.96%
|
article 6
|
80%
|
45%
|
17.5%
|
13.24%
|
article 7
|
20%
|
10%
|
12.5%
|
7.35%
|
article 8
|
90%
|
80%
|
60%
|
36.92%
|
article 9
|
90%
|
50%
|
37.5%
|
18.67%
|
article 10
|
50%
|
5%
|
12.5%
|
5.41%
|
Total
|
43%
|
26.5%
|
19.75%
|
10%
|
Statistical Test
From the observations in Table 4, the averages of the semantic similarity scores between the citation context pairs reduced from the first generation to the fifth. This observation was consolidated with the number of non-zero weights in Table 6 as the number of non-zero weights also reduced from the first to fifth generation citations. To find out if the differences in the averages are statistically significant, Hypothesis 60 was stated and tested.
Hypothesis 60: The residual citation score per paper is the same for all the generations of citation.
The averages of the semantic similarity scores between the citation context in the first- generation articles and subsequent generations decreased as the generations got farther from the base article. In other words, using semantic similarity score between the citation contexts as a measure of the average knowledge flow from the base article, the result of the averages of the semantic similarity measure shows that knowledge flow from the base article continuously reduced as the generations of citations increased. Therefore, Hypothesis 60 was stated to guide this thesis. Inferential statistics were therefore performed to confirm if the observed differences in the averages are significant.
The data is continuous, a recommended statistical test is the analysis of variance (ANOVA). It was tested if the datasets conformed to other conditions for ANOVA test. The following conditions were examined:
- Dependent variable (interval data type): semantic similarity scores
- Normally distributed samples: The histogram of the four distributions are displayed in Figure 3, Figure 4, Figure 5 and Figure 6, which shows all the distributions are approximately normal.
- Test of Homogeneity: Result of the Levene's test of homogeneity of variances is displayed in Table 6. We reject the null hypothesis as p<0.05. The variances are not equal. The datasets violated the test of homogeneity of variances; therefore, the datasets are not appropriate for ANOVA. Kruskal Wallis, a non-parametric test, was considered as an alternative to ANOVA to test if the differences between the citation context pairs' semantic similarity scores are significant.
Table 6: Tests of Homogeneity of Variances for the semantic similarity score per paper
|
Levene Statistic
|
df1
|
df2
|
Sig.
|
Based on Mean
|
5.333
|
3
|
1406
|
.001
|
Based on Median
|
5.125
|
3
|
1406
|
.002
|
Based on Median and with adjusted df
|
5.125
|
3
|
1367.793
|
.002
|
Based on trimmed mean
|
5.345
|
3
|
1406
|
.001
|
Kruskal-Wallis Test
The result of the Kruskal-Wallis statistic test is displayed in Table 7 below. A Kruskal-Wallis test showed there was a statistically significant difference in the semantic similarity score per paper between the generations of citation, χ2(3) = 65.58, p = 0.00, with a mean rank semantic similarity score of 917.31 for the second generation citations, 817.79 for the third generation citations, 731.23 for the third generation citations, and 629.54 for the fifth generation citations. The mean rank statistic shows that the citation context similarities reduced as the generations went farther from the base article, and this is statistically significant.
Table 7: Mean Rank Statistics
|
Semantic similarity categories
|
N
|
Mean Rank
|
Residual citation weights score per paper from the four generations
|
Second generation citations
|
100
|
917.31
|
Third generation citations
|
200
|
817.79
|
Fourth generation citations
|
400
|
731.23
|
Fifth generation citations
|
710
|
629.54
|
Total
|
1410
|
|
Table 8: Independent-Samples Kruskal-Wallis Test Summary
Total N
|
1410
|
Test Statistic
|
68.58a
|
Degree Of Freedom
|
3
|
Asymptotic Sig.(2-sided test)
|
.00
|
a. The test statistic is adjusted for ties.
|
Given that Hypothesis 60 was rejected as there was a statistical difference between the semantic similarity scores between the generations of citation, pairwise comparisons between consecutive generations was examined using Bonferroni correction. The result of the pairwise comparison test is displayed in Table 9.
Table 9: Pairwise Comparisons of the Semantic Similarity Score categories
Sample 1-Sample 2
|
Test Statistic
|
Std. Error
|
Std. Test Statistic
|
Sig.
|
Adj. Sig.a
|
Fifth generations semantic similarity score-Fourth generation semantic similarity score
|
101.69
|
25.46
|
4.00
|
.00
|
.00
|
Fourth generation semantic similarity score -Third generation semantic similarity score
|
86.55
|
35.26
|
2.46
|
.014
|
.085
|
Third generation semantic similarity score -Second generation semantic similarity score
|
99.53
|
49.87
|
2.00
|
.046
|
.276
|
Each row tests the null hypothesis that the Sample 1 and Sample 2 distributions are the same.
Asymptotic significances (2-sided tests) are displayed. The significance level is .050.
a Significance values have been adjusted by the Bonferroni correction for multiple tests.
Cascading Citation and the Proposed Indirect Citation Weighting Comparison
The comparison between the cascading citation system and the proposed residual citation weights is in two phases. In the first phase, cascading citation weights were compared to the proposed indirect citation weights for each of the ten base articles at every generation. In the second phase, cascading citation weight per indirect citation was compared to the average semantic similarity score.
Cascading citation weights and the proposed indirect citation weights per base article
The cascading citation weights compared to the proposed indirect citation weights for each of the ten base articles at the second generation is visualized in Figure 8. A total of 20% of all the base articles received zero indirect semantic similarity-based citation weights, while all the base articles received equal cascading residual citations. Article 8 and article 9 received the highest residual semantic similarity-based citation weight of 5. Only two articles (article 8 and article 9) received the same value of cascading citation weights and semantic similarity-based citation weight. At least 80% of the residual citation weights of three base articles' (article 6, article 8 and article 9) second-generation articles were allocated non-zero semantic similarity-based citation weights. Nevertheless, the weights under the proposed method were lower than those of the cascading citation system, except on two occasions.
The comparison between the proposed citation weights and the cascading citation system at the third generation is visualized in Figure 9. At the third generation, the number of indirect citations to the base articles doubled, though the cascading citation weights remained the same. The number of base articles that got zero residual citations also increased in the third generation. Unexpectedly, the number of non-zero weights reduced from the second generation though the number of indirect citations increased at this generation as 50% of the base articles received zero weights when the lowest citation contexts' semantic similarity scores were considered for weight allocation. On the other hand, on two occasions, base articles got more residual citations from the proposed method than the cascading citation when the highest citation contexts' similarity scores were considered for weight allocation.
The comparison between the proposed citation weights and the cascading citation system at the fourth generation is visualized in Figure 10. The number of indirect citations to the base articles quadrupled at the third generation, though the cascading citation weights remained the same. The number of base articles that got zero residual citations also increased in the third generation. The average value of residual citations per base article continued to increase, though the semantic similarity score average reduced.
The comparison between the proposed citation weights and the cascading citation system at the fifth generation is visualized in Figure 11. The number of indirect citations to the base articles increased eight folds at the fifth generation, though the cascading citation weights remained the same. The number of base articles that got zero residual citations also increased in the fifth generation.
Comparison between cascading citation weight and the highest semantic similarity score per second-generation article
Hypothesis 70, Hypothesis 80, Hypothesis 90, and Hypothesis 100 were stated to guide the study. The hypotheses were stated to determine if the differences between ½, ¼, 1/8, and 1/16 (cascading citation weights) and the semantic similarity scores per second, third, fourth and fifth-generation articles, respectively. For instance, Hypothesis 70 was stated to investigate if there was a significant difference between ½ (second-generation cascading citation weight) and the semantic similarity scores at the second generation.
Since the distributions of the semantic similarity weights are normal (see Figure 3, Figure 4, Figure 5, Figure 6), one sample T-Test was considered appropriate to investigate if there is are statistical differences between the semantic similarity scores and cascading citations. The result of one sample-T-test statistical test that was performed on the appropriate datasets to investigate the stated hypotheses is displayed in Table 10. The result showed that there is a significant difference between the cascading citation weight and the residual citation score at every generation. It was found there was a significant difference t(99)=-2.47, p=.02, between the cascading citation weight and average residual citation score at the second generation. It was found there was a significant difference t(199)=20, p=.00, between the cascading citation weight and average residual citation score at the third generation. It was found there was a significant difference t(399)=46.13, p=.00, between the cascading citation weight and average residual score at the fourth generation. It was found there was a significant difference t(709)=75.41, p=.00, between the cascading citation weight and average residual citation score at the fifth generation.
Hypothesis 70: There is no significant difference between the cascading citation weight and average residual citation score per second-generation article.
Hypothesis 80: There is no significant difference between the cascading citation weight and average residual citation score per third-generation article.
Hypothesis 90: There is no significant difference between the cascading citation weight and average residual citation score per fourth-generation article.
Hypothesis 100: There is no significant difference between the cascading citation weight and average residual citation score per fifth-generation article.
Table 10: One-Sample T-Test result for the Average Residual Citation Score
Sample 1-Sample 2
|
Test Statistic
|
Std. Error
|
Std. Test Statistic
|
Sig.
|
Adj. Sig.a
|
Second generation residual citation-fourth generation residual citation
|
101.69
|
25.46
|
4.00
|
.000
|
.000
|
Fifth generation residual citation-third generation residual citation
|
188.24
|
32.60
|
5.78
|
.000
|
.000
|
Fifth generation residual citation-second generation residual citation
|
287.77
|
43.49
|
6.62
|
.000
|
.000
|
Fourth generation residual citation-third generation residual citation
|
86.55
|
35.26
|
2.46
|
.014
|
.085
|
Fourth generation residual citation-second generation residual citation
|
186.08
|
45.52
|
4.09
|
.000
|
.000
|
Third generation residual citation-second generation residual citation
|
99.53
|
49.87
|
2.00
|
.046
|
.276
|
Each row tests the null hypothesis that the Sample 1 and Sample 2 distributions are the same.
Asymptotic significances (2-sided tests) are displayed. The significance level is .050.
aSignificance values have been adjusted by the Bonferroni correction for multiple tests.
|