Experiments on two domains, rainstorm and typhoon disasters, are conducted to evaluate the effectiveness of the proposed focused crawler strategy FCMOPSO. We compare the performance of FCMOPSO with that of the other strategies, namely, the optimal priority search algorithm (OPS) [13], the simulated annealing algorithm considering host information (SA-host) [14], the improved tabu search algorithm incorporating ontology (On-ITS) [15], focused crawler combining web space evolutionary algorithm and domain ontology (FCWSEO) [18], and focused crawler based on ontology learning and multi-objective ant colony optimization (OLMOACO) [19]. FCMOPSO was run by JAVA and executed on a PC with Intel Core i7-7700 and 3.60 GHz CPU and 8.00 GB RAM.
7.2 Experimental results and discussion
Figures 3–6 display the Accuracy, RW, Ravg, and Rsd of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO in domains of the rainstorm disaster and the typhoon disaster, respectively. Notably, there are no experiments conducted by FCWSEO in the typhoon disaster domain, so we exclude it from the comparison in the typhoon disaster domain. Also, the results are recorded when DW reaches 15,000 because the performance evaluation indices of every crawler have maintained a period of steady trends.
Figure 3 (a) and (b) compare the Accuracy of different strategies in the rainstorm disaster domain and typhoon disaster domain, respectively. When DW reaches 15,000, the Accuracy of most strategies tends to be stable except for OPS, and FCMOPSO attains the highest Accuracy in both domains. In the rainstorm disaster domain when DW reaches about 6,000, the Accuracy of FCMOPSO exceeds that of the other strategies. The final Accuracy of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO are about 60.4%, 70.6%, 75.2%, 81.1%, 74.2%, and 82.3%, respectively; In the typhoon disaster domain when DW reaches 9,000, the Accuracy of FCMOPSO strategy exceeds the other strategies. The Accuracy of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are about 50.6%, 74.6%, 76.8%, 74.7%, and 84.0%, respectively.
From Fig. 3, it is not hard to find that OPS features higher Accuracy than the other strategies in the early crawling stage, but it plummets in the later crawling stage, resulting from its greedy strategy. Initially, OPS crawls webpages from the seed hyperlinks with the highest priority, which is not conducive to the expansion of the search range. When OPS falls into the choice of a hyperlink with no prospects, the webpage it points to may contain few valuable hyperlinks and the Accuracy of OPS declines rapidly. Similar to OPS, SA-host is also a kind of greedy strategy but changes the optimal search by adopting a certain probability to receive hyperlinks with relatively low priority. However, it only surpasses OPS because it has limited ability to expand the search range, especially in the later crawling stage, and is heavily influenced by the setting of parameters that are difficult to determine. In addition, On-ITS shows comparable Accuracy that it ranks third in the rainstorm disaster and only underperforms FCMOPSO in the typhoon disaster. On-ITS filters out the visited hyperlinks by modifying the tabu object and acceptance principles. If no sub-hyperlink of a visited hyperlink has higher priority than itself, the visited hyperlink will be set as a tabu object. Therefore, although On-ITS ensures that the selected hyperlinks have considerable topic relevance, the process of comparing the topical priorities of the visited hyperlink and its sub-hyperlinks involves extensive webpage content analysis and topic relevance calculations, resulting in high time consumption. Moreover, due to its acceptance principles of hyperlinks, some potential sub-hyperlinks are not fully exploited, which limits the search range. Notably, in the domain of rainstorm disaster, FCWSEO maintains its upward trend in the whole crawling process and overmatches the four strategies other than FCMOPSO when DW reaches about 8,000. FCWSEO and FCMOPSO remain close in Accuracy in the later crawling stage, with FCMOPSO overtaking FCWSEO when DW reaches about 12,500 owing to its dynamic adaptive evaluation strategy. As the parameters of the CPEM are constantly updated, the evaluation of the hyperlinks changes dynamically and tends to be more reasonable, so that FCMOPSO can maintain good Accuracy in the later crawling stage.
Figure 4 (a) and (b) demonstrate the number of topic-relevant webpages RW downloaded by the different strategies in the rainstorm disaster domain and the typhoon disaster domain, respectively. The OPS strategy crawls fewer and fewer topic-relevant webpages in the later crawling stage, while the growth trend of the number of topic-relevant webpages of the other strategies remains almost constant. In both domains, when DW reaches 15,000, FCMOPSO can crawl more topic-relevant webpages than the other strategies, and its growth rate of the number of topic-relevant webpages is also greater in the later crawling stage. The final RW of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO are 9053, 10596, 11280, 12165, 11126, and 12352, respectively in the rainstorm disaster domain and those of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are 7593, 11192, 11520, 11201, and 12596, respectively in the typhoon disaster domain.
Figure 5 (a) and (b) show the comparison of the average topic relevance Ravg of the webpages downloaded by the different strategies in both domains, respectively. The overall trend manifests that the Ravg of FCMOPSO is stable and achieves comparable values during the whole crawling process. In the rainstorm disaster domain, the final Ravg of FCMOPSO reaches about 0.770, while the Ravg of OPS, SA-host, On-ITS, FCWSEO, and OLMOACO are about 0.622, 0.663, 0.692, 0.820, and 0.778, respectively. The quality of the webpages captured by FCWSEO is superior to other strategies, followed by OLMOACO and FCMOPSO. In the typhoon disaster domain, the Ravg of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are about 0.606, 0.710, 0.728, 0.700, and 0.734, respectively. The final Ravg of FCMOPSO ranks first in the typhoon disaster domain. Overall, these results suggest that FCMOPSO is still competitive in Ravg and can obtain web information with great topic relevance.
Figure 6 (a) and (b) display the comparison of the standard deviation Rsd of the topic relevance of the downloaded webpages by the different strategies. As illustrated in Fig. 6, OLMOACO has the lowest Rsd when DW reaches 15,000, followed by FCMOPSO. Specifically, the Rsd of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO are about 0.208, 0.195, 0.158, 0.157, 0.138, and 0.145, respectively in the rainstorm disaster domain and those of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are about 0.209, 0.197, 0.151, 0.075, and 0.149, respectively in the typhoon disaster domain. During the whole crawling process, OPS exhibits the greatest volatility while OLMOACO has the greatest stability. OPS selects hyperlinks with the highest priority in every iteration and the valuable sub-hyperlinks are ignored as the network is explored deeper. Its Rsd soars because the topic relevance is variable. As the irrelevant webpages are crawled randomly, the volatility of average topic relevance becomes greater, resulting in its distinct fluctuation of Rsd. The Rsd of OLMOACO maintains a downward trend and achieves the lowest Rsd in the later crawling stage in both domains. This is because the ants in OLMOACO will accumulate more pheromones as the crawler continues and are easier to find a better crawling path and fetch more topic-relevant hyperlinks.
Table 2
Results of different focused crawler strategies when DW reaches 15,000
Strategy
|
Rainstorm disaster domain
|
Typhoon disaster domain
|
Accuracy/%
|
RW
|
Ravg
|
Rsd
|
Time/h
|
Accuracy/%
|
RW
|
Ravg
|
Rsd
|
Time/h
|
OPS
|
60.4
|
9053
|
0.622
|
0.208
|
8.93
|
50.6
|
7593
|
0.606
|
0.209
|
8.5
|
SA-host
|
70.6
|
10596
|
0.663
|
0.195
|
11.48
|
74.6
|
11192
|
0.710
|
0.197
|
11.34
|
On-ITS
|
75.2
|
11280
|
0.692
|
0.158
|
12.21
|
76.8
|
11520
|
0.728
|
0.151
|
11.96
|
FCWSEO
|
81.1
|
12162
|
0.822
|
0.157
|
11.64
|
—
|
—
|
—
|
—
|
—
|
OLMOACO
|
74.2
|
11126
|
0.778
|
0.138
|
16
|
74.7
|
11201
|
0.700
|
0.075
|
15
|
FCMOPSO
|
82.3
|
12352
|
0.770
|
0.145
|
6.26
|
84.0
|
12596
|
0.734
|
0.149
|
6.03
|
Table 3
Friedman ranks of different focused crawler strategies for four evaluation indices (Accuracy, Ravg, Rsd, Time) when DW reaches 15,000
Evaluation indices
|
Rainstorm disaster domain
|
Typhoon disaster domain
|
OPS
|
SA-host
|
On-ITS
|
FCWSEO
|
OLMOACO
|
FCMOPSO
|
OPS
|
SA-host
|
On-ITS
|
OLMOACO
|
FCMOPSO
|
Accuracy
|
6
|
5
|
3
|
2
|
4
|
1
|
5
|
4
|
2
|
3
|
1
|
Ravg
|
6
|
5
|
4
|
1
|
2
|
3
|
5
|
3
|
2
|
4
|
1
|
Rsd
|
6
|
5
|
4
|
3
|
1
|
2
|
5
|
4
|
3
|
1
|
2
|
Time
|
2
|
3
|
5
|
4
|
6
|
1
|
2
|
3
|
4
|
5
|
1
|
Average
|
5
|
4.5
|
4
|
2.5
|
3.25
|
1.75
|
4.25
|
3.5
|
2.75
|
3.25
|
1.25
|
Table 2 summarizes the Accuracy, RW, Ravg, Rsd, and the running time of the different strategies when DW reaches 15,000. As shown in Table 2, FCMOPSO outperforms the other five focused crawler strategies in the evaluation indices of Accuracy, RW, and Time in the rainstorm disaster domain and Accuracy, RW, Ravg, and Time in the typhoon disaster domain. Although the evaluation indices of Ravg and Rsd do not display the superiority of FCMOPSO, its results are still competitive. According to Time in Table 2, compared with other strategies, FCMOPSO possessed lower time consumption in both domains because of its simple hyperlink selection strategy in crawling. Unlike the other multi-objective algorithms, MOPSO is utilized only for parameter optimization with low data volume instead of complex and repetitive calculations among the increasing set of hyperlinks in the waiting queue when selecting hyperlinks, whose running time surges as the crawling process continues.
It can be seen from Table 2 that FCMOPSO does not have the optimal results in all evaluation indices. To further prove the effectiveness and superiority of FCMOPSO, Table 3 describes the Friedman ranks [46], a non-parametric statistical test to evaluate the performance of several algorithms by rankings. The lower the average ranking, the better the overall performance of the strategy. When DW reaches 15,000, the four representative evaluation indices, i.e., Accuracy, Ravg, Rsd, and Time, of these focused crawler strategies are converted to rankings. As can be seen, FCMOPSO ranks first in both domains and has the minimal average rank for four indices, indicating that it performs best out of other strategies. Also, FCWSEO ranks second in the rainstorm disaster domain and On-ITS in the typhoon disaster domain, both followed by OLMOACO.
To sum up, experimental results show that FCMOPSO achieves impressive and satisfactory results in most performance evaluation indices, particularly prevailing over the other five crawlers in the crawling accuracy and time consumption. The overall performance of FCMOPSO is better than the other strategies in the literature. More importantly, the experimental results show the effectiveness of our proposed dynamic adaptive hyperlink evaluation method, which sheds light on more efficient approaches for hyperlink evaluation in the focused crawler.