Applying multi-objective particle swarm optimization-based dynamic adaptive hyperlink evaluation to focused crawler for meteorological disasters

doi:10.21203/rs.3.rs-2209988/v1

In traditional semantic-based focused crawlers, the topical priority of unvisited hyperlinks is calculated by linearly integrating pre-defined topical similarity evaluation metrics and their corresponding weighted factors. However, these weighted factors are manually determined by personal experience, which may introduce bias in evaluating unvisited hyperlinks, resulting in topic deviation during crawling. To address this problem, we propose a dynamic adaptive focused crawler, denoted by FCMOPSO, based on multi-objective particle swarm optimization (MOPSO). For topic representation, two domain ontologies of meteorological disasters are constructed. Additionally, we present a comprehensive priority evaluation method (CPEM) of hyperlink concerning both webpage content and hyperlink structure. In MOPSO, the weights of metrics of the CPEM can be updated in every crawling iteration. Furthermore, we utilize the non-dominant sorting with the nearest farthest candidate solution (NS_NFCS) to ensure the diversity of crawling hyperlinks and expand the search range. Compared with focused crawler strategies in the literature, the experimental results on domains of rainstorm disaster and typhoon disaster show that our proposed FCMOPSO achieves satisfactory performance that it can obtain more topic-relevant webpages with ideal time consumption.

Focused crawler

Multi-objective particle swarm optimization

Hyperlink priority evaluation

Meteorological disasters

Ontology

Recently, meteorological disasters have become increasingly extreme, whose damages to lives and properties have become more severe simultaneously. For example, in July of 2021, the unprecedented rainstorm disaster in Zhengzhou, China, caused tremendous property loss and affected tens of millions of people [1]. The analytical study by Liu, Gao, Zhao, and Chen [2] has revealed that from 2000 to 2016, the frequency of tropical cyclones causing typhoon disasters in China was alarming, which induced devastating and extensive environmental damages and social threats. One hundred thirty-five tropical cyclone disasters occurred within 17 years, and direct economic losses increased each year, with over 20 provinces affected. Worse still, Aristizábal et al. [3] have pointed out that when precipitation events exceed critical slope stability thresholds, geological hazards, such as clusters of landslides and debris flows, are triggered. It can be seen from Colombia’s reports with 104 fatalities in Salgar in May 2015 and 400 deaths in Mocoa in March 2017.

In the light of the severer meteorological disasters all around the world these years, timely information on meteorological disasters plays a crucial role in disaster prevention and recovery for the public. In the current digital information age, the internet has experienced unparalleled growth and rapid expansion, becoming a warehouse of massive data resources for the public to obtain information on weather disaster mitigation and emergency response. However, information about meteorological disasters is scattered throughout the vast internet. Therefore, it is significant to crawl online information about weather hazards quickly and accurately. Traditional manual information filtering methods cannot capture and update meteorological disaster information efficiently. On the other hand, generic crawlers such as Scrapy, Pyspider, and WebCollector, suffer from low accuracy when crawling topical information. To address this problem, scholars have started to study focused crawlers [4]-[6], which aims to acquire domain-specific knowledge by retrieving webpages related to a predefined topic on the Internet. The fundamental issues of the focused crawler involve the method of topic representation, the topical priority evaluation of unvisited hyperlinks, and the design of the crawling strategy.

Topic representation is a primary task that describes domain-specific knowledge, which also performs a benchmark model for identifying whether the crawled webpages are topic-relevant or not. The main approaches include keyword-based methods and feature word-based methods. Inevitably, a list of keywords lacks complete representation of domain knowledge and neglect polysemy and relations between keywords. As a result, feature word-based methods based on the domain corpus and portrayal of the relations between feature words become the mainstream methods for topic representation and the most popular ways are context graph (CG) [7]-[9] and domain ontology[10][11]. The establishment of CG highly depends on the user’s query history, leading to topic deviation due to insufficient and biased user knowledge. Accordingly, the ontology that can describe concepts and relations is leveraged to construct a benchmark model for topic representation, thus carrying out focused crawlers in a semantic way.

In evaluating the topical priority of unvisited hyperlinks, the methods can be classified into webpage content analysis-based method and hyperlink structure-based method. In this paper, aiming to improve the evaluation of unvisited hyperlinks, we attempt to explore a comprehensive model of metrics concerning both hyperlink structure and various webpage text documents. In the calculation of topical priority, most of the related research simply integrates different evaluation metrics and their corresponding weighted factors, which significantly impacts the effectiveness of evaluating the crawling priority of hyperlinks and determining the direction of web crawlers. However, the weighted factors are static and manually pre-determined, which cannot be updated in the whole crawling process. Therefore, to address the problem of dynamically optimizing the priority weights, we try to devise a dynamic adaptive strategy to adjust the weights of metrics in the hyperlink priority evaluation.

As for the design of the crawling strategies, the most common ones are the breadth-first search algorithm (BFS) [12] and the optimal priority search algorithm (OPS) [13]. The BFS adopts the idea of first-in-first-out when searching the Web resource and ignores the topical priority of unvisited hyperlinks. The OPS is a greedy algorithm so that the crawler may easily fall into the choice of a hyperlink with no prospects. For global search, scholars have currently introduced intelligent optimization methods in the focused crawler, such as the simulated annealing algorithm considering host information (SA-host) [14], the improved tabu search algorithm incorporating ontology (On-ITS) [15] and the ant colony optimization (ACO) algorithm[16][17]. On the other hand, some multi-objective strategies, such as the focused crawler combining Web space evolution and domain ontology (FCWSEO) [18] and focused crawler based on ontology learning and multi-objective ant colony optimization (OLMOACO) [19], are utilized to fetch topic-relevant webpages, but no intelligent approaches have been applied for dynamic parameter optimization in the hyperlink evaluation. Therefore, we attempt to employ the intelligent optimization algorithm to optimize the evaluation of hyperlinks, while at the same time devising a strategy for selecting unvisited hyperlinks in a multi-objective way.

In this paper, we propose a dynamic adaptive focused crawler based on multi-objective particle swarm optimization (MOPSO) for meteorological disasters. Experiments on domains of the rainstorm and typhoon disasters are tailored to assess the effectiveness of our proposed focused crawler in comparison with other strategies in the literature. The main contributions of this paper are demonstrated as follows:

(1) A comprehensive priority evaluation method (CPEM) of hyperlink, concerning both webpage content and hyperlink structure, is designed. The evaluation metrics are the topic relevance of the webpage that the hyperlink points to, the average topic relevance of all webpages containing the hyperlink, the topic relevance of the anchor text of the hyperlink, and the improved PR value of the webpage to which the hyperlink points.

(2) A dynamic adaptive hyperlink evaluation method based on the MOPSO is proposed. In the MOPSO, the weighted factors of evaluation metrics of the CPEM can be optimized by learning the features of topic-relevant webpages in every crawling iteration.

(3) For hyperlink selection, the non-dominant sorting with the nearest farthest candidate solution strategy (NS_NFCS) is adopted to choose Pareto-optimal hyperlinks to guide the crawling direction, which ensures the diversity of hyperlinks and broadens the search range.

The following part of this paper can be divided into seven sections. Section 2 reviews related works of the focused crawler. In Section 3, the construction of two meteorological disaster domain ontologies for topic representation is given. Then, in Section 4, the semantic similarity calculation method and the construction of topical semantic weighted vector are described. In Section 5, the webpage text document feature vector is constructed and a comprehensive priority evaluation method (CPEM) of hyperlink is proposed. In Section 6, a focused crawler strategy FCMOPSO is designed and the hyperlink selection based on the NS_NFCS is elucidated. In Section 7, experimental results of two meteorological disasters and the discussion are presented to prove the effectiveness of our strategy. Lastly, the conclusion and future work are drawn in Section 8.

In this section, the development of focused crawlers is reviewed, which can be divided into three categories: heuristic algorithm-based, conceptual semantic-based, and intelligent optimization algorithm-based focused crawlers.

2.1 Heuristic algorithm-based focused crawlers

Traditional heuristic algorithm-based strategies can be categorized into hyperlink topology analysis methods and web content analysis methods. The hyperlink topology analysis methods focus on the evaluation of hyperlink authority by considering the relations between hyperlinks. To capture webpages with greater importance, Wang and Ji [20] introduced the user interest and topic into the PageRank (PR) algorithm, in which the calculation of PR value was modified by the browsing time spent on webpages and the number of clicks on hyperlinks. Another representative topology-based method, the hyperlink-induced topic search (HITS) [21], was designed to calculate the hub score and authority score for each webpage and then output the one with the highest score. However, they were prone to topic deviation in crawling due to the ignorance of the topic relevance of webpage content.

On the other hand, the web content analysis methods mainly leverage various text documents such as anchor text of webpages, webpage text content, and so on to determine the topical priority of hyperlinks at a fine-grained level. To enhance the scoring mechanism in a semantic way, a word-embedding clustering weighted method [22] was adopted in the shark search algorithm [23]. In terms of the calculation of topic similarities of various texts, Liu and Du [24] have concluded and classified them into two types: the vector space model (VSM) and the semantic similarity retrieve model (SSRM). Taking the advantages of the VSM and the SSRM, a semantic similarity vector space model (SSVSM) was designed [25] to merge the cosine similarity and the semantic similarity to describe the topical priorities of the unvisited hyperlinks. To sum up, these methods considered the topic relevance of webpage text content but ignored the network topology that depicts the importance of structural relations formed by hyperlinks, inducing a restricted search range.

To tackle the problem of topic deviation and expand the search range, naturally, scholars have paid attention to the establishment of ranking methods based on the combination of content analysis and hyperlink topology. Prakash and Kumar [26] added the PR value to the shark search. Seyfi, Patel, and Júnior [27] leveraged HTML elements to predict the topical focus of unvisited hyperlinks of webpages and assigned topical priority to them based on the T-Graph hierarchical structure. Moreover, Zhao, Guan, Cao, and Liu [28] developed a hyperlink-based and content-based assessment method, which is called online topical quality estimation (OTQE), to intelligently prioritize the unvisited URLs.

In conclusion, the hyperlink structure-based and webpage text-based methods all play important roles in evaluating the topic-relevance of webpages or hyperlinks, and the focused crawlers should absorb the advantages of these two methods. Therefore, in this paper, we propose a comprehensive priority evaluation method (CPEM) of hyperlink concerning both webpage content and hyperlink structure.

2.2 Conceptual semantic-based focused crawlers

In the traditional crawling strategies, the calculation of topic relevance of webpage content mainly utilized topical keywords [29] for topic representation and topic relevance calculation of various text documents, which has the disadvantages of polysemy and ignorance of relations between domain concepts, resulting in crawling noisy webpages and missing topic-relevant webpages. It is of great significance to establish a benchmark topic model that can describe as accurate and complete topical knowledge as possible. Therefore, the conceptual semantic-based crawlers were addressed to cope with the problem. Currently, many scholars have centered on the context graph (CG) and domain ontology, which are two useful ways for topic representation at the semantic level.

In the CG-based methods, Guan and Luo [7] used a concept context graph (CCG) to describe topics and Du, Li, Hu, Li, and Chen [8] proposed a path trust knowledge graph (PTKG) that relied on the user’s historical searching information. Similarly, knowledge graph (KG) [9] was presented to portray knowledge. Nevertheless, the CG-based methods highly depend on personal knowledge, so they inevitably risk topic deviation and deficiency for topic representation. On the other hand, another way for topic representation, domain ontology, can clarify conceptual semantic hierarchies and relationships between concepts. For example, Yang [30] developed an OntoCrawler based on ontology-supported website models, which considered both user requests and domain semantics. Moreover, Wang et al. [10] constructed a domain ontology and built a URL pattern library for topic relevance prediction of URL. Currently, the most popular semi-automated method of constructing domain ontology was the formal concept analysis (FCA), the concept lattice of which is a semantic network representing the relationship between concepts. Zhu, Yang, Wu, and Feng [11] have detailed the FCA-based construction of domain ontology of meteorological disasters.

In conclusion, ontology has been widely used in the field of IR and played an essential role in topic representation. In this paper, we construct two domain ontologies of meteorological disasters, namely, the rainstorm disaster domain and the typhoon disaster domain, based on FCA for topic representation in our focused crawler.

2.3 Intelligent optimization algorithm-based focused crawlers

To obtain higher crawling accuracy, many scholars have recently developed intelligent optimization algorithm-based focused crawler strategies to improve the searching capacity of crawlers. Breadth-first search (BFS) [12][31]and optimal priority search (OPS) [13] were first utilized for hyperlink selection to guide the direction of crawling. However, these methods would lead to choices of hyperlinks with no prospects, i.e., some potential hyperlinks that contain topic-relevant sub-hyperlinks would be ignored. To solve the problem, scholars have applied global intelligent optimization algorithms to focused crawlers. Jing, Wang, and Dong [32] leveraged the genetic algorithm (GA) to establish an adaptive focused crawler strategy based on a dynamic fitness function. Yan and Pan [33] adapted genetic operators based on users’ browsing behaviours to alter the fitness function containing both topic relevance and hyperlink importance, depicted by vector space model (VSM) and an improved PR value calculation. In addition, an improved SA-based crawling strategy [14] concerned a comprehensive hyperlink priority evaluation method and achieved better experimental results. The web space evolutionary algorithm [18] was also utilized for hyperlink searching, which achieved an impressive performance of crawling stability. The improved tabu search algorithm incorporating ontology (On-ITS) [15] re-defined the tabu object, the neighbourhood set, and the acceptance principles of the tabu search (TS) algorithm, which had the advantage of selecting better hyperlinks. Aiming at the global search, Chen, Zhang, and Zhang [16] attempted to introduce the ant colony optimization (ACO) into the focused crawling strategy and update the computational method of pheromone with certain adaptability. Zheng [17] took the advantage of the GA and the ACO and presented a genetic and ant algorithm-based focused crawler called GAAA, which leveraged the fast, random and global convergence features of the GA, at the same time improving the performance by the parallelism and positive feedback of the ACO. Additionally, the focused crawler based on ontology learning and multi-objective ant colony optimization (OLMOACO) [19] has achieved great performance through machine learning-based topic representation and MOACO-based hyperlink evaluation. Liu and Du [24] applied the cell-like membrane computing optimization (CMCO) algorithm to the optimization of weights of various text documents for overall hyperlink priority, which threw a light on the difficulty in determining the weighting coefficients. The optimized weights performed better on the database of various topics than before. Dewanjee [34] used the cuckoo search (CS) algorithm inspired by the behaviours of cuckoo birds in the focused crawler and presented a heuristic approach that featured great intensification and diversification.

To sum up, these intelligent optimization algorithms are mainly used for hyperlink analysis and selection, and few of them focus on the employment of dynamic adaptive optimization. As mentioned above, to address the problem that weights of metrics of the CPEM are difficult to determine manually, we achieve a dynamic adaptive focused crawler by optimizing the weights based on the multi-objective particle swarm optimization (MOPSO) algorithm.

Traditional focused crawlers use topical keywords for topic representation, but such methods ignore the semantic information between concepts, resulting in incomplete and weak topical representation [29]. To establish a semantic-based topic representation method, ontology is introduced to overcome the problem of weak topic representation of topical keywords. The ontology is served as the topic benchmark model for topic judgment of webpages. This section introduces the construction of two ontologies of meteorological disasters, namely, the rainstorm disaster domain and the typhoon disaster domain, for topic representation in our focused crawler.

3.1 Ontology-based topic representation

Gruber [35] first proposed ontology to describe the knowledge and defined it as a formal, explicit specification of a shared conceptualization. Ontology enables domain-specific representation at the semantic and knowledge level, addressing the shortcomings of traditional keyword-based methods and enabling more comprehensive and accurate topic representation. The construction steps of ontologies are as follows:

(1) Determine the topical keywords of domains of rainstorm disaster and typhoon disaster. Retrieve the topical academic papers in the China national knowledge infrastructure (CNKI) database using the topical keywords, and extract the titles, abstracts, and keywords that best summarize the content of the papers. The extracted words of each paper are collected in a document.

(2) IK-Analyzer², an open-source word segmentation tool, is utilized to obtain a domain feature word candidate set.

(3) The formal concept analysis (FCA) [35][37] is used to construct a “document-feature word” matrix by domain knowledge, which is then fed into the development tool ConExp³ to generate the concept lattice [39].

(4) The Ontology Web Language (OWL)⁴ is used to describe the semantic and hierarchical relations between the concept lattices to form a hierarchical ontology structure.

(5) Use Protégé⁵ to visualize the ontology.

In this paper, we have built two domain ontologies of the rainstorm disaster and the typhoon disaster for topical representation, each containing multiple topic-relevant feature words and several semantic relations in their domains. Specifically, the rainstorm disaster domain contains 68 topical terms, while the rainstorm disaster domain contains 73 topical terms. The semantic relations between two meteorological disaster-related terms encompass synonym, induced-by and is-a. The relation synonym denotes that a term is synonymous with the other one, the relation induced-by denotes a term is triggered by the other one, and the relation is-a denotes a term is inherited by the other one. Different semantic relations are assigned different semantic relation indicators in the semantic similarity calculation of the feature words (see subsection 4.1). Figure 1 (a)-(i) exemplifies the hierarchical structure of the rainstorm disaster domain ontology. The root structure is illustrated in Fig. 1 (a), and the extended parts, framed by rectangles in the figures, are displayed in Fig. 1 (b)-(i).

² https://github.com/blueshen/ik-analyzer

³ https://sourceforge.net/projects/conexp/

⁴ https://www.w3.org/TR/owl-features/

⁵ https://protege.stanford.edu/

In this section, we introduce the semantic similarity calculation method and construct the topical semantic weighted vector based on the feature words in our constructed ontology.

4.1 Semantic similarity indicators

Referring to [40], we consider five indicators to compute the semantic similarity between two concepts based on the structural features of the constructed domain ontologies, which are semantic distance indicator (I_Dis), concept density indicator (I_Den), concept depth indicator (I_Dep), concept coincidence degree indicator (I_Coi), and concept semantic relation indicator (I_Rel).

Definition 1

The semantic distance Dis(C₁, C₂) between two concepts C₁ and C₂ is quantified by the shortest path length in the ontology tree, i.e., the number of least edges between two concepts C₁ and C₂. The semantic distance indicators (I_Dis) of the semantic similarity can be described by Eq. (1).

$${I}_{Dis}=\frac{\epsilon }{\epsilon +{Dis}^{2}({C}_{1},{C}_{2})}$$

1

Here, ε denotes an adjusting factor, which is a real number greater than 0.

Definition 2

The concept density between two concepts C₁ and C₂ reflects the density of the region in which they are located in the ontology tree, which is quantified as the total number of direct child-concepts contained in their nearest common ancestor concept. Suppose Den(C) is the total number of direct child-concepts contained in concept C, the nearest common ancestor concept of concepts C₁ and C₂ is C_a, and the concept density indicator (I_Den) of the semantic similarity can be described by Eq. (2).

$${I}_{Den}=\frac{Den\left({C}_{a}\right)}{Den\left(M\right)}$$

2

Here, Den(M) denotes the maximum number of child-concepts among all concepts in the whole ontology tree.

Definition 3

The concept depth Dep(C) portrays the hierarchical depth of the ontology tree where the concept is located, and it is quantified as the number of edges of the shortest path between the concept C and the root concept in the ontology tree. For concepts C₁ and C₂, the concept depth indicator (I_Dep) can be described by Eq. (3).

$${I}_{Dep}=\frac{1}{2}(\frac{Dep\left({C}_{1}\right)+Dep\left({C}_{2}\right)}{\left|Dep\left({C}_{1}\right)-Dep\left({C}_{2}\right)\right|+2*Dep\left(M\right)}+\frac{Dep\left({C}_{a}\right)}{Dep\left(M\right)})$$

3

Here, Dep(M) denotes the maximum depth among all concepts in the whole ontology tree, Dep(C_a) denotes the depth of the nearest common ancestor concept C_a of concepts C₁ and C₂

Definition 4

The concept coincidence degree refers to the number of common ancestor concepts (the same hypernyms) of concepts C₁ and C₂ in the ontology tree. Suppose Anc(C) represents the number of ancestor concepts of concept C, and the concept coincidence degree indicator (I_Coi) of the semantic similarity can be described by Eq. (4).

$${I}_{Coi}=\frac{\left|Anc\left({C}_{1}\right)\cap Anc\left({C}_{2}\right)\right|}{\text{m}\text{a}\text{x}(Dep\left({C}_{1}\right),Dep\left({C}_{2}\right))}$$

4

Here, |Anc(C₁)∩Anc(C₂)| represents the number of common ancestor concepts of concepts C₁ and C₂, max(Dep(C₁), Dep(C₂)) represents the greater hierarchical depth between concepts C₁ and C₂.

Since the ontology reflects domain knowledge through diverse relations between concepts, we consider different types of concept relations may have different levels of impact on semantic similarity. Specifically, three types of concept relation, i.e., synonym, induced-by and is-a are considered, whose concept semantic relation indicators (I_Rel) are 1, 1/2, 1/3, respectively.

4.2 Topical semantic weighted vector construction

According to the above five semantic similarity indicators, the semantic similarity between two concepts C₁ and C₂ can be expressed by the following Eq. (5).

$$Sim\left({C}_{1},{C}_{2}\right)={\lambda }_{1}\text{*}{I}_{Dis}+{\lambda }_{2}\text{*}{I}_{Den}+{\lambda }_{3}\text{*}{I}_{Dep}+{\lambda }_{4}\text{*}{I}_{Coi}+{\lambda }_{5}\text{*}{I}_{Rel}$$

5

Here, λ₁ ~ λ₅ are the weights of the five indicators and satisfy λ₁ + λ₂ + λ₃ + λ₄ + λ₅ = 1. In this paper, the values of λ₁ ~ λ₅ are determined 0.80, 0.04, 0.06, 0.03, and 0.07 with reference to multiple experimental tests and suggestions of domain experts.

In our constructed ontology tree, assume that the topical keyword is C and the set of feature words is T = (t₁, t₂, …, t_i, …, t_n), where n denotes the number of feature words of the ontology. And then calculate the semantic similarity between each feature word and topical keyword C according to Eq. (5). Finally, the topical semantic weighted vector ${W}_{T}=\left({w}_{{t}_{1}},{w}_{{t}_{2}},\dots ,{w}_{{t}_{n}}\right)$ can be obtained by Eq. (6).

$${W}_{T}=\left({w}_{{t}_{1}},{w}_{{t}_{2}},\dots ,{w}_{{t}_{n}}\right)=\left(Sim\left(C,{t}_{1}\right), Sim\left(C,{t}_{2}\right), \dots , Sim\left(C,{t}_{\text{n}}\right)\right)$$

6

Here, ${w}_{{t}_{i}}$is the weight of the i-th feature word in the set T, i.e. the semantic similarity Sim(C, t₁) between the topical keyword C and the feature word t_i.

In this section, we introduce a comprehensive priority evaluation method (CPEM) of hyperlink. Firstly, the webpage text document feature vector is constructed based on different HTML segments attached by their corresponding levels of weight. Next, the topic relevance calculation of webpage text document is introduced. Then, the topic relevance calculation of anchor text and an improved PageRank (PR) algorithm, are described. Eventually, the CPEM is proposed based on the above evaluation metrics.

5.1 Webpage text document feature vector construction

Hypertext markup language (HTML) has been widely used as information representation on the World Wide Web (WWW) owing to its simplicity and versatility. The content of HTML webpages is presented in the form of various tags, which have different effects on the calculation of topic relevance. Therefore, we consider the different levels of importance indicated by the presence of feature words at different tag positions. In this paper, we divide the tags of HTML into five levels (see Table 1), each of which is assigned a different weighted coefficient.

Table 1

The five levels of HTML segmentation
Level	HTML tags	denotations	W_g
Level 1	<title>, <keyword>, <description>, <h1>	title, keyword, description, first-level heading	2
Level 2	<h2>, <h3>	secondary-level heading, third-level heading	1.5
Level 3	<h4>, <h5>, <h6>, <strong>	fourth-level heading, fifth-level heading, sixth-level heading, bold text	1.2
Level 4	<p>, <td>, <li>	main information	1.0
Level 5	other tags	secondary information	0.2

The webpage text document is then mapped into a feature vector $D=\left({d}_{1},{d}_{2},\dots ,{d}_{i},\dots ,{d}_{n}\right)$ and the corresponding feature weighted vector is ${W}_{D}=\left({w}_{{d}_{1}},{w}_{{d}_{2}},\dots ,{w}_{{d}_{i}},\dots ,{w}_{{d}_{n}}\right)$, where ${w}_{{d}_{i}}$ denotes the weight of the i-th feature word in the webpage text document, calculated by the formula shown in Eq. (7).

$${w}_{{d}_{i}}={\sum }_{g=1}^{G}w{f}_{i,g}*{W}_{g}={\sum }_{g=1}^{G}\frac{{WF}_{i,g}}{{max}{WF}_{i,g}}*{W}_{g}$$

7

Here, wf_i,g represents the normalized word frequency of i-th feature word in g-th level of the webpage text document; WF_i,g represents the word frequency of i-th feature word in g-th level of the webpage text document; maxWF_i,g represents the maximum frequency of i-th feature word in all levels of the webpage text document; W_g represents the weighted coefficient of g-th level in the webpage text document.

5.2 Topic relevance of webpage text document

We utilize the vector space model (VSM) to compute the topic relevance of webpages. Precisely, for the text document of a webpage P, the topic relevance R(P) is calculated by the topical semantic weighted vector W_T and the webpage text document feature vector W_D as shown in Eq. (8).

$$R\left(P\right)=Sim\left({W}_{T},{W}_{D}\right)=\frac{{\sum }_{i=1}^{n}\left({w}_{{t}_{i}}*{w}_{{d}_{i}}\right)}{\sqrt{{\sum }_{i=1}^{n}{w}_{{t}_{i}}^{2}}*\sqrt{{\sum }_{i=1}^{n}{w}_{{d}_{i}}^{2}}}$$

8

Referring to the reference [14] and our experimental tests, the topic relevance threshold of webpage δ₁ is set at 0.62, i.e., if R(P)>δ₁, the webpage P is considered topic-relevant.

5.3 Topic relevance of anchor text

The anchor text of a hyperlink is essential for the prediction and evaluation of the webpage content. We calculate the anchor text feature weights of hyperlinks based on the TF-IDF model. The weight ${w}_{{a}_{i}}$ of i-th feature word in anchor text can be shown in Eq. (9).

${w}_{{a}_{i}}=T{F}_{i}*ID{F}_{i}=\frac{{f}_{i}}{{\sum }_{m=1}^{n}{f}_{m}}*{{log}}_{a}(\frac{N}{{N}_{i}}+0.01),$ $i=\text{1,2},...,n$(9)

Here, a > 1, f_i represents the word frequency of the i-th feature word in the anchor text; N denotes the total number of crawled webpages; N_i denotes the number of crawled webpages containing the i-th feature word in the crawled webpages.

Similar to the calculation of the topic relevance of webpage text document, for the anchor text A_h of hyperlink h, suppose the anchor text weighted vector is ${W}_{A}=\left({w}_{{a}_{1}},{w}_{{a}_{2}},\dots ,{w}_{{a}_{i}},\dots ,{w}_{{a}_{n}}\right)$, the topic relevance R(A_h) of anchor text can be obtained by Eq. (10).

$$R\left({A}_{h}\right)=Sim\left({W}_{T},{W}_{A}\right)=\frac{{\sum }_{i=1}^{n}\left({w}_{{t}_{i}}\times {w}_{{a}_{i}}\right)}{\sqrt{{\sum }_{i=1}^{n}{w}_{{t}_{i}}^{2}}\times \sqrt{{\sum }_{i=1}^{n}{w}_{{a}_{i}}^{2}}}$$

10

5.4 Improved PageRank algorithm

PageRank (PR) algorithm is a classical hyperlink evaluation method developed by Google [41]. It depicts the importance of hyperlinks by their structural relation. For a webpage P, the original calculation formula of PR value can be described by Eq. (11).

$$\text{PR}\left(P\right)=\left(1-d\right)+d*{\sum }_{i=1}^{m}\frac{\text{PR}\left({P}_{i}\right)}{C\left({P}_{i}\right)}$$

11

Here, d is the damping factor, which is 0.2 in this paper; P_i denotes the i-th in-hyperlink webpage P_i of the webpage P; m is the number of in-hyperlinks of the webpage P among all crawled webpages; PR(P_i) denotes the PR value of the i-th in-hyperlink webpage P_i of the webpage P; C(P_i) denotes the total number of out-hyperlinks of the webpage P_i.

However, this single evaluative method of hyperlink structure may easily lead to topic deviation in crawling. To address the problem, we introduce the topic relevance of anchor text into the original PR value calculation and present an improved PR algorithm. The improved PR value calculation formula is shown in Eq. (12).

$$\text{PR}\left(P\right)=\left(1-d\right)+d*{\sum }_{i=1}^{m}\left[\frac{\text{PR}\left({P}_{i}\right)}{C\left({P}_{i}\right)}*\left(1+\omega *R\left({A}_{i}\right)\right)\right]$$

12

Here, ω is the adjustment factor, which is 0.6 in this paper; R(A_i) represents the topic relevance of anchor text A_i of the i-th in-hyperlink of P. Notably, as the number of the crawled webpages increases, the structure of the in-hyperlinks and out-hyperlinks keeps variable and the improved PR value is constantly updated.

5.5 Comprehensive priority evaluation method of hyperlink

According to the above evaluation metrics, a comprehensive priority evaluation method (CPEM) of hyperlink concerning both webpage content and hyperlink structure, is designed, as illustrated in Fig. 2. For an unvisited hyperlink h, the CPEM integrates four evaluation metrics, namely, the topic relevance R(A_h) of anchor text A_h of hyperlink h, the topic relevance of webpage P_h to which hyperlink h points, the average topic relevance of all webpages containing hyperlink h, and the improved PR value PR(P_h) of webpage P_h to which hyperlink h points, to obtain the comprehensive topical priority of hyperlink h. The calculation formula is shown as Eq. (13).

$$Priority\left(h\right)={\sigma }_{1}*R\left({P}_{h}\right)+{\sigma }_{2}*\frac{1}{k}\sum _{i=1}^{r}R\left({P}_{i}\right)+{\sigma }_{3}*R\left({A}_{h}\right)+{\sigma }_{4}*\text{P}\text{R}\left({P}_{h}\right)$$

13

Here, σ₁ ~ σ₄ are weighted factors of the four hyperlink priority evaluation metrics, respectively and satisfy σ₁ + σ₂ + σ₃ + σ₄ = 1; r is the sum of webpages that contains hyperlink h; P_i denotes the i-th webpage among all webpages containing hyperlink h. In this paper, the comprehensive priority threshold of hyperlink δ₂ is 0.20, i.e., if Priority(h)>δ₂, the hyperlink h can be deemed topic-relevant and be added to the hyperlink waiting queue; else the hyperlink h will be neglected.

In most existing semantic-based focused crawler strategies, the crawling hyperlinks are prioritized by formulating a comprehensive evaluation method of hyperlink, which linearly integrates topical similarity evaluation metrics and their corresponding weighted factors. The poor operability of this method attributes to the fact that it is hard to determine the weights beforehand. Furthermore, the empirical and predetermined weighted factors ignore the dynamic and ever-changing nature of the webpages in the network. To solve this problem, we first introduce a dynamic adaptive hyperlink evaluation method based on the multi-objective particle swarm optimization algorithm (MOPSO), which can optimize the weighted factors σ1 ~ σ4 in Eq. (13) during the crawling process.

In this section, a novel focused crawler strategy FCMOPSO based on MOPSO is proposed. Firstly, we introduce the generation method of seed hyperlinks. Then, the multi-objective optimization model for hyperlink evaluation is introduced. Next, the method of non-dominant sorting [42][43] with the nearest and farthest candidate solution [44], denoted by NS_NFCS, is introduced. Moreover, the dynamic adaptive weighted factors optimization by MOPSO is described. Finally, the focused crawler based on MOPSO (FCMOPSO) is elucidated.

6.1 Generation of seed hyperlinks

The selection of seed hyperlinks has a significant impact on the performance of focused crawler. It can help capture as many topic-relevant webpages as possible in the initial stage. In this paper, when generating the initial seed hyperlinks, we input the topical terms in common search engines and select k = 30 seed hyperlinks from the searching results. The detailed steps are as follows:

(1) Input the topical keyword in common search engines such as Baidu, Bing, Google, etc., retrieve and filter the top-ranked webpages and add them to the candidate seed webpage database.

(2) Parse the filtered webpage texts, do the word segmentation, calculate the word frequency, and select the high-frequency words as new topical keywords.

(3) Calculate and sort the topic relevance of the webpages in the candidate seed webpage database by Eq. (8).

(4) With the recommendations of domain experts, select the hyperlinks whose corresponding webpages have high topic relevance as seed hyperlinks.

(5) If k = 30 hyperlinks are finally determined as seed hyperlinks, the generation of seed hyperlinks ends, otherwise repeat step (1) to expand the candidate seed webpage database by newly added topical keywords until 30 seed hyperlinks are eventually obtained.

6.2 Multi-objective optimization model for hyperlink evaluation

As shown in Fig. 2, the topic relevance of a hyperlink involves four factors, namely, anchor text, PR value, all webpages that contain the hyperlink, and the webpage to which the hyperlink points. Since the evaluation of the topic relevance of a hyperlink involves more than one objective, which is conflicting, a multi-objective optimization model can be established for evaluating unvisited hyperlinks. The objective functions are defined as following Eqs. (14)-(17).

$$\text{max}{f}_{1}\left(h\right)=R\left({P}_{h}\right)$$

14

$$\text{max}{f}_{2}\left(h\right)=\frac{1}{r}\sum _{i=1}^{r}R\left({P}_{\text{i}}\right)$$

15

$$\text{max}{f}_{3}\left(h\right)=R\left({A}_{h}\right)$$

16

$$\text{max}{f}_{4}\left(h\right)=\text{P}\text{R}\left({P}_{h}\right)$$

17

Here, f₁(h) represents the topic relevance of webpage P_h that hyperlink h points to; f₂(h) represents the average topic relevance of all webpages containing hyperlink h; f₃(h) represents the topic relevance of anchor text A_h of hyperlink h; f₄(h) represents the improved PR value of webpage P_h that hyperlink h points to. In Eq. (15), r represents the sum of webpages that include hyperlink h.

6.3 Hyperlink selection by NS_NFCS

For multi-objective optimization problems, no set of solutions can make each objective outperform the others. The non-dominated sorting (NS) method [42][43] can obtain the Pareto optimal solution set. In the classical fast non-dominant sorting multi-objective genetic algorithm (NSGA-II) [43], the crowded degree comparison method can only select solutions within a dense space so that the distribution of solutions is limited in a narrow search range. To diversify the distribution of the selected unvisited hyperlinks in the crawling process, we replace the crowded degree comparison method with the nearest and farthest candidate solution (NFCS) method [44]. In this paper, we put forward a hyperlink selection strategy of non-dominant sorting with the nearest and farthest candidate solution (NS_NFCS).

A distance calculation formula between two solutions (hyperlinks) H_s, H_t based on the objective functions (see Eqs. (14)-(17)) is formulated as the following Eq. (18).

$$DIS\left({H}_{s},{H}_{t}\right)=\sqrt{\sum _{i=1}^{l}{({f}_{i}\left({H}_{\text{s}}\right)-{f}_{i}\left({H}_{\text{t}}\right))}^{2}}$$

18

Here, l represents the number of the objective functions, which is four in this paper; f_i(H_s) and f_i(H_t) are the i-th objective function value of H_s and H_t.

Suppose that the candidate solution set S_C contains p Pareto solutions where q (≤ p) optimal solutions need to be selected, and then q optimal solutions are added to the optimal solution set S_B. The NFCS method is implemented in the following steps:

(1) Let S_B = ∅, for each objective function f_i(H) (i = 1, 2, …, l), calculate the objective function value f_i(H_j) of each solution H_j (j = 1, 2, …, p) in the candidate solution set S_C. Select the solutions with the greatest objective function values for each objective, then add them to an optimal solution set S_T;

(2) If q ≤ l, randomly select q solutions from S_T and add them to S_B. Go to step (6);

(3) If q > l, add all the solutions from S_T to S_B and remove them from S_C.

(4) Let q = q-l.

(5) For each solution in S_C, calculate its nearest objective function distance to all solutions in S_B by Eq. (18). Select the solution X_f with the farthest distance, add it to S_B, and remove it from the S_C. Let q = q-1. If q = 0, go to step (6), otherwise go to step (5).

(6) Output S_B.

6.4 Dynamic adaptive weighted factors optimization by MOPSO

Particle swarm optimization (PSO) is a metaheuristic algorithm first proposed by Eberhart and Kennedy [45]. As one of the swarm intelligence methods, it is inspired by the collective behaviours and movement patterns of bird swarms. In PSO, every particle has a velocity vector and a position vector, which are updated by learning current individual best position and the global best position of all individuals in each iteration. Suppose particle i is at t-th iteration period, whose velocity vector is V_i(t) and position vector is X_i(t), and D represents the dimension number of vectors. The calculation formulas for updating vectors are shown as Eqs. (19)-(20).

$${v}_{ij}\left(t+1\right)=\omega *{v}_{ij}\left(t\right)+{c}_{1}*{rand}_{1}\left(t\right)*\left({{p}{b}{e}{s}{t}}_{ij}\left(t\right){-x}_{ij}\left(t\right)\right)+{c}_{2}*{rand}_{2}\left(t\right)*({{g}{b}{e}{s}{t}}_{j}\left(t\right)-{x}_{ij}\left(t\right))$$

(19)

$${x}_{ij}\left(t+1\right)={x}_{ij}\left(t\right)+{v}_{ij}\left(t+1\right)$$

20

Here, v_ij(t) denotes the j-th dimension of the velocity vector of i-th particle, rand₁ and rand₂ are two random numbers from 0 to 1, c₁ and c₂ denote the cognitive and social parameters, which represent the learning rates of pbest_ij and gbest_j, respectively. pbest_ij denotes j-th dimension of the position vector of the current best solution of i-th particle. gbest_j denotes j-th dimension of the position vector of the current best solution among all particles.

In this paper, we present a multi-objective particle swarm optimization method (MOPSO) for dynamic adaptive hyperlink evaluation. To be specific, we adapt MOPSO to optimize the weighted factors σ₁ ~ σ₄ of the CPEM to maximize the topical priorities of hyperlinks. In every crawling iteration, the procedure of MOPSO will be executed in a hyperlink list for optimization, denoted by LinkOpti. At the beginning of the focused crawler, we add k = 30 seed hyperlinks into LinkOpti. Then in the following crawling period, LinkOpti will be changed by newly found topic-relevant hyperlinks in each iteration, thus ensuring that the weighted factors can be constantly updated.

Procedure 1. MOPSO (LinkOpti)
1:	Initialize constants ɷ=0.78, c₁ = c₂ = 2.0, and let t = 1. // t is the iteration period.
2:	Randomly initialize the velocity vector V_i(t)=[v_i1(t), v_i2(t), …, v_iD(t)] and the position vector X_i(t)=[x_i1(t), x_i2(t), …, x_iD(t)] of each hyperlink l_i in LinkOpti; // D represents the dimension of velocity and position vectors and D = 4; v_ij(t), x_ij(t)∊(0,1], j = 1, 2, …, D.
3:	Randomly select X_i(t) from LinkOpti to initialize pbest_i and gbest. // pbest_i represents the position vector of l_i with the highest historical Priority(l_i) and gbest represents the position vector of the hyperlink l with the highest historical Priority(l) in LinkOpti.
4:	For i = 1 to k do // k is the size of LinkOpti and k = 30.
		For j = 1 to D do // j is the j-th dimension of the vector.
				v_ij(t + 1) = ɷv_ij(t) + c₁rand₁(t)(pbest_ij(t)-x_ij(t)) + c₂rand₂(t)*(gbest_j(t)-x_ij(t)); x_ij(t + 1) = x_ij(t) + v_ij(t + 1); σ_j = x_ij(t + 1); // the updated j-th dimension of the position vector is assigned to σ_j.
		End for
		Normalize σ₁ ~ σ₄ and calculate hyperlink priority Priority(l_i) of l_i by Eq. (13). Update pbest_i(t + 1) by assigning the position vector of the hyperlink with the individual highest comprehensive priority.
	End for
5:	Update gbest(t + 1) by assigning the position vector of the hyperlink with the global highest comprehensive priority.
6:	Assign gbest(t + 1) to σ₁ ~ σ₄.
7:	Let t = t + 1.
8:	If t ≥ 500 then
		The procedure ends.
	Else
			Go to step 4.
	End If

The detailed steps of MOPSO are elucidated in Procedure 1. MOPSO considers the hyperlinks in LinkOpti as particles (30 in total), each of which has a four-dimensional velocity vector and a four-dimensional position vector, whose values are randomly selected from 0 to 1. The comprehensive priority of hyperlink calculated by Eq. (13) is regarded as fitness value, pbest as the position vector of the individual best solution, and gbest as the position vector of the global best solution. After a large number of experimental tests, the training iteration period is set to 500 in this paper. The final output global best solution gbest is regarded as the weighted factors σ₁ ~ σ₄. In MOPSO, we set the constant inertia parameter ɷ=0.78 and the cognitive and social parameters c₁ = c₂ = 2.0.

6.5 Focused crawling strategy based on the multi- objective particle swarm optimization

By incorporating the domain ontologies, the CPEM, and MOPSO methods into the focused crawler, a novel focused crawler strategy, denoted by FCMOPSO, is devised. FCMOPSO first randomly generates k = 30 seed hyperlinks (see subsection 6.1) and adds them to LinkSeed and LinkOpti which is used to optimize σ₁ ~ σ₄. Thereafter, by executing Procedure 1 (see step 3 in Algorithm 1), FCMOPSO optimizes σ₁ ~ σ₄ and obtains more reasonable weighted factors for hyperlink evaluation. Next, calculate the topical priorities of the seed hyperlinks by Eq. (13) and add the top half to LinkCandidate. Then, add the sub-hyperlinks from webpages to which all hyperlinks of LinkCandidate point to LinkChild. The sub-hyperlinks in LinkChild whose topical priorities are greater than δ₂ are stored in LinkWait, which is the source of choosing new seed hyperlinks by the NS_NFCS strategy for the next crawling period. Additionally, the topic relevance of webpages to which all hyperlinks of LinkWait point will be calculated by Eq. (8). The sub-hyperlinks whose corresponding webpages with topic relevance greater than δ₁ are stored in LinkSave, which is the source of topic-relevant hyperlinks to update LinkOpti and optimize σ₁ ~ σ₄ for the next crawling period. The detailed process of FCMOPSO is explicated in Algorithm 1.

Algorithm 1. FCMOPSO
Input: seed hyperlinks Output: downloaded webpages
1:	Determine the domain, construct the domain ontology and obtain the topical semantic weighted vector.
2:	Randomly generate k = 30 seed hyperlinks (see subsection 6.1) and add them to LinkSeed and LinkOpti. Initialize δ₁, δ₂, RW, DW, LinkCandidate, LinkChild, LinkWait, LinkSave. // DW denotes the number of downloaded webpages and RW denotes the number of topic-relevant webpages.
3:	Execute Procedure 1 MOPSO (LinkOpti) to update σ₁ ~ σ₄.
4:	Calculate and rank the hyperlink priority Priority(l_seed) of each hyperlink l_seed in LinkSeed by the updated σ₁ ~ σ₄, select the top half of the hyperlinks, and add them to LinkCandidate.
5:	Obtain all the sub-hyperlinks from webpages to which all hyperlinks of LinkCandidate point, add them to LinkChild and remove the duplicate ones. Denote the size of LinkChild by s.
6:	For i = 1 to s do
		Calculate the hyperlink priority Priority(l_child) of each hyperlink l_child in LinkChild.
		If Priority(l_child)≥δ₂ then
			Add l_child to LinkWait and let DW = DW + 1. Calculate the topic relevance R(P_child) of webpage P_child to which l_child points.
			If R(P_child)≥δ₁ then
				Add l_child to LinkSave and let RW = RW + 1.
			Else
				Skip the hyperlink l_child.
			End If
		Else
			Skip the hyperlink l_child and continue.
		End If
		If DW ≥ 15,000 then
			The algorithm ends. Output the downloaded webpages and performance evaluation indices.
		Else
			Continue.
		End If
	End for
7:	Select k hyperlinks from LinkWait to update LinkSeed by the NS_NFCS strategy (see subsection 6.3) and select the k hyperlinks from the end of LinkSave (i.e., newly added k hyperlinks) to update LinkOpti.
8:	Clear LinkChild and LinkCandidate. Go to step 3.

Experiments on two domains, rainstorm and typhoon disasters, are conducted to evaluate the effectiveness of the proposed focused crawler strategy FCMOPSO. We compare the performance of FCMOPSO with that of the other strategies, namely, the optimal priority search algorithm (OPS) [13], the simulated annealing algorithm considering host information (SA-host) [14], the improved tabu search algorithm incorporating ontology (On-ITS) [15], focused crawler combining web space evolutionary algorithm and domain ontology (FCWSEO) [18], and focused crawler based on ontology learning and multi-objective ant colony optimization (OLMOACO) [19]. FCMOPSO was run by JAVA and executed on a PC with Intel Core i7-7700 and 3.60 GHz CPU and 8.00 GB RAM.

7.1 Performance evaluation indices

In the focused crawler, the recall rate and accuracy rate are usually utilized as performance evaluation indices. However, the recall rate is challenging to calculate since the sum of topic-relevant pages in the whole network is infinite and ever-expanding. Consequently, we use the accuracy rate as an evaluation index in this paper, as shown in Eq. (21).

$$Accuracy=\frac{RW}{DW}$$

21

Here, RW denotes the number of topic-relevant webpages in all downloaded webpages, and DW denotes the number of downloaded webpages.

In addition, we use the average topic relevance R_avg (see Eq. (22)) of the downloaded webpages to evaluate the quality of the downloaded webpages and the standard deviation R_sd (see Eq. (23)) of the topic relevance of the downloaded webpages to quantify their amount of variation or dispersion, reflecting the stability of the quality (topic relevance) of the downloaded webpages. The lower the value, the higher the stability.

$${R}_{avg}=\frac{1}{DW}*{\sum }_{i=1}^{DW}R\left({P}_{i}\right)$$

22

$${R}_{sd}=\sqrt{\frac{1}{DW}*{\sum }_{i=1}^{DW}\left(R\right({P}_{i})-{R}_{avg}{)}^{2}}$$

23

Here, R(P_i) denotes the topic relevance of webpage P_i.

7.2 Experimental results and discussion

Figures 3–6 display the Accuracy, RW, R_avg, and R_sd of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO in domains of the rainstorm disaster and the typhoon disaster, respectively. Notably, there are no experiments conducted by FCWSEO in the typhoon disaster domain, so we exclude it from the comparison in the typhoon disaster domain. Also, the results are recorded when DW reaches 15,000 because the performance evaluation indices of every crawler have maintained a period of steady trends.

Figure 3 (a) and (b) compare the Accuracy of different strategies in the rainstorm disaster domain and typhoon disaster domain, respectively. When DW reaches 15,000, the Accuracy of most strategies tends to be stable except for OPS, and FCMOPSO attains the highest Accuracy in both domains. In the rainstorm disaster domain when DW reaches about 6,000, the Accuracy of FCMOPSO exceeds that of the other strategies. The final Accuracy of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO are about 60.4%, 70.6%, 75.2%, 81.1%, 74.2%, and 82.3%, respectively; In the typhoon disaster domain when DW reaches 9,000, the Accuracy of FCMOPSO strategy exceeds the other strategies. The Accuracy of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are about 50.6%, 74.6%, 76.8%, 74.7%, and 84.0%, respectively.

From Fig. 3, it is not hard to find that OPS features higher Accuracy than the other strategies in the early crawling stage, but it plummets in the later crawling stage, resulting from its greedy strategy. Initially, OPS crawls webpages from the seed hyperlinks with the highest priority, which is not conducive to the expansion of the search range. When OPS falls into the choice of a hyperlink with no prospects, the webpage it points to may contain few valuable hyperlinks and the Accuracy of OPS declines rapidly. Similar to OPS, SA-host is also a kind of greedy strategy but changes the optimal search by adopting a certain probability to receive hyperlinks with relatively low priority. However, it only surpasses OPS because it has limited ability to expand the search range, especially in the later crawling stage, and is heavily influenced by the setting of parameters that are difficult to determine. In addition, On-ITS shows comparable Accuracy that it ranks third in the rainstorm disaster and only underperforms FCMOPSO in the typhoon disaster. On-ITS filters out the visited hyperlinks by modifying the tabu object and acceptance principles. If no sub-hyperlink of a visited hyperlink has higher priority than itself, the visited hyperlink will be set as a tabu object. Therefore, although On-ITS ensures that the selected hyperlinks have considerable topic relevance, the process of comparing the topical priorities of the visited hyperlink and its sub-hyperlinks involves extensive webpage content analysis and topic relevance calculations, resulting in high time consumption. Moreover, due to its acceptance principles of hyperlinks, some potential sub-hyperlinks are not fully exploited, which limits the search range. Notably, in the domain of rainstorm disaster, FCWSEO maintains its upward trend in the whole crawling process and overmatches the four strategies other than FCMOPSO when DW reaches about 8,000. FCWSEO and FCMOPSO remain close in Accuracy in the later crawling stage, with FCMOPSO overtaking FCWSEO when DW reaches about 12,500 owing to its dynamic adaptive evaluation strategy. As the parameters of the CPEM are constantly updated, the evaluation of the hyperlinks changes dynamically and tends to be more reasonable, so that FCMOPSO can maintain good Accuracy in the later crawling stage.

Figure 4 (a) and (b) demonstrate the number of topic-relevant webpages RW downloaded by the different strategies in the rainstorm disaster domain and the typhoon disaster domain, respectively. The OPS strategy crawls fewer and fewer topic-relevant webpages in the later crawling stage, while the growth trend of the number of topic-relevant webpages of the other strategies remains almost constant. In both domains, when DW reaches 15,000, FCMOPSO can crawl more topic-relevant webpages than the other strategies, and its growth rate of the number of topic-relevant webpages is also greater in the later crawling stage. The final RW of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO are 9053, 10596, 11280, 12165, 11126, and 12352, respectively in the rainstorm disaster domain and those of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are 7593, 11192, 11520, 11201, and 12596, respectively in the typhoon disaster domain.

Figure 5 (a) and (b) show the comparison of the average topic relevance R_avg of the webpages downloaded by the different strategies in both domains, respectively. The overall trend manifests that the R_avg of FCMOPSO is stable and achieves comparable values during the whole crawling process. In the rainstorm disaster domain, the final R_avg of FCMOPSO reaches about 0.770, while the R_avg of OPS, SA-host, On-ITS, FCWSEO, and OLMOACO are about 0.622, 0.663, 0.692, 0.820, and 0.778, respectively. The quality of the webpages captured by FCWSEO is superior to other strategies, followed by OLMOACO and FCMOPSO. In the typhoon disaster domain, the R_avg of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are about 0.606, 0.710, 0.728, 0.700, and 0.734, respectively. The final R_avg of FCMOPSO ranks first in the typhoon disaster domain. Overall, these results suggest that FCMOPSO is still competitive in R_avg and can obtain web information with great topic relevance.

Figure 6 (a) and (b) display the comparison of the standard deviation R_sd of the topic relevance of the downloaded webpages by the different strategies. As illustrated in Fig. 6, OLMOACO has the lowest R_sd when DW reaches 15,000, followed by FCMOPSO. Specifically, the R_sd of OPS, SA-host, On-ITS, FCWSEO, OLMOACO, and FCMOPSO are about 0.208, 0.195, 0.158, 0.157, 0.138, and 0.145, respectively in the rainstorm disaster domain and those of OPS, SA-host, On-ITS, OLMOACO, and FCMOPSO are about 0.209, 0.197, 0.151, 0.075, and 0.149, respectively in the typhoon disaster domain. During the whole crawling process, OPS exhibits the greatest volatility while OLMOACO has the greatest stability. OPS selects hyperlinks with the highest priority in every iteration and the valuable sub-hyperlinks are ignored as the network is explored deeper. Its R_sd soars because the topic relevance is variable. As the irrelevant webpages are crawled randomly, the volatility of average topic relevance becomes greater, resulting in its distinct fluctuation of R_sd. The R_sd of OLMOACO maintains a downward trend and achieves the lowest R_sd in the later crawling stage in both domains. This is because the ants in OLMOACO will accumulate more pheromones as the crawler continues and are easier to find a better crawling path and fetch more topic-relevant hyperlinks.

Table 2

Results of different focused crawler strategies when DW reaches 15,000
Strategy	Rainstorm disaster domain					Typhoon disaster domain
Strategy	Accuracy/%	RW	R_avg	R_sd	Time/h	Accuracy/%	RW	R_avg	R_sd	Time/h
OPS	60.4	9053	0.622	0.208	8.93	50.6	7593	0.606	0.209	8.5
SA-host	70.6	10596	0.663	0.195	11.48	74.6	11192	0.710	0.197	11.34
On-ITS	75.2	11280	0.692	0.158	12.21	76.8	11520	0.728	0.151	11.96
FCWSEO	81.1	12162	0.822	0.157	11.64	—	—	—	—	—
OLMOACO	74.2	11126	0.778	0.138	16	74.7	11201	0.700	0.075	15
FCMOPSO	82.3	12352	0.770	0.145	6.26	84.0	12596	0.734	0.149	6.03

Table 3

Friedman ranks of different focused crawler strategies for four evaluation indices (*Accuracy*, R_avg, R_sd, *Time*) when DW reaches 15,000
Evaluation indices	Rainstorm disaster domain						Typhoon disaster domain
Evaluation indices	OPS	SA-host	On-ITS	FCWSEO	OLMOACO	FCMOPSO	OPS	SA-host	On-ITS	OLMOACO	FCMOPSO
Accuracy	6	5	3	2	4	1	5	4	2	3	1
R_avg	6	5	4	1	2	3	5	3	2	4	1
R_sd	6	5	4	3	1	2	5	4	3	1	2
Time	2	3	5	4	6	1	2	3	4	5	1
Average	5	4.5	4	2.5	3.25	1.75	4.25	3.5	2.75	3.25	1.25

Table 2 summarizes the Accuracy, RW, R_avg, R_sd, and the running time of the different strategies when DW reaches 15,000. As shown in Table 2, FCMOPSO outperforms the other five focused crawler strategies in the evaluation indices of Accuracy, RW, and Time in the rainstorm disaster domain and Accuracy, RW, R_avg, and Time in the typhoon disaster domain. Although the evaluation indices of R_avg and R_sd do not display the superiority of FCMOPSO, its results are still competitive. According to Time in Table 2, compared with other strategies, FCMOPSO possessed lower time consumption in both domains because of its simple hyperlink selection strategy in crawling. Unlike the other multi-objective algorithms, MOPSO is utilized only for parameter optimization with low data volume instead of complex and repetitive calculations among the increasing set of hyperlinks in the waiting queue when selecting hyperlinks, whose running time surges as the crawling process continues.

It can be seen from Table 2 that FCMOPSO does not have the optimal results in all evaluation indices. To further prove the effectiveness and superiority of FCMOPSO, Table 3 describes the Friedman ranks [46], a non-parametric statistical test to evaluate the performance of several algorithms by rankings. The lower the average ranking, the better the overall performance of the strategy. When DW reaches 15,000, the four representative evaluation indices, i.e., Accuracy, R_avg, R_sd, and Time, of these focused crawler strategies are converted to rankings. As can be seen, FCMOPSO ranks first in both domains and has the minimal average rank for four indices, indicating that it performs best out of other strategies. Also, FCWSEO ranks second in the rainstorm disaster domain and On-ITS in the typhoon disaster domain, both followed by OLMOACO.

To sum up, experimental results show that FCMOPSO achieves impressive and satisfactory results in most performance evaluation indices, particularly prevailing over the other five crawlers in the crawling accuracy and time consumption. The overall performance of FCMOPSO is better than the other strategies in the literature. More importantly, the experimental results show the effectiveness of our proposed dynamic adaptive hyperlink evaluation method, which sheds light on more efficient approaches for hyperlink evaluation in the focused crawler.

With the increasing frequency and severity of meteorological disasters, the internet is a substantial channel for the public to learn about disaster mitigation and emergency response in real time. To collect information on the domain of meteorological disasters from the internet, semantic-based focused crawlers can capture domain-specific knowledge with higher accuracy than generic crawlers.

The pivotal issues of focused crawler are topic representation, hyperlink evaluation, and crawling strategy. In this paper, we propose a novel focused crawler of dynamic adaptive hyperlink evaluation based on the MOPSO, denoted by FCMOPSO. In terms of topic representation, two domain ontologies of meteorological disasters, namely, the rainstorm and typhoon disaster domains are constructed. For topical priority evaluation of hyperlinks, traditional focused crawlers simply integrate different evaluation metrics that portray the topic relevance of hyperlinks and their corresponding weighted factors. However, these weighted factors are manually assigned and cannot dynamically adjust themselves in the crawling process. To solve this problem, we first introduce a dynamic weighted factors optimization for hyperlink evaluation based on MOPSO. More importantly, our research first attempts to tackle the issue of hard-to-decide parameters in the focused crawler, providing a reference value for related research. In addition, a comprehensive model concerning both hyperlink structure and various webpage text documents is introduced. As for hyperlink selection, the NS_NFCS strategy is utilized to select Pareto-optimal hyperlinks, guiding the crawling direction.

Experiments on two domains are conducted, compared with the other five focused crawler strategies in the literature, namely, OPS, SA-host, On-ITS, FCWSEO, and OLMOACO. In both domains, FCMOPSO has better overall performance than other focused crawler strategies. The experimental results show the effectiveness of our strategy. However, the proposed method also needs improving. For example, topic-relevant webpages may be discovered if we continue to crawl the discarded irrelevant hyperlinks. Therefore, to improve the recall rate and uncover more topic-relevant webpages in the network, tunnel crawling for potential hyperlinks is worthy of our continued research. Moreover, as the crawling network goes deeper, webpages under a certain web host are easily crawled repeatedly, which may affect efficiency. Consequently, in future work, we will try to incorporate the host information into the dynamic adaptive hyperlink evaluation and study host-aware crawling strategies.

Ethics approval and consent to participate

This research does not involve any human participants and/or animals.

Competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability statement

The authors confirm that the data supporting the findings of this study are available within the paper

Authors’ contribution

Liu Jingfa: Conceptualization, Methodology, Funding acquisition, Supervision; Yang Zhihe: Methodology, Data curation, Writing- Original draft, Formal analysis, Validation

Acknowledgements

This research is financially supported by the Special Foundation of Guangzhou Key Laboratory of Multilingual Intelligent Processing, China (Grant No. 201905010008), the Program of Science and Technology of Guangzhou, China (Grant No.202002030238), and Guangdong Basic and Applied Basic Research Foundation of China (Grant No. 2021A1515011974).

Wu, X. (2021). Three scenes of heavy rainfall disaster in Henan. Xinmin Weekly, (28), 52-57.
Liu, Q., Gao, L., Zhao, P., & Chen, X, W. (2020). Study on the temporal-spatial characteristics of tropical cyclone disasters in China in 2000-2016. China Flood & Drought Management, 30(05), 50-57.
Aristizábal E., Arango Carmona M.I., Gómez F.J., López Castro S.M., De Villeros Severiche A., Riaño Quintanilla A.F. (2020) Hazard Analysis of Hydrometeorological Concatenated Processes in the Colombian Andes. In: Fernandes F., Malheiro A., Chaminé H. (eds) Advances in Natural Hazards and Hydrological Risks: Meeting the Challenge. Advances in Science, Technology & Innovation (IEREK Interdisciplinary Series for Sustainable Development). Springer, Cham.
Chakrabarti, S., Berg, M. V. D., & Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31(11), 1623-1640.
Yu, J., & Liu, G. (2015). Survey on topic-focused crawlers. Computer Engineering & Science, 37(2), 231-237.
Deng, S.Q. (2020). Research on the Focused Crawler of Mineral Intelligence Service Based on Semantic Similarity. Journal of Physics: Conference Series, 1575(1), 1-8.
Guan W. G., Luo Y. G. (2016). Design and implementation of focused crawler based on concept context graph. Computer Engineering & Design, 37, 2679-2684.
Du, Y. J., Li, C. X., Hu, Q. Li, X. L., & Chen, X. L. (2016). Ranking webpages using a path trust knowledge graph. Neurocomputing, 269(20), 58-72.
Jia, Z., Pramanik, S., Roy, R. S., & Weikum, G. (2021). Complex temporal question answering on knowledge graphs, In: The Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Queensland, Australia. pp.792-802.
Wang, J.J., Dang, D.P., Zhou, P.X., Wang, H.J., Jiang, X.& Huang, S.H. (2013). Crawling Strategy Based on Domain Ontology of Emergency Plans.(eds.) Proc of 2013 the International Conference on Education Technology and Information System (ICETIS 2013) (pp.646-649). Hainan, China.
Zhu, G., Yang, J. Y., Wu, X. H., & Feng, M. N. (2017). Research on Construction of Hierarchy Relationship and Ontology of Meteorological Disaster Based on FCA. Journal of Modern Information, 37(5), 79-88.
Wang, Y. (2011). Design and implementation of focused crawler based on breadth-first. Fudan University, Shanghai.
Rawat, S., & Patil, D. R. (2013). Efficient focused crawling based on best first search. 2013 3rd IEEE International Advance Computing Conference (pp. 908-911), Ghaziabad, India, IEEE.
Liu, J. F., Li, F., & Jiang S. Y. (2019) Focused Annealing Crawler Algorithm for Rainstorm Disasters Based on Comprehensive Priority and Host Information. Compute Science, 46(2), 215-222.
Liu, J.F., Gu, Y.P., & Liu, W.J. (2020). Focused crawler method combining ontology and improved Tabu search for meteorological disaster. Journal of Computer Applications, 40(8), 2255-2261.
Chen Y. B., Zhang Z., Zhang T. (2011). A searching strategy in topic crawler using ant colony algorithm. Microcomputer & its Applications, 30(1), 53-56.
Zheng S. (2011). Genetic and ant algorithms based focused crawler design. In: The Proceedings of the 2011 2nd International Conference on Innovations in Bio-inspired Computing & Applications, Shenzhen, Guangdong, pp. 374-378.
Liu, J. F., Li, X., Zhang, Q.S., & Zhong G. (2022), A novel focused crawler combining Web space evolutionary and domain ontology. Knowledge-Based Systems, 243, 108495.
Liu, J. F., Dong, Y., Liu Z. X., & Chen, D.B. (2022). Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge. Expert Systems With Applications, 198, 116741.
Wang, C., & Ji, X. H. (2016). Improved page rank algorithm based on user interest and topic. Computer Science, 43(3), 275-278.
Asano, Y., Tezuka, Y., & Nishizeki, T. (2008). Improvements of HITS algorithms for spam links. IEICE Transactions on Information & Systems, 91(2), 200-208.
Cheng, Y., Liao, W., & Cheng, G. (2018). Strategy of focused crawler with word embedding clustering weighted in Shark-Search algorithm. Comput. Digit. Eng, 46, 144-148.
Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., & Ur, S. (1998). The shark-search algorithm-an application: tailored web site mapping. Computer Networks & ISDN Systems, 30(1-7), 317-326.
Liu, W., & Du, Y. (2014). A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing, 123, 266-280.
Du, Y., Liu, W., Lv, X., & Peng, G. (2015). An improved focused crawler based on semantic similarity vector space model. Applied Soft Computing, 36, 392-407.
Prakash, J., & Kumar, R. (2015). Web crawling through shark-search using PageRank. Procedia Computer Science, 48, 210-216.
Seyfi, A., Patel, A., & Júnior, J. C. (2016). Empirical evaluation of the link and content-based focused Treasure-Crawler. Computer Standards and Interfaces, 44, 54-62.
Zhao W., Guan Z. Y., Cao Z. W., & Liu Z. (2016). Mining and harvesting high quality topical resources from the web, Chinese Journal of Electronics, 25(1), 48-57.
Tan, S., Ma, J., & Wu, Y. Z. (2011). The Application of Topic-Relevance in Web Information Extraction. Journal of the China Society for Scientific and Technical Information, 30(2),155-159.
Yang, S. Y. (2010). Ontocrawler: a focused crawler with ontology-supported website models for information agents. Expert Systems with Applications,37(7), 5381-5389.
Vidal, M. L., Silva, A. S., Moura, E. S., & Cavalcanti, J. (2006). Structure-driven crawler generation by example. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. (pp. 292-299), Seattle, Washington, USA.
Jing, W. P., Wang, Y. J., Dong, W. W. (2016). Research on adaptive genetic algorithm in application of focused crawler search strategy”, Computer Science, 43(8), 254-257.
Yan, W., & Pan, L. (2018). Designing focused crawler based on improved genetic algorithm. 2018 Tenth International Conference on Advanced Computational Intelligence (pp. 319-323). Xiamen, China, IEEE.
Dewanjee, J. (2016). Heuristic approach for designing a focused web crawler using cuckoo search. International Journal of Computer Science and Engineering, 4(9), 59-63.
Gruber, T. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5, 199-220.
Peng, Q. Q., Du, Y. J., Hai, Y. F., Chen, S. M., & Gao, Z. Q. (2009). Topic-Specific crawling on the web with concept context graph based on FCA. International Conference on Management & Service Science. Wuhan, China. IEEE.
Du, Y. J., Pen, Q. Q., & Gao, Z. Q. (2013). A topic-specific crawling strategy based on semantics similarity. Data & Knowledge Engineering, 88, 75-93.
Kang, X.P., & Miao D.Q. (2016). A study on information granularity in formal concept analysis based on concept-bases. Knowledge-Based Systems, 105. 147-159.
Rios-Alvarado, A. B, Lopez-Arevalo, I., Sosa-Sosa, V. J. (2013). Learning concept hierarchies from textual resources for ontologies construction. Expert Systems with Applications, 40(15): 5907-5915.
Ma, L. L., Li, H. W., Lian, S. W., Liang, R.P., & Chen, H. (2016). A disaster focused crawler strategy based on ontology semantics. Computer Engineering, 42(11), 50-56.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30 (1), 107-117.
Huang, X., Ye, C. M., & Cao, L. (2017). Mixed variation weed optimization algorithm for multi-objective job shop scheduling problem. Journal of Computer Applications, 34(12), 3623-3627.
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multi-objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197.
Liu, J. F., Liu, S. Y., Liu, Z. X., & Li, B. (2020). Configuration space evolutionary algorithm for multi-objective unequal-area facility layout problems with flexible bays. Applied Soft Computing, 89, 106052.
Eberhart, R., & Kennedy, J. (1995). A new optimizer using particle swarm theory. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 39-43.
Derrac, J., García, S., Molina, D., & Herrera, F. (2011). A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm & Evolutionary Computation, 1(1), 3-18.

Applying multi-objective particle swarm optimization-based dynamic adaptive hyperlink evaluation to focused crawler for meteorological disasters

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works