This section focuses on extracting semantic information from non-fixed news information and constructing the DO framework of network news. DO can be reused in a specific domain. It provides the definition of domain-specific concepts and the relationships between concepts, including the activities in the domain, as well as the main theories and fundamentals of the domain. Therefore, we need to identify their domains of expertise and scope of use, as well as reuse of existing ontology classes. No current news DO being available for network news, we consider constructing the DO according to the characteristics of news data structure and coverage of news semantic information. Compared with general modeling structures, ontology modeling emphasizes knowledge sharing and knowledge reuse. In addition, the ontology structure can reuse the established model and provide a unified description language for different application systems in the process of cross-domain knowledge fusion. On the other hand, the DO of topic news may convey the semantic information of most news. News summaries may contain one or more semantic triples of concepts. We try to find as many conceptual semantic triples as possible for news summaries, so the conceptual semantic triple with the greatest coverage is the optimal match for the selection of a specific category of the ontology model.
4.1 News instance mapping process
The complete semantic triple pattern is mapped to the semantic minimum pattern:
$$Tri\left({e}_{h},r,{e}_{t}\right)\stackrel{P}{\to }Tri\left({S}_{n},verb,{G}_{n}\right)$$
1
where \(P\) represents the mapping process, \({S}_{n}\) represents a specific noun, \(verb\) represents the action of a news event, and \({G}_{n}\) represents a generalized noun which may also be used as a specific semantic term or a complete action target of news event.
Similarly, the mapping of the incomplete semantic triple pattern to the semantic minimum pattern is formulated as follows:
$$Tri\left({e}_{h},r\right)\stackrel{P}{\to }Tri\left({G}_{n},verb\right)$$
2
$$Tri\left(r,{e}_{t}\right)\stackrel{P}{\to }Tri\left(verb,{G}_{n}\right)$$
3
where \({e}_{h}\) in Eq. (2) is a supplementary explanation of news events, which is mapped into \({G}_{n}\) in association with \(verb\) as a passive form.
4.2 Semantic role annotation
According to the above definition, several minimum semantic units can be found by labeling the semantic roles of news sentences[30]. The setting rules are described as follows:
·First, a news summary is divided into one or more sentences according to full stop. Each sentence has one or more semantic triples.
·Then the semantic structure of the sentence is marked from the largest layer to the smallest layer. \(Tri\left({e}_{h},r,{e}_{t}\right)\), \(Tri\left({e}_{h},r\right)\), and \(Tri\left(r,{e}_{t}\right)\) of semantic triple patterns correspond to functions \(f\left(\text{A}\right)\), \(f\left(\text{B}\right)\), and \(f\left(\text{C}\right)\) respectively, where A, B, and C are the sets of \({e}_{t}\) in different semantic triple patterns respectively. Let a, b, and c denote the minimum semantic units in each set, respectively. Then they form the minimum collocation \(v-a\) denoted as \(f\left(a\right)\),\(v-b\) denoted as \(f\left(b\right)\) and \(v-c\) denoted as \(f\left(c\right)\) respectively. \(r\) is a set of actions (\(v\in r\)).
·Finally, we can find fine-grained annotation of semantic triple pattern.
If there exists only one \(f\left(a\right), f\left(a\right)\in f\left(\text{A}\right)\), label it directly;
If there exist more than one \(f\left(a\right), f\left(a\right)\in f\left(\text{A}\right)\), add a term \({A}_{0}\) for each element \(f\left(a\right)\) to form semantic triple instances of multiple minimum semantic units in A, which are independent of one another, that is:
$$f\left(\text{A}\right)=\left\{\begin{array}{c}f\left(a\right), a=A\\ \sum _{n}n{A}_{0}+f\left({a}_{n}\right), {a}_{n}\in A,{a}_{m}{\cap a}_{n}=\varnothing ,m\ne n\end{array}\right.$$
4
If there exists only one \(f\left(b\right), f\left(b\right)\in f\left(\text{B}\right)\), label it directly;
If there exist more than one \(f\left(b\right), f\left(b\right)\in f\left(\text{B}\right)\), each \(f\left(b\right)\) is a semantic triple instance of the minimum semantic units in B, which are independent of one another, that is:
$$f\left(\text{B}\right)=\left\{\begin{array}{c}f\left(b\right), b=B\\ \sum _{n}f\left({b}_{n}\right), {b}_{n}\in B,{b}_{m}{\cap b}_{n}=\varnothing ,m\ne n\end{array}\right.$$
5
\(f\left(\text{C}\right)\) and has a similar distribution structure to that of \(f\left(\text{B}\right)\), that is:
$$f\left(\text{C}\right)=\left\{\begin{array}{c}f\left(c\right), c=C\\ \sum _{n}f\left({c}_{n}\right), {c}_{n}\in C,{c}_{m}{\cap c}_{n}=\varnothing ,m\ne n\end{array}\right.$$
6
By iterating the above steps, a large number of semantic triples with fewer word structures can be extracted from news summaries.
4.3 Optimal matching concept
As a specific noun, \({S}_{n}\) is extracted directly from the semantic roles in news. As the object or passive of the verb connection, \({G}_{n}\) needs to be transformed into the concept of instances. Semantic roles with smaller word-forming structures and clearer semantic structures are found to represent the concept of a set of instances of triples. To figure out the basic concepts of news instances, the conceptual hierarchy can be compacted. The conceptual level usually covers only the basic concepts of a topic news. Then, a more fine-grained concept is obtained by combining the mined news semantic information with basic concepts, so as to construct news triple instances that meet the requirements of ontology instantiation.
We decompose the news semantic information into a large number of semantic triple instances. The term \({G}_{n}\) of each triple is a single instance which conveys a piece of specific news semantic information. A single instance can produce a collection of one or more concepts, and the more frequently the concepts occur, the more typical they are. Therefore, the typification between concept and instance is used to find out the basic concept of a single instance[31].
Let the instance schema for \({G}_{n}\) be \({\text{G}}_{e}\) and the conceptual schema be \({\text{G}}_{c}\). \(p\left({\text{G}}_{e}∕{\text{G}}_{c}\right)\) represents the typicality of instance \({\text{G}}_{e}\) in news singleton concept \({\text{G}}_{c}\), and \(p\left({\text{G}}_{c}∕{\text{G}}_{e}\right)\) represents the typicality of concept \({\text{G}}_{c}\) in news singleton instance \({\text{G}}_{e}\). The specific formulae are given as follows:
$$p\left({\text{G}}_{c}∕{\text{G}}_{e}\right)=\frac{\text{n}\left({\text{G}}_{c},{\text{G}}_{e}\right)}{\sum _{ci}\text{n}\left({\text{G}}_{e},{\text{G}}_{ci}\right)}$$
7
$$p\left({\text{G}}_{e}∕{\text{G}}_{c}\right)=\frac{\text{n}\left({\text{G}}_{c},{\text{G}}_{e}\right)}{\sum _{ei}\text{n}\left({\text{G}}_{ei},{\text{G}}_{c}\right)}$$
8
where\(\text{n}\left({\text{G}}_{c},{\text{G}}_{e}\right)\) represents the frequency of occurrences of \({\text{G}}_{c}\) and \({\text{G}}_{e}\) in the semantic triple corpus of news information.
For the semantic information of a piece of topic news, it is easy to associate this instance information with a concept. The concept is a basic one. Therefore, it can be expressed as the following formula:
$$\text{R}\text{e}\text{p}\left({\text{G}}_{c},{\text{G}}_{e}\right)=p\left({\text{G}}_{c}∕{\text{G}}_{e}\right)·p\left({\text{G}}_{e}∕{\text{G}}_{c}\right)$$
9
Next, the concept of actions is described in action instantiation. Each action has an appropriate word in the news semantics to form the action instantiation information. Therefore, each action instantiation information corresponds to a piece of action conceptualization information in the news semantic triple corpus. The probability of occurrence of such action instantiation information can be calculated as
$$p\left(v\right)=\frac{\text{n}\left(v\right)}{\sum \text{n}\left({v}_{i}\right)}$$
10
where \(\text{n}\left(v\right)\) represents the frequency of the occurrence of a certain action instantiation in the news semantic triple corpus, and \(\sum \text{n}\left({v}_{i}\right)\) represents the sum of all the combinations of the action instantiation phrases.
Action conceptualization is the maximum coverage of action instantiation, with the optimal pattern being the corresponding action conceptualization pattern for each action instantiation. Therefore, we look for the maximum distribution of action conceptualization patterns in action instantiation.
Given the action instantiation information, we find the corresponding action conceptualization pattern f and maximize it to \(\text{max}\left(f\right)\), where f is composed of two parts, an action part and a conceptual part. Therefore,
$$\text{max}\left(f\right)=\text{max}\left[f\left(v\right)·f\left({\text{G}}_{c}\right)\right]$$
11
where \(f\left(\text{v}\right)\) is the probability distribution of the action, and \(f\left({\text{G}}_{\text{c}}\right)\) is the basic concept of the specific instance. The conditional entropy can be used to express the distribution of action conceptualization in action instantiation, which is called semantic information concept triplet pattern.
$$\text{max}\left(f\right)= \text{m}\text{a}\text{x} \text{H}\left({\text{G}}_{c}/v\right)=\sum _{x,y}p(v,{\text{G}}_{c})\text{log}\frac{1}{p({\text{G}}_{c}/v)}$$
$$=-\sum _{v,{\text{G}}_{c}}p({\text{G}}_{c}/v)p\left(v\right)\text{log}p({\text{G}}_{c}/v)$$
$$=-\sum _{v}p\left(v\right)\sum _{c,e}\text{R}\text{e}\text{p}\left({\text{G}}_{c},{\text{G}}_{e}\right)\text{log}\text{R}\text{e}\text{p}\left({\text{G}}_{c},{\text{G}}_{e}\right)$$
12
Through the above steps, we can find the semantic information concept triple pattern that satisfies the maximum coverage of action instantiation information in the news semantic triple. The other categorical semantic triple patterns are also derived using the above steps.
4.4 Ontology structure updating algorithm
The content of network news is constantly updated, so the content and service objects of network news DO will constantly change over time. Therefore, we designed an updating algorithm of ontology structure to ensure that the ontology model could be automatically updated as the coverage of the ontology model dropped to a certain extent. The pseudocode of the specific algorithm flow (named as Algorithm 1) is given as below.
Algorithm 1. Ontology structure updating algorithm |
Require:\({Tri}_{new}\left({e}_{h},r,{e}_{t}\right)\) Ensure:\({Tri}_{new}\left({S}_{n},verb,{G}_{n}\right)\) |
1. if \({Tri}_{new}\left({e}_{h},r,{e}_{t}\right)\nrightarrow Tri\left({S}_{n},verb,{G}_{n}\right)\) then 2. put \({Tri}_{new}\left({e}_{h},r,{e}_{t}\right)\) into a collection\({TRI}_{new}\) 3. end if 4. if \({TRI}_{new}.length>1000\) then 5. for each \({Tri}_{new}\) in \({TRI}_{new}\) do 6.\({Tri}_{new}\left({e}_{h},r,{e}_{t}\right)\stackrel{P}{\to }{Tri}_{new}\left({S}_{n},verb,{G}_{n}\right)\) 7. if \({Tri}_{new}.verb==Tri.verb\) && \({Tri}_{new}\left({S}_{n},verb,{G}_{n}\right)\).\(f\)>\(Tri\left({S}_{n},verb,{G}_{n}\right).{f}_{min}\) then 8. add\({Tri}_{new}\left({S}_{n},verb,{G}_{n}\right)\) 9. else 10. create\({Tri}_{new}\left({S}_{n},verb,{G}_{n}\right)\) 11. end if 12. end for 13. end if |
Other incomplete semantic triple patterns \({Tri}_{new}\left(verb,{G}_{n}\right)\) and \({Tri}_{new}\left({G}_{n},verb\right)\) can also be updated by the above algorithm. |
The updating algorithm for DO is to extract the facts when the network news content of a specific topic changes significantly. Through the extraction of news facts, the ontology database can always maintain the ability to learn new data structures, so as to implement the semi-automatic construction of DO.
4.5 Ontology model evaluation algorithm
In order to evaluate the effectiveness of our ontology model method, we tested two indicators:
·the coverage rate of the ontology model for each piece of news information;
·the correct rate of the ontology model for news information analysis.
The first indicator is the coverage of ontology categories in each piece of news. For example, a news semantic analysis shows that there are two instances of \(Tri\left({e}_{h},r,{e}_{t}\right)\), three instances of \(Tri\left(r,{e}_{t}\right)\), and two instances of \(Tri\left({e}_{h},r\right)\). Matched from the ontology base, \(Tri\left({S}_{n},verb,{G}_{n}\right)\) is recognized as 1, \(Tri\left(verb,{G}_{n}\right)\) as 1, and \(Tri\left({G}_{n},verb\right)\) as 2. The coverage of the ontology category in a piece of news is defined as the ratio of the recognized sum to the sum of all analyzed instances, that is:
$$cov=\frac{n(pattern\_all)}{n(instances\_all)}$$
13
where\(n(pattern\_all)\) represents the number of the recognized semantic information conceptual patterns in the ontology base, and \(n(instances\_all)\) represents all analyzed semantic triples of each piece of news. The higher the value of \(cov\), the closer the fact described by news information to the content of the ontology database; the lower the value of \(cov\), the more significantly the fact described by news information changes over time.
The second indicator is that each news item is considered a match if only there is a pattern in the ontology category. Each type of the semantic information conceptual pattern f is evaluated from the largest to the smallest, and the average number of patterns is found such that the coverage of ontological patterns reaches a saturation point. The pseudocode of the algorithm flow (named as Algorithm 2) is given as below.
Algorithm 2. The average number of patterns for batch news data coverage saturation |
Require: \(Tri\left({S}_{n},verb,{G}_{n}\right)\),\(Tri\left(verb,{G}_{n}\right)\),\(Tri\left({G}_{n},verb\right),H\) Ensure:\(num\) |
1. while \((i=1,i<=H.length)\) do //\(H is the set of batch news\) 2. while \((j=1,j<=\text{max}\left(Tri\left({S}_{n},verb,{G}_{n}\right), Tri\left(verb,{G}_{n}\right), Tri\left({G}_{n},verb\right)\right).length)\) do 3. if \({Tri\left({S}_{n},verb,{G}_{n}\right)}_{j}\stackrel{P}{\to }{H}_{i}\) then 4. print (\({Tri\left({S}_{n},verb,{G}_{n}\right)}_{j},j\)) 5. break 6. end if 7. if \({Tri\left(verb,{G}_{n}\right)}_{j}\stackrel{P}{\to }{H}_{i}\) then 8. print (\({Tri\left(verb,{G}_{n}\right)}_{j},j\)) 9. break 10. end if 11. if \({Tri\left({G}_{n},verb\right)}_{j}\stackrel{P}{\to }{H}_{i}\) then 12. print (\({Tri\left({G}_{n},verb\right)}_{j},j\)) 13. break 14. end if 15. \(j++\) //\(if a pattern is not found,the loop fetches the next pattern\) 16. end while 17. \(i++\) //\(if a pattern is found,the loop fetches the next news set.\) 18. end while 19. \(num=\frac{\sum _{i}j}{i}\) |
We calculated the average number in the following way: |
$${num}_{news}=\frac{n(news\_all)-n(news\_nocover)}{n(pattern\_cover)}$$
14
where \({num}_{news}\) represents the number of pieces of news that are matched by the ontology model, \(n(news\_all)\) represents the number of pieces of news in the test set, \(n(news\_nocover)\) represents the number of pieces of news not matched in the test set, and \(n(pattern\_cover)\) represents the number of different patterns in the ontology that are matched. As ontology patterns increase, a news item may be matched to multiple patterns, so the larger the value of \(n(pattern\_cover)\), the smaller the value of \({num}_{news}\).