WIKI STREAMS: Wikipedia Article Recent Edit Retrieval System using Hierarchical Stream Clustering

Stream analytics, a new paradigm in data analytics, has gained momentum due to the voluminous stream data generation. With the huge increase in the edits performed on Wikipedia topics, it is tedious for the digital knowledge discovery users to find their domain updates immediately. The users need to go through large information and spend more time to find the potential data. There is a need for retrieving the Wikipedia edits based on the meta data of the article edits for later retriev-al. Hence, the clustering technique may be employed in order to group the Wikipedia article edits domain wise. Hence, in this paper, hierarchical stream clustering is applied in order to retrieve the edits based on the user interest. Over a period of month, the data from Wikipedia is collected and used as a dataset. Our method is compared with the state-of-the-art clus-tering system WikiAutoCat and it is observed that the accuracy is improved by 10% and the clustering time is reduced by 20%.

knowledge and research findings. In addition to these information sources, online encyclope-dias play a major role in sharing the domain contents to the users. Wikipedia is one of the major online en-cyclopedias and it provides information on all domains pertaining to multiple topics. Wikipedia articles are referred by most of the online users. These articles are edited by the Wikipedia article editors with a new data on a topic. Wikipedia article edits are an authorized content and these edits are created from different regions round the clock. These edits are continuously generated on various topics and on an average 3 edits are created per second [1]. Since Wikipedia is one of the trusted sources of information, many of the users are interested in knowing the recent updates in their domain from Wikipedia articles.
There is no provision in Wikipedia to maintain the article edits related to relevant domains. It is difficult for the domain specific interested users to immediately refer the recent edit content related to their domain [2]. It is useful if some methods are applied on the edit contents of the articles such as classification, clus-tering and summarization. The classification of Wikipedia articles is performed using domain ontology-based classifiers. This is done only for few domains and it requires large training data for covering all do-mains [3]. The clustering of articles is done using various clustering algorithms. It involves grouping of all the domain contents into distinct groups [4]. Summarization of Wikipedia articles is performed using con-cept model of semantic elements for generating summary on a specific topic using article contents [5].
Among these Wikipedia article-based data management systems, clustering is considered as the promi-nent method for grouping the article content. Even though the clustering of Wikipedia articles has been done in many existing systems, it is required to implement a system to handle the dynamic Wikipedia edits. It is necessary to estimate the edit similarity and the edit contents have to be maintained in a hierarchical structure to easily access it for query-based retrieval. Since the edits need to be handled as data streams, it is essential to keep the meta data of the streams alone without storing the edit contents in offline. These all create a need to design a system to incorporate the said requirements. Hence, in this paper, a hierarchical stream clustering-based Wikipedia edit retrieval system is proposed. The rest of the paper is organized as follows. Section 2 discusses the works related to stream clustering of Wikipedia articles. Section 3 describes the hierarchical stream clustering and retrieval of Wikipedia article edits. The empirical results are discussed in Section 4. Finally, Section 5 concludes the performance results and observations of the system.

Related works
Wikipedia is the largest online platform for providing knowledgeable contents on various topics from different domains. There are many articles available in Wikipedia pertaining to diverse domains. These arti-cle contents are updated by the interested users when they add or modify the contents on a specific topic [6]. The edits are rapidly accumulated for various topics. It needs to be collected in-stantaneously to scan and analyse the content present in it. It requires concurrent processing of data collection using some way of par-allel data crawling methods. Multi-threaded based crawlers are used for collecting these Wikipedia edits. The Wikipedia article contents are analysed in various forms [7]. The potential information from these arti-cle edits is interesting to the users who look for updated information on their domains. It is necessary to process these article edits to bring some insights from it. The article edits may be grouped, so that users can easily retrieve the recent edits from the grouped data.
Clustering is one of the data mining tasks for grouping the data in an unsupervised method. Many tradi-tional clustering methods are available for grouping the data. These clustering methods are not appropriate to handle the Wikipedia edits. Most of the Wikipedia article clustering have used Wikipedia data repository [8]. The stream clustering is applied on numerical streams and short text messages. It has not been applied on handling the Wikipedia edits. It is needed to apply the stream clustering algorithm for the Wikipedia edit streams to cluster them dynamically. Hai-long et.al [9] conducted a survey on data stream mining algorithms and analysed that stream clustering lacks about handling the text data and need a method to perform it. Mihai et. al analysed the problems of extracting event related information from Wikipedia using clustering of Wikipedia articles [10]. Thomas et.al proposed a system for Wikipedia stream clustering to categorize various article edits performed on Wikipedia. This system failed to focus on handling the streams with some identities for the arriving streams instead of storing the edit contents. Jean et.al developed anytime data stream clustering algorithm using a complex network construction model [11]. This model lacks in handling the drifts in stream data. Chunyong [12] proposed a model for anomaly detection based on data stream clus-tering. The focus was only on limited data, not efficient when performed on large data.
The clustering process considerably affects the retrieval of contents. Andrei et. al [13] proposed a stream clustering which uses hierarchical structure for summarizing the streams. Dilip et. al [14] introduced stream clustering with time series data to organize the content with temporal information for easy retrieval with user queries. Dean et. al [15] used adaptive hierarchical stream clustering using map reduce for handling large data. Here, the hashing is used for easy retrieval. However, the domain corpuses are not considered.
The variation of results between actual stream clustering and corpus-based stream clustering is throwing light to viable research. A reliable domain content management system shall provide content clustering with recent updates on various topics. From the literature, it is observed that there is a lack of work on creating a retrieval system, especially like Wikipedia kind of edit stream generating platforms. Hence, in this work hierarchical stream clustering based Wikipedia edit retrieval system is developed. The unique contribution of this work is that the live Wikipedia edits are collected instantaneously. Another main contribution of the work is that the streams are not stored for cluster generation, instead, the meta data of the Wikipedia edits are stored in the clusters. The queries are applied on the grouped data and the stream contents are retrieved instantly using the meta data available in the clusters.

Wikipedia edit stream retrieval system
Wikipedia edit Stream Retrieval System (WRS) is shown in figure 1. The system consists of major com-ponents such as Wikipedia Edit Stream Data Source, Wikipedia Edit Stream clusterer, and Wikipedia Edit Stream Retriever.
Wikipedia edits are generated for different topics on various domains. The domain specific corpuses pertaining to 9 different domains are given as input for keyword extractor. The keyword extraction is carried out by concurrently scanning the Wikipedia edit streams. The stream clustering involves grouping of the streams based on the keywords present in it. The clustering is performed parallelly on the arriving edit streams. Only the metadata of these clustered edit streams are buffered. These meta data are used for accessing the streams from online Wikipedia. The buffer consists of meta data of the edit streams that are clustered by the hierarchical stream clusterer. The edit stream retriever helps to retrieve the recent edit contents based on the user given query.

Similarity Computation in Hierarchical clustering model
The data are arranged in the hierarchical structure of clusters. The main task in hierarchical clustering is that how data elements are assigned to form the hierarchical structure. In agglomerative hierarchical clustering, the starting of cluster formation is considered as separate data elements. The next steps is proceeded as the grouping of similar data elements to the starting point of the cluster.
Agglomerative clustering assigns to set of data elements O and the continuation of its group S 0 , S 1 . . . S n−1 to the clusters and it is assigned to every clusters The split of the set of data elements S 0 are its distinct elements that is single data cluster where in the number h(C The complete linkage is defined as follows. It determines the distance between the clusters.
= number of data elements in Cluster 1 n2 = number of data elements in Cluster 2 In centroid method, the dissimilarity is calculated as the distance between the cluster centroids. The distance coefficient is known as Lance-William formula.
= number of data elements in Cluster 1 n2 = number of data elements in Cluster 2 n3 = number of data elements in Cluster 3 In this proposed work, the dissimilarity of clusters is calcuated using complete linkage method. In addition, the fuzzy set concept is applied to distinguish the similar clusters with little dissimilar article edit contents. The mathematical description about fuzzy set is explained as follows.
Given a bunch of articles, X = {x1, . . . , xn}, a fuzzy set S is a subset of X that permit each item in X to have member degree somewhere in the range of 0 and 1. i.e., FS : X − > [0, 1] . The fuzzy set is applied on groups. That is, given arrangement of items, a group is fuzzy arrangement of articles. Such a group is called fuzzy cluster. Thus, a grouping contains different fuzzy bunches of articles.
Given a bunch of items, o 1 , . . . , o n , a fuzzy grouping of k fuzzy bunches, C 1 , . . . , C k , can be addressed utilizing a segment framework, The segment matrix ought to fulfill the accompanying three prerequisites: 1. For each item, oi, and group, Cj, 0 <= wij <= 1. This necessity upholds that a fuzzy bunch is fuzzy set. 2. For each item, oi, k. w ij = 1. This necessity guarantees that each item takes part in grouping identically.
3. For each group, Cj, 0 < n wij < n. This necessity guarantees that for each bunch, there is at least one item for which the enrollment esteem is nonzero.
Given a bunch of items, o 1 , . . . , o n , and a fuzzy bunching C of k groups, C 1 , . . . , C k . Let M = w ij (1 <= i <= n, 1 <= j <= k ) be the parcel network. Let c1, . . . , ck be the focuses of groups C1, . . . , Ck, separately. Focus can be either mean or medoid. Comparability or distance between the focal point of the bunch and an item is task measure which chooses how well article has a place with group. For any article, o j , and the bunch C j , if w ij > 0, then dist o i , c j measures conservativeness of item with relating group. As an article has a place with more than one group the sum of distances to the corresponding cluster centers weighted by degrees of membership captures how well the object fits focuses weighted by levels of participation catches how well the item fits the grouping. For an article oi, the amount of squared blunder (SSE) is SSE ( o i ) controls the impact of the levels of enrollment. The bigger the estimation of p, the bigger the impact of the levels of enrollment. Hence, the SSE for a bunch, C j , is SSE C j and the SSE of the grouping is SSE(C).

Results and Analysis
The Wikipedia edits are observed and totally more than 300000 edits were observed in a 1-week period. Since it is tedious to tabulate the entire results, the sample of 19604 Wikipedia edits were considered for the result analysis and these edits are related to 9 domains collected over a period of 6 hours interval. The do-main corpuses used here was collected from various data repository. During the experimentation, each of the domain related words are compared with the edit streams for keyword extraction. The major domains considered in this work are politics, healthcare, business, sports, education, electronics, nature, software and travel. The politics domain comprises of almost 97000 different keywords [17]. Healthcare domain consists of around 60000 keywords [18]. Business domain contains around 600000 related words [19]. Sports domain consists of around 32000 sports related keywords [20]. Education domain consists of nearly 84000 keywords [21]. The keywords for education, nature, software and travel domain are used from Wik-ipedia hierarchical corpuses [22]. The Stanford POS tagger has been applied for tagging the terms present in the edit streams [19].
The similarity of the stream content is calculated and corpus based keyword tagging is combined for put-ting the streams into an appropriate cluster. Incorporating the stream omission after the similarity check and considering the Wikipedia article edit ID increases the performance of stream clustering. Equal importance is provided to all edits received at various periods on different topics. Priority is given to the recent Wik-ipedia edits as the similar content needs to be checked frequently for retrieving it to the users. As content-based similarity depend more on the similar keywords present in the edit streams, the corpus-based key-word extraction and similarity measure is assigned with higher weightage in the applied method. The Hier-archical Stream Clustering (HSC) algorithm is given as follows.

Algorithm: Hierarchical Stream clustering (HSC)
Inputs: streams S 1 , S 2 . . . . S n , corpuses c 1 , c 2 . . . . ck. Output: The clustered Wikipedia edit streams are shown in table 1. Only the streams of 4 sub-domains of 3 domains are listed. The number of streams grouped under each domain with the prominent keywords are mentioned in the table. Each cluster contains the streams with time date information mentioned in it. These information helps in retrieving the streams with topic wise easily for a specific period.
To prove the effectiveness of WRS, the article id (meta data) based cluster representation is analysed since stream data is not stored in clusters. The clusters contain the article title, article id, major keywords and minor clusters. The article id helps in retrieving the edit streams later from the online Wikipedia platform based on the user query. The clusters with the Wikipedia article metadata are illustrated in table 2.
The clusters are generated based on the similarity of the content in the streams. In addition, the main cluster and the sub clusters are formed based on the content received in the streams in a particular interval of time. The cluster generation process is shown for a single domain in figure 2 with various levels. It includes the minor keywords in the outer level and major keywords in the inner level and the cluster in the middle.

Query Processing
The Hierarchically clustered Wikipedia edits are more informative than the data that are clustered with-out hierarchical clustering. The Wikipedia edit have been  clustered hierarchically without any pre-mentioned number of clusters. These hierarchical cluster results are more deterministic and provides higher retrieval accuracy [24].
The clustered meta data is evaluated for the user query to know the recent edit performed in the last 1-hour period. The result is obtained by parsing the user given query. The time information has been taken from the user query and it is checked with the clustered meta data. Fur-ther, the article IDs corresponding to the particular time interval is identified and the number of edits were counted under each cluster. The user query and its cluster contents are shown in table 3.
The most edited Wikipedia topics and domains are identified by parsing the user query and matching it against the number of edits under various topics in each domain. The topics in each domain is identified and it has been shown in table 4.
It is essential to know the countries from where most of the Wikipedia edits are generated. The user given query is parsed for the keyword 'country'. The IP address and the geolocation information of the edits are identified and then the city name, region name and country name are identified using the IP address.
The most edits happened countries are shown in table 5. The Wikipedia edit information with its originated location (geolocation), country, domain, keyword and wiki edit content is displayed in table 6.
The recent Wikipedia edits are clustered and the cluster wise edits are shown in figure 3. The user given keyword-based Wikipedia edits are retrieved from the clus-    ter data. The retrieved results are shown in figure 4 for different user given queries.

Fig. 3 Cluster wise Wikipedia Recent Edits
Various user queries are posed to the cluster data. It is required to retrieve the Wikipedia edit streams based on the user given query. The edit streams are available in the online MediaWiki platform from where the streams need to be retrieved for every query. Since the streams are clustered by retaining the stream meta data, the streams need to be retrieved based on these meta data. Article id  meta data is the prominent information which helps to retrieve the edit stream from online. Wikipedia edit streams cannot be retrieved in a single shot just by passing the article id. It is required to append the article id with the Wikipedia URL. It is also performed in two stages in Wikipedia. These are article id checking and getting the article id related Wikipedia page. This further helps in reaching the edit location and retrieving results back to the users. Since these processes are involved, it is necessary to analyse the time taken for processing the user given query to obtain the Wikipedia edit streams. The time taken for processing the various queries is displayed in table 7. The retrieved edits based on the user given query with the article IDs available in the clusters and the obtained results are shown in table 8.

Comparison with existing methods
The performance of the proposed Wikipedia edit retrieval system (WRS) has been compared with the al-ready existing system WikiAutoCat [23]. The precision and recall have been measured and significant per-formance improvement has been observed. The comparative results are tabulated in table 9. The WikiAuto-Cat system performs categorization of Wikipedia articles but the system has been evaluated with the Wik-ipedia data repository. It has not been evaluated with the streams collected at different time periods. These issues have been tackled in the pro-posed system and the system has been evaluated with the stream data and achieved better precision and recall in data retrieval from the clusters. The proposed system has been checked against the stream data col-lected at various time periods. The accuracy has been increased when compared with the already existing Wikipedia categorization systems. The accuracy results are tabulated in table 10.
The precision of WRS is compared with the WikiAutoCat system. The precision is observed for four different datasets. The improved precision values are observed in each dataset. Since the already existing WikiAutoCat system performs the retrieval based on the title and body of the wikipedia article contents alone, all the relevant articles are not retrieved effectively. But the proposed system WRS uses hierarchical clustering, the number of relevant articles retrieved is high. Thus, it is evident from the graph illustrated in figure 5 that WRS achieves higher presicion than WikiAutoCat.
The recall value of WRS is compared with the WikiAutoCat. The recall values are observed for four different datasets. In each dataset, the higher recall is observed for WRS thatn WikiAutoCat. The higher number of relevant wikipedia article edits are available in the retrieved edits. The recall of WRS and WikiAu-toCat is plotted and it is illustrated in figure 6. Since the proposed system uses effective hierarcical stream clustering, the accuracy of the proposed system is high when compared to the WikiAutoCat. The accuracy of WRS and WikiAutoCat for four different datasets is illustrated in figure 7.
It is required to analyse the clustering time of the system. The number of Wikipedia article edits is cluster under various topics. The time taken for clustering the article edits vary with the number of article edits. The time analysis is shown in the graph figure 8. It is evident from the graph that the proposed Wikipedia edit categorization and retrieval system takes less time when compared to the WikiAutoCat categorization system in the literature. The query time to retrieve the meta data and article-id from the clusters has been evaluated with various matching methods [25] as shown in figure 9. The query time dramatically reduces in hierarchical method (1045 ms) when compared to exhaustive search (5640 ms) and other methods. The query processing time is observed under vari-

Conclusion
In this paper, Hierarchical clustering-based Wikipedia edit stream retrieval system is proposed. The re-cent Wikipedia edit streams are observed and domain specific similar keywords are extracted from edit streams using various domain corpuses. The edit streams are clustered under many major and sub categories. The clusters  are made to keep only the meta data of edit streams. Further, the user queries are evaluated and the relevant edit streams are retrieved. The stream clustering has helped to answer the user queries to re-trieve the recent edits. It helps to generate the clusters immediately using the arriving streams. The user que-ries are appropriately answered and the recent edit streams are effectively retrieved using the cluster data. The WRS system has been compared with the state-of-the-art Wikipedia clustering system WikiAutoCat and the proposed systems achieved reduced clustering time and improved precision, recall and accuracy results.

Funding
Not Applicable.

Conflicts of interest
We, the authors of this research paper does not have any conflict of interest to publish in this journal.

Availability of data material
The data has been collected from wikipedia using API. We have not stores these data as it is not possible to store the streams. The experiments have been carried out by implement the retrieval system in C sharp and .NET environment as a web application. If required, kindly contact authors to get code repository link.

Code Availability
The code has been uploaded in an online repository. If required, kindly contact authors to get code repository link.    Query time to Retrieve Metadata from Clusters using various methods