SMAFED: Real-Time Event Detection in Social Media Streams

Interactions via social media platforms have made it possible for anyone, irrespective of physical location, to gain access to quick information on events taking place all over the globe. However, the semantic processing of social media data is complicated due to challenges such as language complexity, unstructured data, and ambiguity. In this paper, we proposed the Social Media Analysis Framework for Event Detection (SMAFED). SMAFED aims to facilitate improved semantic analysis of noisy terms in social media streams, improved representation/embedding of social media stream content, and improved summarisation of event clusters in social media streams. For this, we employed key concepts such as integrated knowledge base, resolving ambiguity, semantic representation of social media streams, and Semantic Histogram-based Incremental Clustering based on semantic relatedness. Two evaluation experiments were conducted to validate the approach. First, we evaluated the impact of the data enrichment layer of SMAFED. We found that SMAFED outperformed other pre-processing frameworks with a lower loss function of 0.15 on the first dataset and 0.05 on the second dataset. Secondly, we determined the accuracy of SMAFED at detecting events from social media streams. The result of this second experiment showed that SMAFED outperformed existing event detection approaches with better Precision (0.922), Recall (0.793), and F-Measure (0.853) metric scores. The findings of the study present SMAFED as a more efficient approach to event detection in social media.


INTRODUCTION
Event detection in social media streams is a worthwhile research area because it provides the opportunity to know about current happenings and people's views at the click of a button. Gone are the days when persons can 'kill' the news that they do not want other persons to know about by suppressing it through the news agency or organisation. This is no longer possible with the existence of various social media platforms. Once a piece of interesting news gets into social media, it travels faster. It reaches a wider audience who would continue to spread the news until a desirable action is taken. Thus, researching this area is important to objectively reveal current events and happenings, such as breaking news, instant outbreaks, infectious disease, and terror attacks [1].
We are in the era of social media, where there is abundant data and opportunities to be exploited. Social media enable us to take advantage of the social nature of human association, making it possible for individuals to express their feelings, to become part of a virtual network and collaborate remotely [2].
Wavelet-based Signals (EDCoW). Authors in [21] employed Term Frequency (TF) and Kullback-Leibler divergence (KLD) to propose real-time summarization of scheduled events from Twitter streams. The work addressed summarization of tweet content to provide the user with summed upstream describing the key sub-events by employing a two-step process; sub-event detection using an outlier-based sub-event detection technique and selection of tweets related to the sub-event detected to provide a summary. For the summary, TF and KLD techniques were compared and found out that KLD performed better. Authors in [15] proposed a framework to detect events in social streams using similarity score and cluster summarisation techniques. Authors used content-and network-stream based clustering for event detection. Mining Spatio-temporal information on microblogging streams using a density-based online clustering method was proposed by [22]. The paper investigated the extraction of spatio-temporal features of social media streams by employing an Incremental Density-based Spatial Clustering Application with Noise (DBSCAN) algorithm to enhance event awareness. A weighting factor called BursT, a sliding window technique to address concept drift, was employed. However, none of these outlined approaches focused on handling or analysing SAB terms prevalent in social media streams.
Balance Iterative Reducing and Clustering using Hierarchies (BIRCH) clustering algorithm was proposed [23]. The research work was based on detecting localized events and tracking the evolution of such events. Spatio-temporal characteristics of keywords were continuously extracted using the entropy of the spatial signature. A single-pass clustering algorithm, Birch, was used to group event keywords based on the cosine similarity of their spatial signatures. Top-k scoring clusters were considered possible event clusters. In the pre-processing stage, stop-words were removed, stemming was applied, and WordNet dictionary lookups were performed. A wordNet is a lexical dictionary for English words. Authors in [24] employed multiple social media feeds features such as titles, description, location-textual, location proximity, and date along with Term Frequency -Inverse Document Frequency (TF-IDF) and Normalised Mutual Information Frequency to detect events. In the same vein, [25] traced the German centennial flood in the stream of tweets. The authors applied density-based clustering called Ordering Points to Identify the Clustering Structure (OPTICS) to group a set of flood-related tweets with respect to time and location. The result was validated with subsidiary data sources. Fuzzy hierarchical agglomerative clustering was used to propose a TweetMogaz framework for identifying new stories in social media [26]. The framework used an adaptive method to track relevant tweets. Fuzzy Hierarchical Agglomerative clustering with term co-occurrence probability as distance measure was used to identify hot stories tweets with enough content. To overcome the problem of duplicate story detection, cosine similarity was used to compute vectors of the two clusters.
Also, real-time entity-based event detection for Twitter was proposed by [16]. The proposed approach identified bursty named entities and then clustered tweets based on the occurrence of the named entities using a cosine distance similarity score. Multiscale event detection in social media was presented by [27]. The authors explored the properties of the wavelet transform. They proposed a novel algorithm to compute a data similarity graph at appropriate scales and simultaneously detect events of different scales by a single graph-based clustering process. The clustering process is based on comparing common terms between pairs of tweets. [28] proposed a bursty event detection from a microblog framework using a distributed and incremental approach. The paper focuses on detecting events from Weibo (microblog) on Spark engine framework (taking into consideration of topic drift) by employing distributed and incremental temporal topic model, Bursty Event dEtection (BEE+). Online indexing and clustering of social media data for emergency management was presented by [29]. The authors implemented online indexing techniques; incremental TF-IDF, Skewness and Learn & Forget Model. Clustering was evaluated using Silhouette and Davies-Bouldin metrics. Authors in [12] presented three different approaches to merging information from two different social media sources using time-evolving graphs. It was demonstrated that using information from multiple data streams increases the quality and quantity of detected events. An event detection system that uses inverted indices and incremental clustering algorithms was proposed by [30]. Burst detection based on the volume of tweets without considering the tweets' context may be misleading because co-occurrence terms in tweets may not be synonymous when the context in which they are used is taken into consideration. Real-time event detection on social data streams was conducted by [31]. Events were modelled as a list of clusters of trending entities over time using entity co-occurrence, Louvain clustering, and aggregate ranking. The approach only considered the length of words contained in a tweet but did not look at such representative's local and global importance. In addition, SAB terms were not handled. [32] proposed a multimedia big data system that used both incremental clustering event detection approach enriched with the analysis of multimedia content and a bio-inspired influence analysis technique to support alert spread and situation awareness over the network. Target-aware holistic influence maximization in spatial social networks was carried out by [33]. The authors came up with a diffusion model which takes care of both physical and cyber user interactions. They also proposed a spatial, social index based on an R-tree algorithm that computes users' interest similarity concerning online keyword queries. Both synthetic and three datasets were used to validate the effectiveness of the proposed model. However, these approaches did not handle the semantic analysis of SAB in social media content. The summary of unsupervised learning approaches to event detection is shown in Table 1.

Semi-supervised Learning for Event Detection in Social Media Stream
Semi-supervised learning models are trained by combining both the unlabeled and labelled data. More specifically, a little proportion of labelled data with a great deal of unlabeled data. Some of the event detection efforts based on semi-supervised learning are now presented.
Authors in [35] identify and characterise social media events by using generic event detection and topic-specific event detection with TF-IDF and Naïve Bayes. Civil unrest prediction with a Tumblr-based exploration was reported by [36]. The authors focused on detecting civil unrest by continuously applying text-based filters (keyword, location and future date filters) to the Tumblr data stream. a semi-supervised method for Automatic Targeted-domain Spatio-temporal Event Detection (ATSED) in Twitter using historical and real-life Twitter streams was proposed by [37]. The proposed method was suitable for event detection from historical data but not for real-time event detection. SPOTHOT: Scalable detection of geo-spatial events in large textual streams was proposed by [38]. The authors proposed a SigniTrend event detection system capable of tracking unusual occurrences of arbitrary words at arbitrary locations in real-time without specifying the terms of interest in advance. None of these outlined methods handled the noisy characteristics inherent in social media data.
Also, [39] implemented various algorithms like k-means, Hierarchical agglomerative, Latent Dirichlet Allocation (LDA) topic modelling on Twitter stream to analyze real-time Twitter data to empower citizens by keeping them updated about what is happening around the city. In the pre-processing stage, removal of hashtag, stop-word, URL and special characters and stemming were done, but there was no treatment of SAB terms. Authors in [40] proposed a model for detecting and tracking breaking news from Twitter in real-time by employing Multinomial Naïve Bayes Classifier and DBSCAN algorithms. The proposed model could not dynamically learn from the available new sources. The pre-processing stage removed tags, mentions, URLs, and non-ASCII characters but did not address SAB terms prevalent in social media posts. [17] proposed a framework for detecting news events from the Twitter stream in real-time. The approach used ANN to classify news relevant tweets from the stream based on AvgW2V, and Mini-batch cluster to group detected tweets into events. [41] worked on sub-story detection in Twitter with a hierarchical Dirichlet process. The paper proposed a Hierarchical Dirichlet Process (HDP) to address the problem of automatic substory detection associated with the main story. Like others, none of these research efforts considered resolving the ambiguity of SAB terms during the analysis of social media content. The summary of semi-supervised approaches to event detection is presented in Table 2.

Supervised Learning for Event Detection in Social Media Stream
The supervised learning models are the class of machine learning algorithms that can extrapolate a prediction or classification function after being trained on labelled sample data. The training examples contains a couple of input (vector) and output (supervisory signal). Instances are of the format (x,y), where x is a vector and y is referred to as the class or target attribute (or scalar). Supervised learning approaches typically build a model that maps x to y by finding a mapping m(.) such that m(x) = y. Given an unlabeled instance, m(x) and m(.) learned from training data, the outcome of an unlabeled instance can be computed. Subsequently, some instances of supervised learning applied to event detection are presented next.
Authors in [42] proposed Geo-spatial event detection in a Twitter stream. Machine learning algorithms (Naïve Bayes, Multilayer perceptron, and Prune C4.5) were used to analyse whether the geo-spatial clusters contain real-life events. The detected events (candidate clusters) were displayed according to the individual tweet ranking score in descending order on a map with their locations in real-time. [43] proposed a graphical-based model, location-time-constraint topic, LTT (an improvement over LDA), to capture social media time, content, and location data. Kullback Leibler, KL-divergence was used to measure the similarity of uncertain media content. Social events were detected using a hash-based indexed scheme, Variable Dimensional Extendible Hash (VDEH). The LTT model was refreshed after every block of tweets in an incoming time slot to accommodate topic drift. Transaction-based Rule Change Mining (TRCM) framework that applied Association Rule Mining to extract association rules from the tweet's hashtag was proposed by [44]. Unexpected changes in the consequent and conditional rules in each time slot were ranked. Hashtags detected was then compared with the key terms in the ground truth from BBC Sport commentary within the same time frame. [45] studied real-time top-R topic detection on Twitter with topic hijack filtering. The extraction of meaningful topics and noisy filtering messages over the Twitter stream were integrated using Streaming Non-negative Matrix (NMF). There were false detections (false negatives) of hijacked topics due to the model misspecification. Twitter Life Detection Framework was presented by [46] using TF-IDF for similarity score and SFPM for classification.TF-IDF directly computes document similarity on the word-count space, which is usually slow for large vocabularies. None of these reported approaches focused on treating SAB terms in tweets.
A multimodal classification of events in social media using TF-IDF and SVM was presented by [47]. The pre-processing stage removed stop-words, special characters, numbers, emoticons, HTML tags, and words with less than four characters. [48] proposed an audio-based multimedia event detection using recurrent neural networks. The authors introduced longerrange temporal information with a recurrent neural network for feature representation and classification to determine whether a given event can be traced to a video. In the same vein, multimedia event detection was presented by [49]. The author proposed algorithms for detecting complex event detection from web videos by engaging a two-stage convolutional neural network. Our focus in this paper is to use social media contents which are very noisy due to user-generated content (social media content). These reported efforts on multimedia event detection did not specifically address SAB terms and grammatical errors that may be contained in video data.
Also, [50] proposed an approach to detect Foodborne disease from Weibo data using TextRank and SVM. The SVM was used to filter unwanted tweets. However, the proposed framework was found to perform poorly in the face of sparsity and concept drift. A deep learning approach for traffic accidents detection from social media was developed by [51]. Tokens and paired tokens were extracted from over 3 million tweets. Deep Belief Network and Long Short-Term Memory deep learning models were implemented on the extracted tokens and paired tokens to detect traffic accident information. [52] proposed a hate speech detection model to identify hatred against vulnerable minority groups using Amharic text data on Facebook. Apache Spark distributed platform was used for data pre-processing and feature extraction. Feature extraction was done using Word2Vec as an embedding model. Gated Recurrent Unit (GRU) was used for the classification stage. Table 3 summarises supervised learning approaches that were applied to event detection.

Semantic-based Approaches for Event Detection in Social Media Stream
Authors in [53] worked on scalable distributed event detection using Twitter streams. The paper proposed scalable automatic distributed real-time event detection by incorporating a lexical key partitioning strategy (hash key grouping borrowed from LSH) to spread the detection process across multiple machines while avoiding partitioning as a series of subsets. The proposed framework was implemented on the Storm topology. It was identified that no pre-processing was done, even though Twitter streams are noisy, temporal and full of slang. [53] proposed Locality Sensitive Hashing (LSH) to detect events from Twitter and Facebook. LSH was used twice in the event detection process; it was used to obtain events from Twitter and Facebook independently. It was later applied to detect cross-over events in the two social media streams. LITMUS, a system that used keywords to extract social media data related to "landslide", was proposed by [55]. The system then employed an augmented Explicit Semantic Analysis (ESA) algorithm using a semantic interpreter by extracting a subset of Wikipedia as classification features to classify data into relevant and irrelevant. Semantic clustering based on semantic distance was used for location estimation. Only geo-tagged data were considered and not the entire dataset. These semantic-based approaches did not consider the treatment of SAB terms in their analysis.
Authors [56] presented a system, ArmaTweet, which used Natural Language processing techniques to extract structured information from tweets and then integrated the structured information with RDF from DBpedia and WordNet. The system used semantic queries to identify tweets matching the user interest and passed them to the anomaly detection algorithm to determine their correspondence to actual events. This improves the keyword search and is suitable for topic-specific event detection. However, the precision of the pre-processing component was not investigated in the face of acronyms, slangs, abbreviations, and passive words prevalent in social media data.
A framework for event classification in tweets based on hybrid semantic enrichment using TF-IDF, Named Entity Recognition, Page Rank, CfsSubsetEval was proposed by [57]. Semantic enrichment was combined with external document enrichment and named entity extraction to classify tweets. [58] proposed an event detection model based on scoring and word embedding to discover key events from a high volume of data streams. In the pre-processing stage, stop words, modal auxiliary verbs, URLs, emoticons were removed. Word2vec was used for embedding, and improved Expected maximization was used for the event detection stage. Word2Vec is limited to calculating word similarities. However, none of these approaches considered resolving the ambiguity associated with SAB terms in social media. A summary of the instances of applying semantic-based approaches for event detection methods is presented in Table 4. Our review of the literature revealed that existing event detection methods had focused mostly on filtering out SAB, removing noisy terms including SAB, or ignoring them entirely during the pre-processing stage of social streams. They did not perform semantic analysis of noisy terms like SAB to determine their contextual meanings and their impact on the accuracy of results.
These noisy terms include short messages, slangs, acronyms, mixed languages, grammatical and spelling errors, dynamically evolving, irregular, informal, abbreviated words, and improper sentence structure, which make it challenging for the efficient performance of the learning algorithms (6,59). This gap that was not addressed by previous research efforts necessitated our study.
According to [30], the representation of social media stream must be in a way that the semantics of social media content is preserved. Hence, using the contextual clues surrounding a social media stream is critical for useful and accurate results. Thus, there is a need to develop an event detection framework that will focus on semantic analysis of slangs, acronyms and abbreviations (SAB) terms in social media streams and the ambiguity associated with their usage to improve the accuracy of event detection in social media streams. Most of the previous research efforts have not addressed this problem, which is where SMAFED seeks to make a difference. The summary of the strength and weaknesses of existing event detection techniques and their attributes is presented in Table 5.

METHODOLOGY
This paper proposes the Social Media Analysis Framework for Event Detection (SMAFED) as an efficient and integrated social media stream analysis approach incorporating social media stream pre-processing and enrichment to improve event detection results.

Problem Formulation: Event Detection in Social Media Streams
In our approach, detecting events from social media streams consisted of 10 main tasks based on the Input-Process-Output model. The input part contains the first task. The second to seventh tasks are under the process, and tasks eight to ten constitute the output. The tasks are outlined as follows: We now present the formal definitions of the main tasks of our approach to event detection as follows: Input:

Definition 3.1 (Data Streams Collection)
A stream S = e 1 , e 2 , …, e n is an ordered sequence of objects or points where e i indicates the ith object or point observed by the algorithm. For t > 0, let S(t) symbolises the first t entries of the stream: e i , e i+1 , …, e t . For 0 < i ≤ j, let S(i,j) designate the substream e i , e i+1 , …, e j . Define S = S(1,n) be the whole stream observed until e n , where n is, as before, the total number of objects or points observed so far.

Process: Definition 3.2 (Tokenise Tweets)
Given a stream S = e 1 , e 2 , … e n where e i represents an individual tweet, tokenise e i such that w i ∈ W and i = 1, 2, …, m, where w i is the individual words and W is all the words in e i .

Definition 3.3 (Lemmatise Tweets)
For where k i is the pos_tag with wordnet value and r i is the root word of w i .

Definition 3.4 (Filtering SAB)
Given a stream S = e 1 , e 2 , … e n at a time t where e i represents individual tweet containing words w i ∈ W and i = 1, 2, …, m.
Find w i such that ∄ where D is the set of English words.

Definition 3.5 (Disambiguating SAB)
Let the size of the social media stream context window, 2n+1, be denoted as N. Given local vocabulary (IKB) as the integrated database containing definitions, usage examples, and related terms of SAB. Let the IKB SAB terms in the context window be represented as W i , i ≤ 1 ≤ N. If the number of IKB SAB terms is less than 2n+1, all of the IKB SAB terms in the instance serve as the context.
Each SAB term W i has one or more possible senses. Let the number of senses of the SAB term be represented as |W i |. Each possible combination of senses for SAB in the context window will be evaluated. There are such combinations, each of which is referred to as a candidate combination. A combination score is computed for each candidate combination. The target SAB term is assigned the sense of the candidate combination that attains the maximum score.

Definition 3.6 (Semantic Tweets Representation)
Given context or source embedding and target embedding for each word w in the vocabulary with embedding dimension h and k = |v|. The tweet embedding is the average context word embeddings of constituent words augmented by learning n-grams. The tweet embedding v s for current tweet S is modelled as: where R(S) designates the list of n-grams, including unigrams present in sentence S.

Definition 3.7 (Semantic Similarity among tweets)
Assume two tweets x has m words x 1 , x 2 , …, x m and y has n words y 1 , y 2 , …, y n . The semantic similarity matrix (SSM) for two tweets x and y is given as: The semantic similarity between word x s and tweet b is given as follows: The semantic similarity between tweets x and y is calculated as: The semantic relatedness of x i and y i is calculated by comparing glosses of synsets related to x i and y i through explicit relationships of IKB.

Output: Definition 3.8 (Grouping of semantically similar tweets into clusters)
Given the assumption of the distribution of data stream uploads belonging to an event, an incremental clustering algorithm can be defined as follows: First, the small size of the window 1 of the data stream of size, N is clustered such that 1 ≤ . As a new data stream arrives on window 2 Clustering is performed again with | 2 | = 2 * | 1 |.
If certain clusters detected in the window 1 are re-detected in 2 , then those clusters that are "stable" remain stable, and their items are removed from further clustering. For subsequent data stream clustering on window 3 , there is likely to be where is the set of stable clusters.

Definition 3.9 (Event Cluster Ranking)
The importance/information richness of a cluster is based on the number of important words it contains, the Weight of a cluster C, ( ) is computed as follows: where ( ) is the count of the word w in the input collection and the ( ) is greater than a given threshold.

Definition 3.10 (Representative Event Selection)
A candidate for representation is selected based on the importance of its constituent words. Let the local importance of word w be given as log (1 +

High-Level Overview of SMAFED
We now present the high-level process view of SMAFED. The process workflow of SMAFED (see Figure 1) is divided into four main steps described below.
Step 1: A user interface is built around the underlying API provided by Twitter using Python Programming Language to collect tweets in English or Pidgin English from Nigeria origin. Python is chosen due to its efficiency and suitability for building high traffic and data-heavy workflows. Collected tweets within each window period are stored in a queue. The collected tweets are passed to the pre-processing stage.
Step 2: From the data stream collected, URLs, Tags, mentions and non-ASCII characters were automatically removed through the use of a regular expression. Then, the next data preparation stage was to perform tokenisation and normalisation.
This basic pre-processing reduces the number of features and addresses the problem of overfitting (Romero & Becker, 2019). After that, slang, acronyms, and abbreviations (SAB) are filtered from the tweets using corpora of English words in the natural language toolkit (NLTK). The filtered SAB terms are then passed to the local vocabulary (IKB) for further processing.
Step 3: Meanings of SAB are extracted from IKB. Due to several meanings attached to each SAB, there is a need to disambiguate the ambiguous terms and select the best sense from the several meanings provided. This is done by leveraging the Slang, Acronym, and Abbreviation, Disambiguation Algorithm (SABDA) based on the ambiguous SAB's context in the tweet. The peculiarity informed the choice of SABDA of the IKB to provide a rich source of information and improve overall disambiguation accuracy. The data enrichment stage is then concluded with spelling correction (using Python's automatic spell-checker library) and emoticon replacement.
Step 4: The enriched tweets produced from the previous stage must be transformed to enable the clustering algorithm to build on them. The enriched tweets are transformed into a vectorial form using the sent2vec model. Sent2Vec provides a significant improvement over state-of-the-art supervised and unsupervised methods for sentence or paragraph embedding, as revealed in the literature [60]. The embedded tweets are clustered as they arrive using semantic histogram-based incremental clustering (SHC) [61]. The idea is to have clusters representing the same event as much as possible. SHC maintains high cohesiveness within clusters, implying a high distribution of similarities. This necessitates the choice of SHC. The tweets in each cluster are ranked based on the information richness of their constituent words. Lastly, the representative tweet that best describes each of the top n candidate event clusters is selected.

The Conceptual Architecture of SMAFED
The conceptual architecture defines the structure of components of the SMAFED. It consists of four modules: data collection, data pre-processing, data enrichment, and event detection, as presented in Figure 2.

Data Input Layer
The data layer serves as the input layer of SMAFED. It is responsible for streaming tweets from Twitter to SMAFED for processing. A user interface enabled by a Twitter API is used for tweet streaming in JSON formats. The input to SMAFED is data at rest stored either in the Comma Separated Value (CSV) or JavaScript Object Notation (JSON) format.

Data Pre-processing Layer
The data pre-processing layer comprises three sub-layers: data cleaner, data transformer, and data filter. The data cleaner handles data cleaning of responses fetched through the Twitter API, punctuations, repeated characters elimination, and substitution. The data transformer and the data filter, both in the pre-processing layer of SMAFED, perform the feature extraction. The data transformer performs tokenisation and normalisation using the Python NLTK library. After that, the data filter extracts the SAB from tweets collected using corpora of English words in the Python NLTK library. In other words, any normalised token that is not found in the corpora of English words is taken as slang or abbreviations or acronyms or emoticons. The sifted SAB is transferred to the data enrichment layer for additional processing. The tweet being analysed, and the SAB serve as input to the enrichment layer.

Data Enrichment Layer
This layer has three sub-components: IKB API, Disambiguator, and Spelling Checker. To better represent tweets, there is a need to provide meaning for slang, acronyms, and abbreviations (SAB) found in tweets because these noisy contents in tweets have hidden meanings that can form part of the rich context of tweets well defined. The IKB component of SMAFED is a lexicon of SAB that stores all the contents of the three knowledge sources: Naijalingo, Urban dictionary, and Internet slang in MongoDB. Naijalingo is an online Nigerian Pidgin English and slang words reference that gives definitions to Nigerian words and expressions. Urban dictionary is a publicly supported online word reference for English slang words and expressions. Internet slang is a word reference containing a pool of slang terms, acronyms, and abbreviations on online blogs, Twitter, chat rooms, SMS, and internet forums. The last stage of the enrichment layer is to perform a spelling check on the tweet content. JamSpell version 1.0.0 from the python library is a spell-checking tool that is efficient and effective. It considers words context for better correction (accuracy), can correct up to 5000 words/sec (speed) and is available for many languages (multi-language). The choice of JamSpell was informed by its better performance when compared to other spell-check libraries such as Norvig, Hunspell, and Dummy in terms of speed and accuracy.

The Formal Definition of SABDA Model
The formal definition of the SABDA model is as follows: If = 1 , 2 , … and = 1 , 2 , … are the usage gloss and the context, respectively, we build their semantic representation and in the semantic space through the addition of word vectors belonging to them: = 1 + 2 + ⋯ + = 1 + 2 + ⋯ + The measure of relatedness between and is a measure of the similarities between and given as: where is the number of pair relations in which is defined in a reflexive relation given as: = ( 1 , 2 )| 1 , 2 ∈ ; if( 1 , 2 ) ∈ , ℎ 2 , 1 ∈ where RELS is a set of relations. To choose the best , Map the max ( ( , )) with the corresponding ∈ .

The Algorithm for Disambiguation of SAB (SABDA)
The pseudocode of Slang, Acronym, and Abbreviation Disambiguation algorithm (SABDA) is presented in Algorithm 1.

Illustration of SABDA Pseudocode
For clarity, the disambiguation and interpretation of the noisy terms process are illustrated with Example 1.
Example 1: Consider the case of disambiguating the term "baddo" in the tweet: "I am a baddo when it comes to this profession." Given the following senses from the ikb as shown in Table 6, pick the sense with the most word overlap between the context (tweet in question, sjk) and the usage examples (usage_senses). The overlap between the context and the usage example is shown in Table 7. The tweet that is being considered, sjk, "I am a baddo when it comes to this profession", is contrasted with all the usage examples (usage_senses) in the ikb identifying with the word "baddo". The score for every comparison (relatedness(sti,sjk)) is stored as an array (score). The comparison with the most noteworthy (highest) score is taken as the best score. The usage example 2 with 6 overlaps is picked as the most proper sti. The best usage example is mapped to its corresponding definition. The usage example 2 is mapped with the meaning of baddo 2 . Consequently, the best sense for "baddo" in this context is "someone who is highly respected or seen as very good at what he/she does".

Event Detection Layer
The event detection layer has four components: Embedder, Event Clusterer, Event Ranker, and Event Summariser.

The Embedder
The embedder converts the enriched tweets into a vector form. This is done using a language model called sent2vec, developed by Pagliardini et al. (2018). The model uses an unsupervised objective to train distributed representation of phrases/sentences. Words that are not found in the dictionary of the model are represented as zero vector, which implies that such words have no contribution to the mean vector.

The Event Clusterer
The Event Clusterer of the Event Detection Layer in SMAFED performs the incremental grouping of the embedded tweets into event bins using semantic histogram-based incremental clustering (Gad & Kamel, 2010). Semantic histogram-based incremental clustering is a dynamic incremental method of building clusters that makes use of the semantic histogram concept to maintain a high degree of cluster coherency. New tweets are compared with each event cluster histogram to maintain the incremental creation of coherent clusters. If the addition of a new tweet will largely degrade the distribution, such tweet is not added; otherwise, it is added. The quality of event cluster cohesiveness (semantic histogram) is measured by the ratio of similarity count above a certain similarity threshold to the total similarity count. The higher the semantic histogram ratio, the more the cluster cohesiveness.

The Event Clusterer Algorithm
The event clusterer algorithm (Semantic Histogram-based Incremental clustering) is presented in Algorithm 2. The computation of semantic similarities between tweets is based on how they are semantically represented. The Sent2vec model is used to obtain the semantic representation of tweets. When a new semantic similarity value between two tweets is determined, it augments the semantic comparability check (count) inside the bin (cluster) where such similarity is found. To add another tweet, the new tweet is compared against each semantic histogram cluster. On the off chance that the distribution is degraded, it is not added; else, it is added. At this stage, the issue of concept drift is implicitly taken care of because when there is an arriving tweet that does not fit into the existing clusters, a new cluster is created for it. if (SHRnew ≥ SHRold) OR ((SHRnew > SHRmin) AND (SHRold -SHRnew < )) then Add T to E end if // Exit from the inner loop to avoid any chance of assigning the same tweet to more than one event cluster. end for // inner loop if T was not added to any cluster then Create a new event cluster E

The Event Ranker
The Event Ranking component of SMAFED orders the contents in each cluster based on the importance of the constituent words. Since the event clustering is unsupervised and the number of clusters is not known in advance, it is necessary to determine the clusters that would contribute to the representative summary. In other words, the importance of the information richness of a cluster is based on the number of important words it contains.

The Event Ranker Algorithm
The event ranker algorithm is implemented (as in definition 3.7) using Algorithm 3. The focus of the event ranker algorithm is to determine which of the detected event clusters are actual events. For each of the clusters, the weight is computed by summing up the weights of all the important words in each event cluster. The event clusters are then sorted in descending other based on their weights. Any event cluster whose weight is greater than a given threshold is considered an event.

The Event Summariser
The Event Summariser component of SMAFED finds a suitable representative summary of the candidate event cluster with a coherent and fluent summary using the extractive summary approach. Ideally, candidate event clusters should have tweets that belong to the same event, but there is a need to find a representative that can represent individual clusters. The tweet with the highest score based on its local and global importance is selected. The local significance of a word found in each tweet shows how much contribution the word makes to the central tweet concept. The global importance corresponds to the word's contribution in the subtopics formation spread over the cluster of tweets.

The Event Summariser Algorithm
Algorithm 4 presents the event summary based on definition 3.8. The event summariser algorithm looks at each tweet found in each event cluster to find which of the tweets in each cluster can serve as representative. In other words, which of the tweets in each event cluster can we pick and use as a summary of all the tweets in an event cluster? The algorithm answers this question by counting the frequency of each important word in a tweet and in the cluster in which the tweet appears. After that, the average local and global importance is computed. This computation is done for each important word in the tweet, and the sum is taken as the weight of the tweet in the event cluster. This is how the weight of all tweets in the event cluster is computed. The tweet with the highest score is taken as the summary of the event cluster.

EVALUATION EXPERIMENT
In this section, we report the evaluation of the SMAFED framework. The evaluation was divided into two parts. The first part gives a detailed summary of the impact of the data enrichment layer of SMAFED, which focused on semantic analysis of SAB compared to when there is no treatment of SAB. The second part focuses on the performance of SMAFED when used for event detection from social media streams.

Experiment I: Impact of the Data Enrichment Layer of SMAFED
SMAFED was evaluated by benchmarking it with the General Social Media Feed Pre-processing Method (GSMFPM) to determine the impact of the enrichment layer of SMAFED. The difference between GSMFPM and SMAFED is highlighted in Table 8.

Dataset Description
Two datasets for the first experiment were Twitter sentiment analysis training corpus and Naija-tweets. A summary of the two datasets is presented in Table 9.

Feature extraction and representation
We extracted two types of features, namely, unigram and bigram, from the datasets. The summary of the feature extraction and representation is shown in Table 10. Global Vector for Word Representation (GloVe) was used for the feature extraction. GloVe is an unsupervised learning algorithm for obtaining word-word co-occurrence statistics from a corpus. This results in representations that showcase interesting linear structures of the word vector space. GloVe is a log-bilinear model with a weighted least-squares, which combines the features of the local context window and global matrix factorization methods. The underlying intuition of the model is that the ratios of word-word co-occurrence have some form of meaning encoding potential.

Classifiers
We used supervised learning techniques for text classification, namely, multilayer perceptron (MLP) and convolutional neural networks (CNN). MLP trains on input-output pairs and models the dependencies between the inputs and outputs. CNN is a deep learning architecture model that aims to learn higher-order features present in data through convolutions.

Experiment I: Result and Discussion
The proposed SMAFED was benchmarked with the General Social Media Feed Pre-processing method (GSMFPM) by testing their impact on two classifiers. The essence of assessing the impact of the general pre-processing method -GSMFPM and the proposed SMAFED on the classifiers was to determine whether analysing SAB terms and resolving ambiguity in SAB in social media streams can affect event detection results. To do this, we compared the cross-entropy loss function of the classifiers (MLP and CNN) when GSMFPM and SMAFED were used. The comparison based on the loss function indicates how good a classifier accurately predicts the expected outcome. The cross-entropy result for sentiment classification of Twitter sentiment analysis training corpus for Multilayer Perceptron and Convolutional Neural Network with five epochs and eight epochs, respectively, are shown in Tables 11 and 12. At the same time, the Naija-tweets dataset is presented in Tables  13 and 14. should also be noted that the lowest loss function in each of unigram, bigram, and unigram+bigram of both approaches was obtained at epoch_5, meaning that the more the number of epochs, the better the performance of the classifier. The Cross-Entropy Loss Function of CNN with kernel size = 3 and one-four convolution layers using eight epochs for SMAFED compared with GSMFPM on Twitter Sentiment Analysis Training Corpus are presented in Table 12. From the table, it can be deduced that the pre-processing coupled with the data enrichment components of SMAFED outperformed GSMFPM regarding matching the predicted and the actual sentiment. It should also be noted that the loss function of the first layer of CNN cross-entropy for both approaches is lower than that of other layers. The Cross-Entropy Loss Function of Multilayer Perceptron with Unigram, Bigram, and Unigram+Bigram features using five epochs for SMAFED compared with GSMFPM on Naija-Tweets dataset are presented in Table 13. The table shows that SMAFED outperformed GSMFPM with respect to matching between the predicted and the actual sentiment. As noted with Twitter Sentiment Analysis Training Corpus and Naija-Tweets, the lowest loss function in both approaches' unigram, bigram, and unigram+bigram was obtained at epoch_5.  Table 14 presents the Cross-Entropy Loss Function of CNN with kernel size =3 and one-four convolution layers using eight epochs for SMAFED compared with GSMFPM on the Naija-Tweets dataset. From the table, the performance of preprocessing coupled with data enrichment components of SMAFED outperformed GSMFPM with respect to matching the predicted and the actual sentiment. It should also be noted that the loss function of the first layer of CNN cross-entropy for both approaches is lower than that of other layers.

SMAFED Efficiency
The performance of the SMAFED framework was assessed using run-time performance metrics [63] to measure the efficiency and practicability of the framework. We implemented the proposed event detection method using Python (v 3.7). We used Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz processor for testing, with a 12 GB RAM and a 64-bit Windows 10 operating system. The framework was deployed on the cloud using a 4 GB Docker Droplet hosted on DigitalOcean services.
The tweets used for the prototype implementation were sourced from the Nigeria location. It was found out that the average number of tweets from Nigeria per minute is 45 without the application of a filter. This shows that tweeting in Nigeria is very small compared to an overall average of 350,000 tweets per minute. The part of the processing that took the longest time to process was spell checking. Before the clustering stage, it took 5 seconds to pre-process and enrich 40 tweets. The average processing time for each tweet is about 0.125 seconds. This is well within the limits needed to manage the estimated average of 1 tweet from Nigeria. In an improbable circumstance, where all tweets are recognized as events, the framework will have the option to handle eight times the normal volume of the tweets from Nigeria origin. With SMAFED, the life span of an event cluster is 4 days, after which it is deleted. This assumption was due to the fact that the potential value of data lies in its freshness. Choosing a lifespan of 4 days for an event cluster was done so as not to: 1) clog the memory, 2) maintain a limited number of clusters in memory, and 3) limit the number of comparisons to be made. SMAFED efficiency in terms of tweet streaming from Nigeria origin and pre-processing is depicted in Figure 3.

Experiment II: Accuracy of SMAFED
This experiment aims to determine how well the SMAFED can detect events in social media streams. Three metrics, including Precision, Recall and F-measure, were used to benchmark SMAFED with other existing frameworks. These metrics are regular assessment measurements for event detection techniques [19]. Precision alludes to the number of actual events detected. Recall gives the actual similar event level that the framework can identify, and F-measure speaks to the harmonic mean of Precision and Recall. The formulae for the metrics used to ascertain event detection accuracy are given as follows:

Dataset Description
To evaluate SMAFED, tweet IDs and the event relevance judgment (which is made available based on Twitter's terms of service) provided by [64] were used to obtain the tweet dataset. This was done using Tweepy with Twitter REST API. All 152 900 relevant tweets could not be extracted because some user accounts had been deleted and were no longer accessible. A total of 82,887 labelled tweets and additional 142,652 irrelevant tweets were collected. The distribution of the final dataset used by SMAFED for evaluation is presented in Table 15.

Experiment II: Result and Discussion
The section presents the results of the enrichment layer of SMAFED and the evaluation of SMAFED as benchmarked with existing frameworks in terms of accuracy. The final result of the enrichment stage is shown in Figure 4.
After the data enrichment stage, with a typical result presented in Figure 5, there is event clustering, ranking and summarisation. The enriched tweets from the enrichment layer are used as input to the event detection module to detect events from tweets. This module has four sub-modules: embedding, clustering, ranking, and summarisation. Sent2vec model vectorizes cleaned and enriched tweets. The Sent2vec model is wrapped in class Sent2VecWrapper and has a method "vectorize_sentences," which returns an array of vectorized sentences. The model is downloaded from the cloud service, DigitalOcean. Vectorized tweets are clustered with a semantic histogram clustering algorithm. The event ranking phase follows this. This task is implemented in the "Ranker" class. The last stage of the event detection stage is an event summary involving cluster summary computation. The resulting sample of the event detection stage, which includes event clustering, ranking and summarisation, is presented in Table 16. The section also presents a report on the accuracy of the event detection by SMAFED as compared with Locality Sensitive Hashing (LSH), Cluster Summary (CS), Entity-based approach and the Repp framework. The difference between SMAFED and four event detection approaches is presented in Table 17.   Table 18 shows the results of four different approaches compared with SMAFED. A cluster is considered a candidate event for the two baselines (LSH and CS) if it contains more than 30 tweets. The entropy-based method used 75 and above tweets with the best run, [17] used 10+ tweets with a mean over 20 runs, and SMAFED considered clusters weight > 100 threshold. Out of 120 million tweets available in the Event2012, McMinn et al. (2015) discovered 152,900 tweets that can be considered event tweets. Instead of using 120 million unlabeled tweets, SMAFED focused on the relevant tweets (150,000+) used as ground truth for benchmarking purposes. However, since tweets may be deleted or users can delete their own Twitter account, making them unavailable, the total 152900 relevant tweets could not be all extracted. A total of 82,887 labelled tweets were downloaded. To introduce noise (irrelevant tweets) to the dataset and assess the performance of SMAFED, an additional 142,652 irrelevant tweets were collected from the pool of irrelevant tweets in the 120 million tweets.
FIGURE 5. The F-Measure score for event detection approaches compared.
From Figure 5, it can be deduced that SMAFED performed better than the existing event detection approaches. SMAFED also has the highest value for F-measure compared to existing methods for event detection. This indicates that both precision and Recall are reasonably high and that the SMAFED has an excellent ability to detect events in social media streams. SMAFED was measured closely against the best event detection framework (amongst the four event detection approaches compared with SMAFED) proposed by [17]. Even though the number of tweets (categorised as irrelevant) in the Events2012 Twitter dataset added to the available 82,887 tweets (relevant) was more than Repp, SMAFED still performed better.

CONCLUSION AND FURTHER WORK
In this paper, a Social Media Analysis Framework for Event Detection (SMAFED) that can analyse the rich but hidden knowledge in social media streams to improve the accuracy of event detection was presented. SMAFED, as proposed in this paper, serves as an improvement on the existing event detection approaches with better metric scores in terms of Precision (0.922), Recall (0.793) and F-Measure (0.853). In addition, an evaluation experiment was carried out to determine the impact of the data enrichment layer by benchmarking SMAFED with GSMFPM. The cross-entropy result for sentiment classification of Twitter sentiment analysis training corpus and Naija-Tweets dataset for Multilayer Perceptron and Convolutional Neural Network with five epochs and eight epochs, respectively, showed that SMAFED outperformed GSMFPPM. This paper contributes to big data analytics research, particularly event detection in social media streams. More precisely, it caters for the observed limitations of existing event detection approaches by 1) performing semantic analysis of SAB terms along with ambiguity in their usage. This leads to better comprehension and interpretation of social media streams noisy terms; 2) evolving SABDA to disambiguate ambiguous SAB terms; 3) creating an integrated knowledge base to facilitate semantic analysis of noisy terms in social media streams.
In this paper, SMAFED used only social media stream texts for event detection in social media streams. The integration of images and correlated text from social media streams will further strengthen the event detection result. While Twitter is a well-known research data source, exploring and or combining it with other social media sources will lead to more events being detected and harmonisation of event detection results from multiple sources of social media stream. This is still open to further research as few approaches have exploited this medium.