“Social media, as a new data provider, has largely contributed to the appearance of new issues related to the modeling and manipulation of data. In this context, the analysis of data issued from social media is a promised research topic that has attracted the attention of many researchers and has given birth to novel analysis areas, such as Social Media Analysis. The works related to this area can be subdivided into two major categories: Those designing Data Warehouse (DW) models only from one social media (Twitter, Facebook,…), while others deal with the problem of modeling data warehouse schema from two social media (Moalla et al. 2017; Kurnia, P.F 2018; Valêncio et al. 2020).
We start by studying works that have been interested in designing Data Warehouse (DW) models only from Twitter (Bringay et al. 2011; Rehman et al. 2012; Cuzzocrea et al. 2016). The first research on designing DW from Twitter was conducted by (Bringay et al. 2011). The authors defined a multidimensional star model for analyzing a large number of tweets. However the proposed model was dedicated to a particular trend. In order to do this, the authors have proposed an adapted measure, called "TF-IDF adaptive", which identifies the most significant words according to level hierarchies of the cube (the location dimension). Nevertheless, their case study deals with a specific area: the evolution of diseases, referring to the thesaurus MeSH (Medical Subject Headings) by adding to their multidimensional model a dimension called MotMesh (MeshWords).
(Rehman et al. 2012) proposed a system for warehousing Streams from Twitter. Their system lies on an architecture consisting of five layers: i) The data source layer is represented by the available Twitter APIs, ii) The ETL layer (Extract, Transform and Load) for the extraction of data from tweets and processing in a suitable format for the target database, iii) The Data warehouse layer for the storage of data issued from tweets, iv) The Analysis layer dedicated for OLAP analyses of the tweets, and v) The Presentation layer of analysis results. Then, the authors present an extension of their work (Rehman et al. 2013) where they focus on integrating extensive natural language processing capabilities in OLAP to perform multidimensional social media analysis. A rather similar approach is proposed in the works of (Cuzzocrea et al. 2016). Considering Twitter, the authors emphasize the role of implicit information that could be derived or discovered in the tweets beyond explicit available metadata. Hence, they presented an extension of their previous work where they focus on the definition of multidimensional data model for the storage of tweets’ data to deal with the OLAP analysis. To do so, the authors defined the data cube where the dimensions are of two types: (i) Semantic Dimension: this dimension is extracted from Wikipedia knowledge base, and it makes use of the titles of Wikipedia articles and of the Wikipedia category graph. (ii) Metadata Dimension: Metadata refers to the information about the tweet that can be derived from its metadata, such as: timestamp, user, hashtag, location, etc. In addition, the authors propose a measure that exploits Wikification service to represent a sentence with a set of Wikipedia concepts. For OLAPing purpose, the authors proposed a summarization algorithm to select the best tweets that represent the data of each cube according to OLAP dimensional fact model.
Another study, with similar objectives, is presented by (Jenhani et al. 2019). An approach based on distributed storage and parallel processing was developed to extract events from streaming social media data at a large scale. To achieve this, Storm was used to process data extracted from Twitter's Streaming API, and Hadoop was used to process large volumes of data for integration into the data warehouse. The authors proposed a snowflake schema for modeling event data, which allows for both independent analysis of social media events and their integration with the existing enterprise data warehouse. A bridge table was employed to establish the connection between the Social Media Data Warehouse (SMDW) and the Enterprise Data Warehouse (EDW). In a similar vein, (Girsang et al. 2020) introduced a Business Intelligence (BI) application dashboard that utilizes a data warehouse to provide the journalistic community with a solution for accessing powerful, effective, and extensive news sources. To achieve this, the authors collected data from Twitter using the available API and periodic crawlers. The retrieved data were then stored in the database as raw data and subjected to text classification using the Support Vector Machine (SVM) algorithm. The resulting data were stored in the database as analysis data, which was then processed through the Extract-Transform-Load (ETL) process and placed in the data warehouse. The data warehouse enabled the creation of a BI dashboard that could assist journalism in analyzing news information.
Despite the growing interest in designing data warehouses from social media, there are few studies that focus on modeling schema from two different social media platforms (Kurnia, P.F 2018; Valêncio et al. 2020; Moalla et al.2017, 2022).
For their part, (Moalla et al. 2017) propose a new method of opinion analysis based on machine learning that determines the polarity of users ‘comments shared on different social media. The latter will be integrated in the ETL (Extract, Transform and Load) process to analyze the users' opinions. The proposed method is based on the n-grams technique to construct a semi-automatic dictionary for positive and negative keywords that is used in the learning phase to establish the prediction model. In addition, they propose a new features vector specific for social media for classifying the comments as positive, negative or neutral. The evaluation results performed on both advertising data sets Stanford Twitter Sentiment (STS) and Sanders's dataset showed a high accuracy level. In 2022, the authors presented an extension of their previous work. They propose a new approach for building a DW from social media for opinion analysis. The proposal consists of four phases: Data extraction and cleansing, Transformation, loading, and analysis. They have presented the different stages of data extraction and cleansing. These steps are intended to model data marts for each social media. In the transformation phase, the authors have detailed the mapping and merging step to obtain a generic DW schema. In addition, they have summarized the opinion analysis step. After that they presented the implementation of a DW under a database NoSQL- oriented documents. Finally, in the analysis and reporting steps, they performed some queries on our DW. A rather similar approach is proposed in (Valêncio et al. 2020). The authors proposed a constellation schema to model data from Facebook and Twitter. Their normalized model utilizes quantitative attributes from social media posts to reduce duplicate data and minimize the execution time of data mining algorithms. To support this schema, the authors introduced a Configurable Load and Acquisition Social Media Environment (CLASME), which allows for data preparation and knowledge discovery from Facebook and Twitter. The ETL phase begins with the selection of public data from pages and accounts, followed by loading qualitative data, classifying comments into positive, negative, and neutral, and determining opinions about the post. Quantitative data is then loaded. Once ETL and data warehousing are complete, data mart algorithms are applied in the data analysis phase to validate the classification model. The results are interpreted to facilitate the decision-making process. In a same context, (Kurnia, P.F. 2018) developed a business intelligence dashboard to assess the performance of topics posted on Facebook and Twitter. To achieve this, they implemented a data warehouse model and software for the business intelligence system that consists of four stages. First, data is collected by extracting information from Facebook and Twitter using the available APIs. Text classification techniques, such as Naive Bayes, decision tree, and SVM, are then applied to attribute a category or class to the data based on the document characteristics. The data warehouse design stage follows the Kimball method, which proposes a star schema model to represent the number of comments, tweets, likes, etc., for each topic. Finally, the business intelligence design stage involves creating a business intelligence dashboard using the CodeIgniter framework. Despite the strengths of existing approaches, they also have certain limitations. Notably, most of the proposed design approaches in this domain are confined to using a single social media platform (Twitter) as a data source and fail to consider other social media platforms. Additionally, the design process is not presented in detail, with little to no discussion of the rules that govern the data warehouse schema. To overcome these limitations, our approach leverages multiple social media platforms to gather additional information. Specifically, we utilized two social media platforms that are relevant to the same subject matter. To effectively combine and aggregate the data from these platforms, we developed a schema integration method that generates a unified data warehouse schema.”
“Further to this study, we may conclude that most of these works ensure a special treatment of data extracted from social media but don’t offer tools for the decision maker to manipulate the information contained in the combined meta-data associated their posts.
Hence, our aim is to provide a multidimensional model dedicated to the content, metadata and social aspect of posts that is generic (i.e., independent of the special needs pre-defined a priori) and taking into account the structural specificity and possibly semantic data.”