Improving Association Discovery through Multiview Analysis of Social Networks

The rise of social networks has brought about a transformative impact on communication and the dissemination of information. However, this paradigm shift has also introduced many challenges in discerning valuable conversation threads amidst fake news, malicious accounts, background noise, and trolling. In this study, we address these challenges by focusing on propagating fake news labels. We evaluate the eﬃcacy of community-based modeling in eﬀectively addressing these challenges within the context of social network discussions using the state-of-art benchmark. Through a comprehensive analysis of millions of users engaged in discussions on a speciﬁc topic, we unveil compelling evidence demonstrating that community-based modeling techniques yield precision, recall, and accuracy levels comparable to those achieved by lexical classiﬁers. Remarkably, these promising results are achieved even without considering the textual content of tweets beyond the information conveyed by hashtags. Moreover, we explore the eﬀectiveness of fusion techniques in tweet classiﬁcation and underscore the superiority of a combined community and lexical approach, which consistently delivers the most robust outcomes and exhibits the highest performance measures. We illustrate this capability with speciﬁc network graphs constructed based on Twitter interactions related to the COVID-19 pandemic, showcasing the practicality and relevance of our proposed methodology. To demonstrate the excellent performance achieved with the fusion of modalities, we show an improvement of the combined lexical and community method that achieves up to 60% both for precision and recall measures.


Introduction
The advent of social networks has conferred significant importance on these platforms as principal channels for news consumption among a substantial segment of the populace.The interconnectivity of online users within these networks facilitates the swift propagation of information, surpassing the conventional scope of traditional news media outlets like newspapers and television.Nevertheless, this inherent interconnectivity also amplifies the ease with which inaccurate and deceptive information can increase, particularly within the context of users' social network connections.This study seeks to examine the potential utility of the structural characteristics of social network user connections in identifying and addressing false information, specifically within the domain of Twitter.
Can we classify the tweet without knowing its content?In this paper, we explore the social network context, Twitter's rich network of interaction, i.e., connections, tags, retweets, and mentions, and how they influence the labeling of the content.We test the observation that people in the same social network group or discussion thread tend to quote and discuss similar resources and have shared topic items, shed new light on the challenges posed by social network dynamics, and offer an effective means of tackling them through community-based modeling.We contribute to advancing tweet classification methodologies by demonstrating the comparable performance of community-based approaches to traditional lexical classifiers as we uncover the actual value of the contextual information embedded within social network interactions involving tweet authors and objects.Our research opens up exciting avenues for further exploration and application, paving the way for more sophisticated network selection and fusion methods that leverage both community attributes and lexical modeling to enhance the accuracy and effectiveness of tweet classification in the everevolving landscape of social networks.We present tangible evidence of our ability to capture comprehensive information by constructing network graphs that encapsulate crucial features such as retweets, mentions, replies, and quote networks.
We propose an enrichment of Tweet classification with a network-based analysis of the Twitter network, as illustrated in Figure 1, and relate the content of the tweets using multi-modal lexical analysis, employ community discovery by building a network of retweets, mentions, and hashtags, and employ network analysis on structural data mined from Twitter.Our robust lexical-based analysis for Tweet content considers colloquialisms, abbreviations, and OCR text in images.It is part of the scalable data science package that downloads, saves, and analyzes Twitter data at scale.It provides a robust content analysis of noisy communities on Twitter introduced in Nogueira; Nogueira and Tešić (2021); Nogueira (2020).We evaluate the approach in the MediaEval 2020 Fake News task benchmark and COVID-19 (+) Twitter data set.Pogorelov et al. (2020) demonstrate the author's network's value in content classification for the MediaEval Fake News Detection Task 2020.Two Fake News Detection sub-tasks on COVID-19 and 5G conspiracy topics detect misinformation claims that the construction of the 5G network and the associated electromagnetic radiation triggered the SARS-CoV-2 virus.This benchmark challenge looked only at Tweet classification of COVID-19-related tweets in two ways: (1) multi-class labeling: 5G-Corona Conspiracy, Other Conspiracy, and Non-Conspiracy, and (2) binary labeling: Unknown-or-Non-Conspiracy and Any-Conspiracy.This research finds that the tweet classification on the author's network only (without analyzing tweet content) performs similarly to tweet content classification.

Motivation and Contribution
Researchers in the machine learning field tend to train models with features derived from one modality without exploiting or exploring other ones.A singular focus on one modality may limit the model's ability to capture a holistic understanding of how to generalize on unseen data.This paper substantiates the importance of employing community networks to build classifiers for tweet classification.We demonstrate this by utilizing the MediaEval 2020 Fake News task benchmark and the custom COVID-19 (+) Twitter data sets, where we utilize six distinct community network knowledge graphs to classify tweets correctly.In addition, we show that incorporating the community features and the lexical features produces the most superior performance and precision, recall, and accuracy metrics.Finally, we take advantage of the user attributes for the tweets used as input to the Random Forest classifier for classification.

Related Work
This section reviews the related work on fake news detection on Twitter.The prevalence of "fake news" raises significant concerns.Osmundsen et al. (2021) shows that fake news sharing is fueled by the same psychological motivations that drive other forms of partisan behavior, including sharing biased news from traditional and credible news sources.Given the widespread proliferation of misinformation online and Reply connection network majority prediction: 5g corona conspiracy # of edges in labeled 5g corona conspiracy set: 11 # of edges in the other conspiracy dataset: 0 # of edges in the non conspiracy conspiracy dataset: 0 % of tweets in the detected community that are from 5g corona conspiracy dataset: 100% % of tweets in the detected community that are from other conspiracy dataset 0% % of tweets in the detected community that are from non conspiracy dataset 0% the growing reliance on social media for news consumption, it is essential to comprehend how people evaluate and engage with posts of low credibility.This study examines users' responses to fake news posts on their Facebook or Twitter feeds, seemingly originating from accounts they follow.To explore this phenomenon, we conducted semi-structured interviews with 25 participants who regularly employ social media for news consumption.Using a browser extension unbeknownst to the participants, we temporarily introduced fake news into their feeds and observed subsequent interactions.The participants provided insights into their browsing experiences and decision-making processes through this process.Our findings highlight various reasons individuals refrain from investigating posts of low credibility, including a tendency to accept content from trusted sources at face value and a reluctance to invest additional time in verification.Moreover, we outline the investigative techniques employed by participants to verify the trustworthiness of posts, encompassing both the functionalities provided by the platform and impromptu strategies.Geeng et al. (2020) explores how to assist users in assessing the credibility of posts with low credibility.Bovet and Makse (2019) uses Twitter data to understand the influence of fake news during the 2016 US presidential election, Ahmed et al. (2020) uses Twitter data to analyze the COVID-19 and the 5G Conspiracy Theory, and Sha et al. (2020) uses Twitter data to evaluate the influence of COVID-19 Twitter narrative among U.S. governors and cabinet executives.et al. (2016) shows that the content of the Tweet dominates in correct Tweet classification, and Zhou and Zafarani (2019) identifies writing style and frequency of word usage emerged as relevant features in the lexical analysis.Two primary directions of leveraging community information are adapting deep learning techniques to learn the underlying characteristics of the Tweets in communities (e.g., et al. (2019)) or exploring the structural and sharing patterns of the topic (e.g., et al. (2020)).
Context Through Connections: Zhou and Zafarani (2019) has shown that community-based modeling of social networks that leverages the spread of information in social media through retweets and comments improves NLP-based modeling.Structural and sharing patterns in the Twitter-verse are rich, and the definition of communities on Twitter is multi-dimensional.Users in the community can share geographic proximity and interconnections with mutual friends, groups, and topics of interest.Osmundsen et al. (2021) mapping of psychological profiles of over 2,300 American Twitter users linked to behavioral sharing data and sentiment analyses of more than 500,000 news story headlines finds that the individuals who report hating their political opponents are the most likely to share political fake news.They also selectively share helpful content to derogate these opponents.Nguyen et al. (2020) proposes a Factual News Graph (FANG).FANG is a graphical social context representation and learning framework for fake news detection focusing on representation learning.It has captured social context to a degree if the topic is well represented and has generalized to related tasks, such as predicting the factuality of reporting of a news medium.Su (2022) uses similar unsupervised graph embedding methods on the graphs from the Twitter users' social network connections to find that the users engaged with fake news are more tightly clustered than users only engaged in factual news.Gangireddy et al. (2020) graph-based approaches focus on bi-clique identification, graph-based feature vector learning, and label spreading on Twitter.The downside of the existing graph representation is that it does not scale to the millions of users and the heterogeneity of the topics examined.Schroeder et al. (2019) developed a framework for capturing and analyzing vast amounts of Twitter data.It consists of the primary data capturing component (Twitter API), the proxy, the storage, and experiment wrappers, which are connected to the storage and the proxy.The proxy provides quota leasing, an external API allowing users to execute calls with the same syntax and request caching.
Lexical Aspects: The #MeToo hashtag is a movement that has recently emerged against sexual assault and advocating women's rights.The lexical aspects of tweets with this tag have been predicted by capitalizing on both textual and visual modalities.Bansal (2020) shows that the contextual embeddings and transformer language models were too computationally expensive to include.Many similar works dealing with these same types of modalities have put the preserved version of BERT and a generic Deep Neural Network (DNN) to use for feature extraction.Suman et al. (2021) developed a profiling system to identify anonymous and potentially nefarious users' genders.Gao et al. (2020) utilized multi-modality for finding disaster tweets.de Bruijn et al. (2020) proposed incorporating contextual hydrology information to classify flood-related tweets effectively.Lim et al. (2020) showed that the pivotal attribute for tweet sentiment analysis is the location features (longitude and latitude) of geotagged tweets.These representations enhance accuracy in classifying sentiment compared to the baseline GloVe model using a convolutional neural network (CNN) and a bi-directional long short-term memory recurrent neural network (LSTM).
Hybrid Analysis: Graph Neural networks perform well in multi-modal contexts.Many state-of-the-art Graph neural network (GNN) variants have been developed to resolve current issues of vanilla baseline GNNs.Gao et al. (2020) present MM-GNN, a novel framework that addresses inquiries by providing information from images.MM-GNN incorporates visual, semantic, and numeric modalities to represent an image as a graph.The node features are refined by leveraging contextual information from these modalities (using message passing), which improves performance in questionanswering tasks.Yang et al. (2021) introduces SelfSAGCN to alleviate over-smoothing when labeled data are severely scarce using "Identity Aggregation" and "Semantic Alignment" techniques.Wang et al. (2021) design Bi-GCN for the limited memory resources.It binarizes the network parameters and input node features and produces results comparable to baseline vanilla models such as GraphSage and GCN.Dai et al. (2021) introduces NR-GNN variant to deal with sparsely and noisily labeled graphs.Liu et al. (2021) presents Tail-GNN, a network inference that utilizes neighborhood translation to enhance node representations and uncover missing neighborhood nodes.Dai and Wang (2021) shows that all graph neural networks suffer from training data bias and vertex feature dependency.
Table 2 Tweet content has all the words, and the lexical approach misclassified it.The community approach provided enough attributes for the fusion run to identify it correctly.
Content: Explaining why beneficial effects from cannabis on intestine inflammation conditions like ulcerative colitis and Crohn's disease have been reported often.If the endocannabinoid isn't present, inflammation isn't balanced; the body's immune cells attack the intestinal lining.Ground Truth: non conspiracy Lexical model Prediction: 5g corona conspiracy All connections network majority prediction: non conspiracy # of connections in the 5g corona conspiracy dataset: 0 # of connections in the other conspiracy dataset: 129 # of connections in the non conspiracy conspiracy dataset: 185 % of tweets in the community that are from 5g corona conspiracy dataset: 10% % of tweets in the community that are from other conspiracy dataset: 25% % of tweets in the community that are from non conspiracy dataset: 65% Fake News Detection Social media platforms have become a vital source of information during the outbreak of the pandemic (COVID-19).The phenomenon of fake data or news spread through social media has become increasingly prevalent and a powerful tool for information proliferation.Detecting fakes is crucial for the betterment of society.Existing fake news detection models focus on increasing performance, improving overfitting, and lag generalizability.Bhatia et al. (2023) is used as a baseline for the work.Robust distance is a generalization of transformers-based generative adversarial network (RDGT-GAN) architecture and can generalize the model

Methodology
This paper uses a scalable approach to gather, discover, analyze, and summarize joint sentiments of Twitter communities, extract community and network features, and improve the lexical-based baseline for Tweet classification using community information Nogueira and Tešić (2021).he entire pipeline is summarized in Figure 5.

Content Analysis, Transformation, and Feature Selection
The tweets we analyzed had a content capacity of 280 characters.That limit tends to produce a writing style that differs from most corpora.To achieve brevity, users employ a lexicon that includes abbreviations, colloquialisms, hashtags, and emoticons, and tweets may contain frequent misspellings.The context of a Tweet is also more affluent, as it resides in a rich network of retweets and replies.To this end, we employ lexical-based analysis and community analysis for Tweet content and context.The Lexical Analysis Pipeline implements the transformation of Twitter content, feature extraction, and modeling to make predictions for the NLP-based task Magill and Tomasso (2020).
In the transformation step, we tested several pre-processing, tokenization, and normalization techniques.e measured the influence of each transformation approach to predict performance on the part of the development set, turning off the feature and comparing the performance using 5-fold measures.Removing punctuation, preserving URLs, and normalizing several specific terms (e.g., 'U.K.' to 'UK') in the Tweet contributed to better content classification, as expected for the short tweet content.Stemming did not influence the classification recall on this small development set, nor did lemmatization.e speculate that the Tweet content was too short and the data was too small to derive any meaningful conclusion, so we did not apply either.Feature extraction from Tweet content was implemented in two ways: encoding terms as vectors representing either the occurrence of terms in the text (Bag-Of-Words) or the impact of terms on a document in a corpus (TF-IDF).e extended the feature set in the tweets using Optical Character Recognition (OCR) of embedded images.

Rich Graph Network Analysis
We apply the Community Analysis Pipeline for community discovery in networks created from user and hashtag connections to construct seven different networks from the raw Twitter data: All Users Connections, a network created from the labeled data set, with each vertex in the network being a user and each edge of the network being the connection between two users by either a retweet, quote, reply, mention, or friendship; Retweet Connections, which is similar to All Users Connections, but with each edge being the connection between two users by retweets only; Mention Connections which is similar to All Users Connections, but with each edge being the connection between two users by mentions only; Reply Connections, which is similar to All Users Connections, but with each edge being the connection between two users by replies only; Quote Connections, which is similar to All Users Connections, but with each edge being the connection between two users by quotes only; Friends Connections, which is similar to All Users Connections, but with each edge being the connection between two users by friendship only and Hashtag Connections is a network created from the labeled data set, with each vertex in the network being a hashtag and each edge of the network being the connection between two hashtags used together in the same Tweet.e have developed an in-house scalable package pytwanalysis Nogueira; Nogueira and Tešić (2021); Nogueira (2020) to collect and save information-rich Twitter data, create networks, and discover communities in the data.

Community Labeling
We utilized all networks to learn the user attributes and tweets relevant to the community and topic.First, we found communities using an adapted Louvain method Aynaud (2020); Nogueira.e labeled each community with one of the three conspiracy categories (5G, non, other) based on the majority of the tweets for that community.f we found a community with more tweets with the 5G label as opposed to non or other, we assigned the 5G label to unlabeled tweets in that community.igure 1 demonstrates a simplification of this method.e applied the method to all seven networks for community discovery and assigned seven community labels (from seven networks) to each Tweet, listed as features 1 through 7 on Table 3. or the Hashtag Connections network, because one Tweet can have multiple hashtags, then one Tweet could belong to multiple hashtag communities.In that case, the majority logic selects the most common community found for that Tweet.The remaining tweets that did not belong to any community or that belonged to a community with tweets strictly originating from the test data set were assigned as Unknown.Many Unknowns were found because many tweets did not have any connections with other users in the labeled data sets (i.e., no retweets, replies, quotes, mentions, friends, or hashtags).An additional combined label was created with a combination of the other seven labels, listed as feature eight on Table 3.The combined label first uses the label from the quote network; if the quote network has an unknown value, it uses the value from the reply network, followed by the mention of all user connections, retweets, friends, and hashtag networks.The order of use for each network in the combined label was decided based on the evaluation metrics for the predictions coming from each network (Table 9).The community discovery approach can be helpful for data sets in which users are well-connected to each other.User connectivity was also extracted from the graphs created from the development data sets.User connectivity is a feature that shows the degree of connectivity between each user in the All Users Connections network for each of the provided classification labels, driven by the observation that if vertices are well-connected, their content is similar.See features 9 through 12 on Table 3.

Attribute Labeling
User Attributes in the tweets are also extracted from the Twitter data.The produced networks can contain several disconnected tweets, so we expand the suite of network features and extract four additional user attributes and one Tweet attribute as follows: 1. user followers count (Fig. 2; 2. user friends count (Fig. 3; 3. user statuses count (Fig. 4; 4. user verified (Fig. 7); 5. tweet age (days since creation) (Fig. 6).Since the community majority selection predictions generated many unknown assignments, we used an additional classifier to help predict labels for tweets that were disconnected from the network.Since we have different types of features, we used the versatile Random Forest classifier that can work well with a mixture of categorical and numerical features.Community features 1 through 12 from Table 3 and user features 1. to 5. The items listed above are used as input to the Random Forest classifier.The distribution of data for the features in the labeled data is shown in Figure 2, Figure 3, Figure 4, Figure 6, and Figure 7.
Community features 8 through 20 from Table 3 and user features from 1 through 5 are input to the multi-label (5G, non, other) Random Forest classifier.Because of the number of unknown predictions from the community assignments, this additional classifier helps predict labels for tweets that were disconnected from the network.Since we have different types of features, we used the versatile Random Forest classifier that can work well with a mixture of categorical and numerical features.
First, we create three different networks from the raw data: User Connections from provided data: vertex is a user, and each edge is the connection between two users by either a retweet, quote, reply, or mention; Hashtag Connections from provided data: vertex in the network is a hashtag, and edge exists between two hashtags if they were used together in the same tweet; and User Connections 8M : a network created from provided data and the auxiliary dataset of over 8M tweets, where vertices and edges of the network made the same way as the User Connections network.Next, we extract the degree of connectivity for each of the provided conspiracy labels (5G, non, and other) driven by the observation that if vertices are well connected, their content is similar.We employ the Louvain Community discovery method to discover communities in all three networks and apply to specific tweets information from each network analyzed Nogueira (2020).We labeled each community with one of the three conspiracy categories (5G, non, other) based on the majority of the labels for that community associated with the tweet label.If we find a community where 5G labels are more significant than others, we will use the 5G label to assign the label to unlabeled tweets in that community.These assignments were done based on the combination of communities in all three networks.tweets that did not belong to any community or belonged to a community with tweets strictly originating from the test dataset were assigned based on their degree of connectivity, and the remaining were assigned as Unknown.Many unknowns were found because many tweets did not have any connections with other users in the given datasets (no retweets, replies, quotes, mentions, or hashtags).

Modality Overlap Analysis
In this subsection, we aim to explore and determine whether the communities derived from different modalities exhibit low overlap, signifying complementary information, or if there is a considerable amount of overlap, suggesting redundancy or similar underlying structures.Quantifying this measure may help identify the modalities contributing to the unique information and design fusion methods accordingly.For example, it can allow us to determine which modalities should be assigned more weight to get the best performance in classification tasks.

Network Construction
After undergoing multiple pre-processing steps, a network has been constructed from the COVID-19 (+) data set, which consists of 8 million tweets.First, replies, quotes, and retweets are the selected connection modes of the network.Unlike in the case of quotes and retweets, we have found that there is no elaborate information present (full text, media url...etc.)replied by tweets in COVID-19 (+).Hence, we removed any edges constructed in the replies connection mode, where the target node is not found within the 8 million tweets due to the inability to extract textual and visual features from it.To reduce sparsity in the network, every target node should be connected to at least ten nodes.Otherwise, the isolated nodes or the nodes' connections falling under this threshold are pruned.Moreover, isolated nodes and duplicate edges were eliminated, and the first occurrence of any duplicate was kept.As a result, the total number of nodes and edges dropped to 3,407,903 and 3,316,523, respectively.For simplicity, every node ID, designated by its tweet ID, was mapped to values ranging from 0 to 3,407,902.

Visual and Textual Feature Extraction
We find that 154,923 tweets had images in COVID-19 (+).Some of the tweets were suspended, impeding some of the retrieval of the images.We also assigned the name of each image to its corresponding tweet ID, preserving the link between the tweet and the image.VGG16 model pre-trained on ImageNet was employed as a feature extractor for all the images.On the other hand, textual embeddings were produced by a trained adapted version of BERT for COVID tweets called BERTweet by VinAIResearch Nguyen et al. (2020).We utilized the baseline normalizations as elaborated below in subsection 3.1 but with a few alterations that include removing usernames, all special characters, hashtags, contractions, non-English tweets if present, links (which not only incorporates "https://t.co/,"but also "http" and "www"), and emojis.These additional textual normalizations were applied, and BERTweet features were subsequently extracted.

Augmented Network Construction
We seek to obtain an infused network that is comprised of the network above, as well as a visual similarity graph.The latter is built by computing the cosine similarity between each node's image DNN features in the pre-processed network.Hence, the edges are formed between each node and its five most visually similar nodes.The number of edges bumped up to 4,091,138 in our processed COVID (+) network.The motive behind this is that the GNN will aggregate features from the neighboring nodes of hose from replies, quotes, and retweets and the nodes with an image that's visually like it.

Graph Neural Network Training
To leverage all modalities and aggregate features from neighborhood nodes, the adjacency, and the feature matrices are fed to an unsupervised GNN framework.The selected model for training the graph neural network is GraphSage Hamilton et al. (2017), which produces an embedding output of size 50 dimensions.The hyperparameters are epoch = 1, batch size = 50, layer size = 50, and learning rate = 0.001, with Adam as an optimizer.The choice of this variant of GNN is ascribed to the fact that GraphSage utilizes the neighborhood sampling concept, which it renders scalable.GraphSage GNN has been trained separately on the constructed and visually infused networks with the same textual feature matrix representing the nodes' features.

Clustering
Both networks have been clustered using the Louvain Algorithm Blondel et al. (2008).However, the rest has been clustered using HDBSCAN (Hierarchical DBSCAN) Campello et al. (2013).It is faster than regular DBSCAN.The minimum cluster size has been set to 10. Due to the memory constraints associated with clustering high dimensional textual embeddings and extensive data, the number of dimensions of the text has been reduced to 10 using the PCA method.However, the dimensions are intact when generating GNN embeddings.The task at hand deals with highly imbalanced datasets as outlined in Table 4 for details).Generating fake tweets using the most predictive or most common terms for each class led to the over-fitting of most classifiers.We took a different route and adjusted class weights to account for imbalanced data when possible.The MediaEval Fake News Detection Task 2020 looks into tweets for misinformation claims that the construction of the 5G network and the associated electromagnetic radiation triggered the SARS-CoV-2 virus.We have received a labeled data set of approximately 6,000 tweets related to COVID-19, 5G, and their corresponding metadata; see details in Table 4).Note that all of our training was done using the development set, which contains 1,120 tweets labeled for 5G-COVID conspiracy, 688 tweets for another conspiracy, and 4,138 for non-conspiracy tweets, as shown in Table 4.This data set is small and very imbalanced.Thus, we extended the labeled data set with a new COVID-19 (+) data set that contains tweets related to #Coronavirus, #Covid19, and #Covid-19, collected from March through September 2020, with over 3.2 million users and 8 million tweets Nogueira (2020).From the 8 million tweets, we filtered only the tweets that can make a connection in the existing networks created from the labeled data.After applying the filter, we ended with 771,203 COVID-19 Tweets.The COVID-19 (+) data set was used to augment the feature space for classification.We also extended knowledge about user relationships by using the Twitter API to retrieve a list of friends for each user in the labeled data set.A total of 3,385,981 users were retrieved, but that number does not include 100%

Measures
We measured the performance of the proposed methods on a tiny labeled subset of test data in Table 4. MediaEval officially reported that the metric used for evaluating the multi-class classification performance was the multi-class generalization of the Matthews correlation coefficient (MCC) Pogorelov et al. (2020); Chicco and Jurman (2020); Baldi et al. (2000).MCC has advantages in bioinformatics over F1 and accuracy, as it considers the balance ratios of the four confusion matrix categories (true positives, true negatives, false positives, and false negatives).In a social network analysis, we are more interested in missed tweets (false negatives) and true positives.For this reason, we discuss our results from the perspective of precision, recall, and accuracy.We employ the adjusted Rand index (ARI) metric to measure the overlap between modalities and compare the partitions.We have already tested the lexical classification pipeline incorporating a variety of classifiers: Naive Bayes, Support Vector Machine, Random Forest, Multilayer Perceptron, Stochastic Gradient Descent, and a Logistic Regression classifier, and ended up using Logistic Regression, which has been shown to perform best for the content-based classification in Magill and Tomasso (2020).We compared the performance of the classifiers on validation sets, both for the multi-class and binary classification subtasks.While the TF-IDF vectorizer captures the importance of terms well, we found better results using a Bag-Of-Words model in Section 5, likely due to the high occurrence and variety of colloquialisms and abbreviations.Table 5 shows the metrics for the multi-class and binary predictions using the Logistic Regression classifier Magill and Tomasso (2020).This paper's paper's lexical analysis pipeline's baseline results improve upon Data Lab's best multi-class logistical regression (LR) model MediaEval 2020 submission Magill and Tomasso (2020) using cross-validation and regularization.The new best MCC result for the LR used in this paper is 0.435 for multi-class and 0.492 for binary classification.

Community Analysis Pipeline
Table 9 shows the metrics for the multi-class and binary predictions using the Louvain community majority assignment for each type of network with and without the COVID-19 (+) data set.Results are intuitive, as community majority assignments using the combined connections network with the COVID-19 (+) data set perform the best over the range of measures.The table also shows the number of tweets that were classified as unknown when they did not belong to any community.The additional results for the Random Forest classifier are included in the table for comparison.Note that the total for each model is always 2,908, which is the number of labeled tweets in the test set.The Community Contribution Analysis MediaEval 2020 development set is small and only captures fragments of the community.The number of unknown community assignments is large.It skews the use of community attributes, as shown by the low performance in section Multi-class with Unknowns in Table 9.Thus, we separate the evaluation in the multi-class community majority assignment into evaluation including the unknowns and evaluation excluding the unknowns.The metrics without the unknowns were calculated separately so that we could evaluate how well we could classify the tweets that did belong to a community, as shown in section Multi-class without Unknowns in Table 9 and Figure 8. Results calculated without the unknowns show comparative performance with the lexical pipeline.
The results in Table 9 show that the performance of community modeling is comparable to the lexical model if unknown assignments are excluded, and the quality of the predictions in different types of networks is broken down.Networks created from quotes and replies seem to yield the best results.Our initial premise is that similar topics and news are shared with the people who quote each other or participate in the same discussion thread, so this finding confirms the value of that correlation.On the other hand, the hashtag network's predictions do not provide excellent results, as many of the same hashtags are used in both conspiracy and non-conspiracy-labeled data.
Labeling Considerations: The main challenge of the community approach is scale; the annotations and the topic should be prevalent in the data set to benefit from the community-based analysis truly.The COVID-19 (+) data set was obtained by finding an intersection of our originally mined data set of 8 million Tweets; see Section 4.1.Community-based analysis with the auxiliary data brought the value of community connections to this analysis; compare model and model+ in Table 9.The COVID-19 (+) data set improved the connectivity in the network, which consequently enhanced the number of tweets that were able to be classified.The number of unknowns from the all connection network (All) decreased from 198 (All) to 108 (All+) when an Table 7 Overlap in the community multi-class predictions by the method: the percentage shows the overlap between the predictions of two methods out of the 2908 test records.
analysis of the same labeled data was done within the more extensive network, and the MCC score jumped from 0.089 to 0.180.Using the Random Forest classifier over community and attribute labels improves the overall performance of the classification; see Table 9.The classifier can assign values for tweets that could not be classified with the community majority assignments since it uses additional features apart from the community features; see Section 3.2.2.
Table 10 summarizes the correct classification results that the network modeling produces that the lexical one does not.The community predictions perform comparably for cases where the Tweet was not isolated from the network.Figure 7 illustrates the overall multi-class detection overlap by the method.The highest overlap occurs between the all connections network predictions and the Random Forest model, which is expected since the network predictions were used as features for the Random Forest model.The lexical model overlaps most with the all connections network predictions and Random Forest.Other methods that have high overlap in their predictions are the all connections network with the friends network, the retweet network with the mention network, and the quote network with the reply network.

Combining Community and Lexical Attributes
In this experiment, we combine the logic of the lexical pipeline, as described in Section 3.1, and the community pipeline, as described in Section 3.2.We use the prediction of the lexical pipeline as a new input feature for the community pipeline that uses the Random Forest classifier.The combination of features that provided the best results was the following: lexical prediction, user followers count, user friends count, user statuses count, user verified, tweet age, lv comty usr all(majory dataset), and lv comty(majory dataset)-combined.Community modeling does not consider the tweet's content beyond hashtags: it models the interactions with the tweet (mentions, quotes, retweets, replies), and with the author (friends).The model trained on community-based and lexical-based features achieved the highest MCC score on the test set, as shown in Table 8.Binary lexical and community classifications (non-conspiracy vs. conspiracy) perform better than the lexical multi-class baseline.Recent work has shown different dispersion patterns regardless of the conspiracy topic et al. ( 2018), and our community and lexical binary capture this observation well, as it outperforms across four different measures of classification efficiency; see Table 8 for details.

Quantifying Modality Overlap
Table 11 shows that multiple modalities seem to capture specific information, and it is not relevant for community discovery at a global scale due to the negligible overlap between the modalities.However, communities produced by each modality might have value for specific discovery and mining tasks.The low overlap provides insights into the effectiveness of different modalities in capturing the underlying patterns within multi-modal tweet data and how much they complement each other.

Discussion and Outlook
In conclusion, this research highlights the significant influence of community behavior in tweet classification, suggesting that it carries a comparable weight to tweet content.By introducing a community-based approach to tweet classification, we successfully utilized six distinct community network knowledge graphs to classify tweet content accurately.Our findings demonstrate the advantages of incorporating community attributes and models into the lexical baseline for tweet classification.Notably, community networks offer valuable contextual information for understanding tweet communication, and our study reveals that community-only modeling is as informative as content modeling, as it encompasses crucial details regarding social network interactions with the tweet object.Remarkably, our community modeling techniques, implemented on a large-scale real network, achieved precision, recall, and accuracy to comparable a lexical classifier, even without considering tweet content beyond hashtags.Furthermore, we have shown that essential fusion techniques outperform lexical and network baselines.In contrast, combining community and lexical approaches produces the most robust outcomes and superior performance measures, as evidenced by the MediaEval Fake News task results.The complex knowledge graph depicted in Figure 7, which encompasses retweet, mentions, reply, and quote networks, illustrates our ability to capture and incorporate comprehensive network information.Moving forward, we plan to explore enhanced network selection and fusion methods in conjunction with Lexical Modeling and Friends network to improve the accuracy of tweet classification. of prediction algorithms for classification: an overview.Bioinformatics 16(5), 412-424 (2000) al., V.: The spread of true and false news online.Science 359(6380), 1146-1151 (2018)

Fig. 2
Fig. 2 Distribution of the feature user followers count for the different class labels (5G, non, other).

Fig. 3
Fig. 3 Distribution of the feature user friends count for the different class labels (5G, non, other).

Fig. 4
Fig. 4 Distribution of the feature user statuses count for the different class labels (5G, non, other).for COVID-19 fake news datasets with different distributions.We show that the proposed model outperforms Bhatia et al. (2023)'s current state-of-the-art (SOTA) model with 98.7% accuracy on PolitiFact, a standard FakeNewsNet dataset, and an extended Twitter dataset.

Fig. 6
Fig. 6 Distribution of the feature tweet age for the different class labels (5G, non, other).

Fig. 7
Fig. 7 Distribution of the feature user verified for the different class labels (5G, non, other)

Fig. 8
Fig. 8 Comparison of the multi-class community majority assignment excluding the unknowns for the different types of networks, as detailed in section Multi-class without Unknowns in Table 9

Fig. 9
Fig. 9 Modeling comparisons on multi-class for the test set for Multi-Class classification.Community-only classification offers comparable precision and accuracy without even considering tweet text.Fusion of the lexical and community methods offers the best performance across the board.

Fig. 10
Fig. 10 Modeling comparisons on binary results for the test set for Binary classification.Communityonly classification offers comparable precision and accuracy without even considering tweet text.Fusion of the lexical and community method offers the best performance across the board.

Table 1
Tweet by a user with strong 5G Corona Conspiracy community ties.Community-based detection identified the group and augmented the lexical classification.Content: Does #5G cause #COVID2019 #coronavirus?No, of course not!Does non-ionizing #wireless radiation accelerate viral replication and contribute to #AntibioticResistance?es.

Table 3
Community attributes as explained in 3.2.1.

Table 4
MediaEval 2020, COVID-19 (+), and friendship data sets.For MediaEval 2020, note that the number of users in each set does not add up to the total number of users, as the same user can have tweets in different data sets.

Table 5
Logistic regression (LR) and logistic regression with OCR (LR-OCR) modeling scores for Multi-class and binary labeling of MediaEval 2020 test set.

Table 6
Ternary (runs 001 -004) and binary (runs 011 -014) labeling scores returned by benchmark engine (MCC), and our analysis on development set (MediaEval 2020) released ground-truth (MCC, Precision, Recall, Acc).Model abbreviations: LR for logistic regression; LR-OCR for logistic regression w OCR; CL for community labeling; LR-CL for fusion run.The team placed second in the competition.

Table 8
Modeling comparisons on multi-class and binary results for the test set of MediaEval 2020

Table 9
Predictions for the community labeling using MediaEval development data and Auxiliary COVID-19 (+) data set.Performance measures (MCC, Precision, Recall, Accuracy) were computed for every type of network for multi-class classification, including the unknown predictions, for multi-class classification, excluding the unknown predictions, and for binary classification.

Table 10
Comparison of the predictions between the community and lexical models.The test data set has 2,908 labeled Tweets.Equal to lexical is the number of predictions for that model that were classified the same as the lexical model.Unique is the number of predictions the model predicted differently than the lexical model.