Multimodality Fusion based Topic Detection and Evolution Analysis of Web Videos

doi:10.21203/rs.3.rs-4190241/v1

Download PDF

Research Article

Multimodality Fusion based Topic Detection and Evolution Analysis of Web Videos

https://doi.org/10.21203/rs.3.rs-4190241/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Due to the prevalence of video social media and the increase of user generated content, the Internet is filled with a large amount of unstructured data. Videos often contain multimodal data such as title, tags, images and audios. Therefore, fusion of multimodal features is a valid way for video topic detection. The titles and tags of videos are short and sparse, and they are high level semantics, whereas the audio and images of videos are low level semantics. It is not suitable to represent a video by directly fusing these features. To address the issue, an effective multimodal fusion method based on the transformer model is proposed for detecting video topics. First, video data is crawled from Bilibili platform, and the titles, tags and descriptions of videos are processed by deleting invalid symbols and null values. The audios are converted to text and texts are recognized from video covers. Second, the transformer-based model is applied to fuse the three forms of text from different modalities to represent videos with multi-dimensional vectors. Then the HDBSCAN and hierarchical clustering (HC) are compared by Silhouette coefficient when clustering videos for topic detection. In addition, we compare video topic clustering under multimodal and single-modal. Finally, the intensity and content evolution of video topics over time are analyzed in the paper. Experimental results with the real data collected from Bilibili verify the effectiveness of the proposed method for video topic detection and evolution.

Multimodality fusion

Topic detection

Topic evolution

Transformer

Web videos

Bilibili platform

With the rapid development of mobile Internet and video websites like YouTube or BiliBili, video platform as a new medium for information dissemination has become a part of our life, which involves a variety of unstructured content, such as text, images and audio. Faced with a large amount of video data, it is difficult for users to quickly access the key content of videos [1]. As a result, it is essential to detect topics from large volume videos and mine their evolutionary relationships, which is of great significance for finding out what users are interested in and keeping users informed of topics on the Internet. However, the text data of videos alone cannot provide a comprehensive overview of video topics, so it is necessary to combine videos’ multimodal data to detect topics.

Topic detection and tracking (TDT) has been studied for decades, and previous work has focused on discovering topics and the evolutionary relationships between topics from text data stream [2], which has achieved a lot of success [3-4]. Moreover, as the research progresses, scholars have paid attention to detect topics from multimedia data like images, audios and videos etc. Topic models which are often used for topic detection of text data like Latent Dirichlet Allocation (LDA) [5] have been proved to have significant effects, whereas they require constant tuning of parameters to optimize the results. And these models cannot be directly used for detection of video data. Video topic detection is an arduous task in the public opinion analysis [6], because the multimedia content often contains unstructured data with multiple modalities, which is difficult to process and computationally intensive. How to fully utilize the multimodal information of videos? How to explore an effective and quick method for video topic detection? How to reveal the evolution of the video topics? These are main concerns of discovering topics of videos and revealing topic evolution trends, and need to be furthermore studied.

To tackle the above issues, an effective multimodal fusion method based on the transformer model is proposed for video topic detection. We firstly crawl video data from the Bilibili platform which is one of the most popular video websites in China. A variety of features are extracted from the video data, including text data such as titles, tags and descriptions, audio data and video covers. Then, to transform the low-level semantics of audio and video covers into high-level semantics like video text, we respectively convert audios to text by the automatic speech recognition and identify text information contained in video covers by the optical character recognition. Next, multimodal data of videos are fused based on a pre-trained transformer model, which means videos can be represented by these fused data. Finally, we use HDBSCAN to cluster videos to detect topics and compare it with the hierarchical clustering method (HC). The intensity and content evolution of videos’ topics are also analyzed.

The main contributions of the paper can be summarized as follows: Firstly, instead of simply concatenating feature vectors from different modalities directly, we convert the audio and video covers to high-level semantic text separately, which allows alignment among multimodal features, and compensation for sparse text such as titles and tags of videos. Secondly, an effective multimodal fusion method is proposed by using the transformer-based model. It takes full advantage of relations between textual, visual and audio data to generate semantic interactions among the features, so videos can be represented more comprehensively. Finally, we analyze the intensity evolution of topics using topic river diagrams and describe the content evolution of topics through the change of topic words sorted by TF-IDF, which can fully reveal the topics’ evolution in different periods. In a word, the proposed method fully utilizes the multimodality to detect videos topics, and reveal the evolutionary trends of video topics, which can be applied for predicting the popularity and trend of a topic in different time periods, and for optimizing video search and content recommendation.

The remainder of the paper is organized as follows: In the next section we review related literatures. In Section 3, The research framework and the proposed methods are introduced. Afterward, experiments are performed in Section 4. We conclude with a discussion of results in Section 5.

2.1 Topic detection

Most of the existing work on topic recognition has focused on the textual domain, and common methods include topic discovery based on topic framework [7], text-based clustering algorithm [8-9] and LDA [10-12]. LDA which is widely applied in textual data is an effective method for topic detection. In addition, some scholars have also improved the modeling defect of LDA by adding the time factor [13], and considering that LDA cannot be applied to sparse short text modeling, Li et al. [14] put forward the Bi-term Topic Model (BTM) for short-text topic detection. There is a lot of outstanding studies on textual topic detection, whereas it is still difficult to directly detect topics from multimedia data such as videos.

Based on the complex network theory and methods, Gargi et al. applied community detection to online video community on YouTube for content discovery[15]; Cao et al. constructed the video social network based on users’ interaction and proposed a community-driven topic detection model for online videos [16]. They detected video topics from the network perspective, but ignored the multimodal features of videos. The tags and keywords of videos are important information describing the content of videos, some scholars attempted to utilize them to detect video topics [17-18]. Liu et al. used the correlation between video response links and keywords to identify topics [19] and Liu et al. detected topics based on video geographic tags for understanding the temporal and spatial distribution of culture, scene and human behavior [20]. From the above studies, it can be found that video tags and keywords can be used to detect topics from videos, whereas the tags or titles of videos are sparse and noisy and cannot fully describe video content. In [21-22], for videos such as lectures or news, semantic information can be extracted through capturing spoken words by automatic speech recognition to enhance video textual features like tags or titles, and the video topics are detected by LDA. Some scholars sliced and annotated videos through the visual features, and proposed various clustering algorithms to mine topics in videos [23-25]. Detecting topics by video keyframes may classify some keyframes of the same topic to different topics. Besides that, training sets need to be constructed in advance, and topics must be predefined, so new topics are difficult to be captured in time. The above studies only rely on a single modality such as text or visual features, and the detected video topics may not be representative.

Considering the multimodality and sparsity of videos, the key to video topic detection is to fuse multimodal features. Chen et al. [26] fused the dense tag groups with near duplicate keyframes (NDKs) to detect web video topics and Zhang et al. [27] also fuse NDKs with event keywords, but their methods still rely strongly on video tags. Fu et al. [28] proposed a multigraph fusion method that fuses the video shots by a hypergraph for video summarization. Similarly, Chu et al. [29] proposed a multimodal graph fusion method for cross-media topic detection. Multimodal data fused by graphs may overlook semantic differences between modalities and features have no semantic interaction with each other.

The transformer architecture may be an effective tool for fusing multimodal features of video, which has empowered language models to transfer large quantities of data to low-resource domains [30]. And it trains short text data to obtain concise text representation with rich semantics [31]. Rani [32] used the transformer model to fuse heterogeneous data (title, tag and description) for the discovery of web video topics. Zhu [33] adopted a transformer-based encoder-decoder architecture fusing the topical and commonsense features to detect emotion in dialogue text. However, the above studies only fuse textual features, and there are hardly any papers on video topic detection that use transformer-based models to fuse multimodal features.

2.2 Topic evolution

Topic evolution is important for capturing trends in related fields[34], and it is often applied to scientific literature or web text. Some studies apply co-word analysis to topic evolution [35-36], which is based on the quantitative calculation of keyword frequency and co-occurrence frequency, and aggregates into topic clusters according to the correlation strength between keyword pairs. It is a common method to investigate topic evolution by measuring the similarity between different topics, and then explore the evolution process between topics [37-38]. In recent years, some scholars believe that topic evolution can be analyzed from multiple dimensions such as structure, intensity and content [39-40]. On this basis, Zhu et al. proposed a ‘word-topic’ coupling network to analyze the process of topic [41] in order to comprehensively describe the evolution process and state of topics.

Compared to the field of textual information, there are very few studies on the evolution of video topics. Ryosuke et al. proposed a topic evolution tracking algorithm based on salient keyword matching with consideration of semantic broadness for web videos, and tracked the topic evolution from the hierarchy of video groups [42], which can make users understand the topic evolution over time to find web videos they are interested in, but the calculation cost is high and the amount of data is large. Cao et al. presented an efficient algorithm based on salient trajectory extraction on a topic evolution link graph, which considers tag and visual similarities [43]. There is a lot of work on topic evolution based on text data, and the analytical dimension includes the topic intensity and content, but there is little work on video topic evolution.

In summary, there is limited research on topic analysis for multimedia such as videos. Table 1 summarizes the recent studies on video topic detection and evolution. It can be seen from the table that video features for topic detection can be roughly divided into two categories: single-modality and multimodality of videos. The former just considers the text or keyframes of videos, which means that the obtained topics are not well illustrative. The existing work on video multimodal fusion does not consider the semantic level differences among modalities and there is no semantic interaction between features. In this paper, we convert the different features of videos into high-level textual features with rich semantics, and to realize semantic interactions among multimodalities, the transformer-based model is used to fuse features. The HDBSCAN is applied to cluster videos to obtain topics, and the comparative experiments of different clustering methods, single-modal and multimodal are performed respectively. Furthermore, the evolution of video topics is analyzed from the perspective of intensity and content.

Table 1. Comparison of previous studies on the video topic detection and evolution. The letter Y and N represent ‘with and without the referred research point’ respectively.

References	Features	Methods of topic detection	Content evolution	Intensity evolution
Liu et al. [17]	Keywords	Bipartite graph model	N		N
Li et al. [23]	Keyframes; Tags	Semi-supervised machine learning	N		N
Ryosuke et al. [42]	Keywords; Keyframes	Keyword matching	Network analysis		N
Cao et al. [43]	Tags	Event Mining; Detection of burst tags	Topic track diagram		N
Rani et al. [32]	Text	BERT; HDBSCAN	N		N
Current study	Text; Audio; Covers	Transformer; HDBSCAN	Topic word-based		Topic river diagram

In this section, we will introduce the detail of the proposed video topic detection method, which is divided into three phases: data acquisition and video representation, topic detection and evolution analysis. Specifically, data acquisition and video representation involve acquiring video metadata, transforming the different modalities into high-level semantic text, and videos are represented by high-dimensional vectors with 1 row and 768 columns like [0.1444 -0.062 … 0.096 -0.072], which are learned from the fused texts based on the transformer encoder. Topic detection is mainly performed by the HDBSCAN clustering method, and it is compared with hierarchical clustering. We respectively illustrate the evolution of topics in terms of the intensity and content. The intensity is measured by the number of videos and the content evolution is revealed by the change of keywords sorted by TF-IDF. The workflow is shown in Fig. 1 and the proposed method of topic detection is depicted in algorithm 1.

3.1 Preprocessing

In order to improve the accuracy and performance of the proposed method, preprocessing is essential for videos’ titles, tags and description texts, audios and covers. Texts like titles, tags and descriptions are processed by removing invalid symbols, anomalies, null values and then they are combined into one text data. Audios are converted into text by automatic speech recognition (ASR), and the converted audio texts are too long and contain invalid data, so the summary extraction is performed for them. Video covers are applied with optical character recognition (OCR) to detect texts. The process in shown in Fig. 2a.

3.2 Data fusion and embedding

As shown in Fig. 2a, data fusion aims to integrate multiple modal features of videos to provide a more comprehensive description of videos. Considering the misalignment and semantic level differences among different modalities of videos, the audios and video covers are transformed into the text form with richer semantics. When performing fusion, add separators ‘ESP’ as a distinction between texts of different modalities, and use ‘CLS’ as a beginning symbol, i.e., ‘CLS’ + ‘video title text’ + ‘ESP’ + ‘audio text’ + ‘ESP’ + ‘video cover text’ + ‘ESP’.

Video embedding is to convert videos into feature vectors that are learned from the multimodalities. We hope that semantic interactions can be generated among the fused features, resulting in semantically richer fused features. The attention mechanism in the transformer model aims to better understand the contextual relationships of words in texts, which means each word is calculated with other words [32], as shown in Fig. 2b. Therefore, the transformer-based model is used for multimodal fusion to represent videos in the same semantic vector space. Then the similarity between videos can be measured by calculating the cosine similarity of the vectors.

3.3 Topic detection

Considering the multimodal features of videos, previous textual topic models cannot be used to detect the video topics, so clustering algorithms are applied for video topic detection in the paper. The clusters are formed by videos with high similarity, and each cluster represents a topic. Since the video vector has high dimensionality, it may not work well in clustering [32]. We firstly reduce the dimensionality and then perform HDBSCAN clustering. The Uniform Manifold Approximation and Projection (UMAP) is employed to achieve it. Compared to some existing dimensionality reduction methods like PCA or t-SNE, UMAP preserves more local and global features of high-dimensional data at a lower projection dimension [44]. And Allaoui et al. [45] demonstrated that the performance of HDBSCAN is improved by reducing high-dimensional embeddings with UMAP, in terms of clustering accuracy and time. Thus, UMAP is used to reduce the dimensionality of embeddings.

Introduction of HDBSCAN method. HDBSCAN defines a new distance measurement, the mutual reachability distance that can better reflect the density of points and thus it can handle clustering problems with different densities. When performing clustering, HDBSCAN uses a soft-clustering approach allowing noise to be modeled as outliers, which prevents irrelevant data from being assigned to any cluster and is expected to improve the clustering results [46]. Therefore, HDBSCAN is adopted to cluster similar videos to generate topics. It can be roughly divided into these steps: transforming the space, building the minimum spanning tree, building the cluster hierarchy, condensing the cluster tree and extracting the clusters, as shown in Fig. 3.

Introduction of HC method. Hierarchical clustering creates a hierarchical nested clustering tree by calculating the distance between data points of different categories, as shown in Fig. 4a. It does not need to determine the number of clusters in advance, and from the Fig. 4b, it is easy to discover the hierarchical relationship, similar to HDBSCAN. Then the hierarchical clustering method is used for comparison experiments. Using the video similarity as the measure of the distance between videos, the initial video distance matrix is constructed. The distance matrix is updated after each iteration until no new cluster appears, where the distance between different clusters is calculated by average distance.

3.4 Topic evolution

The state of topic evolution can be represented by topic intensity and content. Topic intensity can be regarded as the degree of attention and discussion in a period of time [39]. The more the number of blogs to discuss a topic, the higher the hotness of the topic is [38]. For the video topics represented by clusters, the most intuitive way of measuring the topic intensity is the cluster size. When analyzing the evolution of topic content, we can analyze the change of topic words in different time slices. The fused texts of videos in clusters are segmented, and the words are sorted by TF-IDF to get the top ranked keywords. Then the content evolution can be explained based on the change of keywords.

4.1 Experimental dataset and preprocessing

We crawl 2067 videos from the issue 147 to issue 192 of the “Weekly Must-see” series on the Bilibili platform, spanning from Jan. 7th, 2022 to Nov. 24th, 2022. In this case, one issue of videos refers to the videos published in one week, so we take two issues (half a month) as a time unit. Each video’s titles, tags and descriptions, cover and audio are crawled and saved.

Before topic detection, it is necessary to preprocess the input data. We remove the invalid symbols and null values from the video textual data and combine the processed text. The audios are converted to text by automatic speech recognition and some inaccurate converted texts are corrected manually. The texts in the covers are also recognized by optical character recognition. Then we fuse the data in different modalities and input them into the transformer-based model to get video vectors.

4.2 Performance evaluation

In this paper, video topics are detected by clustering methods, which is an unsupervised process. In order to better evaluate and compare the experimental results, Silhouette coefficient (as shown in Equation (1), (2)) is adopted as the experimental evaluation metric, which is a way to evaluate how good the clustering effect is. The larger the Silhouette coefficient, the greater the distance between the cluster structure where the sample point is located and its nearest cluster structure, thus indicating a better clustering result at this point.

In order to testify the proposed method, two main comparative experiments are conducted. One is the comparison of clustering methods, and the other is the comparison of different numbers of modalities. The silhouette coefficient of various approaches has been presented in Fig. 5a and Fig. 5b.

From the Fig. 5a, it can be concluded that the HDBSCAN is better than the method of HC when the number of clusters is greater than four. And the difference is getting bigger as the number of clusters increases.
Furthermore, Fig. 5b shows that the proposed transformer-based multimodal fusion method outperforms single-modal when detecting video topics by HDBSCAN.

According to the above comparison results, we can see that HDBSCAN can be used to mine video topics and performs well in the case of multimodal fusion, so it is necessary to make full use of the multimodal features.

4.3 Analysis of topic evolution

The state evolution of video topics is analyzed from the perspective of intensity and content respectively. The topics that appear in every time unit (half a month) are selected as the hot topic, and part of hot topics are shown in the Table 2. The topic evolution with a period of three and a half months is analyzed, where the intensity of topics is measured by the number of videos in each cluster and the video content is examined through the keywords sorted by TF-IDF.

Table 2. The top 10 keywords of each hot topic in issue 179 to issue 180. The ‘Topic’ column is the id of hot topics. The ‘Video numbers’ column is the number of videos in each cluster.

Topic	Top 10 keywords	Video numbers	Topic name
T₀	The games; The Black Myth; Game videos; Single-player games; LOL; Players; Sandbox games; Wukong; Live demo; Minecraft,	10	Game
T₃	Chinese music; Songs; Originals; MV; Demo; High pitch; Music creation; Music; Classic Songs; Languages	3	Music
T₄	Xu Mi; Genshin Impact; Version; Music; Models; Mobile Games; Dubbing; God’s pupil; Scenes; Zhong Li	8	Genshin Impact
T₅	Make-up; Ancient style; Red lips; Man in the painting; Pollen; Dream of the red Mansions; Cos; Sexy girl; Immortal Sword; Outfit matching	3	Make-up and clothing
T₆	Auto-tune remix-themed content; Cai Xukun; Chicken; Ikun; Planning; Section 3; Fierce man; Cars; Learning to drive; Liu Huaqiang	4	Auto-tune remix-themed content
T₇	Food; Midsummer; Tutorial; Cookery; Seafood; Chinese style; Lobsters; Fresh milk; Food ingredients; Cartilaginous fish	10	Food
T₁₁	Animation; Comics; Tom & Jerry; Seminar; Carnival; Styles; Dubbing; Variety comics; Animation News; Painter;	6	Comics and animation
T₁₂	Dancing; Dance moves; Choreography; Divine comedy; Original song; Double dance; Vocaloid; Cosplay; Manpower; Roles	2	Dancing
T₁₃	High energy; Little theatre; Magic; Dark Humor; variety show; Talent show; Brainwashing; Brainstorming; Parody show; Funny	7	Funny videos
T₁₅	Movies; High scores; Emmy Award; American drama; Movie box office; television play; flow rate; Game of Thrones; Film and television; famous scene	2	Film and television
T₁₆	Technology products; Digital; Cell phone; Software; Cooling; Overseas; Domestic mobile phones; App; Records	4	Technology product reviews

A topic river diagram is used to visualize the evolution of video topic intensity, as shown in Fig. 6. There are differences in intensity among hot topics in each time unit, like the hot topic of ‘game’ and ‘make-up and clothing’. The intensity of each video topic is not necessarily constant, and it has fluctuations over time. For instance, the topic of ‘Genshin Impact’ has some hotness on September 22, then increases significantly on October 8, but decreases again after November 3. Through the topic river diagram, it is possible to clearly understand the intensity difference among hot topics and examine the intensity change of hot video topics over time.

The fused multimodal features are finally presented in the form of text with richer semantics, so in order to reflect the content evolution of the video topics, the text in the clusters can be divided into words, and the words under each topic are sorted by TF-IDF. The evolution of ‘game’ topic from issue 179 to issue192 (three and a half months) is reflected by keywords in the Fig. 7, which shows the top ten keywords ranked by TF-IDF. It can be seen that there exist changes in the content of topic ‘game’ over different periods. Its keywords are not exactly the same in the whole period, which is related to some online hot events. For example, the keywords of ‘LPL’, ‘LOL’ and ‘S12’ are contained in issue 181 to issue 184, because the LOL’s 12th global tournament was held in that period. By examining the changes in keywords, the evolution of video topics can be analyzed.

Topic detection and tracking in video domain is very interesting. In this paper we propose to use the transformer-based model to fuse multimodal features for detecting video topics. The method fully takes into account the semantic level differences among multimodalities, and enables semantic interaction between features to obtain fused multimodal features with richer semantics. Additionally, the intensity and content evolution of video topics are analyzed based on the river diagram and topic keywords respectively.

Experimental results show that the HDBSCAN method is better than the hierarchical clustering in video topic detection. The HDBSCAN just need to set a few parameters while the hierarchical clustering requires setting division thresholds which is susceptible to subjective factors. For video topic detection, the method of fusing multimodal features is better than that of single modality, so it is necessary to make full use of videos’ multimodal features. Analyzing the evolution of video topics in terms of intensity and content allows us to understand the state change of topics on a macro and micro level.

The detection and evolution analysis of video topics are conducted through modeling and experiments, and it can keep users informed of online video topics and aid the platform to manage videos with related topics. Furthermore, it is also meaningful for video recommendation, and even for predicting hot topics of videos.

Although the current study provides a discussion on hot topic detection and evolution analysis of web videos, it inevitably has its own limitations. The topic detection is based on unsupervised clustering methods, which depends on the setting of parameters, and in the future, self-supervised learning methods that learn the video information from labelled data, can be applied for video topic detection to improve the robustness of the model. In addition, we have fused modalities such as the title, audio and cover of videos, and the video content is also the part of video topics, which may be considered in future research.

Acknowledgements

We are very thankful to the editor and people whose valuable suggestions and comments helped us to improve the article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

This article is supported by the National Natural Science Foundation of China under Grant 72374111, 71874088, and the Postgraduate Research and Practice Innovation Project in Jiangsu Province under the grant KYCX22_0876.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

Xie L, Natsev A, Kender J R, et al. Visual memes in social media: tracking real-world news in youtube videos. Proceedings of the 19th ACM international conference on Multimedia. 2011: 53-62. https://doi.org/10.1145/2072298.2072307.
Ran L, Suzhi X, Yuanyuan R, et al. A modified approach of hot topics found on micro-blog. Frontier and Future Development of Information Technology in Medicine and Education. Springer, Dordrecht, 2014: 603-614. https://doi.org/10.1007/978-94-007-7618-0_58.
Chen K., Luesukprasert L., Chou, S. Hot topic extraction based on timeline analysis and multidimensional sentence modeling, IEEE transactions on knowledge and data engineering. 19 (8) (2007) 1016-1025. 10.3969/j.issn.1005-8095.2020.11.019.
Daouadi K., Rebaï R., Amous I. Optimizing semantic deep forest for tweet topic classification, Information Systems. 101 (2001) 101801. https://doi.org/10.1016/j.is.2021.101801.
Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3(Jan):993-1022.
K.H. Lim, A. Datta. Following the follower: Detecting communities with common interests on Twitter. In: Proceedings of the 23rd ACM conference on Hypertext and social media, Milwaukee, Wisconsin, USA, 25-28 June 2012, pp. 317-318. https://doi.org/10.1145/2309996.2310052.
Xie W, Zhu F, Jiang J et al. Topic sketch: Real-time bursty topic detection from twitter, IEEE Transactions on Knowledge and Data Engineering. 28 (8) (2016) 2216-2229. 10.1109/TKDE.2016.2556661.
Lu Y, Zhang P, Liu J et al. Health-related hot topic detection in online communities using text clustering, Plos one. 8 (2) (2013) e56221. https://doi.org/10.1371/journal.pone.0056221.
Pons A, Berlanga R, Ruiz J. Topic discovery based on text mining techniques, Information processing & management. 43 (3) (2007) 752-768. https://doi.org/10.1016/j.ipm.2006.06.001.
Zhao F, Zhu Y, Jin effeeffectH et al. A personalized hashtag recommendation approach using LDA-based topic model in microblog environment, Future Generation Computer Systems. 65 (2016) 196-206. https://doi.org/10.1016/j.future.2015.10.012.
Guo X, Xiang Y, Chen Q et al. LDA-based online topic detection using tensor factorization, Journal of information science. 39 (4) (2013) 459-469. https://doi.org/10.1177/0165551512473066.
AlSumait L, Barbará D, Domeniconi C. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In: 2008 eighth IEEE international conference on data mining, Pisa, Italy,15-19 December 2008, pp. 3-12. https://doi.org/10.1109/ICDM.2008.140.
Chen L C. An effective LDA-based time topic model to improve blog search performance. Information Processing & Management, 2017, 53(6): 1299-1319. https://doi.org/10.1016/j.ipm.2017.08.001.
Li W, Feng Y, Li D, et al. Micro-blog topic detection method based on BTM topic model and K-means clustering algorithm. Automatic Control and Computer Sciences, 2016, 50(4): 271-277. https://doi.org/10.3103/S0146411616040040.
Gargi U, Lu W, Mirrokni V et al. Large-scale community detection on youtube for topic discovery and exploration. In: Fifth International AAAI Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, 17-21 July 2011, pp. 486-489. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2864.
Cao J, Zhang Y, Ji R et al. Web video topics discovery and structuralization with social network, Neurocomputing. 172 (2016) 53-63. https://doi.org/10.1016/j.neucom.2014.10.103.
Liu L, Sun L, Rui Y et al. Web video topic discovery and tracking via bipartite graph reinforcement model. In: Proceedings of the 17th international conference on World Wide Web, Beijing, China, April 2008, pp. 1009-1018. https://doi.org/10.1145/1367497.1367633.
Wang Y, Wu T, Li G et al. Video topic detection on Micro-Blog using Relational Topic Model, Academic Journal of Computing & Information Science. 4 (2) (2021). https://doi.org/10.25236/AJCIS.2021.040214.
Liu Y, Yu N. Dual linkage refinement for YouTube video topic discovery. In: 2010 IEEE International Conference on Multimedia and Expo, Singapore, 19-23 July 2010, pp. 1576-1581. 10.1109/ICME.2010.5582943.
Liu K, Xu J, Zhang L et al. Discovering hot topics from geo-tagged video, Neurocomputing. 105(2013) 90-99. https://doi.org/10.1016/j.neucom.2012.05.035.
Bian J, Huang M L. Semantic topic discovery for lecture video. In: Proceedings of SAI Intelligent Systems Conference, London, UK, September 2019, pp.457-466. https://doi.org/10.1007/978-3-030-29516-5_36.
Zuo L. Research on the automatic generation scheme of interview short video titles, Computer Products and Circulation. 11(2017) 158. CNKI:SUN:WXXJ.0.2017-11-144.
Li G, Jiang S, Zhang W et al. Online web video topic detection and tracking with semi-supervised learning, Multimedia Systems. 22 (1) (2016) 115-125. https://doi.org/10.1007/s00530-014-0402-0.
Shao J, Ma S, Lu W et al. A unified framework for web video topic discovery and visualization, Pattern Recognition Letters. 33 (4) (2012) 410-419. https://doi.org/10.1016/j.patrec.2011.07.026.
Shao J, Yin W, Ma S et al. Topic discovery of web video using star-structured k-partite graph. In: Proceedings of the 18th ACM international conference on Multimedia, Firenze, Italy, October 2010, pp. 915-918. https://doi.org/10.1145/1873951.1874112.
Chen T, Liu C, Huang Q. An effective multi-clue fusion approach for web video topic detection. Proceedings of the 20th ACM international conference on Multimedia. 2012: 781-784. https://doi.org/10.1145/2393347.2396311.
Zhang C, Wu X, Shyu M L, et al. Adaptive association rule mining for web video event classification. 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI). IEEE, 2013: 618-625. https://doi.org/10.1109/IRI.2013.6642526.
Fu Y, Guo Y, Zhu Y, et al. Multi-view video summarization. IEEE Transactions on Multimedia, 2010, 12(7): 717-729. https://doi.org/10.1109/TMM.2010.2052025.
Chu L, Zhang Y, Li G, et al. Effective multimodality fusion framework for cross-media topic detection. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 26(3): 556-569. https://doi.org/10.1109/TCSVT.2014.2347551.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 2017(30), pages 5998–6008.
Cheng P, Du J, Kou F, et al. Topic Detection Based on Semantics, Time and Social Relationship. Proceedings of 2019 Chinese Intelligent Automation Conference. Springer Singapore, 2020: 691-698. https://doi.org/10.1007/978-981-32-9050-1_78.
Rani S, Kumar M. Heterogeneous Information Fusion based Topic Detection from Social Media Data. Information Systems Frontiers, 2022: 1-16. https://doi.org/10.1007/s10796-022-10334-w.
Zhu L, Pergola G, Gui L, et al. Topic-driven and knowledge-aware transformer for dialogue emotion detection. arXiv preprint arXiv:2106.01071, 2021. https://doi.org/10.48550/arXiv.2106.01071.
Chen B, Tsutsui S, Ding Y et al. Understanding the topic evolution in a scientific domain: An exploratory study for the field of information retrieval, Journal of Informetrics. 11 (4) (2017) 1175-1189. https://doi.org/10.1016/j.joi.2017.10.003.
Tang G, Zhang W. Research progress and analysis of subject evolution based on co-word analysis method, Library and Information Service. 59 (5) (2015) 128-136. 10.13266/j.issn.0252-3116.2015.05.020.
Wei L, Jiamin W, Jiming H. Analyzing the topic distribution and evolution of foreign relations from parliamentary debates: A framework and case study, Information Processing & Management. 57 (3) (2020) 102191. https://doi.org/10.1016/j.ipm.2019.102191.
Lv N, Luo J, Liu Y et al. Analysis of topic evolution based on subtopic similarity. In: 2009 International Conference on Computational Intelligence and Natural Computing, Wuhan, China, 6-7 June 2009, pp. 506-509. 10.1109/CINC.2009.23.
Jian F, Ya W, Yuan D. Microblog topic evolution computing based on LDA algorithm, Open Physics. 16 (1) (2018) 509-516. https://doi.org/10.1515/phys-2018-0067.
Liu Z, Wang X, Bai R. Construction and Empirical Research on Multi-dimensional Topic Evolution Analysis Model, Information Theory and Practice. 40 (3) (2017) 92-98. 10.16353/j.cnki.1000-7490.2017.03.018.
Wang X, Cheng Q, Lu W. Analyzing evolution of research topics with NEViewer: a new method based on dynamic co-word networks, Scientometrics. 101 (2) (2014) 1253-1271. https://doi.org/10.1007/s11192-014-1347-y.
Zhu, H., Qian, L., Qin, W. et al. Evolution analysis of online topics based on ‘word-topic’ coupling network. Scientometrics 127, 3767–3792 (2022). https://doi.org/10.1007/s11192-022-04439-x
Harakawa R, Ogawa T, Haseyama M. Tracking topic evolution via salient keyword matching with consideration of semantic broadness for Web video discovery, Multimedia Tools and Applications. 77 (16) (2018) 20297-20324. https://doi.org/10.1007/s11042-017-5404-4.
Cao J, Ngo C, Zhang Y et al. Tracking web video topics: Discovery, visualization, and monitoring, IEEE Transactions on Circuits and Systems for Video. Technology. 21 (12) (2011) 1835-1846. 10.1109/TCSVT.2011.2148470.
McInnes, L., Healy, J., & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426, 2018. https://doi.org/10.48550/arXiv.1802.03426.
Allaoui M, Kherfi M L, Cheriet A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: a comparative study. Image and Signal Processing: 9th International Conference, ICISP 2020, Marrakesh, Morocco, June 4–6, 2020, 2020: 317-325. https://doi.org/10.1007/978-3-030-51935-3_34.
Grootendorst M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794, 2022. https://doi.org/10.48550/arXiv.2203.05794.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Multimodality Fusion based Topic Detection and Evolution Analysis of Web Videos

Status:

Version 1

Abstract

Figures

Introduction

Related work

Methodology

Experiments and results

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1