In this section, we will introduce the detail of the proposed video topic detection method, which is divided into three phases: data acquisition and video representation, topic detection and evolution analysis. Specifically, data acquisition and video representation involve acquiring video metadata, transforming the different modalities into high-level semantic text, and videos are represented by high-dimensional vectors with 1 row and 768 columns like [0.1444 -0.062 … 0.096 -0.072], which are learned from the fused texts based on the transformer encoder. Topic detection is mainly performed by the HDBSCAN clustering method, and it is compared with hierarchical clustering. We respectively illustrate the evolution of topics in terms of the intensity and content. The intensity is measured by the number of videos and the content evolution is revealed by the change of keywords sorted by TF-IDF. The workflow is shown in Fig. 1 and the proposed method of topic detection is depicted in algorithm 1.
3.1 Preprocessing
In order to improve the accuracy and performance of the proposed method, preprocessing is essential for videos’ titles, tags and description texts, audios and covers. Texts like titles, tags and descriptions are processed by removing invalid symbols, anomalies, null values and then they are combined into one text data. Audios are converted into text by automatic speech recognition (ASR), and the converted audio texts are too long and contain invalid data, so the summary extraction is performed for them. Video covers are applied with optical character recognition (OCR) to detect texts. The process in shown in Fig. 2a.
3.2 Data fusion and embedding
As shown in Fig. 2a, data fusion aims to integrate multiple modal features of videos to provide a more comprehensive description of videos. Considering the misalignment and semantic level differences among different modalities of videos, the audios and video covers are transformed into the text form with richer semantics. When performing fusion, add separators ‘ESP’ as a distinction between texts of different modalities, and use ‘CLS’ as a beginning symbol, i.e., ‘CLS’ + ‘video title text’ + ‘ESP’ + ‘audio text’ + ‘ESP’ + ‘video cover text’ + ‘ESP’.
Video embedding is to convert videos into feature vectors that are learned from the multimodalities. We hope that semantic interactions can be generated among the fused features, resulting in semantically richer fused features. The attention mechanism in the transformer model aims to better understand the contextual relationships of words in texts, which means each word is calculated with other words [32], as shown in Fig. 2b. Therefore, the transformer-based model is used for multimodal fusion to represent videos in the same semantic vector space. Then the similarity between videos can be measured by calculating the cosine similarity of the vectors.
3.3 Topic detection
Considering the multimodal features of videos, previous textual topic models cannot be used to detect the video topics, so clustering algorithms are applied for video topic detection in the paper. The clusters are formed by videos with high similarity, and each cluster represents a topic. Since the video vector has high dimensionality, it may not work well in clustering [32]. We firstly reduce the dimensionality and then perform HDBSCAN clustering. The Uniform Manifold Approximation and Projection (UMAP) is employed to achieve it. Compared to some existing dimensionality reduction methods like PCA or t-SNE, UMAP preserves more local and global features of high-dimensional data at a lower projection dimension [44]. And Allaoui et al. [45] demonstrated that the performance of HDBSCAN is improved by reducing high-dimensional embeddings with UMAP, in terms of clustering accuracy and time. Thus, UMAP is used to reduce the dimensionality of embeddings.
Introduction of HDBSCAN method. HDBSCAN defines a new distance measurement, the mutual reachability distance that can better reflect the density of points and thus it can handle clustering problems with different densities. When performing clustering, HDBSCAN uses a soft-clustering approach allowing noise to be modeled as outliers, which prevents irrelevant data from being assigned to any cluster and is expected to improve the clustering results [46]. Therefore, HDBSCAN is adopted to cluster similar videos to generate topics. It can be roughly divided into these steps: transforming the space, building the minimum spanning tree, building the cluster hierarchy, condensing the cluster tree and extracting the clusters, as shown in Fig. 3.
Introduction of HC method. Hierarchical clustering creates a hierarchical nested clustering tree by calculating the distance between data points of different categories, as shown in Fig. 4a. It does not need to determine the number of clusters in advance, and from the Fig. 4b, it is easy to discover the hierarchical relationship, similar to HDBSCAN. Then the hierarchical clustering method is used for comparison experiments. Using the video similarity as the measure of the distance between videos, the initial video distance matrix is constructed. The distance matrix is updated after each iteration until no new cluster appears, where the distance between different clusters is calculated by average distance.
3.4 Topic evolution
The state of topic evolution can be represented by topic intensity and content. Topic intensity can be regarded as the degree of attention and discussion in a period of time [39]. The more the number of blogs to discuss a topic, the higher the hotness of the topic is [38]. For the video topics represented by clusters, the most intuitive way of measuring the topic intensity is the cluster size. When analyzing the evolution of topic content, we can analyze the change of topic words in different time slices. The fused texts of videos in clusters are segmented, and the words are sorted by TF-IDF to get the top ranked keywords. Then the content evolution can be explained based on the change of keywords.